Sea Change

I’ve been promised Virtual Reality and Artificial Intelligence (AI) all my life. I bought the Power Glove. It was all mostly just hype and small toys that never stuck. But current iterations?

What we are seeing now with AI with regards to Large Language Models (LLMs such as GPT) and Stable Diffusion (Image Generation) is nothing short of a change in how we use computers. Models, weights, and LoRas? are now the “Programs” we run.

I’ve spent last month with these products like InvokeAI and Ollama, they are wonderful, but they aren’t even close to where we will be in 2 years for a consumer. But I can’t help but think of the giant foundational models trained on the entire human corpus being compressed into little tiny chips that can be queried anywhere as some type of “Holographic” computing.

I can understand why that one google engineer freaked out talking to internal chatbot.

“I think, therefore I am” is a famous philosophical statement made by René Descartes in his Meditations on First Philosophy. The statement is often translated from the original French as “Cogito, ergo sum,” and it is meant to express the idea that the very act of doubting one’s existence serves as proof of one’s own consciousness.

In other words, if you are able to think about whether or not you exist, then you must exist in some form in order to have that thought. This statement has been interpreted and debated by philosophers for centuries, but at its core, it is a powerful reminder of the connection between thinking and being.
Mistral24b

Philosophy aside and lets talk Hardware

We are in the “Mainframe” era of AI. These models literally take up rooms (rows of racks in a datacenter) and it seems its One Big GPU per User at a time. I’m trying to fathom what it takes to support a Copilot or ChatGPT at Microsoft and oAI’s scale.

As you see now the problem with size is combination of power and memory bandwidth, but these will be solved with tricks like they always are. INT8 quantization is so hilariously simple optimization that I’m quite shocked it works so well.

Overview of quantization that I will spend the rest of the month fact checking…

Compressing the weights of neural networks from floating-point to int8 (also known as quantization) can significantly reduce the computational requirements and memory footprint of deep learning models, making them more suitable for deployment on low-power hardware such as mobile devices or embedded systems.

The general idea behind weight compression is to represent the weights of a neural network using fewer bits than their original floating-point representation. By reducing the precision of the weights, it is possible to significantly reduce the storage requirements and computational complexity of the model. In particular, int8 quantization involves representing the weights as 8-bit integers rather than 32-bit or 64-bit floating-point numbers.

There are several steps involved in compressing neural network weights from floating-point to int8:

Weight normalization: This step involves scaling the weights of the neural network so that they fit within the range of an 8-bit integer. This is typically done by computing the maximum absolute value of the weights and then scaling them accordingly.

Quantization: This step involves converting the normalized weights from floating-point to int8 format. There are several methods for doing this, including linear quantization and logarithmic quantization.

Decomposition: In some cases, it may be possible to decompose the weight matrix of a neural network into smaller matrices that can be more easily compressed. This is known as low-rank decomposition or tensor decomposition.

Pruning: This step involves removing unnecessary connections between neurons in the neural network, further reducing the computational requirements and memory footprint of the model.

Codebook generation: In some cases, it may be possible to generate a codebook that maps int8 values back to their original floating-point values. This can help improve the accuracy of the compressed model.

Model fine-tuning: After compressing the weights of a neural network, it is typically necessary to fine-tune the model to ensure that its performance is not significantly degraded. This can be done using techniques such as knowledge distillation or iterative quantization.

Overall, compressing neural network weights from floating-point to int8 format can greatly improve the performance and energy efficiency of deep learning models on low-power hardware. However, it requires careful consideration of factors such as weight normalization, quantization, decomposition, pruning, codebook generation, and model fine-tuning in order to ensure that the compressed model retains its accuracy and performance.
Mistral24b

Google sells the Coral TPU system that is a usb accelerator that is capable of few TeraFlops (TOPS – Trillion Operations per Second). Great for old school CNN style networks but pretty much useless for the current generation of AI models of Transformers and giant Video RAM (VRAM 24GB+)

I’m awaiting the LLM/StableDiffusion version of the Coral TPU or Jetson Nano (NVIDIA)

Make sure you board is waxed, this is going to be a giant wave of VR and AI coming in this next 3 years.

Philosophy aside and lets talk Hardware

Related

Published by rob