Meta Releases Quantized Versions of Llama 3.2 1B / 3B Models

After open-sourcing the 1B and 3B models of Llama 3.2 in September, Meta released quantized versions of these models on October 24th, which reduce model size by an average of 56%, RAM usage by an average of 41%, and model speeds by a factor of 2 to 4, while lowering power consumption so that these models can be deployed to a wider range of mobile devices. Model Quantization is the process of converting floating-point models to fixed-point models through various training methods, which compresses the model parameters and reduces the complexity of the model so that it can be run on lighter platforms.

These quantized models are faster, use less RAM, and consume less power than the non-quantized Llama BF16 models, while maintaining nearly the same accuracy as the Llama BF16 version.
Although the quantized Llama 3.2 1B and 3B models only support a context of 8,000 tokens (compared to 128,000 tokens for the original models), Meta’s testing found that benchmark results for both the quantized versions of Llama QLoRA and Llama SpinQuant are actually quite comparable to the original Llama BF16 version. However, Meta’s testing found that both the quantized versions of Llama QLoRA and Llama SpinQuant actually benchmarked well against the original Llama BF16 version.

Currently, Meta has tested these quantized models on mobile platforms such as the One Plus 12, Samsung S24+/S22, and Apple iOS devices (no specific model announced), with “good results,” and the researchers plan to improve the performance of these quantized models with neural processing units (NPUs) in the future.

Author: Hans

Leave a Reply

Your email address will not be published. Required fields are marked *