BytePush Unveils 1.58-bit Quantized FLUX Model
Introduction
Artificial Intelligence (AI) text-to-image (T2I) generation models such as DALLE3 and Adobe Firefly3 showcase remarkable generative capabilities but often require vast amounts of memory due to their billions of parameters. This poses significant challenges for deploying these models on resource-constrained platforms like mobile devices. To tackle this issue, researchers from ByteDance and POSTECH have developed a novel approach to model quantization, resulting in the 1.58-bit FLUX model.
Research Background
The research team focused on exploring techniques for extremely low-bit quantization of T2I models. They chose the FLUX.1-dev model for its public accessibility and strong performance metrics. The researchers implemented 1.58-bit quantization, which compresses the visual transformer weights into only three discrete values: {-1, 0, +1}. This quantization method is unique because it does not require access to any image data and relies solely on the self-supervised capabilities of the FLUX.1-dev model. Unlike previous methods like BitNet b1.58, this approach serves as a post-training quantization solution rather than necessitating the training of a large language model from scratch.
Achievements of the 1.58-bit FLUX Model
By applying this advanced quantization method, the researchers successfully reduced the model's storage requirements by 7.7 times. The 1.58-bit weights were stored using 2-bit signed integers, representing a significant compression from the standard 16-bit precision. To further enhance performance, the team developed a custom kernel optimized for low-bit computation, which reduced inference memory usage by over 5.1 times and improved inference latency significantly. Evaluations conducted using the GenEval and T2I Compbench benchmarks demonstrated that the 1.58-bit FLUX model not only increased computational efficiency but also maintained generation quality comparable to the full-precision FLUX model.
Performance Metrics
The quantization process involved compressing 99.5% of the visual transformer parameters (approximately 11.9 billion) in the FLUX model to 1.58 bits, markedly lowering the storage requirements. Experimental results indicated that the 1.58-bit FLUX model performed similarly to the original FLUX model on both the T2I CompBench and GenEval datasets. Notably, the model exhibited enhanced inference speed on lower-performance GPUs, such as the L20 and A10.
Conclusion
The introduction of the 1.58-bit FLUX model represents a significant advancement toward the practical deployment of high-quality T2I models on devices with limited memory and latency. Despite some remaining limitations regarding speed enhancements and high-resolution image detail rendering, the model's potential to improve efficiency and reduce resource consumption is expected to inspire future research in this field.
Key Improvements
- Model Compression: The model's storage space has been reduced by 7.7 times.
- Memory Optimization: Inference memory usage has decreased by over 5.1 times.
- Performance Retention: The 1.58-bit FLUX model maintains performance levels similar to the full-precision FLUX model in benchmark evaluations.
- No Image Data Required: The quantization method does not rely on any image data, utilizing the model's self-supervision instead.
- Custom Kernel: A specialized kernel for low-bit computation has been implemented, enhancing overall inference efficiency. ### Additional Resources
- Project Page: 1.58-bit FLUX
- Paper Link: Research Paper
- Model Link: Model Access