BytePush Launches Revolutionary 1.58-bit FLUX Model
BytePush Launches Revolutionary 1.58-bit FLUX Model
Artificial Intelligence (AI) driven text-to-image (T2I) generation models have garnered attention for their exceptional capabilities. Notable examples include DALLE3 and Adobe Firefly3, which exhibit significant potential in various real-world applications. However, these models typically consist of billions of parameters and require substantial memory, making deployment on resource-constrained platforms, such as mobile devices, a significant challenge.
To tackle these issues, researchers from ByteDance and POSTECH have explored advanced techniques for extremely low-bit quantization of T2I models. Among various models, FLUX.1-dev was chosen for its public availability and outstanding performance. The researchers employed a 1.58-bit quantization method, which compresses the visual transformer weights in the FLUX model to just three values: {-1, 0, +1}. This innovative approach relies solely on the self-supervision of the FLUX.1-dev model and does not require access to image data, distinguishing it from the previously developed BitNet b1.58 method that entails training a large language model from scratch.
By implementing this quantization method, the model's storage requirements were drastically reduced by 7.7 times. The 1.58-bit weights, stored as 2-bit signed integers, achieved a compression rate from the conventional 16-bit precision. To enhance inference efficiency, the researchers also developed a custom kernel optimized for low-bit computation, resulting in over 5.1 times reduction in inference memory usage and improved latency.
Benchmark evaluations, including GenEval and T2I Compbench, revealed that the 1.58-bit FLUX model significantly enhances computational efficiency while maintaining generation quality comparable to the full-precision FLUX model. In fact, the researchers successfully quantized 99.5% of the visual transformer parameters (totaling 11.9 billion parameters) to 1.58 bits, considerably lowering storage requirements. Experimental results demonstrated that 1.58-bit FLUX performed comparably to the original FLUX model across the T2I CompBench and GenEval datasets. Notably, 1.58-bit FLUX exhibited remarkable improvements in inference speed on lower-performance GPUs, such as L20 and A10.
In conclusion, the introduction of the 1.58-bit FLUX model represents a significant advancement towards enabling high-quality T2I models for practical deployment on devices with limited memory and latency. Although the model still faces some limitations regarding speed improvements and high-resolution image detail rendering, its potential to enhance model efficiency and reduce resource consumption is expected to provide valuable insights for future research in the field.
Key Improvements
- Model Compression: Storage space reduced by 7.7 times.
- Memory Optimization: Inference memory usage decreased by over 5.1 times.
- Performance Retention: Maintained performance comparable to the full-precision FLUX model in benchmarks.
- No Image Data Required: Quantization does not rely on image data, utilizing the model's self-supervision.
- Custom Kernel: Optimized kernel for low-bit computation enhances inference efficiency. For more details, visit the Project Page, Paper Link, and Model Link.