BytePush Unveils 1.58-bit FLUX Model for Enhanced Efficiency

Artificial Intelligence (AI) has transformed text-to-image (T2I) generation, with models like DALLE3 and Adobe Firefly3 showcasing remarkable capabilities. However, these models typically consist of billions of parameters, necessitating substantial memory, which complicates their deployment on resource-limited platforms such as mobile devices.

To tackle this challenge, researchers from ByteDance and POSTECH have developed techniques for extremely low-bit quantization of T2I models. Their focus was on the FLUX.1-dev model, notable for its public availability and impressive performance. By employing a technique known as 1.58-bit quantization, the researchers successfully compressed the visual transformer weights of the FLUX model to just three values: {-1, 0, +1}. This innovative quantization approach does not require access to image data, relying solely on the model's self-supervision. Unlike the BitNet b1.58 method, this technique serves as a post-training quantization solution rather than necessitating the training of a large language model from scratch.

The implementation of this quantization method led to a remarkable 7.7-fold reduction in the model's storage space, as the 1.58-bit weights were compressed using 2-bit signed integers, transitioning from the standard 16-bit precision. To further enhance inference efficiency, the research team developed a custom kernel optimized for low-bit computation, which reduced inference memory usage by over 5.1 times and improved latency. Evaluations conducted using the GenEval and T2I Compbench benchmarks demonstrated that the 1.58-bit FLUX model significantly increased computational efficiency while maintaining generation quality comparable to the full-precision FLUX model.

The researchers managed to quantize 99.5% of the visual transformer parameters (totaling 11.9 billion) in the FLUX model to 1.58 bits, drastically lowering storage requirements. Experimental results indicated that the 1.58-bit FLUX model performed similarly to the original FLUX model on the T2I CompBench and GenEval datasets. Notably, the 1.58-bit FLUX exhibited more substantial improvements in inference speed, especially on lower-performance GPUs such as L20 and A10.

In conclusion, the introduction of the 1.58-bit FLUX model represents a significant advancement in making high-quality T2I models practically deployable on devices with limited memory and latency. While it does have some limitations concerning speed enhancements and high-resolution image detail rendering, the potential for improving model efficiency and reducing resource consumption is substantial and may provide valuable insights for future research.

Key Improvements:

Model Compression: Model storage space reduced by 7.7 times.
Memory Optimization: Inference memory usage reduced by over 5.1 times.
Performance Retention: 1.58-bit FLUX maintained performance comparable to the full-precision FLUX model in the GenEval and T2I Compbench benchmarks.
No Image Data Required: The quantization process does not require access to any image data, relying solely on the model's self-supervision.
Custom Kernel: A custom kernel optimized for low-bit computation was adopted, enhancing inference efficiency. Project Page: 1.58-bit FLUX Project

Paper Link: Research Paper

Model Link: Model Repository