Things are moving fast, getting weird, and staying exciting. FlexGen dropped on GitHub on February 20, 2023. It’s a game changer. You can now run ChatGPT like large language models on a single graphics card. You used to need to 10 GPUs to get to the same performance.
FlexGen is an open source collaboration among the LMSys community, Stanford, UC Berkeley, TOGETHER, and ETH Zürich.
The high computational and memory requirements of large language model (LLM) inference traditionally make it feasible only with multiple high-end accelerators. FlexGen aims to lower the resource requirements of LLM inference down to a single commodity GPU (e.g., T4, 3090) and allow flexible deployment for various hardware setups. The key technique behind FlexGen is to trade off between latency and throughput by developing techniques to increase the effective batch size.
The key features of FlexGen include:
⚡ High-Throughput, Large-Batch Offloading
Higher-throughput generation than other offloading-based systems (e.g., Hugging Face Accelerate, DeepSpeed Zero-Inference). The key innovation is a new offloading technique that can effectively increase the batch size. This can be useful for batch inference scenarios, such as benchmarking (e.g., HELM) and data wrangling.
📦 Extreme Compression
Compress both the parameters and attention cache of models, such as OPT-175B, down to 4 bits with negligible accuracy loss.
🚀 Scalability
Comes with a distributed pipeline parallelism runtime to allow scaling if more GPUs are available.
❌ Limitations
As an offloading-based system running on weak GPUs, FlexGen also has its limitations. The throughput of FlexGen is significantly lower than the case when you have enough powerful GPUs to hold the whole model, especially for small-batch cases. FlexGen is mostly optimized for throughput-oriented batch processing settings (e.g., classifying or extracting information from many documents in batches), on single GPUs.
Learn more by checking out the FlexGen GitHub project and read the supporting paper.