Things are moving fast, getting weird, and staying exciting. FlexGen dropped on GitHub on February 20, 2023. It’s a game changer. You can now run ChatGPT like large language models on a single graphics card. You used to need to 10 GPUs to get to the same performance.
The high computational and memory requirements of large language model (LLM) inference traditionally make it feasible only with multiple high-end accelerators. FlexGen aims to lower the resource requirements of LLM inference down to a single commodity GPU (e.g., T4, 3090) and allow flexible deployment for various hardware setups. The key technique behind FlexGen is to trade off between latency and throughput by developing techniques to increase the effective batch size.
The key features of FlexGen include:
⚡ High-Throughput, Large-Batch Offloading
Higher-throughput generation than other offloading-based systems (e.g., Hugging Face Accelerate, DeepSpeed Zero-Inference). The key innovation is a new offloading technique that can effectively increase the batch size. This can be useful for batch inference scenarios, such as benchmarking (e.g., HELM) and data wrangling.
📦 Extreme Compression
Compress both the parameters and attention cache of models, such as OPT-175B, down to 4 bits with negligible accuracy loss.
Comes with a distributed pipeline parallelism runtime to allow scaling if more GPUs are available.
As an offloading-based system running on weak GPUs, FlexGen also has its limitations. The throughput of FlexGen is significantly lower than the case when you have enough powerful GPUs to hold the whole model, especially for small-batch cases. FlexGen is mostly optimized for throughput-oriented batch processing settings (e.g., classifying or extracting information from many documents in batches), on single GPUs.