The Open Source FlexGen Project Enables LLMs Like ChatGPT to Run on a Single GPU

February 23, 2023 Hans Scharler AI, Technology

Things are moving fast, getting weird, and staying exciting. FlexGen dropped on GitHub on February 20, 2023. It’s a game changer. You can now run ChatGPT like large language models on a single graphics card. You used to need to 10 GPUs to get to the same performance.

FlexGen is a high-throughput generation engine for running large language models with limited GPU memory (e.g., a 16GB T4 GPU or a 24GB RTX3090 gaming card). FlexGen allows high-throughput generation by increasing the effective batch size through IO-efficient offloading and compression.

FlexGen is an open source collaboration among the LMSys community, Stanford, UC Berkeley, TOGETHER, and ETH Zürich.

FlexGen Reduces Weight I/O By Traversing Column by Column

The high computational and memory requirements of large language model (LLM) inference traditionally make it feasible only with multiple high-end accelerators. FlexGen aims to lower the resource requirements of LLM inference down to a single commodity GPU (e.g., T4, 3090) and allow flexible deployment for various hardware setups. The key technique behind FlexGen is to trade off between latency and throughput by developing techniques to increase the effective batch size.

The key features of FlexGen include:

⚡ High-Throughput, Large-Batch Offloading
Higher-throughput generation than other offloading-based systems (e.g., Hugging Face Accelerate, DeepSpeed Zero-Inference). The key innovation is a new offloading technique that can effectively increase the batch size. This can be useful for batch inference scenarios, such as benchmarking (e.g., HELM) and data wrangling.

📦 Extreme Compression
Compress both the parameters and attention cache of models, such as OPT-175B, down to 4 bits with negligible accuracy loss.

🚀 Scalability
Comes with a distributed pipeline parallelism runtime to allow scaling if more GPUs are available.

❌ Limitations
As an offloading-based system running on weak GPUs, FlexGen also has its limitations. The throughput of FlexGen is significantly lower than the case when you have enough powerful GPUs to hold the whole model, especially for small-batch cases. FlexGen is mostly optimized for throughput-oriented batch processing settings (e.g., classifying or extracting information from many documents in batches), on single GPUs.

Learn more by checking out the FlexGen GitHub project and read the supporting paper.

Leave a ReplyCancel reply