Really really fast LLM serving

August 16, 2024

I keep coming across scattered resources on optimizing and making LLMs faster, so I’m just gonna build out a list of articles and resources I come across. All summaries are LLM generated. This is not garunteed to be up to date.


vLLM: Serve LLMs at scale

A single 3090 can serve Llama 3 to thousands of users

This article describes a high-performance deployment environment for vLLM, a serving engine optimized for large language models at scale. It outlines the key features of the environment, including a pre-configured vLLM server, an OpenAI-compatible API endpoint, and easy configuration through environment variables. The article provides information on the system’s capabilities, usage instructions, configuration options, and built-in benchmarking tools. It also covers advanced configuration, log viewing, and offers guidance on using custom SSL certificates. The purpose is to provide users with a comprehensive guide to set up, configure, and utilize the vLLM server for efficient large language model deployment and serving.


LLM Compressor is Here: Faster Inference with vLLM

We’re excited to introduce LLM Compressor, a library to compress LLMs for faster inference with vLLM.

This article announces the release of LLM Compressor, a unified library for creating compressed models to enable faster inference with vLLM (a serving engine for large language models). Key points include:

  1. LLM Compressor allows users to create compressed, accurate versions of models using various techniques like activation and weight quantization, and weight pruning.

  2. It enables activation quantization in vLLM, which can significantly improve performance for compute-heavy workloads, especially in production serving deployments.

  3. The article provides benchmarks comparing different quantization methods (unquantized FP16, INT8 weight and activation quantization, and INT4 weight-only quantization) for the Llama 3.1 70B model, demonstrating improved performance and resource efficiency with activation quantization.

  4. LLM Compressor integrates with Hugging Face models and vLLM for easy use in the open-source ecosystem.

  5. The article includes a code snippet demonstrating how to use LLM Compressor to quantize a model and run inference with vLLM.

  6. Future plans for LLM Compressor include expanding model support, adding new algorithms, supporting non-Nvidia hardware, and implementing additional compression techniques.

  7. Neural Magic offers nm-vllm, an enterprise distribution of vLLM with additional features and support for production deployments.

The article emphasizes the potential of LLM Compressor to improve the efficiency and performance of large language models in various deployment scenarios.


Best LLM Inference Engines and Servers to Deploy LLMs in Production

This article discusses the best LLM (Large Language Model) inference engines and servers for deploying AI applications in production. It highlights the importance of optimizing throughput and latency when serving LLMs. The piece compares several solutions, including vLLM, TensorRT-LLM, Hugging Face Text Generation Inference, RayLLM with RayServe, and Triton Inference Server. Each option offers unique features like efficient memory management, continuous batching, and specialized attention algorithms to enhance performance. The article emphasizes that choosing the right solution depends on specific use cases, model sizes, and hardware requirements, with some options being limited to certain GPU types. It concludes by stressing the importance of high-performance AI infrastructure for optimal LLM deployment.


GPU Utilization is a Misleading Metric

The article “GPU Utilization is a Misleading Metric” argues that the commonly used GPU Utilization metric can be misleading for understanding true GPU performance in machine learning tasks. The authors, working with a foundation model company, found that 100% GPU utilization could be achieved with only 20% Model FLOPS (MFUs) utilization. They explain that GPU Utilization only measures whether a kernel is executing, not how efficiently it uses all cores. The article recommends tracking SM Efficiency and MFUs for a more accurate picture of GPU performance. By implementing optimizations such as fusing layers in transformer blocks, the team achieved a 4x speedup in training time and increased MFU from 20% to 38%. The piece concludes by suggesting that AI teams monitor SM Efficiency alongside GPU Utilization and look forward to future compiler optimizations that might automate these improvements.