You suck at deploying A.I. models

August 25, 2024

> If I had more time, I would have written a longer rant. ~Some Dude

If you deploy AI models to production you need to give a shit. Please, I beg of you. Stop wasting the compute, stop over-provisioning extremely expensive GPU nodes, stop wasting our time, stop wasting my time, and stop throwing money into a furnace.

I just found out that GPU utilization can be at 100% while doing 0 actual work. This confirms a long standing theory I’ve had that streaming multiprocessors play a much bigger part in AI Inference than most people realize and just relying on VRAM and GPU Utilization as your sole metrics is worse than useless – it’s misleading. It makes you think that everything is all hunky-dory when in reality you are probably only using 10% of the compute you are paying for.

> I payed for the whole computer I'm going to use the whole computer!

When you are deploying LLMs to production, this is your north star, your guiding light, your mantra. Provisioning GPU nodes on AWS runs hundreds of dollars per node per month for a small one, and thousands of dollars per month for larger nodes. LLM model deployment is the art of pegging every single resource on that box as close to 100% as you can to maximize the amount of useful work you can get out of it. Using GPU resources efficiently matters. Failure to understand the underlying architecture is lighting money on fire – and a lot of it. LLM deployment pushes the bounds of what computers are capable of. You cannot ignore computer systems’ limitations like you can with normal software.

Most people deploy an LLM to a web server using something like huggingface and FastAPI and call it a day. This is just fine if you have virtually no users and don’t expect redundancy. But if you’re building something serious you cannot just be content with that. Deploying AI models is a minefield, so let me kickstart your journey by offering you some questions to ask yourself as you improve your API:

  • How do you set up redundancy?
  • How are requests routed between instances of the model?
  • How many models can you load simultaneously?
  • Can you load multiple models per docker container or do you need to split them out?
  • Can multiple docker containers reserve the same GPU? What about in kubernetes?
  • Do you need a GPU base image?
  • Why does PyTorch come with CUDA?
  • If you have a host machine with CUDA installed, a base image with another version of CUDA installed, and the PyTorch version of CUDA installed, which does the model use?
  • How do you quickly load data onto and off of the GPU?
  • What is the relationship between latency of loading info onto the GPU and processing time on the GPU? How does this change as batch/message size changes?
  • What is the time to first token and how do you speed that up?
  • How does the model parameter precision level affect loading speed, inference speed, quality of inference, disk space, and space in VRAM? (specifically, using calculated numbers, not in a hand-wavy kind of way)
  • How do you calculate VRAM usage and inference speed based on the properties of the model and GPU?
  • How does inferencing with a small batch compare to a large one? Does the GPU usage spike to 100% or does it only use a fraction of it?
  • How is GPU utilization actually calculated? (HINT: It isn’t real, it’s an estimate based on many different factors)
  • What is the role of streaming multiprocessors?
  • How long does it take to unload a batch from the GPU? Is there a way to speed that up?
  • Why do tried and true tools like nvitop aggregate information rather than showing the direct number?
  • Why does nvidia-smi lie about the version of CUDA installed?
  • What version of CUDA is your special snowflake of an AI inference engine compatible with?
  • Why can’t multiple child processes inference with the same model?
  • How does interconnect affect efficiency?
  • How does PCIe bandwidth affect effciency for multi-gpu inference?
  • Does your CPU even have enough PCIe lanes available?
  • How does the amount of CUDA cores affect the speed of inference?
  • Why is a $1500 consumer grade graphics card faster at inferencing a model than a $380,000 enterprise server?
  • How do you prevent memory leaks from happening when the batches are taking too long to process?
  • How do cards with vs without tensor cores affect the speed of models? And more importantly, why? How is the difference calculable between different GPU types?
  • How do you do continuous batching?
  • What about streaming responses?
  • Is your model type compatible with the engine you want to use? What about the inference mode?
  • How is encoding vs decoding done on the GPU and how can it be optimized in a cluster?
  • What parts of my specific model architecture could be improved by writing a custom CUDA kernel?
  • How does ALL OF THE ABOVE affect training vs inference?

This field is so far from standardized. The ground is shifting beneath our feet as we build and failure to understand your tools will result in a half-baked implementation that will waste tons of money and time. The reason I’m so obsessed with pushing LLMs to their limits is because of all the waste I’ve seen in the industry due to lack of knowledge and (worse) lack of caring. So many people just not caring that the AWS bill is ballooning in ways that dwarf all other team spend.

You don’t need to know the answer to all of these questions. I don’t know the answers to all of them myself. You can pretend they don’t exist, but you will be subjected to their limiting effects nonetheless.