The Memory Anatomy of Large Language Models: A Surgeon's Guide

Picture this: you’re about to deploy a shiny new 70-billion parameter language model, and your colleague asks the dreaded question: “How much GPU memory do we need?”.

Picture this: you’re about to deploy a shiny new 70-billion parameter language model, and your colleague asks the dreaded question: “How much GPU memory do we need?”. You pause, calculator in hand, realizing that the answer isn’t just “big model needs big memory.”, nor is it simply “Mind your own business, Steve”. The truth is far more nuanced and depends entirely on what you’re planning to do with your digital patient.

Just as a surgeon needs to understand whether they’re dealing with a routine checkup or open-heart surgery, we need to distinguish between two fundamentally different scenarios: inference and training. The memory requirements between these two modes are so dramatically different that treating them the same way is like prescribing aspirin for brain surgery.

The Calm of Inference: When Your Model is Just Thinking

During inference, your language model is in its most zen-like state. It’s not learning anything new, just processing inputs and generating outputs. The memory footprint here is surprisingly predictable, dominated by two main components that we can calculate with mathematical precision.

The first and most obvious memory consumer is the model itself. Every parameter in your neural network needs to live somewhere in memory, and the math is beautifully straightforward. Take your parameter count, multiply by the number of bytes per parameter based on your chosen precision, and you have your baseline. A 7-billion parameter model stored in 16-bit precision occupies roughly 14 gigabytes. Double the parameters, double the memory. It’s linear, it’s predictable, and it’s the easy part.

So if N is the number of parameters and B is the number of bytes used to store each parameter, the model size S will be:

S=NBS = N \cdot B

But here’s where things get interesting: the model parameters are just the beginning. As your model processes text, it builds up what we call the KV cache, a kind of working memory that stores the computed attention keys and values for each token it has seen. This cache is what allows the model to avoid recomputing the same values over and over again as it generates each new token in a sequence.[1]

The KV cache grows with every token, and its memory footprint can be calculated precisely. For any transformer model, the KV cache memory requirement follows this formula:

KV_cache=bsseq_len2n_layersdB\text{KV_cache} = \text{bs} \cdot \text{seq_len} \cdot 2 \cdot \text{n_layers} \cdot d \cdot B

where bs is the batch size and d the hidden dimension.[2]

For example, if we consider a LLaMA-7B (32 layers, 4096 hidden size) processing a single 4K token sequence in BF16:

KV_cache=1819223240962=4.2GB\text{KV_cache} = 1 \cdot 8192 \cdot 2 \cdot 32 \cdot 4096 \cdot 2 = 4.2,\text{GB}

Notice how dramatically this scales. Double the sequence length and the KV cache doubles as well. Process 8 concurrent users with 8K contexts each, and you're looking at 33.6 GB just for KV cache storage. The relationship is linear with both batch size and sequence length, making these the two most critical factors in production memory planning.

However, modern serving frameworks like vLLM have evolved sophisticated strategies to manage this memory explosion. Rather than allowing the KV cache to grow indefinitely until your GPU runs out of memory, these frameworks operate on a budget-based approach. You specify a maximum VRAM allocation upfront, and the system implements a best-effort caching strategy within that constraint. When memory pressure builds up, older or less frequently accessed cache entries are evicted, forcing partial recomputation for those sequences. This transforms the memory scaling from a hard constraint into a performance trade-off, where exceeding your memory budget results in slower inference rather than complete failure.

This is why inference memory planning requires thinking beyond just model size. You're not just storing a static neural network; you're running a dynamic system where memory grows predictably but can be managed intelligently. Understanding this scaling relationship is essential for production deployment, but modern frameworks give you the tools to operate within fixed memory budgets while maintaining reasonable performance.

The Intensity of Training: When Your Model is Learning

If inference is like a patient resting comfortably, then training is major surgery with multiple specialists working simultaneously. Everything that seemed manageable during inference suddenly multiplies by factors that can seem almost absurd to newcomers.

During training, every single parameter in your model travels with an entourage. The parameter itself is just one member of a group that includes its gradient (direction and magnitude of the update needed for the parameter) and various optimizer state variables. For example, if you’re using the popular Adam optimizer, each parameter is accompanied by momentum and variance estimates. What started as one number per parameter becomes four numbers per parameter, quadrupling your parameter-related memory consumption before you’ve even started thinking about the rest.

Another crucial component to account for is activation memory, which is the intermediate results computed during the forward pass and stored for backpropagation. These activations represent every intermediate computation your model performs as it processes a batch of training data through every layer of the network.

Think of it this way: during inference, your model computes values, uses them immediately and throws them away. During training, it has to keep almost everything it computes because the backpropagation algorithm needs to trace backward through every calculation to update the parameters correctly. This creates a memory accumulation effect that grows with batch size, sequence length, model size and depth.

The mathematical relationship governing activation memory reveals why training becomes so memory-intensive. For transformer models, activation memory approximately follows:

Activation_Memorybsseq_lendn_layersBmultiplier\text{Activation_Memory} \approx \text{bs} \cdot \text{seq_len} \cdot d \cdot \text{n_layers} \cdot B \cdot \text{multiplier}

The multiplier accounts for all the intermediate tensors: attention matrices, feed-forward activations, layer normalization buffers, and residual connections. Most critically, attention computation creates matrices of size batch_size × num_heads × sequence_length², making memory scale quadratically with sequence length,

Let’s examine this with concrete numbers. Training LLaMA-7B with a batch size of 8 and sequence length of 2048 using BF16 precision:

  • Parameters: 7B × 2 bytes = 14 GB
  • Gradients: 7B × 4 bytes[3] = 28 GB
  • Optimizer states: 7B × 8 bytes[4] = 56 GB
  • Activations[5]: 8 × 2048 × 4096 × 32 × 2 × 8 ≈ 67 GB

For a total of around 165GB for what’s considered a “small” model.

Notice the quadratic scaling with sequence length. Moving from 2K to 4K tokens doesn’t just double the activation memory; it roughly quadruples it due to the attention matrix computations[6]. The mathematical reality is why training with long contexts requires exponentially more resources.

When One GPU Isn't Enough: Memory Distribution Rules of Thumb

Eventually, every machine learning practitioner encounters models that simply won’t fit on a single device. The key insight is understanding which parallelism techniques reduce which memory components, allowing you to target your specific bottleneck[7].

Data Parallelism doesn't actually provide any memory savings per device, since it simply replicates the entire model across GPUs. Each device still needs the full 165GB for our LLaMA-7B example, however you can process larger effective batch sizes across the cluster, potentially improving training efficiency.

Pipeline Parallelism offers modest memory reductions by distributing layers across devices. For a 32-layer model split across 4 GPUs, each device handles 8 layers, reducing parameter-related memory by roughly 75%[8]. Your 98 GB parameter-related memory becomes ~25 GB per device. However, activation memory reduction is less predictable, as pipeline buffers and micro-batching can create additional overhead.

Tensor Parallelism shards individual layer computations across devices. For our 7B parameter model split across 4 GPUs, parameter-related memory drops to ~25 GB per device (98 GB/4). Activation memory also scales down proportionally, making this highly effective for both parameter and activation bottlenecks. The trade-off is increased communication overhead between devices, which can become non-negligible especially when working in a multi-node environment[9].

Fully Sharded Data Parallel (FSDP) and its successor FSDP2 represent the most sophisticated approaches to memory distribution. FSDP partitions parameters, gradients, and optimizer states across all available devices, then dynamically gathers the necessary shards for computation. For our LLaMA-7B example across 8 GPUs:

  • Parameter memory: 14 GB / 8 = 1.75 GB per device
  • Gradient memory: 28 GB / 8 = 3.5 GB per device
  • Optimizer state memory: 56 GB / 8 = 7 GB per device
  • Total parameter-related memory: ~12 GB per device (down from 98 GB)

FSDP2 improves upon the original with better memory efficiency and reduced communication overhead, particularly for models with heterogeneous layer sizes. It also provides more fine-grained control over which parameters to shard and when to perform the gather operations.

The activation memory bottleneck requires different strategies. Activation checkpointing trades computation for memory by recomputing activations during the backward pass rather than storing them. This can reduce activation memory by 75-90% at the cost of roughly 25% more compute time. Sequence parallelism specifically targets the memory consumed by layer normalization and dropout operations, which traditionally can't be tensor-parallelized but can be partitioned along the sequence dimension.

Rule of Thumb for Memory Planning:

  • Use FSDP/FSDP2 when parameter memory (98 GB in our example) is your bottleneck
  • Combine with activation checkpointing when activation memory (67 GB) becomes limiting
  • Add tensor parallelism when both parameter and activation memory need aggressive reduction
  • Pipeline parallelism works well for very large models where other techniques hit communication limits

The most effective approach often combines multiple techniques. A common pattern for large-scale training uses FSDP2 for parameter sharding, activation checkpointing for memory efficiency, and sequence parallelism for the remaining bottlenecks.

The Special Case of Reinforcement Learning

Reinforcement learning introduces its own memory considerations that go beyond standard supervised training. These techniques typically require multiple model copies serving different roles in the learning process, creating memory multiplication effects that can catch practitioners off guard.

Proximal Policy Optimization (PPO) represents the most memory-intensive RL approach. PPO maintains multiple model copies: the actor model being trained (full training memory), a frozen reference copy for KL divergence (inference memory only), a critic model that's also being trained, and often a separate reward model. For our LLaMA-7B example:

  • Actor model (training): 165 GB
  • Reference model (inference): 14 GB
  • Critic model (training): ~165 GB (similar size to actor)
  • Reward model (inference): ~14 GB
  • Total: ~360 GB

The memory pressure is further intensified by rollout buffers containing generated sequences, advantage calculations, and policy probabilities, which can add another 20-30 GB depending on batch size and sequence length.

Direct Preference Optimization (DPO) offers a more memory-efficient alternative by eliminating the need for separate critic and reward models. DPO requires only the training model and a frozen reference copy:

  • Training model: 165 GB
  • Reference model (inference): 14 GB
  • Total: ~180 GB

Group Relative Policy Optimization (GRPO) emerges as the most memory-efficient approach by eliminating the reference model entirely. GRPO uses group-relative comparisons within each batch instead of KL divergence against a fixed reference:

  • Training model: 165 GB
  • Total: ~165 GB (essentially the same as supervised fine-tuning)

Constitutional AI and Self-Training approaches vary widely depending on the specific implementation. A typical setup might include:

  • Generator model (training): 165 GB
  • Multiple critic models (inference): 14 GB each
  • Revision model (training): ~165 GB
  • Total: ~350-400 GB depending on the number of critics

Practical RL Memory Planning:

  • PPO: Budget ~360 GB (2.2× base training memory)
  • DPO: Budget ~180 GB (1.1× base training memory)
  • GRPO: Budget ~165 GB (same as base training memory)
  • Constitutional AI: Budget ~350-400 GB (2-2.5× base training memory)

The key insight for RL practitioners is that memory optimization often determines which techniques are feasible for a given model size. GRPO's memory efficiency makes it practical for larger models where PPO becomes prohibitive, while techniques like constitutional AI may require dropping to smaller base models or aggressive parallelization strategies.

Planning for the Unexpected

The most important lesson in LLM memory management is that the theoretical minimum is never the practical requirement. Memory usage during training can spike unpredictably due to gradient accumulation, framework overhead, temporary allocations, and the countless small inefficiencies that accumulate in complex software systems.

Successful deployment requires thinking like a surgeon planning a complex operation: you prepare for complications, keep emergency procedures ready, and never assume everything will go according to the ideal plan. Gradient checkpointing can trade computation for memory when you're running close to limits. Mixed precision training can halve your memory requirements with minimal impact on model quality. Quantization techniques can dramatically reduce inference memory footprint for deployment.

The landscape of LLM memory optimization continues to evolve rapidly. New techniques for efficient attention computation, better gradient checkpointing strategies, and more sophisticated parallelism approaches appear regularly. But the fundamental principles remain constant: understand your memory anatomy, plan for the worst case, and always keep your optimization tools sharp and ready.

Memory management in the world of large language models isn't just about fitting models into available hardware. It's about understanding the intricate relationship between model architecture, training dynamics, and hardware constraints. Master this understanding, and you'll find yourself capable of pushing the boundaries of what's possible with the resources at your disposal.

Notes

  1. KV cache has been on my writing backlog for months now. Would an in-depth article on this topic be helpful for you?
  2. I use d for hidden size here instead of my usual h to avoid confusion with the number of heads.
  3. We assume that gradients are stored in FP32, even if modern setup can store them in BF16/FP16.
  4. I am assuming Adam which stores its parameters in FP32, but there are also optimizers that reduce this memory requirement by a significant margin.
  5. This is an overestimate, it assumes that all activations are materialized at full precision simultaneously. In practice, frameworks adopt several optimization techniques to reduce this memory requirement.
  6. This holds true for the standard Transformers attention, and this is the main reason why nowadays some models don’t use it.
  7. Discussing parallelization methodologies in detail is out of the scope of this article. Let me know if you’d like a dedicated blogpost about that!
  8. This is an ideal value. In practice, there are extra buffers involved in this technique, so a more realistic estimate would be around 50-60% per-device memory savings.
  9. If you want to know why this is true, ask me to write about GPU comms!