Hands-on LoRA implementation from scratch: building, training, merging, and inspecting low-rank adapters with PyTorch.
Run in Google Colab | View on GitHub
In my last post, I walked through SFT, DPO, and RLHF for fine-tuning LLMs. Throughout that entire post, LoRA kept showing up in every code example, every training config, every LoraConfig(r=16, lora_alpha=32) call. I used it the way most of us do: copy the config from a tutorial, set r=16 because that's what everyone uses, set lora_alpha to double the rank because... reasons, and move on. The model trains, the loss goes down, the outputs improve. Ship it.
But a few days ago I got into a discussion with a colleague about fine-tuning efficiency: how much memory we were actually saving with LoRA, whether we could push the rank lower without hurting quality, whether it even mattered which layers we targeted. I had opinions on all of this, but when I tried to back them up with anything beyond "it worked last time," I realized I was hand-waving. I knew what LoRA did at a high level (low-rank matrices, fewer parameters, memory efficient), but I couldn't actually explain why those specific numbers mattered. What does rank even mean in this context? Why does lora_alpha scale the way it does? What's actually happening to the weight matrices during training? I'd been treating LoRA like a black box with good defaults, and that bothered me.
So I blocked out a weekend, pulled up the original paper, and went through the math line by line. What follows is what I wish someone had explained to me before I started using LoRA in production.
Let's start with why LoRA exists. A model like Llama 3.1 8B has roughly 8 billion parameters. Full fine-tuning means updating all of them: every weight in every layer gets a gradient, an optimizer state, and a momentum term. For Adam, that's 3x the model size in memory just for the optimizer. On a Llama 8B in float32, that's:
Add the model weights, gradients, and activations, and you're looking at needing multiple A100 80GB GPUs just for fine-tuning. For most teams, that's impractical.
LoRA's insight: when you fine-tune a large model on a specific task, the weight updates don't use the full dimensionality of the weight matrices. The change in weights during fine-tuning is low-rank. It lies in a much smaller subspace than the original weights. So instead of updating a giant matrix, you can decompose the update into two small matrices and only train those.
Here's the key equation. For a pretrained weight matrix , LoRA constrains the update
to be a low-rank decomposition:
Where and
, with rank
.
That's it. That's the whole trick. Instead of learning a matrix of updates (potentially millions of parameters), you learn two smaller matrices whose product has the same shape but far fewer total parameters.
Let me make this concrete with a picture. Say you have a weight matrix in a transformer attention layer with $d = 4096$ and $k = 4096$:
With $r = 8$, you're training 65,536 parameters instead of 16.7 million — a 256x reduction for this single layer. Across the entire model, LoRA typically trains 0.1-1% of the total parameters.
During a forward pass, the original weight and the LoRA update combine like this. For an input $x$:
Here's what that looks like step by step:
The pretrained weights stay completely frozen: no gradients, no optimizer states, no memory overhead. Only $B$ and $A$ receive gradients. This is why LoRA is so memory-efficient: you only store optimizer states for the tiny adapter matrices, not the full model.
Let's implement this from scratch in PyTorch so you can see exactly what's happening:
A few things to notice here. The original layer is frozen (requires_grad = False). And there's a scaling factor that we'll come back to shortly. Now the adapter matrices:
This initialization is critical. $B$ starts at zero, which means at the beginning of training. The model starts producing exactly the same outputs as the pretrained model. Training then gradually learns the update. $A$ uses Kaiming uniform initialization to break symmetry.
The forward pass puts it all together:
Two separate matrix multiplications through the bottleneck: compresses to rank $r$, then
projects back up, plus the scaling factor. Let's see the parameter savings in action:
The rank $r$ is LoRA's most important hyperparameter, and it's worth building intuition about what it controls.
In linear algebra, the rank of a matrix is the number of linearly independent rows (or equivalently, columns). A rank-$r$ matrix can be expressed as the sum of $r$ rank-1 outer products. Think of it as the number of "independent directions" the matrix can push information through.
When we constrain with
and
, the product $BA$ has rank at most $r$. This means the weight update can only modify the model's behavior along $r$ independent directions in the weight space.
The original LoRA paper found something surprising: even $r = 1$ or $r = 2$ works reasonably well for many tasks. The weight updates during fine-tuning really are low-rank. Here's an intuition for why: when you fine-tune on a specific task (like marketing copy), you're not rewiring the model's entire understanding of language. You're making a targeted adjustment: "write in this style" or "prefer these patterns." That adjustment occupies a small subspace of what the model's weights can represent.
Here's a practical way to see this. Let's create a weight update, compute its singular values, and see how the energy concentrates:
A random matrix spreads its energy uniformly across all singular values, which is why even $r = 64$ only captures ~2%. But real fine-tuning updates aren't random. They concentrate on a few directions that matter for the task. In practice, $r = 8$ or $r = 16$ captures the meaningful signal while ignoring noise.
The sweet spot for most tasks is . Going higher adds parameters without proportional improvement. Going lower risks underfitting complex tasks.
If you've ever stared at lora_alpha=32 in a config and wondered what it does, here's the answer. The LoRA forward pass applies a scaling factor:
Where is
lora_alpha and $r$ is the rank. This scaling serves a critical purpose: it decouples the learning rate from the rank.
Without this scaling, changing the rank would change the magnitude of the LoRA update. If you double $r$, you'd roughly double the norm of $BA$ (more parameters contributing to the output), and you'd need to halve the learning rate to compensate. The factor normalizes this away.
Here's the practical implication. When lora_alpha = 2 * r (the common convention), the scaling factor is . The LoRA update gets amplified by 2x. This means:
You can think of lora_alpha as a "volume knob" for the LoRA update. Higher alpha amplifies the adapter's effect. The convention of alpha = 2 * r works well in practice, but you can tune it, especially if you notice training instability (lower alpha) or the model not learning fast enough (higher alpha).
Let's see this in action:
Linear relationship: double the alpha, double the output magnitude. The learning rate and scaling factor interact, which is why the convention of fixing alpha = 2r and tuning only the learning rate is the pragmatic approach.
In a transformer, LoRA is typically applied to the attention projection matrices. Looking at a standard multi-head attention block:
The original paper applied LoRA only to and
, but modern practice targets all four attention projections. Some people also include the MLP layers (
gate_proj, up_proj, down_proj), though the marginal benefit varies.
Here's the config you'll see in most production setups:
And if you want to be more aggressive:
Let's count the parameter difference across a full model:
Even the aggressive "all projections" approach trains less than 1% of the model. That's LoRA's superpower.
Let's put all the pieces together with a real training example. We'll fine-tune a small model so you can actually run this, and inspect the LoRA matrices at each stage.
First, let's create a minimal dataset and load a model with LoRA:
Only 0.34% of parameters are trainable. Let's inspect what the LoRA matrices look like before training:
Exactly as expected. $A$ is initialized with random values, $B$ is all zeros, so . The model starts as if no adapter exists.
Now let's train it on a few examples and see how the matrices change:
After training, let's check the matrices again:
$B$ is no longer zero. Training has learned a low-rank update. The model's behavior has shifted, but only along 8 independent directions in the weight space.
This is where things get practically interesting. You've trained your LoRA adapter. Now what? You have two options: keep the adapter separate, or merge it into the base model. The choice has real implications for serving.
Merging is just matrix addition. You take the pretrained weight and permanently add the LoRA update:
After merging, the model is a regular model again: no adapter, no separate matrices, no extra computation at inference time.
Here's how you do it in code:
Let's verify the merge is mathematically correct:
Just matrix addition with scaling. Nothing mysterious.
If you skip the merge, the LoRA adapter stays separate from the base model. This isn't just an academic distinction; it affects both performance and flexibility.
Inference overhead. Without merging, every forward pass computes two paths: the base model path and the LoRA path. For a single request, the overhead is small. But at scale, those extra matrix multiplications add up:
The exact overhead depends on hardware, but expect 5-15% extra latency on the forward pass. Not catastrophic, but not free.
Multi-adapter serving. Here's the flip side: not merging is actually a feature when you need to serve multiple adapters. If you have one base model and 50 brand-specific LoRA adapters (like the marketing scenario from the previous post), you can:
Each adapter is a few megabytes. The base model is tens of gigabytes. Without merging, you store one base model + N tiny adapters instead of N full model copies. That's the difference between needing 1 GPU and needing 50.
Let's trace exactly what merge_and_unload does under the hood. It's simple but worth understanding:
The merged weight is a regular matrix. No special structure, no adapter overhead. But you lose the ability to "un-merge"; the adapter's contribution is baked into the weights permanently.
A few things I've learned the hard way that the paper doesn't tell you:
Start with r=8 and alpha=16. This is a good default for 7B-13B parameter models on most tasks. Only increase rank if you see clear signs of underfitting (training loss not decreasing fast enough despite reasonable learning rate).
Learning rate matters more than rank. The learning rate for LoRA should typically be 5-10x higher than what you'd use for full fine-tuning. This is because you're only updating a small subset of parameters, so they need to move more per step to have the same overall effect. Start with 2e-4 and adjust from there.
Dropout is your friend for small datasets. lora_dropout=0.05 is the default, but if you're training on fewer than 1000 examples, bump it to 0.1. The low-rank bottleneck is already a form of regularization, but it's not always enough.
Save adapters, not merged models, at least during development. A LoRA adapter for a 7B model is ~10-50 MB. A merged model is ~14 GB. When you're running dozens of experiments, that storage difference matters.
Double-check your target modules. Different model families have different linear layer names. Llama uses q_proj, k_proj, v_proj, o_proj. Other models might use query, key, value, or qkv_proj. Check with:
LoRA's elegance is in how simple it actually is once you see the math. Freeze the pretrained weights, learn a low-rank update decomposed into two small matrices, and add it to the forward pass with a scaling factor. That's the whole algorithm. The rest is engineering: choosing which layers to target, setting the rank and scaling, deciding whether to merge for serving or keep adapters separate for flexibility.
The next time you write LoraConfig(r=16, lora_alpha=32), you'll know exactly what those numbers mean and why they matter. And when someone on your team asks "can we make r bigger?" you'll be able to explain what it actually changes in the weight space, not just whether to do it.
Originally published on AI Terminal.
Tags: lora, peft, merging, low-rank, fine-tuning