← writing

LoRA Explained: Fine-Tuning LLMs Without the Price Tag

2026.07.02LoRA · QLoRA · fine-tuning · LLM · edge AI

Full fine-tuning of a large language model used to mean renting a cluster of enterprise GPUs and updating every single weight in the network. Low-Rank Adaptation (LoRA) changed the economics completely: you can adapt a model to your domain by training roughly 0.1% of its weights, hit 90-95% of full fine-tune quality, and do it on a single consumer GPU in an afternoon.

This post explains what LoRA actually does, how it works under the hood, the 2026 variants worth knowing, and — just as importantly — when not to reach for it.

Why LoRA matters

The headline numbers are what make LoRA hard to ignore:

  • 0.1% of weights updated — everything else stays frozen.
  • 90-95% of full fine-tune quality — for most tasks the gap is invisible.
  • ~75% less VRAM with QLoRA — fine-tuning fits on hardware you already own.
  • One afternoon on a consumer GPU — no cluster, no cloud bill.

The result is that fine-tuning stops being a capital expense and becomes a weekend experiment. That shift is what has pushed so much specialised, small-model work out to the edge.

How LoRA works

LoRA rests on a simple insight: the change a model needs during fine-tuning has a low intrinsic rank. You do not need to move every weight — you only need to nudge them in a small number of directions. Three steps make that practical.

1. Freeze the base model

All existing weights stay locked. There is no expensive full retraining, and the original model's general capabilities are preserved exactly as they were.

2. Inject low-rank matrices

Two tiny matrices — conventionally called A and B — are inserted into the attention layers. Together they form a low-rank "delta" that absorbs the updates the model would otherwise make across its full weight matrices.

from peft import LoraConfig, get_peft_model
 
config = LoraConfig(
    r=8,                # rank of the A/B matrices
    lora_alpha=32,      # scaling factor
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
)
model = get_peft_model(base_model, config)

3. Train only the adapters

Only the new matrices are trained — a tiny fraction of the parameters. At inference you can merge the adapters back into the base weights, which means zero added latency and zero runtime overhead compared to the original model.

Quality vs size: the QLoRA ladder

Quantising the frozen base model on top of LoRA (that's QLoRA) is where the VRAM savings come from. There is a predictable trade-off between how aggressively you quantise and how much quality you keep:

  • LoRA (full precision) — ~95% quality, same size as the base model.
  • QLoRA (Q8) — ~90% quality, near-lossless.
  • QLoRA (Q4_K_M) — ~85% quality. The sweet spot for most consumer setups.
  • QLoRA (Q2) — ~70% quality. A last resort when memory is truly scarce.

For most real projects, Q4_K_M is the pragmatic default: a large drop in memory footprint for a small, usually acceptable, drop in quality.

2026 variants worth knowing

LoRA is no longer a single technique — it's a family. Three variants are worth having in your toolkit:

  • QLoRA — LoRA applied on top of a 4-bit quantised base model. Cuts VRAM by around 75% and is the best fit for consumer hardware.
  • DoRA — decomposes each update into magnitude and direction, which gives better convergence than vanilla LoRA. Enable it with use_dora=True.
  • RS-LoRA — rank-stabilised LoRA that fixes LoRA's historical struggle with retaining factual knowledge and scales better on complex datasets.

When to use LoRA (and when not to)

Fine-tuning is not always the right tool. The clearest way to decide is to ask whether you are teaching the model a behaviour or feeding it facts.

Reach for LoRA when you need:

  • Brand voice or style consistency.
  • Strict output formats — JSON, schemas, structured responses.
  • Domain knowledge that is stable over time (legal, medical, AEC).
  • Edge deployment — small, specialised models that run locally.

Reach for RAG + prompting instead when you need:

  • Factual grounding from live or changing data.
  • Knowledge that updates frequently.
  • One-off tasks that don't justify a training run.
  • General-purpose assistant behaviour.

A useful rule of thumb: LoRA teaches the model how to respond; retrieval tells it what is true right now. Many production systems use both — a LoRA-tuned model for tone and format, RAG for up-to-date facts.

The takeaway

LoRA turned fine-tuning from an infrastructure project into something you can run on the GPU already in your machine. Freeze the base, train a pair of small adapters, optionally quantise with QLoRA, and you get a specialised model at a fraction of the cost — with quality that, for most tasks, is indistinguishable from a full fine-tune.

If your problem is about behaviour — voice, format, domain style — LoRA is very likely the cheapest good answer.


References: LoRA — Hu et al., 2022 · QLoRA — Dettmers et al., 2023 · DoRA — Liu et al., 2024 · RS-LoRA — Kalajdzievska, 2023.