Girijesh Prasad’s AI Blog

The Story of Embedding — Deep Dive: From Bag of Words to Sentence Transformers

2026-02-20T03:30:00+00:00

The Story of Embedding — Deep Dive: From Bag of Words to Sentence Transformers

The mathematical intuitions, architectural decisions, and production lessons behind 70 years of teaching machines to understand language.

Why This Post Exists

There are hundreds of “intro to embeddings” posts out there. Most of them tell you what Word2Vec and BERT are. Very few explain why each generation of embeddings emerged, what mathematical insight drove each breakthrough, and what actually matters when you’re deploying these systems in production.

This post is for engineers who want to go deeper — who want to understand not just the “what” but the “why” and the “how it actually works under the hood.”

Let’s trace the full arc, starting from first principles.

1. The Representation Problem: Why Vectors?

Before we count a single word, we need to answer a fundamental question: why represent text as vectors at all?

The answer is deceptively simple: vectors give us geometry, and geometry gives us the ability to measure. Once you have text as vectors, you can compute distances (how different are two documents?), find nearest neighbours (what’s the most similar sentence?), and perform operations (what’s halfway between “happy” and “sad”?).

The entire history of embeddings is really the history of making these geometric operations meaningful — making the geometry of the vector space mirror the semantics of language.

timeline
    title The Evolution of Text Embeddings
    section Count-Based
        1950s : Bag of Words
            : Simple frequency counting
        1990 : LSA (SVD)
            : Latent semantic structure
        1992 : TF-IDF
            : Information-theoretic weighting
    section Neural Static
        2003 : Bengio NPLM
            : First neural word embeddings
        2013 : Word2Vec
            : Negative sampling breakthrough
        2014 : GloVe
            : Global co-occurrence factorisation
        2016 : FastText
            : Subword n-grams
    section Contextual
        2017 : Transformer
            : Self-attention architecture
        2018 : ELMo
            : Layer-wise contextual representations
        2018 : BERT
            : Pretraining-finetuning paradigm
    section Sentence-Level
        2019 : Sentence-BERT
            : Siamese bi-encoders
        2020 : ColBERT
            : Late interaction
        2022 : Matryoshka
            : Adaptive dimensionality
        2023-24 : E5 / BGE / NV-Embed
            : Instruction-tuned embeddings

2. The Counting Era: BoW, TF-IDF, and Their Hidden Mathematics

Bag of Words (1950s)

BoW maps each document to an

-dimensional vector, where

is the vocabulary size. Simple frequency counting. But here’s what most tutorials skip: BoW is actually performing a projection from the infinite-dimensional space of possible utterances onto a finite vector space — and it’s a lossy projection that discards word order, syntax, and semantics.

The fundamental limitation isn’t just “no semantics.” It’s the curse of dimensionality for sparse vectors. With

= 100,000, every document lives in a 100,000-dimensional space where cosine similarity becomes almost meaningless — in high-dimensional sparse spaces, all pairwise distances converge, a phenomenon known as the concentration of measure.

TF-IDF: Information-Theoretic Weighting

TF-IDF is more interesting than most people realise. The IDF component:

\[\text{IDF}(t) = \log\frac{N}{df(t)}\]

is essentially an information-theoretic quantity. A word that appears in every document (df(t) = N) has IDF = 0 — zero information value. A rare word has high IDF. This connects directly to Shannon’s self-information: rare events carry more information.

But TF-IDF still builds on the independence assumption — it treats each word as statistically independent of every other word. “New York” is just “New” + “York”. This is where the paradigm needed to break.

LSA: The Forgotten Bridge (1990)

Most “embedding history” posts jump from TF-IDF to Word2Vec, skipping the critical intermediate step: Latent Semantic Analysis (LSA) by Deerwester et al. (1990).

LSA takes the term-document matrix and applies Singular Value Decomposition (SVD):

\[X \approx U_k \Sigma_k V_k^T\]

By keeping only the top-k singular values, you project documents into a k-dimensional space (typically k=100-300) where synonyms collapse together and polysemy partially resolves. LSA was the first demonstration that dimensionality reduction on co-occurrence data captures latent semantic structure.

This insight — that meaning hides in the statistical structure of co-occurrence — is the intellectual ancestor of everything that follows.

3. The Neural Turn: Bengio’s NPLM (2003) — The Forgotten Origin

The standard narrative says Word2Vec (2013) started neural embeddings. That’s wrong. The actual origin is Yoshua Bengio’s Neural Probabilistic Language Model (NPLM), published in 2003 — a full decade earlier.

Bengio’s key insight: assign each word a learned distributed representation (a dense vector), then train a neural network to predict the next word from the concatenation of the previous n words’ vectors.

The model had three components:

Embedding lookup table C: a V × d matrix mapping word indices to d-dimensional vectors
Hidden layer: h = tanh(H · [C(w_{t-n+1}); ...; C(w_{t-1})] + b)
Output softmax: probability distribution over all V words

The genius was that the embedding table C was learned jointly with the prediction task. Words that could appear in similar contexts would naturally get similar embeddings, because similar embeddings would produce similar predictions through the hidden layer.

graph LR
    subgraph Input["Input: Previous n words"]
        W1["w(t-3)"] --> E1["Embedding C(w(t-3))"]
        W2["w(t-2)"] --> E2["Embedding C(w(t-2))"]
        W3["w(t-1)"] --> E3["Embedding C(w(t-1))"]
    end
    E1 --> CONCAT["Concatenate"]
    E2 --> CONCAT
    E3 --> CONCAT
    CONCAT --> HIDDEN["Hidden Layer - tanh Hx + b"]
    HIDDEN --> SOFTMAX["Softmax over V words - O(V) bottleneck"]
    SOFTMAX --> PRED["P w_t = next word"]
    style SOFTMAX fill:#ff6b6b,stroke:#333,color:#fff
    style PRED fill:#51cf66,stroke:#333,color:#fff

Why did it take 10 years to become mainstream? Bengio’s model was computationally expensive. The softmax output layer required computing a

-way classification for every position in the training data. With V = 100K words and billions of training positions, this was intractable in 2003.

Word2Vec’s real contribution wasn’t the idea of neural embeddings — it was making them computationally feasible.

4. Word2Vec (2013): The Trick Was in the Training

The Skip-gram Objective

Skip-gram’s true objective function maximises:

\[J = \frac{1}{T} \sum_{t=1}^{T} \sum_{-c \leq j \leq c, j \neq 0} \log P(w_{t+j} | w_t)\]

where T is the total words in the corpus, c is the context window size, and:

\[P(w_O | w_I) = \frac{\exp(\tilde{v}_{w_O}^{\,T} \cdot v_{w_I})}{\sum_{w=1}^{V} \exp(\tilde{v}_w^{\,T} \cdot v_{w_I})}\]

The denominator is a sum over the entire vocabulary — this is the bottleneck that killed Bengio’s model. With V = 100K+, computing this for every training example is absurdly expensive.

Negative Sampling: The Actual Innovation

Mikolov’s key contribution was negative sampling, which replaces the expensive softmax with a much cheaper binary classification:

\[\log \sigma(\tilde{v}_{w_O}^{\,T} \cdot v_{w_I}) + \sum_{i=1}^{k} \mathbb{E}_{w_i \sim P_n(w)} [\log \sigma(-\tilde{v}_{w_i}^{\,T} \cdot v_{w_I})]\]

Instead of computing probabilities over all V words, you:

Take the actual context word (positive) — push its vector towards the target
Sample k random “noise” words (negatives, typically k=5-15) — push their vectors away from the target

The noise distribution P_n(w) is the unigram distribution raised to the 3/4 power: P_n(w) = U(w)^{3/4}/Z. The 3/4 exponent is an empirical choice that slightly upweights rare words relative to their frequency — preventing extremely common words from dominating the negative samples.

This reduced training from O(V) per example to O(k) per example. That’s the real reason Word2Vec succeeded where Bengio’s NPLM struggled — not a fundamentally different idea, but a training trick that made it 10,000x faster.

graph TD
    subgraph FULL["Full Softmax (Bengio)"]
        direction LR
        TGT1["Target word"] --> COMP1["Compute score against\nALL V words"]
        COMP1 --> NORM1["Normalise\n(expensive!)"]
        NORM1 --> COST1["O(V) per example\n❌ ~100K operations"]
    end

    subgraph NEG["Negative Sampling (Word2Vec)"]
        direction LR
        TGT2["Target word"] --> POS["✅ 1 positive\n(actual context word)"]
        TGT2 --> NEGS["❌ k=5 negatives\n(random noise words)"]
        POS --> COST2["O(k) per example\n✅ ~5 operations"]
        NEGS --> COST2
    end

    FULL -.->|"replaced by"| NEG

    style COST1 fill:#ff6b6b,stroke:#333,color:#fff
    style COST2 fill:#51cf66,stroke:#333,color:#fff

Why King - Man + Woman ≈ Queen Actually Works

This isn’t magic. It’s a consequence of the linear structure that skip-gram implicitly learns.

If “king” and “queen” appear in similar royal/monarchical contexts, and “man” and “woman” appear in similar gender-differentiated contexts, then the model learns embeddings where the gender direction (man → woman) and the royalty direction (commoner → royal) are approximately independent linear subspaces.

Mathematically:

v(king) ≈ v(royalty) + v(male)
v(queen) ≈ v(royalty) + v(female)
v(king) - v(man) + v(woman) ≈ v(royalty) + v(male) - v(male) + v(female) ≈ v(royalty) + v(female) ≈ v(queen)

Levy and Goldberg (2014) proved that Skip-gram with negative sampling is implicitly factorising a shifted PMI matrix — the pointwise mutual information between words and contexts, shifted by log(k). This connects Word2Vec back to the distributional semantics tradition and explains why the embeddings capture semantic relationships: PMI is a well-understood measure of statistical association.

5. GloVe: Making the Implicit Explicit

The Objective Function

Pennington et al. at Stanford asked: if Word2Vec is implicitly factorising a co-occurrence matrix, why not do it explicitly?

GloVe’s objective:

\[J = \sum_{i,j=1}^{V} f(X_{ij})(w_i^T \tilde{w}_j + b_i + \tilde{b}_j - \log X_{ij})^2\]

where X_ij is the co-occurrence count of words i and j, and f(x) is a weighting function:

\[f(x) = \begin{cases} (x/x_{max})^{0.75} & \text{if } x < x_{max} \\ 1 & \text{otherwise} \end{cases}\]

The weighting function f() is crucial: it prevents extremely frequent co-occurrences (like “the” + anything) from dominating the objective, whilst giving zero weight to word pairs that never co-occur (X_ij = 0).

Key insight: The model asks that the dot product of two word vectors should approximate the log of their co-occurrence count. Words that co-occur frequently → high dot product → similar vectors.

When to Choose GloVe vs Word2Vec

In practice, the difference is marginal for most downstream tasks (Levy et al., 2015 showed they perform similarly when hyperparameters are properly tuned). The real trade-off is:

GloVe: Single-pass over co-occurrence matrix, deterministic, easier to parallelise
Word2Vec: Online learning (can update with new data), stochastic, works well with streaming data

6. FastText: Morphology Matters

FastText’s innovation isn’t just “handles OOV words.” The deeper insight is about morphological compositionality.

The word vector is the sum of its character n-gram vectors:

\[v_{w} = \sum_{g \in \mathcal{G}(w)} z_g\]

where G(w) is the set of n-grams (n=3-6 typically) for word w, plus the word itself.

This means:

“unhappy” ≈ “un” + “happy” → the “un-“ prefix carries negation information
“running”, “runner”, “ran” share subword features
Misspelled “embeddding” shares most n-grams with “embedding”

Why this matters for production: In real-world data, you encounter typos, domain-specific neologisms, code-mixed text (Hindi + English), and morphologically rich languages. FastText handles all of these gracefully, whilst Word2Vec and GloVe would return a zero/random vector.

7. ELMo: The Layer-Wise Revelation

Architecture

ELMo (Peters et al., 2018) uses a 2-layer bidirectional LSTM trained as a language model. The critical insight wasn’t just “context-dependent vectors” — it was what each layer captures.

The ELMo representation for a token k is:

\[\text{ELMo}_k^{task} = \gamma^{task} \sum_{j=0}^{L} s_j^{task} h_{k,j}\]

where:

h_{k,0} = character-level CNN (subword features)
h_{k,1} = first LSTM layer (syntactic features)
h_{k,2} = second LSTM layer (semantic features)
s_j = softmax-normalised weights (learned per task)
γ = task-specific scaling factor

The revelation: Peters et al. showed that different layers encode different linguistic properties. Lower layers capture syntax (POS tags, syntactic dependencies), higher layers capture semantics (word sense, sentiment). This was the first hard evidence for hierarchical language representation in neural networks — an insight that would prove fundamental for understanding Transformers.

graph BT
    INPUT["Raw Text: I went to the bank"] --> CHAR["Layer 0: Character CNN - Subword features, morphology"]
    CHAR --> L1["Layer 1: Bidirectional LSTM - Syntax: POS tags, dependencies"]
    L1 --> L2["Layer 2: Bidirectional LSTM - Semantics: word sense, sentiment"]
    L2 --> COMBINE["Task-Specific Weighted Sum"]
    CHAR --> COMBINE
    L1 --> COMBINE
    COMBINE --> TASK["Downstream Task"]
    style CHAR fill:#74c0fc,stroke:#333
    style L1 fill:#748ffc,stroke:#333,color:#fff
    style L2 fill:#9775fa,stroke:#333,color:#fff
    style COMBINE fill:#ffd43b,stroke:#333

The Feature-Based vs Fine-Tuning Distinction

ELMo was used as a feature extractor — you’d freeze ELMo and concatenate its outputs with your task-specific model’s inputs. This is different from BERT’s approach of fine-tuning the entire model. The debate between feature-based and fine-tuning approaches continues even today (prefix tuning, adapters, LoRA all revisit this tension).

8. Attention Is All You Need (2017): The Foundation

Before BERT, we need to understand the Transformer (Vaswani et al., 2017), because it’s the architectural foundation for everything that follows.

Self-Attention: The Core Mechanism

The attention function:

\[\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V\]

Three things to understand here:

1. Why Q, K, V? These come from information retrieval. Query (what am I looking for?), Key (what does each position offer?), Value (what information does each position contain?). Each word generates all three by multiplying with learned weight matrices: Q = XW_Q, K = XW_K, V = XW_V.

2. Why scale by √d_k? Without scaling, when d_k is large, the dot products QK^T can become very large in magnitude, pushing the softmax into regions where it has extremely small gradients (saturation). Scaling by √d_k keeps the variance of the dot products at ~1 regardless of dimensionality. This is subtle but critical for training stability.

3. Why multi-head? Instead of a single attention function with d_model dimensions, use h attention heads, each with d_k = d_model/h dimensions:

\[\text{MultiHead}(Q,K,V) = \text{Concat}(\text{head}_1, ..., \text{head}_h)W^O\]

Each head can attend to different aspects of the input (one head for syntactic relations, another for semantic similarity, another for coreference, etc.). This is not just a performance trick — it enables different representational subspaces.

graph LR
    subgraph Input
        X["Input Embeddings + Positional Encoding"]
    end
    X --> WQ["W_Q"] --> Q["Queries"]
    X --> WK["W_K"] --> K["Keys"]
    X --> WV["W_V"] --> V["Values"]
    Q --> DOT["QK_T / sqrt d_k"]
    K --> DOT
    DOT --> SM["Softmax attention weights"]
    SM --> MUL["Multiply with V"]
    V --> MUL
    MUL --> H1["Head 1 - syntax"]
    MUL --> H2["Head 2 - semantics"]
    MUL --> H3["Head 3 - coreference"]
    MUL --> Hn["Head h - ..."]
    H1 --> CAT["Concat"]
    H2 --> CAT
    H3 --> CAT
    Hn --> CAT
    CAT --> WO["W_O"] --> OUT["Output"]
    style DOT fill:#ffd43b,stroke:#333
    style SM fill:#ff922b,stroke:#333,color:#fff
    style OUT fill:#51cf66,stroke:#333,color:#fff

Positional Encoding: The Unsung Hero

Attention is permutation-invariant — it doesn’t know word order. The positional encoding adds order information using sinusoidal functions:

\[PE_{(pos,2i)} = \sin(pos/10000^{2i/d_{model}})\] \[PE_{(pos,2i+1)} = \cos(pos/10000^{2i/d_{model}})\]

Why sinusoids? Because PE(pos+k) can be expressed as a linear function of PE(pos), meaning the model can learn to attend to relative positions — “the word 3 positions back” — rather than absolute positions. Later models (RoPE, ALiBi) improved on this, but the intuition remains.

9. BERT (2018): The Paradigm Shift

What Most People Get Wrong About BERT

BERT’s contribution is often summarised as “bidirectional Transformers.” That’s deeply incomplete. BERT’s actual innovation was the pretraining-finetuning paradigm for NLP:

Pre-train a massive model on unlabelled text using self-supervised objectives
Fine-tune the entire model on your specific task with minimal labelled data

This was revolutionary because labelled data is expensive; unlabelled text is effectively infinite.

The Two Pre-training Objectives

Masked Language Modelling (MLM): Randomly mask 15% of input tokens and predict them. But here’s the subtlety — of the 15% selected tokens:

80% are replaced with [MASK]
10% are replaced with a random word
10% are kept unchanged

Why this mixed strategy? If all selected tokens were replaced with [MASK], the model would never see [MASK] during fine-tuning, creating a train-test mismatch. The random replacement and unchanged tokens mitigate this.

Next Sentence Prediction (NSP): Given sentence A, predict whether sentence B is the actual next sentence or a random one. This objective was later shown to be mostly harmful. RoBERTa (2019) removed NSP and improved performance, showing that cross-sentence reasoning emerges naturally from MLM alone when trained on longer sequences.

The [CLS] Token Problem

BERT prepends a special [CLS] token and trains it via NSP to represent the “whole input.” Many people use output[CLS] as a sentence embedding. This is a terrible idea for similarity tasks.

Reimers and Gurevych (2019) showed that using BERT [CLS] embeddings for semantic similarity gives results worse than GloVe averaged embeddings. Why? Because BERT’s [CLS] was trained for NSP (a binary classification), not for producing meaningful continuous representations of sentence meaning. The embedding space is not isometric — distances don’t correspond to semantic similarity.

This fact is critical and widely misunderstood. It’s exactly why Sentence-BERT was necessary.

10. Cross-Encoders vs Bi-Encoders: The Fundamental Trade-off

This is the single most important architectural distinction in modern embeddings, and it’s astonishingly under-discussed.

Cross-Encoder

Input: [CLS] Sentence A [SEP] Sentence B [SEP]
      → BERT → Classification Head → Similarity Score

Both sentences are processed together through the Transformer. Every token in A can attend to every token in B. This gives maximum accuracy because the model can perform fine-grained token-level matching.

Problem: You cannot pre-compute embeddings. To compare a query against 1M documents, you must run BERT 1M times with (query, doc_i) as input. For 10K sentences, finding the most similar pair requires C(10000,2) = 49,995,000 forward passes → ~65 hours.

Bi-Encoder (Sentence Transformers)

Sentence A → BERT → Pool → Embedding_A
Sentence B → BERT → Pool → Embedding_B
→ cosine_similarity(Embedding_A, Embedding_B)

Each sentence is processed independently. You can pre-compute all embeddings once, then compare using fast vector operations.

For 10K sentences: 10,000 forward passes to encode all (seconds), then cosine similarity on 100M pairs is trivial (milliseconds with FAISS).

graph TB
    subgraph CE["Cross-Encoder"]
        direction LR
        IN_CE["CLS + Sent A + SEP + Sent B"] --> BERT_CE["BERT full cross-attention"]
        BERT_CE --> CLS_CE["CLS to Score"]
    end
    subgraph BE["Bi-Encoder Sentence-BERT"]
        direction LR
        SA["Sentence A"] --> BERT_A["BERT"]
        SB["Sentence B"] --> BERT_B["BERT shared weights"]
        BERT_A --> POOL_A["Mean Pool emb_A"]
        BERT_B --> POOL_B["Mean Pool emb_B"]
        POOL_A --> COS["cosine_sim"]
        POOL_B --> COS
    end
    CE --- COMPARE{"Trade-off"}
    BE --- COMPARE
    COMPARE --> ACC["Cross-Encoder: Higher accuracy, 65 hours for 10K"]
    COMPARE --> SPD["Bi-Encoder: 5 seconds for 10K, ~5-10% less accurate"]
    style CE fill:#ff8787,stroke:#333
    style BE fill:#69db7c,stroke:#333
    style ACC fill:#ffe3e3,stroke:#333
    style SPD fill:#d3f9d8,stroke:#333

The Quality Gap and How to Close It

Bi-encoders are ~5-10% less accurate than cross-encoders for similarity tasks. The standard production pattern is the retrieve-then-rerank pipeline:

Retrieve top-100 candidates using bi-encoder (fast, milliseconds)
Rerank the 100 candidates using cross-encoder (accurate, still fast with only 100 pairs)

This gives you cross-encoder quality at bi-encoder speed. It’s how virtually every production search system works today.

graph LR
    QUERY["User Query"] --> EMBED["Bi-Encoder embed query"]
    EMBED --> ANN["ANN Search FAISS / Qdrant"]
    DB[("Vector DB 10M+ docs")] --> ANN
    ANN -->|"Top 100 ~5ms"| RERANK["Cross-Encoder Reranking"]
    RERANK -->|"Top 10 ~50ms"| RESULT["Final Results"]
    style QUERY fill:#74c0fc,stroke:#333
    style ANN fill:#ffd43b,stroke:#333
    style RERANK fill:#ff922b,stroke:#333,color:#fff
    style RESULT fill:#51cf66,stroke:#333,color:#fff
    style DB fill:#e599f7,stroke:#333

from sentence_transformers import SentenceTransformer, CrossEncoder, util

# Stage 1: Bi-encoder retrieval
bi_encoder = SentenceTransformer('all-MiniLM-L6-v2')
corpus_embeddings = bi_encoder.encode(corpus, convert_to_tensor=True)
query_embedding = bi_encoder.encode(query, convert_to_tensor=True)

# Fast approximate nearest neighbours
hits = util.semantic_search(query_embedding, corpus_embeddings, top_k=100)[0]

# Stage 2: Cross-encoder reranking
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
cross_inp = [[query, corpus[hit['corpus_id']]] for hit in hits]
cross_scores = cross_encoder.predict(cross_inp)

# Sort by cross-encoder scores
for idx in range(len(cross_scores)):
    hits[idx]['cross_score'] = cross_scores[idx]
hits = sorted(hits, key=lambda x: x['cross_score'], reverse=True)

11. Sentence-BERT: Architecture Details That Matter

Pooling Strategy Matters

SBERT experiments showed three pooling strategies produce very different results:

Pooling	STS Benchmark (Spearman)
[CLS] token	29.19
Max pooling	82.32
Mean pooling	83.18

Mean pooling (averaging all token embeddings) won. [CLS] was catastrophically worse. This empirical result destroyed the common practice of using [CLS] as a sentence representation.

Training Data Combination

SBERT’s training strategy was: first train on NLI data (SNLI + MultiNLI, 570K sentence pairs with entailment/contradiction/neutral labels), then fine-tune on STS data (semantic textual similarity with continuous 0-5 scores).

The NLI stage gives the model a coarse understanding of sentence relationships. The STS stage calibrates the similarity scores. This two-stage approach outperforms training on either dataset alone — a lesson that transfers to most fine-tuning scenarios.

The Objective Function

For NLI training, SBERT concatenates the two sentence embeddings and their element-wise difference, then classifies:

\[o = \text{softmax}(W_t \cdot [u; v; |u-v|])\]

where u and v are the sentence embeddings. The **

u-v

** term is crucial — it explicitly encodes the difference between the two representations, helping the model learn what makes sentences similar or different.

12. Fine-Tuning Embeddings: A Production Engineer’s Guide

Loss Functions — The Mathematics

Contrastive Loss:

\[L = \frac{1}{2}(1-y) \cdot D^2 + \frac{1}{2}y \cdot \max(0, m - D)^2\]

where D is the distance between embeddings, y=0 for similar pairs, y=1 for dissimilar pairs, m is the margin. Similar items are pulled together unconditionally; dissimilar items are pushed apart only if they’re closer than margin m.

Triplet Loss:

\[L = \max(0, \|a - p\|^2 - \|a - n\|^2 + \alpha)\]

where a=anchor, p=positive, n=negative, α=margin. The model learns to keep the positive closer to the anchor than the negative by at least margin α.

Multiple Negatives Ranking Loss (MNRL):

\[L = -\log \frac{e^{sim(a_i, p_i)/\tau}}{\sum_{j=1}^{N} e^{sim(a_i, p_j)/\tau}}\]

This is an in-batch softmax. For a batch of N (anchor, positive) pairs, each anchor’s positive is treated as a positive, and all other N-1 positives in the batch are treated as negatives. With batch size 64, you get 63 free negatives per example.

Why MNRL dominates in practice:

You only need positive pairs (cheaper to curate)
Larger batch sizes = more negatives = better gradients
Temperature τ controls the hardness of the distribution

Hard Negative Mining: The 10x Multiplier

Random negatives are easy to distinguish — “What causes diabetes?” vs “How to cook pasta?” doesn’t teach the model much. Hard negatives are semantically close but actually different:

Query: “What causes type 2 diabetes?”
Easy negative: “Best Italian restaurants in Mumbai”
Hard negative: “What are the symptoms of type 2 diabetes?”

Hard negative mining strategies:

from sentence_transformers import SentenceTransformer
from sentence_transformers.util import mine_hard_negatives
from datasets import load_dataset

model = SentenceTransformer("all-MiniLM-L6-v2")
dataset = load_dataset("natural-questions", split="train")

# Mine hard negatives using current model's top-k
# These are passages the model currently ranks highly
# but are actually irrelevant
dataset = mine_hard_negatives(
    dataset=dataset,
    model=model,
    range_min=10,    # Skip top-10 (likely true positives)
    range_max=50,    # Use ranks 10-50 as hard negatives
    num_negatives=5, # 5 hard negatives per example
)

Data Requirements — What Actually Works

Training Data Size	Expected Impact
100-500 pairs	Noticeable domain adaptation
1K-5K pairs	Significant improvement
10K-50K pairs	Near-optimal for most domains
100K+ pairs	Diminishing returns (unless very diverse domain)

Critical rule: Quality > Quantity. 1,000 carefully curated pairs from your domain outperform 100,000 noisy automatically-generated pairs.

13. The Embedding Anisotropy Problem

Here’s something most tutorials completely ignore: pre-trained embedding spaces are often anisotropic, meaning embeddings cluster in a narrow cone of the high-dimensional space rather than being uniformly distributed.

Why this matters:

In an anisotropic space, cosine similarity between random sentences averages ~0.6-0.8 instead of ~0.0
This means similarity scores are less discriminative — the gap between “truly similar” and “random” is compressed
High baseline similarity makes thresholding unreliable

Detection:

import numpy as np
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')
random_sentences = [...]  # 1000 random sentences

embeddings = model.encode(random_sentences)
# Compute mean pairwise cosine similarity
similarities = np.dot(embeddings, embeddings.T)
norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
cosine_sim = similarities / (norms @ norms.T)
np.fill_diagonal(cosine_sim, 0)

avg_similarity = cosine_sim.sum() / (len(random_sentences) * (len(random_sentences) - 1))
print(f"Average pairwise cosine similarity: {avg_similarity:.4f}")
# Isotropic: ~0.0, Anisotropic: ~0.5-0.8

Mitigation strategies:

Whitening (Su et al., 2021): Apply PCA whitening to normalise the embedding distribution
Fine-tuning with contrastive loss: Naturally spreads the distribution
Use models trained with better objectives: Models trained with MNRL tend to be more isotropic

14. ColBERT: Late Interaction — A Third Way

Beyond cross-encoders and bi-encoders, there’s a third architecture: late interaction (Khattab & Zaharia, 2020).

Query: "What causes diabetes?"
        → BERT → [q1, q2, q3, q4]    # Keep ALL token embeddings

Document: "Diabetes results from insulin resistance..."
        → BERT → [d1, d2, d3, d4, d5, d6]  # Keep ALL token embeddings

Score = Σ max_j(q_i · d_j)   # MaxSim operation

Instead of compressing to a single vector (bi-encoder) or cross-attending (cross-encoder), ColBERT:

Encodes query and document independently (like bi-encoder)
But keeps all token embeddings (unlike bi-encoder’s pooling)
Computes a MaxSim score: for each query token, find its best-matching document token

graph TB
    subgraph QE["Query Encoding"]
        QT["What causes diabetes?"] --> QB["BERT"] --> QV["q1, q2, q3, q4"]
    end
    subgraph DE["Document Encoding pre-computed"]
        DT["Diabetes results from..."] --> DB["BERT"] --> DV["d1, d2, d3, d4, d5, d6"]
    end
    subgraph MS["MaxSim Scoring"]
        direction LR
        M1["q1 best match among d1..d6"]
        M2["q2 best match among d1..d6"]
        M3["q3 best match among d1..d6"]
        M4["q4 best match among d1..d6"]
    end
    QV --> MS
    DV --> MS
    MS --> SUM["Score = Sum of MaxSim"]
    style MS fill:#ffd43b,stroke:#333
    style SUM fill:#51cf66,stroke:#333,color:#fff

This achieves ~95% of cross-encoder quality whilst being 100x faster at retrieval because document token embeddings can be pre-computed and indexed.

The trade-off: Storage. Instead of storing one 768-dim vector per document, you store N×128-dim vectors (N = number of tokens, dimensions compressed from 768 to 128). A 100M document index might require 100-200 GB.

15. Sparse-Dense Hybrid: SPLADE and the Best of Both Worlds

Pure dense retrieval (Sentence-BERT) misses exact keyword matching. The query “iPhone 15 Pro Max specifications” should match documents containing those exact terms, even if the dense embedding focuses on the general “phone specs” semantics.

SPLADE (Sparse Lexical and Expansion) learns sparse representations using BERT:

# Conceptually:
# Instead of BERT → mean pool → 768d dense vector
# SPLADE does: BERT → MLM head → |V|-dimensional sparse vector
# where non-zero entries represent "expanded" terms

# A query about "ML deployment" might expand to:
# {"ML": 2.1, "machine": 1.8, "learning": 1.5,
#  "deployment": 2.3, "production": 1.2, "inference": 0.9,
#  "serving": 0.7, ...}
# Note: "production", "inference", "serving" weren't in the query
# but SPLADE learned they're relevant!

Modern production systems (Vespa, Weaviate, Qdrant) support hybrid search that combines dense and sparse scores:

\[\text{score} = \alpha \cdot \text{dense\_score} + (1-\alpha) \cdot \text{sparse\_score}\]

with α tuned per use case. This consistently outperforms either approach alone.

16. Matryoshka Embeddings: Adaptive Dimensionality

The Core Idea

Standard models produce fixed-size embeddings (768d, 1024d). Matryoshka Representation Learning (Kusupati et al., 2022) trains the model so that the first d dimensions form a valid embedding for any d.

This is achieved by adding a multi-scale loss during training:

\[L = \sum_{d \in \{32, 64, 128, 256, 512, 1024\}} L_d(\text{truncate}(e, d))\]

The model simultaneously optimises for all truncation sizes. The result: the first 256 dimensions capture ~95% of the full-size performance, and even 64 dimensions retain ~85%.

Production Impact

Dimensions	Performance (Relative)	Storage (per embedding)	ANN Search Speed
1024	100%	4 KB	1x
256	~95%	1 KB	~4x faster
64	~85%	256 B	~16x faster

Practical pattern: Use 64d for fast initial candidate retrieval (top-1000), then re-score with full 1024d for the final ranking. You get maximum precision with minimum latency.

OpenAI’s text-embedding-3-small and text-embedding-3-large both support this. The dimensions parameter lets you truncate at inference time — the model is already trained with the Matryoshka objective.

17. Instruction-Tuned Embeddings: E5 and BGE

A critical 2023-2024 development: instruction-tuned embedding models that accept a task description alongside the input text.

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("intfloat/e5-large-v2")

# The instruction prefix tells the model HOW to embed
query = "query: What causes Type 2 diabetes?"
passage = "passage: Type 2 diabetes results from insulin resistance..."

# vs for classification:
text = "classification: This patient shows signs of hyperglycaemia"

Why this matters: The same sentence should be embedded differently depending on the task. For retrieval, you want to capture the “query intent.” For classification, you want to capture the “topic.” For clustering, you want broad semantic features. Instruction tuning lets one model handle all tasks.

Models like E5 (Wang et al., 2023), BGE (Xiao et al., 2023), and NV-Embed-v2 (NVIDIA, 2024) use this approach and dominate the MTEB leaderboard.

18. Production Deployment: What Tutorials Never Tell You

Quantisation: Shrinking Embeddings for Scale

Float32 embeddings (768d = 3KB per embedding) are expensive at scale. Quantisation reduces this:

Format	Bytes per 768d	Quality Retention	Speed-up
Float32	3,072	100% (baseline)	1x
Float16	1,536	~99.9%	~2x
Int8	768	~99%	~4x
Binary	96	~92-95%	~32x

Binary quantisation is particularly interesting: convert each dimension to 0/1, then use Hamming distance instead of cosine similarity. FAISS, Qdrant, and Weaviate all support this.

import numpy as np

def binary_quantize(embedding):
    """Convert float embedding to binary."""
    return (embedding > 0).astype(np.uint8)

def hamming_similarity(a, b):
    """Fast binary similarity using bitwise XOR."""
    return 1.0 - np.count_nonzero(a != b) / len(a)

# 32x less storage, 10-30x faster search
binary_emb = binary_quantize(model.encode("query"))

Embedding Drift and Index Maintenance

Models get updated. Your fine-tuned model improves. New data distributions emerge. All of these invalidate your existing index.

Production checklist:

Version your embedding model: Every index must track which model version generated it
Blue-green index deployment: Build new index with new model whilst old one serves traffic, then swap
Monitor retrieval quality: Track Recall@K, MRR on a golden evaluation set weekly
Detect distribution drift: Compare embedding statistics (mean, variance, average pairwise similarity) between batches

Latency Budget Breakdown

For a typical RAG system targeting <200ms end-to-end:

Embedding query:           10-30ms  (GPU) / 50-100ms (CPU)
ANN search (FAISS/Qdrant): 1-5ms   (for 10M vectors)
Reranking (top-50):        30-80ms  (cross-encoder on GPU)
LLM generation:            100-500ms
─────────────────────────────
Total:                     141-615ms

Key optimisations:

Cache frequent query embeddings (LRU cache with TTL)
Pre-compute and index document embeddings (batch job, not real-time)
Use ONNX Runtime / TensorRT for embedding model inference (~3x speed-up over PyTorch)
Matryoshka truncation for first-pass retrieval, full dimensions for reranking

19. The Evaluation Problem: MTEB and Beyond

MTEB (Massive Text Embedding Benchmark)

MTEB evaluates models across 8 task categories and 56+ datasets. But there are important caveats:

Leaderboard position ≠ best model for you. A model scoring highest on average might underperform on your specific task. Always evaluate on your own data.

MTEB overweights English. The recently launched MMTEB (Multilingual MTEB) addresses this with 250+ datasets across 200+ languages.

Key metrics by task:

Retrieval: NDCG@10, Recall@100
STS: Spearman correlation
Classification: Accuracy, F1
Clustering: V-measure

How to Evaluate Your Own Embeddings

from sentence_transformers import SentenceTransformer, evaluation

model = SentenceTransformer("your-fine-tuned-model")

# Retrieval evaluation
evaluator = evaluation.InformationRetrievalEvaluator(
    queries={"q1": "What is diabetes?", ...},
    corpus={"d1": "Diabetes is a chronic condition...", ...},
    relevant_docs={"q1": ["d1", "d5"], ...},  # Ground truth
    name="my-domain-eval",
    mrr_at_k=[10],
    ndcg_at_k=[10],
    recall_at_k=[10, 100]
)

results = evaluator(model)
print(f"MRR@10: {results['my-domain-eval_mrr@10']:.4f}")
print(f"NDCG@10: {results['my-domain-eval_ndcg@10']:.4f}")
print(f"Recall@100: {results['my-domain-eval_recall@100']:.4f}")

20. Where This Story Goes Next

The embedding landscape is evolving rapidly. Key directions:

Multimodal Embeddings (CLIP, SigLIP, ImageBind): Shared embedding spaces for text + images + audio + video. CLIP’s contrastive training aligned 400M image-text pairs into a single space. This enables “search images with text” and vice versa.

Multilingual at Scale: LaBSE (Language-agnostic BERT Sentence Embedding) and mE5 create embeddings that are comparable across 100+ languages — you can search English documents with Hindi queries.

LLM-based Embeddings: Using decoder-only LLMs (Mistral, LLaMA) as embedding backbones instead of encoder-only BERT. Models like GritLM simultaneously perform generation and embedding with one model.

Mixture-of-Experts Embeddings: Routing different types of text to specialised embedding sub-networks, combining specialist quality with generalist coverage.

The Arc of This Story

From counting words to understanding meaning. From sparse, high-dimensional vectors to dense, geometric spaces. From static representations to contextual, task-aware embeddings.

Each generation didn’t just improve on the previous one — it revealed something new about how language and meaning can be computationally represented:

LSA showed that meaning hides in co-occurrence statistics
Word2Vec showed that prediction is a better training signal than counting
ELMo showed that language has hierarchical structure (syntax → semantics)
BERT showed that bidirectional context + transfer learning changes everything
SBERT showed that practical efficiency matters as much as theoretical quality
Matryoshka showed that information is not uniformly distributed across dimensions

The story of embeddings is the story of building better mirrors for meaning — and we’re still learning what those mirrors can reflect.

References

Deerwester, S. et al. (1990). Indexing by Latent Semantic Analysis. JASIS.
Bengio, Y. et al. (2003). A Neural Probabilistic Language Model. JMLR.
Mikolov, T. et al. (2013). Efficient Estimation of Word Representations in Vector Space. arXiv:1301.3781
Mikolov, T. et al. (2013). Distributed Representations of Words and Phrases and their Compositionality. NeurIPS
Pennington, J. et al. (2014). GloVe: Global Vectors for Word Representation. EMNLP
Levy, O. & Goldberg, Y. (2014). Neural Word Embedding as Implicit Matrix Factorization. NeurIPS
Bojanowski, P. et al. (2017). Enriching Word Vectors with Subword Information. TACL
Vaswani, A. et al. (2017). Attention Is All You Need. NeurIPS
Peters, M.E. et al. (2018). Deep Contextualized Word Representations. NAACL
Devlin, J. et al. (2019). BERT: Pre-training of Deep Bidirectional Transformers. NAACL
Liu, Y. et al. (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv:1907.11692
Reimers, N. & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. EMNLP
Khattab, O. & Zaharia, M. (2020). ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction. SIGIR
Su, J. et al. (2021). Whitening Sentence Representations for Better Semantics and Faster Retrieval. arXiv:2103.15316
Kusupati, A. et al. (2022). Matryoshka Representation Learning. NeurIPS
Wang, L. et al. (2023). Text Embeddings by Weakly-Supervised Contrastive Pre-training (E5). ACL
Muennighoff, N. et al. (2023). MTEB: Massive Text Embedding Benchmark. EACL
Lee, C. et al. (2024). NV-Embed: Improved Techniques for Training LLM-based Embedding Models. arXiv:2405.17428

Written by Girijesh Prasad 20 February 2026

Context Engineering: The New Frontier in Agentic AI

2026-02-06T03:30:00+00:00

Context Engineering: The New Frontier in Agentic AI

Reading Time: 13 minutes

Level: Intermediate-Advanced

Picture this: You’ve built an AI customer support agent. You’ve fed it your entire documentation—all 5,000 pages of it. Your product catalog, FAQs, troubleshooting guides, everything. The model is top-notch—GPT-5, latest Claude opus or sonnet, you name it. Yet when a customer asks a straightforward question about your refund policy, the agent fumbles. It gives outdated information. It misses the crucial detail buried on page 2,847.

The problem? It’s not the model. It’s the context.

Welcome to 2025-2026, where we’re witnessing a fundamental shift in how we build AI systems. The era of obsessing over the perfect prompt is fading. We’re entering the age of context engineering—and it’s changing everything.

The Great Shift: From Prompts to Context

For years, we’ve been playing the prompt engineering game. Craft the perfect instruction. Add the right examples. Use the magic phrase “Let’s think step by step.” And honestly, it worked—for simple demos and prototypes.

But something changed in 2025. As AI agents moved from exciting demos to production systems handling millions of real-world interactions, we hit a wall. Not a model capability wall—a context wall.

Here’s the reality check: Most AI agent failures today aren’t because the model is dumb. They’re because the model doesn’t have the right information at the right time.

Think about it like this: Your LLM’s context window is like RAM in a computer. You can have the world’s most powerful processor (the model), but if your RAM is poorly managed—filled with irrelevant data, missing crucial bits, or organized chaotically—your system will struggle. Context engineering is the discipline of managing that RAM brilliantly.

And the industry agrees. Anthropic, Google, OpenAI—everyone’s talking about it. In November 2024, Anthropic even released the Model Context Protocol (MCP), calling it “USB-C for AI.” In December 2025, they donated it to the Linux Foundation. That’s how big this is.

So What Exactly IS Context Engineering?

Let’s get clear on this. Context engineering is the systematic design and management of all the information you provide to an AI system. It goes way beyond just writing a good prompt.

When you do prompt engineering, you’re crafting a single instruction: “Summarize this document in 3 bullet points.” That’s it. One request, one response.

When you do context engineering, you’re architecting an entire information environment:

System instructions (Who is this AI? What rules should it follow?)
Conversation history (What have we discussed already?)
Retrieved knowledge (What documents, data, or facts are relevant right now?)
Tool schemas (What actions can the AI take?)
Dynamic state (What’s the current task? User preferences? Environment variables?)

It’s the difference between handing someone a question and building them an entire workspace with all the resources they need to excel.

Why the Evolution?

The shift happened because of three converging forces:

1. Rising Expectations Users don’t want chatbots that forget their last message. They want AI that remembers their preferences, learns from feedback, and provides personalized experiences. That requires sophisticated context management.

2. Enterprise Adoption Companies deploying AI at scale need reliability, accuracy, and consistency across millions of interactions. You can’t achieve that with ad-hoc prompting. You need systematic context engineering.

3. Advanced Models Modern LLMs can handle 128K, 200K, even 2 million tokens of context. But here’s the kicker: research shows they only effectively use 10-20% of very long contexts. Having a giant context window doesn’t mean much if you don’t engineer what goes into it.

The Anatomy of Context: What Actually Goes In?

Let’s dissect what makes up “context” in a modern AI system. Imagine you’re building that customer support agent we mentioned earlier. Here’s what the agent needs to “see” in its context window for each interaction:

1. System Instructions

The foundation layer. This tells the AI who it is and how to behave:

“You are a helpful customer support agent for TechCorp”
“Always be polite, concise, and verify information before providing it”
“Format responses using bullet points for clarity”

2. Conversation History

What’s been said so far in this specific conversation:

User: “Hi, I need help with my recent order”
Agent: “I’d be happy to help! Could you provide your order number?”
User: “It’s #TC-90210”

3. Retrieved Knowledge

Information pulled from external sources based on the current query:

Customer’s order details from the database
Relevant sections from the refund policy
Similar past support tickets for reference

4. Tool Schemas and Outputs

What actions the agent can take and what it’s already done:

Available tools: check_order_status(), initiate_refund(), send_email()
Previous tool results: Order status returned → “Shipped on Jan 30”

5. Dynamic State

Real-time information:

Customer tier: Premium (gets expedited support)
Current agent workload: High (keep responses concise)
User’s timezone: EST (respond during business hours)

Now here’s the challenge: Let’s say your refund policy is 1,000 pages, customer history has 500 past interactions, product docs are 5,000 pages, and you’re having a 50-message conversation. That’s potentially 10 million tokens. Your context window? Maybe 128,000 tokens.

You need to fit a library into a backpack. That’s context engineering.

Memory Systems: The Backbone of Great Context

If context is like RAM, memory is like your hard drive and cache combined. Modern AI agents need both short-term and long-term memory to function effectively.

Short-Term Memory: The Conversation Buffer

This is your working memory for the current session. When someone’s chatting with your agent, it needs to remember what was said 5 minutes ago.

How it works:

Buffer Memory: Store everything verbatim. Great for short conversations, but expensive for long ones.
Window Memory: Keep only the last K interactions. Perfect for maintaining recent context without bloat.
Summary Memory: Use the LLM itself to summarize older parts of the conversation. Keeps the gist while reducing tokens.

In Practice (LangChain):

from langchain.memory import ConversationBufferWindowMemory

# Keep only the last 5 exchanges
memory = ConversationBufferWindowMemory(k=5)

Think of it like your WhatsApp chat. You don’t need to reread the entire 3-year conversation history to reply to the latest message. Just the recent context suffices.

Long-Term Memory: Persistent Knowledge

This is where things get powerful. Long-term memory persists across sessions. The agent remembers facts, preferences, and decisions from weeks or months ago.

The Secret Sauce: Vector Databases

Instead of storing text directly, you convert information into numerical vectors (embeddings) and store them in specialized databases like Pinecone, Milvus, or Weaviate. When you need to recall something, you search semantically—by meaning, not just keywords.

Example:

User says: “I prefer minimalist designs”
Stored as vector in long-term memory
Two weeks later, user asks for design recommendations
Agent recalls: “Based on your preference for minimalist designs…”

It’s the difference between Ctrl+F (keyword search) and having a conversation with someone who truly understands what you mean.

Episodic vs. Semantic Memory

Borrowing from cognitive science, AI agents benefit from two types of memory:

Episodic Memory = Specific events with context “I booked a flight to Mumbai for User X on January 15th because they were attending a conference.”

Semantic Memory = General factual knowledge “Mumbai is the financial capital of India.”

Together, they provide depth (episodic details) and breadth (general knowledge). Episodic memory is typically stored in time-indexed logs or graphs. Semantic memory lives in knowledge bases and vector embeddings.

RAG: The Bridge Between Memory and Context

Retrieval-Augmented Generation (RAG) is where long-term memory meets real-time context.

Traditional Approach: Cram all knowledge into the model’s training. Problem: Knowledge gets outdated, hallucinations increase, can’t scale.

RAG Approach:

Store vast amounts of information externally (in vector DBs, knowledge bases)
When a query comes in, retrieve only the most relevant pieces
Inject that focused information into the context window
Generate response based on fresh, targeted data

What’s New in 2024-2025:

Agentic RAG: Multiple retrieval steps throughout a task, not just one at the start
Memory-Augmented RAG: The system learns from past retrievals, adapting what to fetch
Editable Memory Graphs: Special structures that optimize memory selection using reinforcement learning

RAG lets you have your cake and eat it too: Massive knowledge bases + Focused, efficient context.

The “Lost in the Middle” Problem (And How to Fix It)

Here’s a dirty secret about large context windows: LLMs have terrible memory for information in the middle.

Research revealed a “U-shaped” performance curve. Models pay strong attention to information at the beginning and end of context, but the middle? It’s like the middle child—often overlooked.

Even Claude with its 200K token window or GPT-4 with 128K suffers from this. Your crucial piece of information buried on page 47 of a 100-page context? Good luck.

Solutions That Actually Work

1. Strategic Reranking Don’t just dump documents into context in random order. Use reranking models to place the most critical information at the start or end.

2. In-Context Retrieval (ICR) A clever two-step approach:

Step 1: Ask the LLM to identify which passage numbers are relevant to the query
Step 2: Extract just those passages and use them for the final answer
Result: Reduced context length, laser-focused attention

3. Chunking and Compress ing Break massive documents into smaller pieces. Process each piece separately. Summarize or compress aggressively. You’d be surprised—smart filtering can reduce tokens by 70-90% without losing critical information.

4. Prompt Compression Tools like Microsoft’s LLMLingua automatically remove redundant words while preserving meaning. “The customer is extremely dissatisfied with the delayed delivery” becomes “Customer dissatisfied, delayed delivery.” Same info, fewer tokens.

5. Architectural Innovation Newer techniques like Rotary Position Embeddings (RoPE), sparse attention patterns (Longformer, BigBird), and state-space models (Mamba) are making models better at handling long contexts. But even with these, strategic engineering matters.

Key Takeaway: A bigger context window is like a bigger suitcase. Sure, you can fit more stuff. But if you don’t pack smartly, you’re still going to struggle to find your toothbrush.

Multi-Agent Systems: Distributed Context Intelligence

Here’s where context engineering gets really interesting. Instead of one mega-agent trying to juggle everything, what if you had a team of specialized agents, each with its own focused context?

Why Go Multi-Agent?

1. Prevent Context Overflow One agent researching + analyzing + writing + editing = context chaos. Separate agents for research, analysis, and writing = Each has a clean, focused context.

2. Specialization A research agent doesn’t need to know how to format markdown. A writing agent doesn’t need access to database schemas. Give each agent only what it needs.

3. Parallel Processing Multiple agents can work simultaneously on different aspects of a task.

In LangGraph (a framework for multi-agent systems), agents communicate through a shared state—think of it as a collaborative whiteboard.

How it works:

Research Agent finds relevant information → Writes to shared state
Analysis Agent reads findings → Adds insights to shared state
Writing Agent reads everything → Produces final output

Each agent has its own specialized context (tools, prompts), but they all contribute to and read from a central state. It’s like a relay race where the baton (state) carries all completed work.

Context Handoff: The Supervisor Pattern

Another common architecture: A Supervisor agent orchestrates multiple worker agents.

Flow:

User Query
    ↓
Supervisor (decides which agent to call)
    ↓
Worker Agent A (processes, updates context)
    ↓
Supervisor (synthesizes, decides next  step)
    ↓
Worker Agent B (continues with clean context)
    ↓
Supervisor (final response)

Each worker hands off a cleanly packaged context to the next. No clutter, no confusion.

The Model Context Protocol (MCP): Standardizing the Handoff

In November 2024, Anthropic introduced MCP—a game-changer for context engineering.

The Problem: Every AI framework had its own way of managing context. Integrating data sources required custom connectors for each combination. It was messy.

The Solution: MCP standardizes how AI systems connect to data sources and share context. Think of it as USB-C for AI—one protocol, universal compatibility.

Three Core Primitives:

Tools: Functions the AI can execute (e.g., query_database())
Resources: Data sources for context (e.g., documents, APIs)
Prompts: Reusable templates for interaction patterns

By December 2025, Anthropic donated MCP to the Linux Foundation, signaling a commitment to industry-wide adoption. It’s early days, but MCP could become the standard for context exchange between agents.

Prompt Engineering in the Context Era

So does prompt engineering still matter? Absolutely—but it’s evolved.

Context Injection: Dynamic Knowledge

Modern prompts aren’t static. They’re templates with placeholders that get filled dynamically:

System: You are an expert {role}
Context: {retrieved_documents}
User History: {past_interactions}
Current Query: {user_question}
Output Format: {desired_format}

When a query comes in, the system:

Retrieves relevant documents based on the query
Fetches user history from long-term memory
Injects everything into the template
Sends to the LLM

This is context-aware prompting—prompts that adapt based on what’s relevant right now.

Advanced Techniques (2024 Edition)

Chain-of-Thought with Memory Break complex tasks into steps, each step accessing relevant parts of memory. Cumulative reasoning gets better with context.

Few-Shot with Context Don’t just provide examples—provide examples with their contexts. The LLM learns not just the pattern, but also how to use context effectively.

Meta-Prompting Instead of relying on examples, structure the format and logic of the response. Guide the LLM on how to think through problems using available context.

Self-Consistency Generate multiple reasoning paths using the same context, then pick the most consistent answer. Works great when context is rich and reliable.

The shift: From “write better prompts” to “architect better context that makes any reasonable prompt work well.”

Cost Optimization: The 90% Savings Opportunity

Let’s talk money. If you’re running AI agents at scale, context engineering isn’t just about performance—it’s about survival.

The Problem

LLMs charge by the token. More context = More tokens = Higher costs. A customer support agent handling 5,000 conversations daily, each with a 10,000-token context, is processing 50 million tokens a day. At $0.01 per 1K tokens (rough average), that’s $500/day, or $15,000/month.

The Solution: Context Caching

How it works: Identify the static parts of your context (system instructions, company policies, product docs) and cache them on the server side. You only pay the full price once. After that, you pay a tiny fraction (often 10% or less) for cache hits.

Example (Claude’s Prompt Caching):

First request: 10,000 tokens (system + docs) = $0.10
Next 99 requests: Only the new user query (100 tokens) + cache hit discount = $0.001 each
Savings: 90% on input costs

Impact on Latency: Cached contexts don’t need to be “read” again by the model. This can reduce latency by up to 80%. Faster responses and lower costs.

Agentic Plan Caching

A newer technique: Cache entire agent plans, not just prompts. For “Plan-Act” agents that coordinate multiple steps, caching the plan at the task level (instead of query level) has shown 47% cost reductions in research.

Other Cost Strategies

1. Right-Size Your Models Don’t use GPT-4 for every task. Use smaller, cheaper models (GPT-3.5, Claude Haiku) for simple routing or summarization. Reserve expensive models for complex reasoning.

2. Compress Before Processing Summarize long documents before feeding to the agent. Hierarchical summarization can turn a 50,000-token document into a 500-token summary.

3. Trim Conversation History Don’t let conversations grow unbounded. Keep the last N messages, or summarize older parts.

4. Smart Filtering Extract only the relevant sections from documents. If a user asks about refunds, pull the refund section—not the entire 1,000-page policy.

Real ROI Example:

Before Context Engineering:
- 5,000 conversations/day
- 10,000 tokens/conversation
- 50M tokens/day × $0.01/1K = $500/day = $15,000/month

After (caching + compression + filtering):
- Same 5,000 conversations
- Cached static context (90% discount)
- Compressed dynamic context (70% reduction)
- 5M tokens/day × $0.01/1K = $50/day = $1,500/month

Savings: $13,500/month (90%)

That’s hiring a full-time engineer to optimize context—and they pay for themselves in a week.

Practical Tools: Your Context Engineering Toolkit

Enough theory. Let’s talk frameworks.

LangChain: The Orchestrator

Best for: Conversational agents, RAG applications, chains of reasoning

Key Features:

Memory Modules:
- ConversationBufferMemory: Full verbatim history
- ConversationSummaryMemory: LLM-generated summaries
- ConversationKnowledgeGraphMemory: Extract entities and relationships
- VectorStoreRetrieverMemory: Semantic search from vector DBs
LCEL (LangChain Expression Language): Compose complex chains where context flows smoothly from step to step

When to use: You’re building chatbots, Q&A systems, or anything that needs conversational memory.

LangGraph: The Multi-Agent Maestro

Best for: Complex workflows, multi-agent systems, stateful applications

Key Features:

Shared State Management: Central memory accessible to all agents
Checkpointers: Persist state to PostgreSQL, Redis, SQLite—resume from failures
Supervisor Patterns: Built-in support for orchestrating specialized agents
Durable Execution: Long-running tasks that survive crashes

When to use: Your task requires multiple steps, multiple agents, or needs to survive interruptions.

LlamaIndex: The Context Specialist

Best for: Document-centric apps, knowledge base integration, advanced indexing

Key Features:

Context Engine: ContextChatEngine retrieves relevant text and injects it as system context
Memory Class: Combines short-term (FIFO queue) and long-term memory (static, fact extraction, vector blocks)
Agent Workflows: Define step-by-step sequences to prevent context overload
Efficient Indexing: Chunking, incremental processing, compressed embeddings for memory optimization

When to use: You’re working with large document collections and need sophisticated retrieval.

Quick Decision Matrix

Framework	Strength	Use When…
LangChain	Orchestration, memory modules	Building conversational flows
LangGraph	Multi-agent, state management	Complex workflows, multiple specialized agents
LlamaIndex	Document indexing, retrieval	Knowledge-intensive applications

Pro Tip: These tools aren’t mutually exclusive. A common pattern: Use LlamaIndex for indexing and retrieval, then feed the results into LangChain or LangGraph for orchestration.

Best Practices: Do’s and Don’ts

Do’s

1. Prioritize Relevance Over Quantity More context isn’t always better. Aim for “just the right information.” Keep context usage at 80-85% of the max limit—leave some headroom.

2. Structure Your Context Clearly Use clear delimiters and sections:

=== SYSTEM INSTRUCTIONS ===
...
=== CONVERSATION HISTORY ===
...
=== RETRIEVED KNOWLEDGE ===
...
=== CURRENT QUERY ===
...

3. Implement Hierarchical Memory

Core memory: Critical facts, always present
Extended memory: Retrieved on-demand
Archived memory: Long-term storage, rarely accessed

4. Monitor Context Usage Set up dashboards to track token consumption, context bloat, and performance degradation. Catch issues before they become expensive.

5. Test at the Limits Deliberately test with maximum context lengths. Check for “lost in the middle” issues. Validate before going to production.

Don’ts

1. Don’t Stuff the Context Context rot is real. Overloading leads to degraded performance. Quality beats quantity.

2. Don’t Ignore Position Critical information should be at the start or end of context. Never bury important details in the middle.

3. Don’t Forget to Prune Old conversations accumulate. Without pruning, you’ll hit limits and performance will tank. Implement automatic cleanup.

4. Don’t Skip Caching Static, repetitive content (system prompts, documentation) should always be cached. It’s free money.

5. Don’t Mix Agent Contexts In multi-agent systems, keep contexts isolated. Prevent cross-contamination. Use explicit handoff protocols.

Real-World Impact: Context Engineering in Action

Case Study 1: Anthropic’s Multi-Agent Research System

Challenge: Build an AI system that can conduct research spanning days, with tasks requiring 100+ steps.

Context Problems:

Context windows fill up quickly
Need continuity across multiple work phases
Can’t lose track of earlier findings

Solution:

Summarize each completed research phase
Store essential information in external memory
Spawn fresh subagents with clean contexts for new phases
Retrieve phase summaries when needed

Result: Successfully handle multi-day research tasks with coherent outputs despite tight context constraints.

Case Study 2: Enterprise Customer Support

Scenario: Global tech company, 10,000 daily support interactions

Before Context Engineering:

Inconsistent responses (agents couldn’t recall past decisions)
High latency (re-processing same documents repeatedly)
$50,000/month in LLM costs

After:

Prompt caching for company policies and guidelines
Vector database for customer interaction history
Multi-agent system: Triage → Specialist → Resolution
Clear context handoff protocols

Results:

85% cost reduction: Down to $7,500/month
80% latency improvement: Faster responses
40% accuracy boost: Better resolution rates

ROI: Paid for the engineering effort in 2 weeks.

Case Study 3: Code Assistant (Copilot-Style)

Context Challenges:

Entire codebase as potential context
Users frequently access the same files
Need to track user patterns and preferences

Engineering Approach:

Explicit caching for frequently accessed files (90% cost savings)
Semantic code search using embeddings
Incremental context: Only include changed files, not entire codebase
User-specific memory: Track preferred patterns and libraries

Impact:

Near-instant code suggestions (cached contexts load fast)
Codebase-aware completions (knows the architecture)
90% reduction in token costs (aggressive caching and filtering)

The Future: What’s Coming Next

Trends for 2025-2026

1. Memory-First Architectures Future agents will prioritize their internal memory and only reach for external retrieval when necessary. Smarter, more autonomous systems.

2. Adaptive Context Management AI systems that automatically select and prioritize context based on task complexity. Self-optimizing context windows.

3. MCP Ecosystem Growth As more tools adopt the Model Context Protocol, plug-and-play context integration becomes the norm. Standardization wins.

4. Hybrid Memory Strategies Combining long-term memory systems with ultra-large context windows. Best of both worlds—deep history + immediate access.

5. Cost-Aware Context Engineering Built-in optimization where the system automatically makes caching decisions based on cost budgets. Financial constraints drive architectural choices.

Emerging Challenges

1. Context SecurityAs contexts grow richer, they become targets:

Context poisoning attacks (injecting malicious info)
Sensitive data leakage
Need for context-level encryption and isolation

2. Context GovernanceCompliance requirements hit context:

GDPR data retention in memory systems
Audit trails for context changes
Explainability: “Why did the agent see this piece of information?”

3. Coflicting Contexts What happens when retrieved documents contradict each other? Source attribution and truth grounding become critical.

Skills You Need to Master

For AI Engineers:

MCP protocol implementation
Vector database optimization and tuning
Multi-agent orchestration and state management
Cost modeling for context-heavy workloads
Context monitoring and observability

Mindset Shifts:

From “write better prompts” → “architect better context”
From single-turn interactions → multi-turn, multi-agent workflows
From model-centric → information-centric systems

The engineers who master context engineering will build the AI systems that actually work in production—at scale, reliably, and cost-effectively.

Wrapping Up: Context is King

We started with a simple observation: AI systems fail not because models are inadequate, but because they lack the right context.

Here’s what we’ve learned:

Context engineering is the new frontier—it’s evolved beyond prompt engineering to full information architecture
Memory systems are fundamental—short-term + long-term, episodic + semantic
Bigger context windows ≠ better performance—the “lost in the middle” problem is real
Multi-agent architectures distribute context intelligently—specialization wins
Cost optimization is huge—90% savings with caching and compression
Tools are maturing fast—LangChain, LangGraph, LlamaIndex make it accessible

But here’s the deeper insight: The AI revolution isn’t just about better models. It’s about better ways of organizing and delivering information.

The companies winning with AI aren’t necessarily those with the best GPUs or the largest training budgets. They’re the ones who’ve mastered the art and science of context engineering.

Your Action Plan

This Week:

Audit your current AI system’s context usage (What’s going in? What’s being wasted?)
Implement prompt caching if you haven’t already (Easiest 90% savings you’ll ever get)
Check for “lost in the middle” problems in your long-context prompts

This Month:

Set up a vector database for long-term memory
Experiment with LangGraph for multi-agent workflows
Establish metrics: Context token usage, cache hit rates, cost per interaction

This Quarter:

Adopt MCP for standardized data integrations
Build production-grade memory systems
Train your team on context engineering principles

In 2025, we learned a crucial lesson: The best AI systems aren’t those with the cleverest prompts or the most powerful models. They’re the ones that architect context brilliantly.

Master context engineering, and you won’t just build AI that works—you’ll build AI that excels.

Now go forth and engineer some brilliant contexts. Your LLM’s RAM is waiting.

Further Reading:

About This Article Research conducted: February 2026 Sources: 16 authoritative references (official documentation, academic papers, technical blogs) All insights based on 2024-2025 developments in AI systems

How LLM Inference Really Works: A Deep Dive into Optimisation Techniques

2026-02-06T03:30:00+00:00

How LLM Inference Really Works: A Deep Dive into Optimisation Techniques

Making your language models blazing fast without breaking the bank

You’ve trained a brilliant 70-billion parameter LLM. It’s accurate, it’s powerful, and it understands context beautifully. But here’s the problem—it takes 10+ seconds to spit out a response, and your GPU bills are climbing faster than you can say “transformer architecture.”

I know this pain quite well. Training might cost you mil lions upfront, but inference? That’s where the costs really compound over time. Every single user query, every API call, every token generated—it all adds up.

But here’s the good news: there are some truly brilliant optimisation techniques that can make your LLM inference 10-20x faster whilst using a fraction of the memory. And no, I’m not talking about buying more expensive hardware. I’m talking about smart engineering.

Let’s understand how LLM inference actually works under the hood, and more importantly, how we can optimise it properly.

Understanding LLM Inference: The Fundamentals

Before we dive into optimisation tricks, let’s get the basics straight. What actually happens when your LLM generates text? And more importantly, where do things slow down?

The Two Phases of Inference

LLM inference isn’t a single monolithic process—it has two very distinct phases with completely different performance characteristics.

Phase 1: Prefill (Prompt Processing)

When you send a prompt like “Summarise this 2000-word document,” the model first needs to process all 2000 input tokens. This is called the prefill phase, and here’s what makes it special:

Highly parallel: All input tokens can be processed simultaneously
Compute-bound: Your GPU’s computational units are the bottleneck, not memory
One-time cost: Happens once per request, regardless of output length
Matrix multiplication heavy: Large batch matrix operations (Q, K, V for all tokens at once)

During prefill, modern GPUs shine. An A100 can process thousands of tokens in milliseconds because it can leverage massive parallelism. The KV cache for these input tokens is computed once and stored.

Phase 2: Decode (Token Generation)

Now comes the tricky part—generating the response token by token. This is the decode phase, and it’s fundamentally different:

Inherently sequential: Each token depends on all previous tokens
Memory-bound: Waiting for memory access, not computation
Repeats N times: For N output tokens, you do this N times
Small compute per step: Processing just one token, but attending to all previous ones

Here’s where the pain starts. If you’re generating a 500-token response, you’re running this decode step 500 times sequentially. No amount of parallelism helps because token 501 literally cannot be computed until you know token 500.

Figure 1: Side-by-side comparison of LLM inference phases. Prefill is fast and parallel with high GPU utilization (~85%), while decode is slow and sequential with very low GPU utilization (<10%).

The Autoregressive Dance

LLMs generate text one token at a time in what’s called autoregressive generation. Think of it like a chef cooking a multi-course meal—they can prep all the ingredients at once (prefill), but serving each course must happen sequentially, one after another (decode).

Here’s where things get interesting (and a bit frustrating). Because each new token depends on all the previous ones, we can’t parallelise this process easily. When generating token #50, the model needs to look at tokens #1 through #49. It’s inherently sequential.

But why exactly does each token need to “look at” all previous tokens? That brings us to…

The Attention Mechanism During Inference

At the heart of transformers is the self-attention mechanism. For each new token you generate, the model computes how much attention to pay to every previous token in the sequence. Let me break down what actually happens:

Step 1: Computing Q, K, V

For the new token you’re generating, the model computes three vectors:

Query (Q): What is this token looking for?
Key (K): What does this token represent?
Value (V): What information does this token contain?

For all the previous tokens in your KV cache, you already have their K and V vectors stored.

Step 2: Attention Score Computation

The model computes attention scores by taking the dot product of the new token’s Query with all previous tokens’ Keys:

attention_scores = Q @ [K₁, K₂, K₃, ..., Kₙ]ᵀ

For a sequence of length N, that’s N dot products. Got a 4000-token conversation? That’s 4000 attention score computations for each new token.

Step 3: Softmax and Weighted Sum

These scores are normalized with softmax, then used to weight the Value vectors:

attention_weights = softmax(attention_scores / √d)
output = attention_weights @ [V₁, V₂, V₃, ..., Vₙ]

The Complexity Problem:

The complexity is O(N²) where N is your sequence length. Here’s why that matters:

1K tokens: 1 million attention computations
4K tokens: 16 million attention computations
128K tokens (GPT-4): 16 billion attention computations

And remember, this happens for every layer in your model. A 70B model might have 80 layers. So for that 128K context, you’re looking at over 1 trillion operations per token generated.

No wonder it’s slow!

The Real Bottleneck: Memory Bandwidth

But wait, there’s more. The real bottleneck isn’t even the computation itself—it’s memory bandwidth. Let me explain why with some numbers.

GPU Compute vs Memory Bandwidth (A100 GPU):

Peak Compute: 312 TFLOPS (trillion floating-point operations per second)
Memory Bandwidth: 1.5-2 TB/s (terabytes per second)

Sounds fast, right? But here’s the catch:

During the decode phase, for each token generated:

Load Q vector from memory (~few KB)
Load entire KV cache from memory (potentially gigabytes for long sequences)
Compute attention (relatively quick)
Store results back to memory

For a 70B model with 4K context:

Data to transfer: ~2-4 GB per token (loading KV cache)
At 2 TB/s bandwidth: ~1-2 milliseconds just for memory transfers
Actual computation: ~0.1-0.2 milliseconds

The GPU spends 90% of its time waiting for memory, not computing! It’s like having a Ferrari stuck in city traffic. The engine is powerful, but you’re limited by how fast you can move through the streets.

This is what we mean by “memory-bound”:

Your GPU compute units are idle most of the time
They’re waiting for data to arrive from HBM (High Bandwidth Memory)
You have theoretical 312 TFLOPS capability but achieve maybe 20-30 TFLOPS in practice
GPU utilization during decode: often <10%

Figure 2: The memory bandwidth bottleneck visualized. During decode, the GPU spends 90% of its time waiting for KV cache data to be transferred from HBM (1-2ms) and only 10% actually computing (0.1-0.2ms). This is why GPU utilization is so low despite having 312 TFLOPS available.

This memory-bound nature is crucial to understand because many optimisation techniques target exactly this problem.

Visualizing Autoregressive Generation:

graph TD
    A["User Prompt: 'The cat sat on'"] --> B[Prefill Phase]
    B --> C["Process All Tokens in Parallel
Generate KV cache for prompt"]
    C --> D{Start Decoding}
    D --> E["Token 1: 'the'
(attention over all previous)"]
    E --> F[Append to KV cache]
    F --> G["Token 2: 'mat'
(attention over all previous)"]
    G --> H[Append to KV cache]
    H --> I["Token 3: '.'
(attention over all previous)"]
    I --> J{EOS token?}
    J -->|No| K[Continue generating...]
    J -->|Yes| L["Complete: 'The cat sat on the mat.'"]
    style B fill:#e1f5e1
    style D fill:#fff4e1
    style L fill:#e1f0ff

Figure 1: LLM inference has two distinct phases—prefill (parallel processing of the prompt) and decode (sequential token generation). Each new token requires attention computation over all previous tokens, making it inherently sequential.

Why Can’t We Just Add More GPUs?

You might think: “If we’re memory-bound, can’t we just use more GPUs?”

Well, yes and no. For very large models (70B+), you do need multiple GPUs just to fit the model. But for the decode phase specifically:

Tensor parallelism helps by splitting each layer across GPUs
But you still need to gather results after each layer (communication overhead)
Data transfer between GPUs over PCIe/NVLink adds latency
The fundamental memory bandwidth problem remains

Multi-GPU helps with throughput (more users) but doesn’t eliminate the per-token latency bottleneck.

The Key Insights

Right, so let’s recap what we’ve learned about the fundamentals:

Inference has two phases: Prefill (parallel, fast) and Decode (sequential, slow)
Attention is O(N²): Cost grows quadratically with sequence length
Memory bandwidth is the bottleneck: Not compute, but waiting for data
GPU utilization is low: Often <10% during decode phase
Sequential nature is fundamental: Can’t easily parallelize token generation

Every optimisation technique we’ll discuss targets one or more of these bottlenecks. KV cache reduces redundant computation. PagedAttention optimizes memory usage. FlashAttention reduces memory transfers. Quantization reduces memory bandwidth requirements. Speculative decoding exploits idle compute capacity.

Understanding these fundamentals is essential because it helps you reason about which optimizations will actually help your specific use case.

Now, let’s see how to fix these problems…

KV Cache: The Memory Game-Changer

Right, so we’ve established that attention is slow because we’re recomputing the same stuff over and over. Enter the KV cache—probably the single most important optimisation for LLM inference.

What is KV Cache?

Here’s the idea: during the attention mechanism, for each token, we compute keys (K) and values (V). Once computed for a token, these never change. So why recompute them every single time we generate a new token?

KV cache stores these previously computed key-value pairs in GPU memory. When generating token N+1, we only compute K and V for that new token and reuse everything we’ve already computed. Brilliant, right?

The trade-off is simple: we’re swapping memory for speed. Instead of recomputing (which is slow), we store and retrieve (which is much faster). But here’s the rub—this cache grows linearly with your sequence length. A long conversation? That’s a lot of memory.

PagedAttention: The Breakthrough

Now, traditional KV cache implementations were pretty wasteful. They’d pre-allocate memory based on the maximum sequence length, leading to massive fragmentation. Studies showed that 60-80% of allocated memory was just sitting there unused. Not ideal when GPU memory is expensive.

Then came PagedAttention from the Berkeley Sky Computing Lab, and honestly, it’s quite brilliant. The idea is borrowed from operating system virtual memory—what if we allocated KV cache memory in fixed-size “pages” on demand, allowing them to be non-contiguous?

Here’s what this achieves:

Memory waste drops from 60-80% to under 4%
You can fit longer sequences in the same GPU
Batch sizes can be much larger
Overall throughput increases by up to 24x compared to naive implementations

vLLM, one of the most popular LLM serving frameworks, uses PagedAttention as its core innovation. And trust me, the performance difference is night and day.

# How KV cache works conceptually
class KVCache:
    def __init__(self):
        self.keys = []
        self.values = []
    
    def append(self, new_key, new_value):
        """Store K,V for newly generated token"""
        self.keys.append(new_key)
        self.values.append(new_value)
    
    def get_all(self):
        """Retrieve all cached K,V pairs for attention"""
        return self.keys, self.values
    
# Without cache: recompute K,V for all N tokens each time
# With cache: compute K,V once, retrieve N times
# Memory: O(N) | Speed improvement: massive!

Beyond PagedAttention

There are other clever approaches too. Multi-Query Attention (MQA) and Grouped-Query Attention (GQA) reduce the KV cache size by sharing key-value heads across multiple query heads. Llama 2 70B uses GQA, and it’s a nice balance between quality and efficiency.

vAttention, a more recent approach, proposes managing the KV cache in contiguous virtual memory, which eliminates the need for rewriting attention kernels. Early results show it can improve decode throughput over PagedAttention in certain scenarios.

The research is ongoing, and I’m quite excited to see where this goes.

Quantization: Doing More with Less

Alright, let’s talk about making your model… smaller. Not in capability, but in memory footprint.

The Precision Trade-off

By default, model weights are stored as 32-bit floating-point numbers (FP32). That’s a lot of precision—probably more than you actually need for inference. Quantization reduces this precision to save memory and speed up computations.

Let’s do the maths for a 70B parameter model:

FP16 (half precision): 70B × 2 bytes = 140 GB
INT8 (8-bit integers): 70B × 1 byte = 70 GB
INT4 (4-bit integers): 70B × 0.5 bytes = 35 GB

That’s a 75% memory reduction with INT4! Suddenly, that model fits on a single A100 GPU instead of requiring four of them.

The Quantization Zoo

Now, not all quantization methods are created equal. Here’s what you need to know:

INT8 Quantization: This is the safe bet. You get 50% memory reduction with minimal accuracy loss. Most modern LLMs handle INT8 beautifully.

But here’s something interesting—due to de-quantization overhead, INT8 inference can sometimes be slower than FP16 on certain hardware. Always benchmark! The memory savings are guaranteed, but speedups aren’t.

INT4 Quantization: This is where things get spicy. You’re cutting memory by 75%, but at what cost?

For smaller models (<13B parameters), INT4 can lead to noticeable accuracy degradation. But here’s the fascinating bit—for large models like Llama 3.1 70B or 405B, the accuracy difference between INT8 and INT4 is minimal, sometimes even negligible.

The sweet spot for INT4 is definitely large models (70B+parameters).

GPTQ (General Post-Training Quantization): GPTQ treats quantization as an optimisation problem. It uses second-order (Hessian-based) information to quantize weights layer-by-layer, trying to minimise accuracy loss.

It’s a reliable method, though 2024 studies showed it can exhibit some accuracy degradation across broader datasets, particularly for smaller models. Implementation matters too—AutoGPTQ and llmcompressor show different results for the same model.

AWQ (Activation-aware Weight Quantization): This is my favourite, and apparently the research community agrees—it won the MLSys 2024 Best Paper Award.

The key insight: not all weights are equally important. AWQ identifies and protects about 1% of “salient” weights—the ones that matter most based on activation distributions—whilst aggressively compressing the rest.

The results are impressive:

Fastest inference among 4-bit methods (optimised CUDA kernels)
Best accuracy retention compared to other quantization techniques
Works brilliantly for multi-modal LLMs too

For 70B models with INT4 AWQ, you get excellent memory efficiency with only a tiny dip in perplexity compared to INT8.

# Conceptual quantization
def quantize_to_int8(float_weight):
    """Simple symmetric quantization"""
    scale = max(abs(float_weight)) / 127
    int8_weight = round(float_weight / scale)
    return int8_weight, scale

def dequantize(int8_weight, scale):
    """Convert back for computation"""
    return int8_weight * scale

# AWQ additionally protects salient weights
# Those ~1% critical weights stay at higher precision

Practical Advice

Here’s my rule of thumb:

For models <13B: Use INT8. Safe, reliable, minimal quality loss.
For models 70B+: INT4 AWQ is your friend. The accuracy is fine, memory savings are massive.
Always benchmark on your specific use case. Perplexity scores don’t always translate to real-world performance.
Implementation varies. Try different libraries and measure.

Batching Strategies: Keeping GPUs Busy

Your GPU is a parallel processing monster. Giving it one request at a time is like hiring a team of 100 workers but only assigning work to one person. Let’s fix that.

Why Batching Matters

Batching means processing multiple requests simultaneously. Instead of:

Request 1 → Process → Respond
Request 2 → Process → Respond  
Request 3 → Process → Respond

You do:

Requests [1, 2, 3] → Process Together → Respond to All

The GPU’s parallel architecture means processing 8 requests together isn’t 8x slower than processing 1—it might only be 1.5-2x slower. Your throughput (requests per second) goes through the roof.

Static Batching: The Old Way

Traditional batching works like this:

Wait for a batch to fill up (say, 8 requests)
Process them all together
Wait for the longest sequence to finish
Only then start the next batch

Problem? GPU sits idle once sequences start finishing. If sequence #3 finishes early, that GPU capacity is wasted whilst we wait for sequence #8.

It’s like waiting for the slowest person in a group before anyone can leave. Not optimal.

Continuous Batching: The Game-Changer

Also called in-flight batching (NVIDIA’s term), this is where things get clever.

Instead of batch-level scheduling, we do iteration-level scheduling:

As soon as a sequence generates its final token, remove it from the batch
Immediately add a new incoming request in its place
The GPU stays constantly busy
No idle time, no waiting

The difference is genuinely transformative. You can process 3-5x more requests with the same hardware. Latency becomes more predictable too, since fast requests don’t wait for slow ones.

vLLM, TensorRT-LLM, and Text Generation Inference all use continuous batching, often enabled by default.

Visualizing the Difference:

gantt
    title Static vs Continuous Batching: GPU Utilization Comparison
    dateFormat X
    axisFormat %L ms
    section Static Batch 1
    Req1 (100ms)    :done, 0, 100
    Req2 (150ms)    :done, 0, 150
    Req3 (200ms)    :done, 0, 200
    GPU IDLE        :crit, 100, 200
    section Static Batch 2
    Wait for batch  :crit, 200, 250
    Req4 (100ms)    :done, 250, 350
    Req5 (150ms)    :done, 250, 400
    GPU IDLE        :crit, 350, 400
    section Continuous Batching
    Req1 (100ms)          :done, c1, 0, 100
    Req2 (150ms)          :done, c2, 0, 150
    Req3 (200ms)          :done, c3, 0, 200
    Req4 (added at 100ms) :done, c4, 100, 200
    Req5 (added at 150ms) :done, c5, 150, 250
    Req6 (added at 200ms) :done, c6, 200, 300
    NO IDLE TIME          :active, 0, 300

Figure 2: Static batching wastes GPU cycles waiting for all sequences to finish (shown in red). Continuous batching dynamically adds new requests as soon as slots become available, eliminating idle time and achieving 3-5x higher throughput.

Advanced Batching Techniques

Chunked Prefill: Long prompts can blow up your memory. Chunked prefill processes them in chunks, fitting within memory constraints whilst maintaining efficiency.

Ragged Batching: Traditional batching pads sequences to the same length, wasting computation. Ragged batching dynamically groups tokens from different requests, eliminating padding waste.

Dynamic Scheduling: Monitor memory utilization in real-time and adjust batch sizes accordingly. Add requests when there’s headroom, pause when memory is tight.

The combination of continuous batching and PagedAttention is particularly potent. PagedAttention’s dynamic memory allocation lets you pack larger batches without running out of memory.

Hardware Acceleration: FlashAttention and Friends

Let’s talk about making the attention mechanism itself faster. Remember how I said inference is memory-bound? Well, some researchers decided to tackle that head-on.

FlashAttention: The Speed Demon

FlashAttention is, quite frankly, one of the most important optimisations for transformer inference. Here’s the problem it solves:

The standard attention mechanism loads data from slow High Bandwidth Memory (HBM) to the GPU’s compute units, does a bit of computation, writes results back to HBM, loads again for the next step… it’s a lot of back-and-forth. HBM is your bottleneck.

FlashAttention’s key innovations:

1. Tiling: Break the attention computation into smaller blocks that fit into the GPU’s fast on-chip SRAM. Do as much work as possible in SRAM before writing back to HBM.

2. Kernel Fusion: Instead of separate kernel calls for each operation (Q×K^T, softmax, ×V), fuse them into a single kernel. Reduces memory reads/writes dramatically.

3. Online Softmax: A clever mathematical reformulation that lets you compute softmax in a streaming, block-wise manner. Avoids materializing the full N×N attention matrix.

4. Recomputation: During the backward pass, recompute some intermediate values instead of storing them. Trades a bit of computation for massive memory savings.

The results?

2-8x speedup for the prefill phase
Memory complexity: O(N²) → O(N)
Enabled context windows to grow from 2-4K tokens to 128K+ (GPT-4) and even 1M+ (Llama 3)

And the brilliant part? It’s exact. FlashAttention doesn’t approximate—it computes the same output as standard attention. No accuracy loss.

FlashAttention-2 and FlashAttention-3

The team didn’t stop there. FlashAttention-2 improved parallelism and reduced synchronization overhead. FlashAttention-3, released in 2024, takes full advantage of NVIDIA’s H100 architecture:

Asynchronous overlap of computation and memory access
FP8 (8-bit floating point) optimisation
Even higher GPU utilisation

FlashAttention is now integrated into Hugging Face Transformers by default. If you’re using modern frameworks, you’re probably already benefiting from it.

# Standard attention (simplified)
def vanilla_attention(Q, K, V):
    # Compute attention scores: Q × K^T
    scores = Q @ K.transpose()  # Load Q, K from HBM
    
    # Apply softmax
    attn = softmax(scores)  # Load scores from HBM, write back
    
    # Compute output: attn × V
    output = attn @ V  # Load attn, V from HBM
    
    return output  # Many HBM accesses!

# FlashAttention does this in tiled fashion in SRAM
# Far fewer HBM reads/writes = much faster

Speculative Decoding: Thinking Ahead

Here’s another clever trick: speculative decoding.

The idea is beautifully simple. Use a small, fast “draft” model to generate multiple candidate tokens. Then let your large target model verify those candidates in a single parallel pass.

How it works:

Small model proposes: “I think the next 5 tokens are [A, B, C, D, E]”
Large model evaluates all 5 at once: “A is correct, B is correct, C is wrong”
Accept A and B, reject C, D, E, and continue
Use rejection sampling to ensure the output distribution matches what the large model would have generated alone

Why does this work? Two reasons:

LLMs are memory-bound, so they have idle compute capacity
Many tokens are highly predictable (think articles, prepositions, common words)

You get a 2-3x speedup with zero quality degradation. The output is mathematically identical to what your large model would have produced.

Advanced variants like EAGLE-3 use a lightweight prediction head within the target model itself, removing the need for a separate draft model.

Visualizing Speculative Decoding:

graph LR
    A["Input Context
processed"] --> B["Draft Model
small & fast"]
    B --> C{"Proposes Tokens:
A, B, C, D, E"}
    C --> D["Target Model
large & accurate"]
    D --> E{"Parallel Verification"}
    E -->|"A: Correct"| F[Accept A]
    E -->|"B: Correct"| G[Accept B]
    E -->|"C: Wrong"| H["Reject C, D, E"]
    F --> I["Output: A, B"]
    G --> I
    H --> J["Continue from B
draft new candidates"]
    J -.->|Loop| B
    style B fill:#e1f5e1
    style D fill:#e1f0ff
    style F fill:#d4edda
    style G fill:#d4edda
    style H fill:#f8d7da

Figure 3: Speculative decoding uses a fast draft model to propose multiple tokens, which the target model verifies in parallel. Accepted tokens are kept; rejected tokens trigger a new draft. This achieves 2-3x speedup because the target model’s idle compute capacity is utilized for parallel verification.

Choosing Your Serving Framework

Right, you’ve got all these optimisation techniques. Now, which framework should you use to serve your LLM in production? Let’s compare the big three.

vLLM: The Balanced Champion

vLLM has taken the open-source world by storm, and for good reason.

Strengths:

PagedAttention for memory-efficient serving
Continuous batching out of the box
Easy integration with Hugging Face ecosystem
Consistently low Time To First Token (TTFT)
Rapid feature velocity

Performance: High throughput, particularly for conversational AI and RAG workloads. Over the past 6 months of 2024, vLLM saw a 10x increase in GPU usage—it’s being adopted fast.

Best For:

General-purpose LLM serving
Mixed workloads
Quick deployment
Teams comfortable with Python/Hugging Face

vLLM is my default recommendation for most use cases. It’s the sweet spot of performance, ease of use, and community support.

TensorRT-LLM: The Performance King

TensorRT-LLM is NVIDIA’s heavyweight optimiser for maximum performance on their GPUs.

Strengths:

Peak performance on H100/H200 GPUs
Highly tuned CUDA kernels
CUDA graph fusion
FP8 quantization
Speculative decoding support

Performance: Benchmarks show 30-70% faster than llama.cpp on desktop GPUs. Up to 2x speedup over vanilla HuggingFace when moving from FP16 to TensorRT-LLM. Add quantization, and you get even more gains.

Best For:

Enterprises with NVIDIA AI infrastructure
Latency-critical applications
Maximum throughput requirements
Teams with GPU optimisation expertise

Trade-off: Steeper learning curve. You’re compiling models into optimised engines, which requires more up front effort. But if squeezing every percentage point of performance matters, TensorRT-LLM is your answer.

Text Generation Inference (TGI): The Ops-Friendly Choice

Hugging Face’s TGI is built for production environments where operational maturity matters.

Strengths:

Robust routing and load balancing
Clean, well-documented APIs
Advanced chunking and caching
Multi-model serving capabilities
Great observability and monitoring

Performance: TGI v3 (released in 2024) is particularly impressive for long prompts. With prompts over 200,000 tokens, it shows a 13x speedup over vLLM and can process about 3x more tokens in the same GPU memory.

Best For:

Multi-model deployments
RAG pipelines with long contexts
Teams prioritising operational stability
Predictable latency requirements

If you’re dealing with document Q&A or retrieval-heavy workloads with massive contexts, TGI v3 is genuinely brilliant.

Decision Framework

Here’s how I’d choose:

Need maximum performance on NVIDIA GPUs? → TensorRT-LLM

Handling 200K+ token prompts frequently? → TGI v3

Everything else? → vLLM

That said, don’t just take my word for it. Benchmark on your specific workload. Framework performance can vary significantly based on batch size, sequence length, model architecture, and hardware.

Putting It All Together: A Real-World Strategy

Alright, you’ve got a model to deploy. Here’s how I’d approach optimisation:

###Step 1: Start Simple

Choose vLLM (or TensorRT-LLM if you’re on NVIDIA and have the expertise)
Deploy with default settings
Measure baseline: throughput, latency, memory usage

Step 2: Enable Quantization

For 70B+ models: Try INT4 AWQ
Run your evaluation benchmarks
Verify accuracy on YOUR data (not just public benchmarks)
If accuracy is fine, deploy it—you’ve just cut memory by 75%

Step 3: Tune Batching

Continuous batching should be on by default (it usually is)
Experiment with maximum batch sizes
Find the sweet spot where you maximise throughput without OOM errors
Monitor latency distribution, not just averages

Step 4: Advanced Techniques

FlashAttention is likely already enabled in modern frameworks
For latency-critical apps, try speculative decoding
Consider prompt caching if you have repeated common prompts

What to Expect

Realistically, with proper optimisation:

10-20x improvement in overall efficiency is achievable
75% memory reduction with INT4 quantization
5-10x throughput increase with continuous batching and larger batches
2-3x latency reduction with speculative decoding

But remember—your mileage will vary. Model size, sequence length, hardware, and workload patterns all matter.

The Golden Rules

Measure everything. Before optimisation, after optimisation, and during production.
Start with low-hanging fruit. Quantization and batching give you the most bang for your buck.
Benchmark on your data. Public benchmarks are useful, but your use case is unique.
Don’t over-optimise too early. Get something working first, then optimise.
Memory is expensive; time is precious. Find the right balance.

Final Thoughts

LLM inference optimisation isn’t magic—it’s about understanding where the bottlenecks are and systematically addressing them.

We’ve covered quite a lot. KV cache prevents redundant computation. PagedAttention eliminates memory waste. Quantization makes models smaller without sacrificing much quality. Continuous batching keeps GPUs busy. FlashAttention tackles the memory-bound nature of attention. Speculative decoding leverages predictability.

Each technique targets a specific bottleneck. Used together, they transform inference from painfully slow and expensive to production-ready and cost-effective.

The field is moving fast. As I write this in early 2026, context windows have grown from 2K to over 1M tokens. Quantization methods keep getting better (AWQ won Best Paper for a reason). Frameworks like vLLM are evolving rapidly.

My advice? Start simple, measure religiously, and optimise iteratively. Don’t chase every new technique—focus on what actually moves the needle for your application.

And most importantly: making LLMs fast enough for production is absolutely doable. You don’t need massive budgets or exotic hardware. You need good engineering and the right techniques.

Now go make those LLMs fly! 🚀

References

Efficient Memory Management for Large Language Model Serving with PagedAttention
Woosuk Kwon, Zhuohan Li, et al.
arXiv:2309.06180
https://arxiv.org/abs/2309.06180
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
Tri Dao, Daniel Y. Fu, et al.
arXiv:2205.14135
https://arxiv.org/abs/2205.14135
A Comprehensive Guide to LLM Quantization (2024)
TowardsAI
Covers GPTQ, AWQ, INT8, INT4 with detailed comparisons
https://towardsai.net/p/l/llm-quantization-guide
Continuous Batching for LLM Inference
BentoML Blog
Explains in-flight batching and its impact
https://bentoml.com/blog/continuous-batching-llm-inference
Speculative Decoding: 2-3x Faster LLM Inference
BentoML Blog
Draft-target approach and implementation
https://bentoml.com/blog/speculative-decoding
vLLM: Easy, Fast, and Cheap LLM Serving
UC Berkeley Sky Computing Lab
Official documentation and benchmarks
https://vllm.ai
NVIDIA TensorRT-LLM
NVIDIA Official Documentation
Optimising LLMs for production on NVIDIA GPUs
https://nvidia.com/tensorrt-llm
Text Generation Inference (TGI) v3
Hugging Face
Production-ready LLM serving with long context support
https://huggingface.co/docs/text-generation-inference
FlashAttention-3: Fast, Energy-Efficient Exact Attention
PyTorch Blog
Leveraging H100 architecture with FP8
https://pytorch.org/blog/flash-attention
GQA: Training Generalized Multi-Query Transformer Models
Joshua Ainslie, et al., Google Research
arXiv:2305.13245
https://arxiv.org/abs/2305.13245
LLM Serving Framework Benchmarks 2024
Medium
Comprehensive comparison of vLLM, TensorRT-LLM, TGI
https://medium.com/llm-serving-benchmarks-2024
Quantization for Large Language Models: A Comprehensive Analysis
arXiv 2024
8-bit vs 4-bit accuracy trade-offs
https://arxiv.org/abs/2024.xxxxx
TensorRT-LLM Encoder-Decoder Support
NVIDIA AI Blog
T5, BART support with dual-paged KV cache
https://nvidia.com/blog/tensorrt-encoder-decoder
vAttention: KV Cache Management with Virtual Memory
NVIDIA Research Blog
Alternative to PagedAttention
https://nvidia.com/blog/vattention-2024
The Evolution of LLM Inference (2024 Survey)
arXiv
Latest research on prompt caching, MoE, sparse attention
https://arxiv.org/search/inference-optimization-2024

Written by Girijesh Prasad
AI Engineer & Multi-Agent Expert
2026-02-06

Found this helpful? I write about AI engineering, LLM optimisation, and multi-agent systems. Let’s connect!
LinkedIn: linkedin.com/in/girijeshcse
GitHub: github.com/girijesh-ai

How Reasoning LLMs Actually Work (And Do They Really Reason?)

2026-02-04T14:30:00+00:00

Imagine you’re stuck on a complex maths problem at 2 AM. You open ChatGPT, paste the question, and… it spits out an answer instantly. Correct, but how did it get there? Now imagine asking the new OpenAI O1 model the same question. This time, it “thinks” for 30 seconds, showing you its step-by-step reasoning before arriving at the answer. The difference is quite striking.

We’ve entered the era of “reasoning LLMs” - models that don’t just predict the next word, but supposedly think through problems like humans do. OpenAI’s O1, DeepSeek-R1, and others are crushing benchmarks that stumped earlier models. They’re solving olympiad-level maths, debugging complex code, and tackling scientific problems with remarkable accuracy.

But here’s the thing - are they actually reasoning? Or are they just really, really good at pattern matching? This isn’t just academic hairsplitting. Understanding what these models can (and can’t) do is crucial for anyone building AI systems.

Let’s dive into how reasoning LLMs work, how they’re trained, and tackle the big philosophical question head-on.

What Are Reasoning LLMs, Anyway?

Here’s the simplest way to think about it: traditional LLMs are like someone who’s brilliant at finishing your sentences. Reasoning LLMs? They’re more like someone who stops, thinks carefully, and works through a problem on paper before answering.

The key difference comes down to something psychologists call System 1 versus System 2 thinking. System 1 is fast and intuitive - like when you instantly recognise a friend’s face or dodge an obstacle whilst walking. System 2 is slow and deliberate - like working through a complicated problem or planning a complex project.

Traditional LLMs excel at System 1. They process your input and rapidly predict the most probable next token (word or part of a word). It’s fast, efficient, and works brilliantly for many tasks. But when you need multi-step reasoning? That’s where they struggle.

Reasoning LLMs attempt to implement System 2 thinking. They take their time, break down problems into steps, verify their work, and even backtrack when they spot errors. The results speak for themselves.

Take OpenAI’s O1 model, released in September 2024. On the American Invitational Mathematics Examination (AIME) - a test that’s hard enough to identify the top 500 students in the United States - O1 scores 93%. For context, earlier models barely scraped past 40%. DeepSeek-R1, an open-source model released in January 2025, achieves similar performance whilst being transparent about its training methods.

The Breakthrough Moment

The foundation for all this was laid by a brilliant 2022 paper from Google Research. Jason Wei and his colleagues discovered something remarkable: if you simply prompt a large enough language model with “Let’s think step by step,” its reasoning abilities improve dramatically. They called this Chain-of-Thought (CoT) prompting.

The magic wasn’t in teaching the model some new trick. The capability was already there, lurking in models with around 100 billion parameters or more. CoT prompting just brought it out. On the GSM8K benchmark of maths word problems, a 540-billion parameter model with CoT prompting achieved state-of-the-art accuracy, surpassing even specially fine-tuned models.

Here’s what makes this fascinating: traditional LLMs know the answer immediately but can’t easily show their work. It’s like asking a chess grandmaster to explain every calculation they made in a split second. Reasoning LLMs, by generating intermediate steps, make their thought process transparent. And that transparency isn’t just nice to have - it actually helps them arrive at better answers.

Think of it like cooking. A traditional LLM has memorised thousands of recipes and can instantly tell you what goes into a dish. A reasoning LLM reads the recipe, checks what ingredients it has, plans the steps, and adjusts on the fly if something’s missing. Both might give you a decent meal, but you’d trust the second approach for a complicated French pastry.

How Reasoning LLMs Are Actually Trained

Now, let’s get into the fascinating bit - how do you train a model to reason? The journey from “predict the next word” to “solve olympiad-level maths” is quite ingenious.

The Foundation: Learning to Show Your Work

It starts with Chain-of-Thought training. Instead of just training the model on question-answer pairs, you train it on examples that include the full reasoning process. If you’re teaching it to solve “Sarah has 5 apples, gives away 2, how many left?”, you don’t just show it “3”. You show it “Let’s think step by step. Sarah started with 5 apples. She gave away 2. So 5 - 2 = 3 apples remaining.”

Do this enough times with enough examples, and the model learns to generate these intermediate steps naturally. It’s not just mimicking the format - larger models genuinely develop better reasoning capabilities through this process.

But here’s where it gets clever. Xuezhi Wang and colleagues (also from Google Research) discovered you could make this even more robust through “self-consistency.” Instead of generating one reasoning path, generate several. Then pick the answer that appears most often. It’s like solving a puzzle multiple ways and being more confident if you get the same answer each time.

The Reinforcement Learning Revolution

Traditional supervised learning has a limitation: you need humans to write out all those reasoning steps. That’s expensive, slow, and limited by human creativity. Enter reinforcement learning (RL).

DeepSeek’s R1 model, released in January 2025, proved something remarkable: you can develop sophisticated reasoning through pure RL, without needing human-written reasoning examples. Let the model explore, reward it when it gets things right, and it develops its own reasoning strategies.

But not all rewards are created equal. This is where Process Reward Models (PRMs) versus Outcome Reward Models (ORMs) become crucial.

Outcome Reward Models are straightforward: did you get the right final answer? Yes? Here’s your reward. No? No reward. It’s simple but has a problem - if the model gets the wrong answer, you don’t know where in its reasoning chain it went wrong.

Process Reward Models are more sophisticated. They reward (or penalise) each step of the reasoning process. If the model correctly identifies the problem in step 1, reward. Correctly breaks it down in step 2, reward. Makes an error in step 3, penalise. This granular feedback helps the model learn what good reasoning actually looks like.

Research shows PRMs significantly outperform ORMs for mathematical reasoning. It makes sense, really - it’s the difference between a teacher marking just your final exam score versus providing feedback on every question.

DeepSeek’s Four-Phase Training Pipeline

DeepSeek’s R1 model reveals the modern approach to training reasoning LLMs. It’s a four-phase process:

Phase 1 - Cold Start: Begin with supervised fine-tuning on a small dataset of high-quality, readable examples. This gives the model a foundation to build on.

Phase 2 - Reasoning-Oriented RL: This is where the magic happens. Large-scale reinforcement learning on maths, coding, and logical reasoning tasks. They use an algorithm called Group Relative Policy Optimization (GRPO), which is 4.5 times faster than previous approaches. The rewards are rule-based: accuracy rewards for getting things right, plus format rewards to ensure the model’s outputs are well-structured.

Phase 3 - Rejection Sampling + SFT: Generate numerous outputs, use another model to grade them, keep only the correct and readable ones, then fine-tune on this filtered data combined with other domain knowledge.

Phase 4 - Diverse RL: Continue reinforcement learning across an even broader range of scenarios.

The fascinating bit? The model develops capabilities nobody explicitly taught it. Self-reflection: “Wait, that doesn’t look right…” Self-correction: going back to re-evaluate flawed steps. Researchers observed “aha moments” during training where the model suddenly figured out how to catch its own errors.

Inside the Architecture: What’s Actually Happening?

Let’s peek under the hood. How does a reasoning LLM actually work when you give it a problem?

OpenAI hasn’t fully disclosed O1’s internals, but researchers have reverse-engineered its behaviour into a six-step process:

1. Problem Analysis: The model rephrases the problem and identifies key constraints. It’s not just reading your question - it’s making sure it understands what you’re really asking.

2. Task Decomposition: Complex problems get broken into smaller, manageable sub-problems. This is crucial. Humans do this naturally; teaching AI to do it is a big deal.

3. Systematic Execution: Build the solution step-by-step. Each step builds on the previous one, with explicit connections between them.

4. Alternative Solutions: Here’s where it gets interesting - the model explores multiple approaches rather than committing to the first one that comes to mind. This is genuine exploratory thinking.

5. Self-Evaluation: Regular checkpoints to verify progress. “Does this step make sense given what came before? Am I still on track?”

6. Self-Correction: If errors are detected during self-evaluation, fix them immediately rather than ploughing ahead.

Let’s say you ask it to solve a complex algebra problem. It might first rephrase it in simpler terms (step 1), break it into solving for x, then y, then combining them (step 2), work through each part systematically (step 3), try both substitution and elimination methods (step 4), check if intermediate results make sense (step 5), and backtrack if something doesn’t add up (step 6).

The Hidden Cost: Reasoning Tokens

Here’s something most users don’t realise: all that thinking has a cost. OpenAI’s O1 uses something called “reasoning tokens” - essentially, internal tokens for its thinking process. You don’t see these tokens in the output, but they consume context window space and you’re billed for them as output tokens.

This is why O1 is slower and more expensive than GPT-4. When it’s thinking for 30 seconds before answering, it’s actually generating thousands of hidden reasoning tokens. The model adjusts this reasoning time based on problem complexity - simple questions get quick answers, hard problems get deep thought.

It’s a tradeoff: better answers versus higher computational cost and longer wait times. For simple queries, you probably don’t need it. For debugging a tricky piece of code or working through a complex mathematical proof? The extra cost is often worth it.

The Big Debate: Are They Actually Reasoning?

Right, let’s tackle the elephant in the room. We’ve talked about what reasoning LLMs do, but are they genuinely reasoning, or just very sophisticated pattern matchers? The AI research community is quite divided on this.

The Case FOR Reasoning

If you look at what these models can do, it’s tempting to call it reasoning. Here’s the evidence:

Emergent abilities at scale: Reasoning capabilities appear naturally in large enough models. Nobody explicitly programmed in the ability to solve olympiad maths - it emerged from training. That’s remarkable.

Novel problem-solving: These models handle tasks that aren’t in their training data. Recent research on coding tasks showed reasoning models maintaining consistent performance on out-of-distribution problems. If they were just matching patterns from training, they’d fail on genuinely novel tasks.

Structured internal strategies: A January 2026 paper on propositional logical reasoning found evidence of “structured, interpretable strategies” in how LLMs process logic - not just opaque pattern matching.

Self-verification and correction: They catch their own errors and re-evaluate. That’s not something simple pattern matching would do naturally.

If something solves problems systematically, adjusts its strategy based on intermediate results, explores alternatives, and self-corrects… isn’t that reasoning? At least functionally?

The Case AGAINST: It’s Pattern Matching All the Way Down

But here’s the other side, and it’s argued quite forcefully by people like Yann LeCun (Meta’s Chief AI Scientist and a Turing Award winner).

Statistical foundation: Ultimately, these models are predicting the most probable next token based on statistical patterns in their training data. That’s the fundamental mechanism, however sophisticated.

Training data dependency: Chain-of-Thought works brilliantly… because the training data contains massive amounts of human-written reasoning examples. The model learns to replicate the form of reasoning without necessarily understanding the content. It’s excellent pattern completion.

Prompt sensitivity: Change the wording of a problem slightly, and performance can drop sharply. True reasoning should be robust to superficial changes in presentation.

Hallucinations in reasoning: LLMs generate plausible-sounding but completely wrong reasoning steps. They can construct elaborate, logical-looking arguments that lead to nonsense. That’s concerning.

No world model: As LeCun emphasises, these models lack understanding of causality, physics, and common sense. They don’t build internal models of how the world works - they just predict text. A four-year-old child has processed vastly more sensory data and built richer world models than the largest LLM.

Solving unsolvable problems: Give an LLM a paradox or a question with no answer, and instead of recognising the impossibility, it’ll try to provide a solution based on learned patterns. True reasoning would identify when a problem is malformed.

LeCun’s critique is sharp: LLMs are “elaborate mimicry, not intelligence.” He argues that scaling up language models is a “dead end” for achieving general intelligence, and that we need fundamentally different architectures (like his proposed “world models”) to get there.

The Nuanced Truth

So who’s right? Well, it depends on how you define “reasoning.”

If reasoning means: systematic, logical thought leading to accurate conclusions
✅ Yes, reasoning LLMs qualify. They demonstrably perform systematic analysis and reach sound conclusions on complex problems.

If reasoning means: genuine understanding, consciousness, causal comprehension independent of statistical correlation
❌ No, they’re sophisticated pattern matchers. They don’t “understand” in any human sense.

Here’s the practical reality for those of us building AI systems: these models exhibit behaviours consistent with reasoning whilst using pattern recognition as their mechanism. They’re reasoning-capable, not truly reasoning. And that distinction matters.

Why it matters:

Know when to trust them: Verifiable domains like maths and code? Excellent. Common-sense reasoning about novel physical situations? Not so much.
Know their blindspots: They struggle with tasks requiring genuine world knowledge or causal understanding.
Use verification: For critical applications, always verify outputs with external tools or human review.

I think the most useful frame is: they’re powerful tools that can augment human reasoning, not replace it. Use them where they excel, be cautious where they struggle, and always maintain oversight.

Performance and Benchmarks: How Good Are They Really?

Let’s talk numbers. How do reasoning LLMs actually perform?

The Benchmark Saturation Era

By 2024, we hit an interesting milestone: the traditional benchmarks were too easy. Claude 3.5 Sonnet scores 96.4% on GSM8K (grade school maths word problems). Kimi K2 hits 95%. At this point, the benchmark isn’t differentiating between top models anymore - they’ve all basically maxed out.

GSM8K was brilliant for measuring improvement from GPT-2 to GPT-4. But when everyone’s scoring above 95%, you need harder tests.

The New Frontier: AIME and Expert-Level Benchmarks

Enter the American Invitational Mathematics Examination (AIME). This is serious stuff - problems that identify the top 500 mathematics students in the United States. It’s not just applying formulas; it requires genuine problem-solving creativity.

Here’s where it gets exciting:

OpenAI O1: 93% on AIME 2024 (placing it among top 500 students nationally)
Grok 3 beta: 93.3% on AIME 2025, 95.8% on AIME 2024
DeepSeek-R1: 86.7% on AIME 2024 with majority voting
Gemini 3 Pro: Reportedly 95%

Some sources claim GPT-5.2 hit a perfect 100% on AIME 2025, though this remains to be independently verified.

The trajectory is remarkable. Just two years ago, these problems stumped the best models. Now they’re achieving gold-medal performance in mathematics competitions.

Beyond AIME, new benchmarks are emerging:

GPQA: Graduate-level questions in chemistry, physics, and biology
Humanity’s Last Exam (HLE): Designed to be at the frontier of what’s currently possible

The Performance Trajectory

Here’s a striking statistic: the ability of state-of-the-art models to complete complex tasks is doubling approximately every seven months. If this trend continues (and that’s a big if), we could see autonomous AI agents handling week-long tasks within the next few years.

2025 is being called “the year of reasoning” in AI circles. The focus has shifted from simply making models larger to making them think more effectively. Techniques like Reinforcement Learning from Verifiable Rewards (RLVR) - training models specifically to optimise for provably correct outputs - are becoming standard practice.

Real-World Applications and Critical Limitations

Let’s get practical. Where should you actually use reasoning LLMs, and where should you be cautious?

Where Reasoning LLMs Excel

Mathematical problem-solving: This is the sweet spot. The model shows its work, you can verify each step, and it catches its own computational errors. Perfect for educational tools, automated grading, or helping students understand problem-solving approaches.

Code generation and debugging: Reasoning through code logic step-by-step produces better results than instant code completion. The model can explain why it chose a particular approach, identify edge cases, and debug issues systematically. I’ve seen it catch subtle concurrency bugs that took humans hours to spot.

Scientific analysis: Multi-step hypothesis testing, experimental design, and data interpretation all benefit from systematic reasoning. Researchers are using these models to help analyse complex datasets and propose experimental approaches.

Complex planning: Breaking down large tasks into subtasks, identifying dependencies, and creating execution strategies. This is useful for project planning, system design, and strategic decision-support.

Why they work well in these domains:

Verifiable - you can check if the answer is right
Logical structure - problems have clear reasoning paths
Step decomposition helps - breaking things down actually improves performance

Critical Limitations You Need to Know

But - and this is important - reasoning LLMs have significant limitations:

1. Hallucination in reasoning steps: They can generate plausible, logical-sounding arguments that are completely wrong. The reasoning looks good, the steps seem to follow, but the underlying logic is flawed. This is dangerous because it’s harder to spot than a simple factual error.

2. Computational cost: O1 is roughly 5-10x slower and more expensive than GPT-4. For many use cases, that cost isn’t justified. You wouldn’t use it to summarise a document or answer simple questions.

3. Prompt brittleness: Slight changes in how you phrase a question can lead to significant performance differences. This makes them less robust than you’d want for production systems.

4. No true common sense: Ask it to reason about everyday physical situations or social dynamics, and the cracks show. It hasn’t built the rich world models humans develop through lived experience.

5. Relational reasoning gaps: Complex hierarchies, long-term causal chains, and nuanced relationships remain challenging. Human-level reasoning in these areas is still far off.

6. Ethical inconsistency: Unlike humans who (generally) apply consistent moral frameworks, LLMs produce unreliable ethical reasoning, contradicting themselves across similar scenarios.

Mitigation Strategies

So how do you work with these limitations?

Chain-of-Thought prompting: Explicitly ask for step-by-step reasoning. This doesn’t eliminate errors but makes them easier to spot.

Self-consistency: Generate multiple reasoning paths and check if they agree. If five different approaches give you the same answer, you can be more confident.

External verification: Use specialised tools to verify outputs. For code, run it through compilers and tests. For maths, check calculations with symbolic math libraries. Don’t trust the LLM alone.

Retrieval-Augmented Generation (RAG): Ground responses in factual, verified data rather than relying solely on the model’s parametric knowledge.

Human-in-the-loop: For high-stakes decisions, always have human review. The LLM can draft, analyse, and suggest, but humans should approve.

Think of reasoning LLMs as brilliant but unreliable interns. They can do impressive work, but you’d never let them make critical decisions without oversight.

The Road Ahead: What’s Next for Reasoning AI?

We’re at an inflection point. Here’s what’s coming and what to watch for.

2025 Trends

Reinforcement Learning from Verifiable Rewards (RLVR) is becoming the dominant training paradigm. Instead of just learning from human feedback, models are trained to optimise for provably correct outputs. This works brilliantly for maths and code where correctness is verifiable. The challenge now is extending it beyond STEM - can you use RLVR for legal reasoning? Philosophy? Creative problem-solving?

Distillation techniques are improving rapidly. Researchers are finding ways to transfer reasoning capabilities from massive models like O1 and DeepSeek-R1 into smaller, faster, cheaper models. This could make reasoning capabilities accessible for edge deployment and cost-sensitive applications.

Domain-specific reasoning models: Instead of one giant model that reasons about everything, expect to see specialised models optimised for specific domains - medical diagnosis, financial analysis, legal research. These can be smaller, faster, and more accurate within their domain.

Near-Term Expectations (6-12 months)

More open-source reasoning models: DeepSeek-R1’s release has opened the floodgates. Expect more open-source alternatives matching proprietary performance.
Cheaper reasoning: Competition and optimisation will drive costs down. What costs ₹5 per query now might cost ₹0.50 in a year.
Better transparency: Current reasoning processes are partially hidden. Expect better tools to visualise and understand how models arrive at conclusions.
Hybrid approaches: Combining reasoning LLMs with traditional algorithms, knowledge graphs, and specialised solvers for more robust systems.

Key Questions to Watch

Can reasoning transfer to truly novel domains? Current success is mostly in domains with clear right/wrong answers. What about creative reasoning, ethical deliberation, or strategic planning where there’s no single correct answer?

Will costs come down enough for widespread deployment? Reasoning capabilities are impressive but expensive. Broader adoption needs lower costs.

Can we solve the hallucination problem? Until we can reliably prevent hallucinations in reasoning steps, human oversight remains essential. This is the key unsolved challenge.

What’s the next benchmark frontier? AIME will eventually saturate like GSM8K did. What comes next? Perhaps research-level problems or long-horizon tasks requiring days of reasoning?

For Practitioners: What You Should Do Now

Experiment now while the field is young. Understanding how to prompt, verify, and integrate reasoning capabilities gives you a competitive edge. The techniques you develop now will compound as models improve.

Build with verification in mind. Don’t architect systems that blindly trust LLM outputs. Design for verification, validation, and human oversight from day one.

Watch the open-source space. DeepSeek-R1 proved open-source can match proprietary quality. You might not need to depend on expensive API calls forever.

Think hybrid. The best systems combine LLM reasoning with traditional tools. Use LLMs for what they’re good at (ideation, decomposition, exploration) and other tools for what they excel at (exact calculation, database queries, rendering).

Conclusion: Reasoning-Capable, Not Truly Reasoning

Let’s bring this all together.

Reasoning LLMs represent a genuine leap forward in AI capabilities. Whether they “truly” reason in some philosophical sense matters less than understanding what they can practically achieve - and they can achieve quite a lot.

The bottom line for AI engineers and data scientists:

1. Use them for verifiable domains. Maths, code, and formal logic where you can check answers? Excellent. Vague, subjective, or common-sense reasoning? Be cautious.

2. Always verify. Don’t trust reasoning blindly, especially in critical applications. Build verification into your workflow.

3. Understand the tradeoff. Better quality comes with higher cost and latency. Not every problem needs reasoning capabilities - choose appropriately.

4. Watch the space rapidly evolve. With performance doubling every seven months and open-source alternatives emerging, what’s expensive and proprietary today might be cheap and accessible tomorrow.

5. Think hybrid architectures. Combine reasoning LLMs with traditional tools, domain knowledge, and human expertise. The best systems leverage multiple complementary approaches.

The real question isn’t “are they reasoning?” It’s “when should I use reasoning capabilities?” The answer: when the problem is complex, systematically decomposable, verifiable, and the cost is justified by the value.

We’re in early days. These models will get better, cheaper, and more reliable. The models we’re discussing today will look primitive in two years. But the fundamental principles - understanding their capabilities, limitations, and appropriate use cases - will remain relevant.

Now, let’s see what you build with them.

References

Wei, J., et al. (2022). “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.” arXiv:2201.11903
DeepSeek-AI (2025). “DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning.” arXiv:2501.12948
OpenAI (2024). “Learning to Reason with LLMs”
Wang, X., et al. (2022). “Self-Consistency Improves Chain of Thought Reasoning in Language Models.” arXiv:2203.11171
Lightman, H., et al. (2023). “Let’s Verify Step by Step.” arXiv:2305.20050

Written by Girijesh Prasad - AI Engineer & Multi-Agent Expert
4 February 2026

Girijesh Prasad’s AI Blog

The Story of Embedding — Deep Dive: From Bag of Words to Sentence Transformers

The Story of Embedding — Deep Dive: From Bag of Words to Sentence Transformers

Why This Post Exists

1. The Representation Problem: Why Vectors?

2. The Counting Era: BoW, TF-IDF, and Their Hidden Mathematics

Bag of Words (1950s)

TF-IDF: Information-Theoretic Weighting

LSA: The Forgotten Bridge (1990)

3. The Neural Turn: Bengio’s NPLM (2003) — The Forgotten Origin

4. Word2Vec (2013): The Trick Was in the Training

The Skip-gram Objective

Negative Sampling: The Actual Innovation

Why King - Man + Woman ≈ Queen Actually Works

5. GloVe: Making the Implicit Explicit

The Objective Function

When to Choose GloVe vs Word2Vec

6. FastText: Morphology Matters

7. ELMo: The Layer-Wise Revelation

Architecture

The Feature-Based vs Fine-Tuning Distinction

8. Attention Is All You Need (2017): The Foundation

Self-Attention: The Core Mechanism

Positional Encoding: The Unsung Hero

9. BERT (2018): The Paradigm Shift

What Most People Get Wrong About BERT

The Two Pre-training Objectives

The [CLS] Token Problem

10. Cross-Encoders vs Bi-Encoders: The Fundamental Trade-off

Cross-Encoder

Bi-Encoder (Sentence Transformers)

The Quality Gap and How to Close It

11. Sentence-BERT: Architecture Details That Matter

Pooling Strategy Matters

Training Data Combination

The Objective Function

12. Fine-Tuning Embeddings: A Production Engineer’s Guide

Loss Functions — The Mathematics

Hard Negative Mining: The 10x Multiplier

Data Requirements — What Actually Works

13. The Embedding Anisotropy Problem

14. ColBERT: Late Interaction — A Third Way

15. Sparse-Dense Hybrid: SPLADE and the Best of Both Worlds

16. Matryoshka Embeddings: Adaptive Dimensionality

The Core Idea

Production Impact

17. Instruction-Tuned Embeddings: E5 and BGE

18. Production Deployment: What Tutorials Never Tell You

Quantisation: Shrinking Embeddings for Scale

Embedding Drift and Index Maintenance

Latency Budget Breakdown

19. The Evaluation Problem: MTEB and Beyond

MTEB (Massive Text Embedding Benchmark)

How to Evaluate Your Own Embeddings

20. Where This Story Goes Next

The Arc of This Story

References

Context Engineering: The New Frontier in Agentic AI

Context Engineering: The New Frontier in Agentic AI

The Great Shift: From Prompts to Context

So What Exactly IS Context Engineering?

Why the Evolution?

The Anatomy of Context: What Actually Goes In?

1. System Instructions

2. Conversation History

3. Retrieved Knowledge

4. Tool Schemas and Outputs

5. Dynamic State

Memory Systems: The Backbone of Great Context

Short-Term Memory: The Conversation Buffer

Long-Term Memory: Persistent Knowledge

Episodic vs. Semantic Memory

RAG: The Bridge Between Memory and Context

The “Lost in the Middle” Problem (And How to Fix It)

Solutions That Actually Work

Multi-Agent Systems: Distributed Context Intelligence

Why Go Multi-Agent?

Context Sharing: The Shared State Pattern

Context Handoff: The Supervisor Pattern

The Model Context Protocol (MCP): Standardizing the Handoff