Journey to LLMs

The Math That Started It All

Before we jump into LLMs, let’s rewind to where it all began: Machine Learning.

At its core, ML is surprisingly simple—you give a model data, the model does a lot of math (mostly matrix multiplication), and it outputs predictions.

But the real magic isn’t intelligence. It’s this:

A model is just a function. Training is just finding the best function parameters.
Don’t worry about the math—I’ll explain it using simple examples.

Traditional Machine Learning: Predict Weight from Height

Let’s say you want a simple system that estimates a person’s expected weight from their height. Why would anyone do this? Plenty of practical reasons: spotting data entry mistakes, flagging suspicious records, or building a quick baseline for health/fitness analytics.

So we start the ML way: collect data—lots of pairs like:

Height (cm) → Weight (kg)

If you plot these points on a graph, each dot is one real person. Now, to make predictions, we can start with the simplest possible model: a straight line.

Task Breakdown

This gives you a prediction for any new height you plug in. But a model isn’t useful just because it produces a number — we need to know how good it is.

So we measure its performance by looking at the gap between:

Actual weight (y)
Predicted weight (ŷ)

That gap is the error, and the goal of training is simple:

Adjust the line so the errors become as small as possible across all data points.

As we collect more examples (more “experience”), the model usually becomes more reliable. And we can also make it smarter by adding more relevant inputs — for instance, including gender, age, or body frame — so the model isn’t forced to explain everyone using one single line.

If we can keep adding features and improving the line… why do we need deep learning?

Why Do We Need Deep Learning at All?

Deep Learning was originally inspired by a simple idea from the brain.

Our brain seems to learn by combining many tiny “signals” into bigger patterns, step by step.

Here’s the analogy in a clean, simple way:

A biological neuron receives signals from many other neurons. If the combined signal is strong enough, it fires and passes a signal forward.
An artificial neuron does the same conceptually: It takes many inputs, gives each input an importance (weight), adds them up, and then applies a small rule (an activation) to decide how much signal to pass forward.

So deep learning models are built as layers of these artificial neurons, where:

early layers learn simple patterns
later layers combine them into bigger patterns
and the final layers use those to make a decision or prediction

That “layer-by-layer pattern building” is the key brain-inspired idea behind deep learning.

Because many real-world problems aren’t just “a better line” away. Sometimes:

The relationship is too complex to hand-design

Even if you add more inputs (age, gender, body frame), the true mapping can still be messy and non-linear. You’re no longer fitting a simple curve — you’re trying to learn a function with tons of interactions (feature A matters only when feature B is present, etc.).

The inputs themselves are not “nice features”

In many problems, you don’t start with clean columns like height/age. You start with raw, high-dimensional data like:
- images (millions of pixel values)
- audio waveforms
- text (sequences of tokens)

Traditional ML usually needs you to manually convert raw data into meaningful features (“edges”, “whiskers”, “tone”, “keywords”…). That’s hard, slow, and often incomplete.

Feature engineering doesn’t scale

For simple datasets, humans can craft good features. But for things like “detect cancer in scans” or “understand language,” it’s unrealistic to hand-code every useful pattern.

That’s where deep learning comes in:

Instead of us designing features and then learning a model, deep learning learns the features and the model together, directly from raw data.

So the need for DL isn’t just “complex functions.” It’s that in many modern tasks, the function is complex AND the inputs are raw, and manually designing the right feature pipeline becomes the real bottleneck.

Neural Networks and Universal Approximation

Can a Neural Network Learn Any Function?

When people say “a neural network can learn anything”, they’re pointing to an idea called the Universal Approximation Theorem.

In plain words, it says:

A neural network with at least one hidden layer can approximate any reasonable function (think: smooth/continuous relationships), as closely as you want, if you give it enough neurons.

That sounds wild—so let’s make it feel obvious.

Neural Network Is a “Curve Builder”

Start with 1D: one input x, one output y.

Imagine the true relationship between height and weight is not a perfect straight line. It’s some curve.

A neural network can approximate that curve like this:

A linear model draws one straight line
A neural network draws a curve by stitching together many small line segments (or many “small bends”)

So the idea is:

If I can create enough tiny bends, I can trace almost any curve.

So far, our examples were “one-shot” problems: you give the model a fixed set of inputs (like height, age, etc.) and it predicts one output.

But language doesn’t work like that.

A sentence isn’t just a bag of words — order creates meaning. “dog bites man” is not the same as “man bites dog.”

Now the input isn’t one number or one vector. It’s a stream: word 1, word 2, word 3…
And to understand word 10, the model often needs context from word 1.

That’s the moment deep learning needed a new idea: models with memory.
And that’s exactly what Recurrent Neural Networks (RNNs) were designed for.

RNNs (Recurrent Neural Networks): Teaching Models to Remember

Imagine you’re reading this sentence:

“I went to the bank to deposit cash, and then I sat on the bank to watch the sunset.”

Same word. Two meanings.
You only know which “bank” is which because you remember what you just read.

That’s the entire reason RNNs were invented.

The Problem RNNs Tried to Solve

Traditional neural networks are great when inputs are independent—like classifying one image at a time. But language isn’t like that. In text, meaning depends on order and context. The word you’re reading right now depends on the words that came before it.

So RNNs introduced a simple idea:

As you read a sentence word by word, keep a small “notebook” of what you’ve seen so far.

That notebook is in the hidden state. It’s the RNN’s memory.

A More Realistic Example: “This Is Sick”

Let’s say someone texts:

“This is sick.”

Is that bad? Or is that slang for “awesome”?

You can’t know without context. Now add one more line:

“This is sick. I’ve been vomiting since morning.” → negative
“This is sick! That guitar solo was insane.” → positive

An RNN processes the sentence in order and tries to keep enough context in its memory to make the right call later.

Disadvantages of RNNs (The Key Ones to Highlight)

Long-range dependency problem (they forget earlier context in long sequences)
Training becomes difficult as sequences grow
Inherently slower because they process sequentially (not easily parallelizable like modern architectures)

In particular, they remained brittle—good at narrow tasks, but poor at understanding context, ambiguity, and long-range dependencies.

LSTMs: Fixing the RNN Forgetting Problem

LSTM Architecture Diagram

RNNs were a smart idea: read a sentence word-by-word and keep a memory of what you’ve read so far.

But they had a very human problem.

They could remember… until the sentence got long. Then they started forgetting the beginning — like someone who’s halfway through a movie and can’t recall the first scene.

That’s why LSTMs were invented.

The Simple Reason LSTMs Exist

RNNs store memory in one running “state.” And every new word slightly overwrites that state.

So after 40–50 words, the memory becomes a blurry summary.

LSTMs fix that by changing how memory works.

Instead of having just one fragile memory, LSTMs say:

“Let’s create a memory lane that can carry information forward for a long time — and let’s control it with smart ‘decisions’.”

That memory lane is called the cell state.

Think of it as a clean highway where important info can travel far without getting corrupted.

Two Types of Memory Inside an LSTM

An LSTM basically has two “brains” running together:

Cell State (Long-term memory)

The long highway that can preserve important signals for many steps.
Hidden State (Short-term working memory)

What the model is actively thinking about “right now” at the current word.

So: long-term memory carries the plot, short-term memory handles the current scene.

That separation alone makes LSTMs much more stable than plain RNNs.

The “Gates” Are Just Memory Decisions

The word “gates” sounds fancy, but it’s really just three learned questions the model keeps asking at every word:

1) Forget Gate — “Should I drop this old info?”

Sometimes earlier info becomes irrelevant.

Example:

“This is sick…”

At first, “sick” could mean cool.

But then the sentence continues:

“I’ve been vomiting since morning.”

Now the model should forget the “cool slang” interpretation and keep the “illness” interpretation.

That’s what the forget gate helps with: clearing noise, keeping memory clean.

2) Input Gate — “Should I store this new info?”

Not every word deserves permanent memory.

Words like “the”, “and”, “of” usually aren’t worth storing.

But words that change the meaning are important:

vomiting
deposit cash
sunset
not
never
however

The input gate learns what signals are worth saving into long-term memory.

3) Output Gate — “What Part of My Memory Should I Use Right Now?”

Even if you remember a lot, you don’t use all of it every moment.

Like humans: you can recall many things, but you only pull out the relevant memory when needed.

Example:

“I went to the bank to deposit cash, and then I sat on the bank…”

When the second “bank” appears, the model needs to decide:

Do I use the finance context?
Or the outdoor / sitting context?

Why LSTMs Were a Big Step Forward

Once LSTMs showed up, they became the “default” for sequence problems for years.

They performed strongly in:

speech recognition (audio is a sequence)
time-series forecasting (sales, sensors, health signals)
early machine translation
sentiment classification
text generation (before Transformers)

And the reason is simple: they could carry meaning farther without forgetting as easily.

So Why Didn’t LSTMs Become LLMs?

Even though LSTMs were smarter than RNNs, they still had a fundamental limitation:

They process words one at a time.

That means:

training is slow (hard to parallelize)
long documents take forever
memory still has a bottleneck — because even with gates, you’re compressing the whole past into a limited internal state

In other words:

LSTMs are better at remembering, but they still “summarize the past into one memory.”

And that’s the key transition to Transformers.

The Idea That Replaced LSTMs

Instead of forcing the model to squeeze everything into a single memory…

Transformers asked:

“Why remember everything, when you can just look back at what you need?”

That “look back” mechanism is Attention.

And attention is what unlocked:

parallel training
long context handling
richer relationships between words
and eventually to LLMs

From Machine Learning to Large Language Models: The Evolution of Intelligence

The breakthrough that enabled Large Language Models (LLMs) was the realization that language understanding could be treated as a general prediction problem at scale. Instead of training separate models for translation, summarization, or question answering, researchers trained a single model to predict the next token across massive corpora of text.

This shift from task-specific learning to foundation models changed everything.

At the core of LLMs is the Transformer architecture, which replaced recurrence with attention. Self-attention allows the model to evaluate relationships between all tokens in a sequence simultaneously, enabling parallelism, long-context reasoning, and richer representations of meaning.

With enough data, parameters, and compute, transformers began to exhibit emergent capabilities—reasoning, abstraction, few-shot learning, and transfer across tasks they were never explicitly trained for.

What truly differentiates LLMs from earlier ML and DL systems is not just size, but how they learn and adapt.

Pretraining on diverse, unlabeled data enables broad world knowledge.
Fine-tuning and alignment (e.g., instruction tuning, RLHF) adapt models to human intent.
In-context learning allows models to perform new tasks using examples provided at inference time—without retraining.

In effect, LLMs blur the boundary between model and system. They are no longer just predictors; they function as reasoning engines, planners, and interfaces that can orchestrate tools, query structured data, retrieve knowledge, and collaborate with humans in natural language.

Understanding this progression is essential before diving into LLM architectures, components, and model families—because LLMs are not a replacement for ML or DL, but rather their natural culmination at scale, augmented by architecture, data, and alignment.

LLMs Architecture and Its Components

LLMs follow the encoder-decoder based transformer architecture.

Transformer Architecture Diagram

The main components of a transformer architecture are the encoder and the decoder. Each of these is composed of multiple layers containing a multi-head self-attention mechanism and a position-wise feed-forward network. Other essential components include positional encoding, which provides positional information, and layer normalization to help stabilize training.

Encoder: Processes the input sequence, converting it into a context-rich representation. It consists of a stack of identical layers, each containing a multi-head self-attention sub-layer and a feed-forward network.

Decoder: Generates the output sequence, step-by-step. It includes both self-attention and feed-forward layers, and crucially, an additional multi-head attention layer that attends to the output of the encoder.

Key Components Within Each Layer

Multi-Head Attention: Allows the model to jointly attend to information from different representation subspaces at different positions. The encoder uses self-attention, while the decoder uses both masked self-attention and encoder-decoder attention.

Position-wise Feed-Forward Network (FFN): A simple, fully connected feed-forward network applied independently to each position.

Positional Encoding: Since transformers lack recurrence or convolution, this component is added to the input embeddings to inject information about the position of each token in the sequence.

Layer Normalization: Applied after each sub-layer to help stabilize training.

Residual Connections: Also known as skip connections, these are used around each sub-layer, allowing gradients to flow more easily during training.

Different Types of LLMs

Each LLM architecture is optimized for specific types of tasks: encoder-only models for understanding, decoder-only models for generation, and encoder-decoder models for tasks requiring a transformation from one sequence to another.

Encoder-Only Models

These models are designed for understanding and analyzing input text to produce rich, bidirectional representations. They excel at tasks that require deep comprehension of the entire input sequence.

Key Characteristics

Use bidirectional self-attention, meaning each token can attend to all other tokens in the input sentence.
Focus on generating an embedding (numerical representation) of the input text for various predictive tasks like classification.

Example LLMs

BERT (Bidirectional Encoder Representations from Transformers): One of the most prominent examples, widely used for tasks like named entity recognition, sentiment analysis, and question answering.
RoBERTa
DistilBERT

Use Cases

Text Classification: Determining the sentiment (positive/negative), topic, or intent of a piece of text.
Named Entity Recognition (NER): Identifying and classifying key information in text, such as names of people, organizations, or locations.
Question Answering (Extractive): Finding the exact span of text within a given document that answers a specific question.
Search and Information Retrieval: Matching user queries with relevant documents or passages by embedding both into a shared space.

Decoder-Only Models

These models are built for generating text sequence by sequence, relying on a causal (unidirectional) attention mechanism, where each token can only look at previous tokens in the sequence.

Key Characteristics

Primarily used for generative tasks, such as writing stories, answering open-ended queries, and conversational AI.
“Understanding” emerges from the scale of the model and the pre-training on next-token prediction.

Example LLMs

GPT (Generative Pre-trained Transformer) series: The models behind ChatGPT (e.g., GPT-3, GPT-4, GPT-4o) are all decoder-only.
Llama (Meta AI’s models): Llama 2, Llama 3
Mistral/Mixtral models
Gemma (Google’s models)
Claude (Anthropic’s models)

Use Cases

Chatbots and Conversational AI: Engaging in natural, multi-turn dialogues and providing human-like responses.
Content Creation: Generating creative text formats like stories, articles, marketing copy, and poems.
Code Generation: Writing code snippets or translating natural language instructions into programming languages.
Abstractive Summarization: Generating summaries that paraphrase the original text (rather than just extracting sentences), creating new, concise content.

Encoder-Decoder Based Models

These models combine the strengths of both architectures: an encoder for understanding the input and a decoder for generating the output. They are ideal for tasks that transform one sequence into another.

Key Characteristics

The encoder creates a rich representation of the source text, and the decoder then uses this representation via cross-attention to generate the target text.
Excellent for translation and summarization tasks.

Example LLMs

T5 (Text-to-Text Transfer Transformer): A highly influential model that treats every NLP task as a text-to-text problem.
BART
M2M100
RedLLaMA (RedLLM)

Use Cases

Machine Translation: Converting text reliably from one language to another (e.g., Google Translate uses T5-based models).
Text Summarization (Abstractive and Extractive): Condensing long documents into concise summaries.
Image Captioning: Describing the content of an image in natural language (the encoder processes the image, the decoder generates the text).
Grammar Correction: Identifying and fixing grammatical errors in a sentence, transforming an incorrect input sequence into a correct output sequence.

Ashutosh Garg

Jesu Arasu Malaiyappan

Krishna Teja M

Similar Blogs

Agentic AI

Exploring Gemini CLI for AlloyDB: Command-Line Power for...

Journey to LLMs

The Math That Started It All

Traditional Machine Learning: Predict Weight from Height

Why Do We Need Deep Learning at All?

Neural Networks and Universal Approximation

Can a Neural Network Learn Any Function?

Neural Network Is a “Curve Builder”

RNNs (Recurrent Neural Networks): Teaching Models to Remember

The Problem RNNs Tried to Solve

A More Realistic Example: “This Is Sick”

Disadvantages of RNNs (The Key Ones to Highlight)

LSTMs: Fixing the RNN Forgetting Problem

The Simple Reason LSTMs Exist

Two Types of Memory Inside an LSTM

The “Gates” Are Just Memory Decisions

1) Forget Gate — “Should I drop this old info?”

2) Input Gate — “Should I store this new info?”

3) Output Gate — “What Part of My Memory Should I Use Right Now?”

Why LSTMs Were a Big Step Forward

So Why Didn’t LSTMs Become LLMs?

The Idea That Replaced LSTMs

From Machine Learning to Large Language Models: The Evolution of Intelligence

LLMs Architecture and Its Components

Key Components Within Each Layer

Different Types of LLMs

Encoder-Only Models

Key Characteristics

Example LLMs

Use Cases

Decoder-Only Models

Key Characteristics

Example LLMs

Use Cases

Encoder-Decoder Based Models

Key Characteristics

Example LLMs

Use Cases

Similar Blogs

Contact Us

How can we help you?