What’s Actually Inside a Generative AI? — From Language Models to Diffusion Models [Intro to ML #8]

In my previous post (Intro to ML #7), I covered the difference between traditional data analysis and AI-powered prediction — and the bias risks that come with using machine learning to make decisions about people.

This post shifts to a different branch of AI: generative AI. An entire session of my MBA data science course was dedicated to this topic — specifically, understanding how tools like ChatGPT actually work under the hood.

Understanding the mechanism changes how you use the tool. That was the core argument of the session, and by the end I agreed completely.

Where Generative AI Stands Today

The session opened with a quote from OpenAI’s Sam Altman:

“GPT-3 felt like talking to a high schooler. GPT-4 felt like a college student. GPT-5 feels like talking to a PhD-level expert. Honestly, I don’t want to use GPT-4 anymore.”

This isn’t hyperbole. On the GPQA Diamond benchmark — graduate-level scientific questions that can’t be answered by Googling — the latest models are scoring above human expert baselines. In one case study cited in class, ChatGPT 5 Pro was given a mathematics paper, asked if the proof could be improved, and produced a valid solution that hadn’t been published before.

Bond Capital’s 2025 AI report painted a picture of structural tension: the cost for users is dropping dramatically while infrastructure investment (data centers, compute) is surging. The technology is accelerating. The business model is still catching up.

Predictive AI vs. Generative AI: What’s the Difference?

Posts #1 through #7 in this series were all about predictive AI — models that take input data and output a classification or a number. Will this loan default? Will this employee quit? The output is a category or a value.

Generative AI takes input and creates something new — text, images, audio, video. The output is content.

But here’s the key insight: generative AI still uses prediction internally. When an LLM generates text, what it’s actually doing is predicting the next word — over and over. “Generation” and “prediction” aren’t opposites. Generation is what emerges when prediction is applied at scale and in sequence.

The Unstructured Data Problem

Everything in posts #1–#7 dealt with structured data — spreadsheet-style tables with rows and columns, where each column is a feature that can be fed directly into a model.

But most of the world’s information is unstructured: images, audio, natural language. Computers only understand numbers — so none of this can be used as-is.

To make unstructured data usable, it needs to be converted into numerical vectors — multi-dimensional arrays of numbers. The development of efficient techniques for doing this is what enabled the generative AI explosion.

One fascinating consequence of vectorizing language: semantic arithmetic becomes possible. The classic example: “niece − female + male = nephew.” Words with similar meanings end up in similar positions in vector space, and those positions can be added and subtracted. This is why LLMs appear to “understand” meaning — they’re performing geometry in a very high-dimensional space.

How Language Models Work: Next-Word Prediction

The core mechanism of an LLM is, at its heart, simple: given the words so far, predict the most probable next word.

Given the prompt “I picked up my phone to ___”, the model calculates a probability distribution over every word it knows: “check” (42%), “read” (28%), “find” (9%), “eat” (0.001%)… It samples from that distribution, picks a word, appends it, and repeats. That’s text generation.

The autocomplete on your phone is the same mechanism — just much smaller. An LLM does this with a corpus of trillions of words and hundreds of billions of parameters.

Training a raw LLM on this task produces a model that’s knowledgeable but not reliably useful — it doesn’t naturally respond the way humans expect in a conversation. That’s where RLHF (Reinforcement Learning from Human Feedback) comes in.

The process that made ChatGPT work:

  • Step 1: Fine-tune a pretrained model on human-written examples of good responses
  • Step 2: Have humans rank multiple model outputs from best to worst, and train a reward model on those rankings
  • Step 3: Use reinforcement learning to push the model toward outputs that score higher on the reward model

The course analogy: pretraining is like an undergraduate education (broad general knowledge). RLHF is like graduate school — learning to behave like a professional in a specific domain, with norms and expectations built in.

From LLMs to Reasoning Models: “From Three Steps to a Hundred”

Beyond LLMs lies the next evolution: LRMs (Large Reasoning Models). By combining reinforcement learning with high-performance computing, these models are trained not just on correct answers but on the process of reasoning — step-by-step thought chains that lead to correct answers.

Masayoshi Son’s phrase — “from three-step logic to hundred-step logic” — captures what changed. The depth of reasoning chains has increased by an order of magnitude. OpenAI’s o1 and o3 series are examples of this approach.

Meanwhile, the competitive landscape has become more crowded. Open-source LLMs (Meta’s Llama and others) are closing the gap with closed models, and the performance difference between top-tier models has narrowed significantly.

How Image Generation Works: Diffusion Models

The dominant approach to AI image generation is the diffusion model. The intuition: destroy an image with noise, then learn to reverse the destruction.

  • Forward process: Take a real image and progressively add random noise until nothing remains but static. A neural network learns to model this destruction process.
  • Reverse process: Run the process backwards — start from pure noise and progressively remove it to recover an image.
  • Conditioning on prompts: At generation time, the prompt text is vectorized and used to steer the denoising — pushing the output toward images consistent with the description.

Stable Diffusion and DALL-E use this mechanism. More recently, the same ideas are being applied to language models (Google’s Gemini Diffusion being one example).

Combined with video generation (OpenAI Sora, Google Veo3) and audio synthesis (ElevenLabs), it’s now possible to produce a full video — with synchronized audio — from a single text prompt. In class, a case study was cited: Ito En used AI to generate, evaluate, and narrow down packaging designs for their Oi Ocha tea brand.

Why Understanding the Mechanism Matters

Once you understand that an LLM generates text by predicting the next word given a probability distribution, certain things click into place:

Vague prompts produce vague outputs — because vague inputs spread probability mass across many possible next words, making the model’s choices less targeted. Concrete, specific prompts narrow that distribution and improve quality. For image generation, knowing that the model “denoises toward the prompt vector” explains why more specific descriptors produce better results.

Using AI as a black box and using it with mechanical understanding are different skill levels. The next post is about what to do with that understanding — how to actually get more out of generative AI through prompting, RAG, and agents.

→ [Intro to ML #9 — coming soon]

Books to Go Deeper

① For Understanding How Generative AI Works

② For Applying LLMs in Business Contexts