Why a 99% Accurate Model Can Be Completely Useless [Intro to ML #3]

In my previous post (Intro to ML #2), I wrote about the unglamorous reality of data prep — missing values, data leakage, and feature engineering.

This time: what happens after you build a model. What makes a “good” model, and how do you actually evaluate one? This turned out to be where machine learning gets most counterintuitive.

The Overfitting Trap: A Model That Memorizes Is Useless

One of the first major concepts in machine learning is overfitting.

Here’s the analogy that made it click for me. Imagine studying for an exam by memorizing past papers word-for-word. You can answer every past question perfectly. But on the actual exam, you fall apart — because the questions are slightly different.

What you should have learned was the underlying patterns and principles — not the specific answers. You memorized noise instead of signal.

The same thing happens in machine learning. A model that fits the training data too perfectly loses the ability to generalize — to perform well on data it hasn’t seen before.

This ability to perform on new, unseen data is called generalization. A model’s goal isn’t to memorize training data — it’s to generalize well.

The Holdout Method: Test Your Model on Data It’s Never Seen

The basic technique for catching overfitting is the holdout method.

The idea is simple: split your data into two parts from the start.

  • Training data (~70%): Used to build the model
  • Test data (~30%): Used to evaluate the model — never touched during training

The only way to know if a model truly generalizes is to test it on data it didn’t train on. High accuracy on training data but low accuracy on test data is the telltale sign of overfitting.

A more robust evaluation technique is cross-validation (k-fold): split the data into 5 parts, rotate which part is used for validation, and average the results across all 5 runs. More reliable than a single train/test split.

When “99% Accuracy” Means Nothing

When evaluating a model, the first metric most people reach for is accuracy — the percentage of predictions that were correct. Simple and intuitive.

But it has a serious blind spot.

Imagine a test for a rare infectious disease, where only 0.01% of the population is infected. If you build a model that simply predicts “everyone is negative” — no testing, no analysis — you get 99.99% accuracy. But the model is completely useless. It hasn’t identified a single infected person.

This is the accuracy trap. When the data is imbalanced — far more negatives than positives — accuracy alone tells you almost nothing meaningful.

The Confusion Matrix: Looking Inside Your Predictions

That’s where the confusion matrix comes in. It breaks down predictions into four categories:

Predicted: Negative (−)Predicted: Positive (+)
Actual: Negative (−)TN (True Negative)FP (False Positive)
Actual: Positive (+)FN (False Negative)TP (True Positive)

From these four numbers, you can calculate metrics that actually match your business goal:

  • Recall (Sensitivity) = TP ÷ (TP + FN): Of all actual positives, how many did you catch? Use when missing a case is costly (e.g., disease screening, fraud detection)
  • Precision = TP ÷ (TP + FP): Of everything you flagged as positive, how many were actually positive? Use when false alarms are costly (e.g., spam filters)
  • F1 Score: The harmonic mean of recall and precision. Use when you want to balance both

Which metric matters most depends entirely on the business context. For loan default prediction: do you care more about catching every potential defaulter (recall), or about making sure your approved loans are solid (precision)? The answer changes what you optimize for.

AUC: Evaluating a Model Without Picking a Threshold

The confusion matrix requires you to set a threshold — the probability cutoff above which you classify something as positive. Change the threshold, and all your metrics change too.

AUC (Area Under the ROC Curve) evaluates a model’s overall performance without needing to fix a threshold.

The intuitive interpretation: if you randomly pick one actual positive and one actual negative, AUC is the probability that the model assigns a higher score to the positive one.

  • AUC = 0.5: No better than random guessing
  • AUC = 1.0: Perfect classification
  • In practice, 0.7–0.8 is often considered good enough

AUC is also useful for comparing different models against each other — a clean, threshold-independent benchmark.

What I Learned: “Good Model” Depends on the Goal

Overfitting, holdout method, confusion matrix, AUC — working through all of this brought one thing into focus: there’s no universal definition of a “good” model.

What are you trying to minimize? What can you not afford to miss? The right metric depends on the business problem. The goal isn’t to maximize accuracy — it’s to solve the problem at hand.

With 30 years in IT, I’d never thought about “evaluation design” this way. In system development, something either works or it doesn’t. In machine learning, you have to define what “working well enough” means — relative to a specific business objective. That shift in thinking is, I think, what makes data science genuinely interesting.

→ [Intro to ML #4 — coming soon]

Books to Go Deeper

① For Understanding Model Evaluation and Overfitting

The Hundred-Page Machine Learning Book — Andriy Burkov

Remarkably concise and clear. Covers bias-variance tradeoff, overfitting, evaluation metrics, and more — all in under 150 pages. A great reference to have on hand when concepts like AUC or regularization come up and you want a clean explanation fast.

② For Practical Guidance on Building and Evaluating Models

Machine Learning Yearning — Andrew Ng (free PDF)

Written by one of the most respected names in AI, this free guide focuses on the practical decisions that matter: how to choose evaluation metrics, diagnose overfitting vs. underfitting, and prioritize what to improve. Written in plain English — approachable even for non-engineers.

Get The Machine Learning Yearning Book By Andrew NG | Free d…