The Threshold Decision: How Machine Learning Actually Changes Business Outcomes [Intro to ML #4]

In my previous post (Intro to ML #3), I covered confusion matrices, AUC, and why accuracy alone is a misleading metric.

This time: what happens when you take a model and actually try to use it. This is where machine learning stopped feeling like a technical exercise and started feeling like a business problem.

1 A Classification Model Outputs Probabilities, Not Answers
2 Recall vs. Precision: Three Business Examples
3 Putting It Into Practice: What the Model Actually Did
4 A Model Isn’t “Done” When It’s Built
5 What I Learned: ML Raises the Quality of Human Judgment
6 Books to Go Deeper
- 6.1 ① For Understanding Business Decisions with Data
- 6.2 ② For the Business Strategy Side of AI Deployment

A Classification Model Outputs Probabilities, Not Answers

To calculate a confusion matrix, you first need to set a threshold.

Here’s something that surprised me: a classification model doesn’t directly output “positive” or “negative.” It outputs a probability — a number between 0 and 1. Something like 0.65 or 0.12. Then you apply a threshold to convert that probability into a category.

Most tools default to 0.5 — above that, it’s classified as positive. Below that, negative. But that threshold is a choice, not a given. And where you set it changes everything:

Higher threshold (e.g., 0.8): Only flag something as positive when very confident → Precision goes up, but you miss more cases
Lower threshold (e.g., 0.2): Flag almost anything as positive → Recall goes up, but false alarms increase

Neither is universally better. The right threshold depends entirely on the business context and what kind of error you can afford.

Recall vs. Precision: Three Business Examples

Working through concrete scenarios made the recall-precision tradeoff much more intuitive.

① Smartphone iris authentication

The catastrophic error here is authenticating a stranger — letting the wrong person unlock your phone. You need everything flagged as “authenticated” to actually be the right person. → Prioritize precision.

② Home security sensor

The catastrophic error is missing an actual intruder. A false alarm is annoying, but an undetected break-in is far worse. → Prioritize recall.

③ Criminal investigation

This one has no clean answer. If the goal is solving cases and catching perpetrators → recall. If the goal is preventing wrongful conviction → precision. The “right” metric depends on which failure you consider more serious — and that’s a values question, not a technical one.

What stuck with me: the number doesn’t tell you which error matters more. That judgment belongs to people, not models.

Putting It Into Practice: What the Model Actually Did

With this in mind, I ran the same exercise from the previous session using peer-to-peer lending data.

Without any filtering: roughly 15% of loans default, resulting in a net loss of $220,000.

Using the model to identify and select only the top 25% of loans with the lowest predicted default probability — the default rate dropped to 6.74%.

The math:

$40M × (1 − 0.0674) × 1.17 ≈ $43.65M → A swing from −$220K to +$3.65M per year.

But here’s the important caveat: the model didn’t make that decision. A human decided to filter to the top 25%, set the threshold, and choose to accept the tradeoff of issuing fewer loans. The model provided the inputs; the judgment call was human.

A Model Isn’t “Done” When It’s Built

One more thing that stuck: a model degrades over time.

Once you deploy a model and start using it in the real world, its accuracy will eventually decline. Reasons include:

Data formats change
Business conditions or market behavior shift
The historical data it was trained on becomes less representative

Coming from an IT infrastructure background, this was a genuine mindset shift. Systems you build are expected to keep working. Machine learning models require ongoing monitoring and periodic retraining. It’s less like “deploying a server” and more like “maintaining a garden.”

What I Learned: ML Raises the Quality of Human Judgment

Thresholds, recall, precision, model maintenance — working through all of this brought a single idea into focus: machine learning doesn’t automate decisions. It improves the quality of the inputs that humans use to make decisions.

Where you set the threshold. Which errors you’re willing to accept. Whether to prioritize revenue or risk reduction. These are judgment calls that require knowing the business context — and no model can make them for you.

AI isn’t building a world where we think less. It’s creating a world where we need to think more carefully about the right questions to ask.

→ [Intro to ML #5 — coming soon]

Books to Go Deeper

① For Understanding Business Decisions with Data

Data Science for Business — Foster Provost & Tom Fawcett (O’Reilly)

Written specifically for business professionals who want to understand how data science models work and how to use them in decision-making. Covers classification, regression, evaluation metrics, and the business framing of ML problems. The chapter on expected value and decision-making with classifiers is directly relevant to everything covered in this post.

Data Science for Business

created by Rinker

② For the Business Strategy Side of AI Deployment

Competing on Analytics — Thomas H. Davenport & Jeanne G. Harris (Harvard Business Review Press)

A classic on how organizations actually build competitive advantage through data and analytics. Useful context for anyone thinking about the “so what” after you have a working model — how do you embed it into a business process, and what does it take to act on the output systematically?

Competing on Analytics: The New Science of Winning

created by Rinker