Machine Learning: It’s 90% Data Prep Before You Even Build a Model [Intro to ML #2]

In my previous post (Intro to ML #1), I wrote about the difference between AI, machine learning, and deep learning — and why “the essence of machine learning is simply prediction.”

This time: what happened when I actually started working with real data. The first wall I hit wasn’t building a model — it was getting the data ready.

1 What Does It Mean to Frame a Business Problem for AI?
2 Fighting Missing Values: Real Data Is Messy
3 Data Leakage: Should You Even Be Using That Variable?
4 Feature Engineering: Creating New Variables from Existing Data
5 What I Learned: The Unglamorous 90% of Data Science
6 Books to Go Deeper
- 6.1 ① For a Practical Introduction to Feature Engineering
- 6.2 ② For End-to-End Data Science Practice

What Does It Mean to Frame a Business Problem for AI?

The exercise used loan data from a peer-to-peer lending service. The challenge was straightforward: predict which borrowers are likely to default, and improve return on investment.

Without any filtering, roughly 15% of loans default — resulting in a net loss. If machine learning can identify high-risk borrowers upfront, profitability improves.

This felt like the clearest example I’d seen of what it means to use machine learning in business.

Define a clear business KPI (in this case: default rate)
Translate it into a prediction problem (will this borrower default? — binary classification)
Identify and collect the data needed to make that prediction

“All roads probably lead to prediction” — working backwards from the business goal to define the right question is the key first step.

Fighting Missing Values: Real Data Is Messy

When I opened the dataset, missing values (NAs) were everywhere. I knew in theory that real-world data is messy — but seeing it firsthand made clear just how much work it involves.

There are three main approaches to handling missing values:

Leave them as-is (keep NA): Tree-based algorithms like decision trees, random forests, and XGBoost can handle NAs natively
Fill them in: Replace with a representative value — mean, median, or mode
Treat missingness as a feature: The fact that a value is missing can itself be predictive information

There’s no single right answer. The guiding principle: handle missing values in whatever way best serves the prediction. Each variable needs its own judgment call.

What surprised me most was realizing that the reason a value is missing often matters. It’s not just a gap to fill — sometimes it’s a signal worth paying attention to.

Data Leakage: Should You Even Be Using That Variable?

Even more eye-opening than missing values was the concept of data leakage.

Data leakage happens when information that wouldn’t be available at prediction time somehow ends up in your model.

Here’s a simple example: say you want to predict whether a free-user will convert to a paid subscription. Your dataset includes whether the user has saved their credit card. If you use that as a feature, you’ll build a model with near-perfect accuracy — but it’s completely useless.

Why? Because users who saved their credit card already converted — that’s why the data exists. You’re training on the outcome itself. The model has learned nothing useful for predicting future behavior.

This is data leakage. If your model’s accuracy looks suspiciously high, that’s actually a reason to be suspicious — not celebratory.

Feature Engineering: Creating New Variables from Existing Data

Beyond cleaning data, a big part of the work involves creating new, more useful variables from what you already have. This is called feature engineering.

For example: the Titanic dataset has separate columns for “number of siblings/spouses” and “number of parents/children.” Combining them into a single “family size” variable can improve prediction accuracy.

Or: instead of using a raw date, you transform it into “day of week” or “weekday vs. weekend.” That’s feature engineering too.

What makes this interesting — and hard — is that it requires domain knowledge. Knowing which variables are likely to matter in a given industry or context is something algorithms can’t just figure out on their own. Deep learning can learn features automatically in some domains, but with structured tabular data, human expertise still plays a major role.

What I Learned: The Unglamorous 90% of Data Science

AI and data science conversations tend to focus on cutting-edge algorithms and impressive accuracy scores. But in practice, most of the time and thinking goes into understanding and preparing the data.

How do you handle missing values? Which variables are safe to use? Can you engineer something new? The quality of this “unglamorous 90%” determines model quality.

Coming from an infrastructure background, this was my biggest revelation. Building systems and working with data require fundamentally different ways of thinking. Data science feels less like “engineering” and more like asking good questions of your data.

Next time: what happens after you build a model. Overfitting, AUC, confusion matrices — how do you actually evaluate whether a model is any good?

→ [Intro to ML #3 — coming soon]

Books to Go Deeper

If you want to dig into data preprocessing and feature engineering further, here are two books worth reading.

① For a Practical Introduction to Feature Engineering

Feature Engineering for Machine Learning — Alice Zheng & Amanda Casari (O’Reilly)

A focused, practical guide to transforming raw data into features that work well for machine learning models. Covers numeric, categorical, and text data. Clear explanations with real examples — a solid read before you start building models seriously.

Feature Engineering for Machine Learning: Principles and Techniques for Data Scientists

created by Rinker

② For End-to-End Data Science Practice

Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow — Aurélien Géron (O’Reilly)

One of the most widely recommended machine learning books for practitioners. Covers the full pipeline from data preparation to model evaluation. The first half alone — on classic machine learning — is worth the price of the book.

Hands-on Machine Learning With Scikit-learn, Keras, and Tensorflow（3rd Edition）

created by Rinker

Machine Learning: It’s 90% Data Prep Before You Even Build a Model [Intro to ML #2]

What Does It Mean to Frame a Business Problem for AI?

Fighting Missing Values: Real Data Is Messy

Data Leakage: Should You Even Be Using That Variable?

Feature Engineering: Creating New Variables from Existing Data

What I Learned: The Unglamorous 90% of Data Science

Books to Go Deeper

① For a Practical Introduction to Feature Engineering

② For End-to-End Data Science Practice

AIの最新記事8件

機械学習、モデルより先に「データ整備」が9割だった【機械学習入門②】

Machine Learning: It’s 90% Data Prep Before You Even Build a Model [Intro to ML #2]

AIって結局何なの？IT歴30年の私がゼロから学んでみた【機械学習入門①】

What Is AI, Really? A 30-Year IT Veteran Learns Machine Learning from Scratch [Intro to ML #1]

When AI Becomes Your Personal Assistant: How OpenClaw Is Reshaping the Way We Work

AIが”秘書”になる時代が来た――OpenClawで変わる日本のビジネス現場

【完全ガイド】DeeVid AIとは？使い方・料金・評判・無料クレジットまで徹底解説

AIで頭がバカになる？いや、最新研究を読んだら問題はもっと「間抜け」な話だった