In my previous post (Intro to ML #1), I wrote about the difference between AI, machine learning, and deep learning — and why “the essence of machine learning is simply prediction.”
This time: what happened when I actually started working with real data. The first wall I hit wasn’t building a model — it was getting the data ready.
What Does It Mean to Frame a Business Problem for AI?
The exercise used loan data from a peer-to-peer lending service. The challenge was straightforward: predict which borrowers are likely to default, and improve return on investment.
Without any filtering, roughly 15% of loans default — resulting in a net loss. If machine learning can identify high-risk borrowers upfront, profitability improves.
This felt like the clearest example I’d seen of what it means to use machine learning in business.
- Define a clear business KPI (in this case: default rate)
- Translate it into a prediction problem (will this borrower default? — binary classification)
- Identify and collect the data needed to make that prediction
“All roads probably lead to prediction” — working backwards from the business goal to define the right question is the key first step.
Fighting Missing Values: Real Data Is Messy
When I opened the dataset, missing values (NAs) were everywhere. I knew in theory that real-world data is messy — but seeing it firsthand made clear just how much work it involves.
There are three main approaches to handling missing values:
- Leave them as-is (keep NA): Tree-based algorithms like decision trees, random forests, and XGBoost can handle NAs natively
- Fill them in: Replace with a representative value — mean, median, or mode
- Treat missingness as a feature: The fact that a value is missing can itself be predictive information
There’s no single right answer. The guiding principle: handle missing values in whatever way best serves the prediction. Each variable needs its own judgment call.
What surprised me most was realizing that the reason a value is missing often matters. It’s not just a gap to fill — sometimes it’s a signal worth paying attention to.
Data Leakage: Should You Even Be Using That Variable?
Even more eye-opening than missing values was the concept of data leakage.
Data leakage happens when information that wouldn’t be available at prediction time somehow ends up in your model.
Here’s a simple example: say you want to predict whether a free-user will convert to a paid subscription. Your dataset includes whether the user has saved their credit card. If you use that as a feature, you’ll build a model with near-perfect accuracy — but it’s completely useless.
Why? Because users who saved their credit card already converted — that’s why the data exists. You’re training on the outcome itself. The model has learned nothing useful for predicting future behavior.
This is data leakage. If your model’s accuracy looks suspiciously high, that’s actually a reason to be suspicious — not celebratory.
Feature Engineering: Creating New Variables from Existing Data
Beyond cleaning data, a big part of the work involves creating new, more useful variables from what you already have. This is called feature engineering.
For example: the Titanic dataset has separate columns for “number of siblings/spouses” and “number of parents/children.” Combining them into a single “family size” variable can improve prediction accuracy.
Or: instead of using a raw date, you transform it into “day of week” or “weekday vs. weekend.” That’s feature engineering too.
What makes this interesting — and hard — is that it requires domain knowledge. Knowing which variables are likely to matter in a given industry or context is something algorithms can’t just figure out on their own. Deep learning can learn features automatically in some domains, but with structured tabular data, human expertise still plays a major role.
What I Learned: The Unglamorous 90% of Data Science
AI and data science conversations tend to focus on cutting-edge algorithms and impressive accuracy scores. But in practice, most of the time and thinking goes into understanding and preparing the data.
How do you handle missing values? Which variables are safe to use? Can you engineer something new? The quality of this “unglamorous 90%” determines model quality.
Coming from an infrastructure background, this was my biggest revelation. Building systems and working with data require fundamentally different ways of thinking. Data science feels less like “engineering” and more like asking good questions of your data.
Next time: what happens after you build a model. Overfitting, AUC, confusion matrices — how do you actually evaluate whether a model is any good?
→ [Intro to ML #3 — coming soon]
Books to Go Deeper
If you want to dig into data preprocessing and feature engineering further, here are two books worth reading.
① For a Practical Introduction to Feature Engineering
Feature Engineering for Machine Learning — Alice Zheng & Amanda Casari (O’Reilly)
A focused, practical guide to transforming raw data into features that work well for machine learning models. Covers numeric, categorical, and text data. Clear explanations with real examples — a solid read before you start building models seriously.
② For End-to-End Data Science Practice
Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow — Aurélien Géron (O’Reilly)
One of the most widely recommended machine learning books for practitioners. Covers the full pipeline from data preparation to model evaluation. The first half alone — on classic machine learning — is worth the price of the book.

