Asking AI “Why Are You Quitting?” — My First Solo Machine Learning Challenge [Intro to ML #6]

In my previous post (Intro to ML #5), I covered regression — using machine learning to predict continuous values like dollar amounts.

This post and the next one are about a case I worked through in my MBA data science course: an employee attrition problem at a pharmaceutical company. What made this one different from everything before it was the setup: no hints, no guided steps — just a business problem and a dataset.

“Understanding something” and “actually being able to do it” turned out to be very different things. This case made that clear.

1 The Case: A Pharma Company With a Retention Problem
2 Step 1: Exploratory Data Analysis
3 Step 2: Messy Data and Feature Engineering
4 How Much Did Accuracy Improve?
5 Explaining the Model to Non-Technical Leadership
6 Key Takeaway: Garbage In, Garbage Out — But Also Gold In, Gold Out
7 Books to Go Deeper
- 7.1 ① For Understanding Data Preprocessing and Feature Engineering
- 7.2 ② For the HR and People Analytics Angle

The Case: A Pharma Company With a Retention Problem

The main character is a newly appointed head of corporate planning at the Japanese subsidiary of a global pharmaceutical company. The subsidiary was originally an independent domestic sales company, acquired five years ago. About 1,000 employees.

At an annual regional leadership meeting, the Japanese subsidiary gets called out: its employee attrition rate is notably higher than peers across Asia and Europe. The question becomes: can we use data to understand and address this problem?

Before touching any data, the manager interviews two former employees, an HR manager, and a data administrator. What comes out is a picture that feels very real: better offers from competitors, difficulty balancing work with childcare, long working hours, and an organizational culture still carrying habits from the pre-acquisition days.

The interviews weren’t just backstory — they were the first input into forming hypotheses before looking at the data.

Step 1: Exploratory Data Analysis

The first task was to explore the basic dataset — eight variables: attrition, age, department, job role, job level, gender, performance rating, and tenure.

Simple aggregations already revealed clear patterns:

Attrition is heavily concentrated among younger, shorter-tenure employees (under 5 years, under 30)
Sales representatives, lab technicians, and HR staff show the highest attrition by job role
Job Level 1 (entry level) and Level 3 (middle management) have notably higher attrition
Gender and performance rating (which only had values of 3 or 4) showed little meaningful difference

Cross-tabulations made the picture sharper. Sales staff at Job Level 1 had a 42% attrition rate — by far the highest. And Job Level 1 accounted for nearly 60% of all 237 departures in the dataset.

Layering the interview findings on top of the numbers helped with interpretation. The high turnover among junior sales staff seemed to be a combination of: a solo-performer culture with little team collaboration, limited promotion paths, and high exposure to outside job offers.

Step 2: Messy Data and Feature Engineering

The second task brought in a more detailed dataset — and that’s where things got messy.

The overtime field contained strings like OverTimePay: 32 instead of a clean number. Date fields were stored as text. The data administrator had warned about this in the interview, and it turned out to be entirely accurate.

This kind of data quality issue felt familiar from IT infrastructure work — real systems rarely produce clean outputs. After preprocessing, I engineered a new variable: “true working hours”, calculated as:

True working hours = monthly hours + overtime hours − hours equivalent of long-leave taken

Plotting this against attrition rate showed a clear pattern: employees working over 200 hours per month had significantly higher attrition.

When this engineered feature was added to the model, it jumped to the top of the feature importance ranking. That was a direct demonstration of why feature engineering matters — not just as a data cleaning step, but as a way of encoding domain knowledge into the model.

How Much Did Accuracy Improve?

Here’s how the AUC (model accuracy) changed with each step:

Detailed dataset, all features as-is: AUC 0.77
After feature engineering (true working hours added): AUC 0.79
After addressing multicollinearity (logistic regression): AUC 0.86

One unexpected finding: logistic regression outperformed XGBoost on this dataset. More complex models don’t always win. What matters is the match between model complexity and data characteristics — and in this case, a simpler model generalized better.

I also saw firsthand how multicollinearity (highly correlated features included together) distorts feature importance scores. Removing correlated variables made the remaining features easier to interpret and improved accuracy.

Explaining the Model to Non-Technical Leadership

Once the model was built, the next challenge was communicating it to someone who doesn’t know what AUC means.

The course highlighted a useful framework: what to leave out when presenting to senior leadership.

Algorithm names (XGBoost, logistic regression, etc.)
Details of the holdout methodology
Abstract metrics like AUC
“Feature importance” scores — these can lead to blind trust in a black box

What to include instead: concrete, decision-relevant numbers. “Of the 237 employees who left, 186 (78%) were flagged as high-risk in advance.” “Among current employees, 76 fall into the at-risk categories.” That’s the kind of information that enables action.

A quote came up in class that stuck with me: “If you can’t explain it simply, you don’t understand it well enough.” — Einstein. Building a model and being able to explain it are genuinely different skills.

Key Takeaway: Garbage In, Garbage Out — But Also Gold In, Gold Out

The biggest lesson from this case: data preparation determines prediction quality.

The best algorithm in the world won’t save you from a messy input. But the flip side is also true: thoughtful feature engineering — knowing what variable to create and why — can take a model from “decent” to genuinely useful. The jump from AUC 0.77 to 0.86 came not from switching algorithms, but from understanding the data well enough to engineer a meaningful feature.

Real business data is messy by default. Accepting that and investing in preprocessing and feature engineering isn’t a workaround — it’s the core of the work.

→ [Intro to ML #7 — coming soon]