Data Science Foundations Chapter 8: Cleaning and Preparing Your Data

You know that feeling when you buy fresh ingredients for dinner, and then spend 80% of your time washing, cutting, and peeling? The actual cooking takes 20 minutes. Data science is exactly like that. The cooking is the model. The prep work is this chapter.

Mariadas and Huke use the chef comparison too. Data preparation goes by many names: transformation, wrangling, conditioning. “Wrangling” is my favorite because it captures how messy this work can be.

Here is the thing. If you skip this step, your model will suffer. Garbage in, garbage out. Bad data gives bad results. And sometimes you will not even know the results are bad.

Form: The Shape of Your Data Matters

The chapter starts with “form.” This is about the shape of your data, not the content. Think granularity, structure, types, and scale.

Granularity is the level of detail. Say you work at a call center and someone asks how many employees you need. For shift planning, you need hourly numbers. For hiring, monthly is fine. You can always roll hourly data into months. But you cannot split monthly data into hours. More granular is usually better. You can zoom out, but you cannot zoom in on data that does not exist.

The book gives a phone faults example. If your data just says “faulty phone,” you cannot figure out whether screens break more than processors. Sometimes the data is there but stored wrong. One column with all fault types versus separate columns for each. The second option is what models need. This is called one-hot encoding: splitting categories into yes/no columns with 1s and 0s.

Structure is how data is organized. Lists, arrays, trees, tables. Tables are the most common. Often your data comes from different systems in different tables. You join them using a common key, like a date. Simple in theory. Gets complicated with many tables and complex relationships.

Data types trip people up constantly. Integers, floats, booleans, strings, date/time. Your date field might look fine, but if it got stored as text instead of a proper date, your model will choke. I have seen this in production more times than I can count. Something that looks like a number is actually a string with a space at the end.

Scale is a big one. Age ranges from 0 to 120. Daily phone usage in minutes ranges from 0 to 1440. Without adjusting, a model treats phone usage as more important just because the numbers are bigger. Feature scaling normalizes everything to the same range, usually 0 to 1. Now both features have equal weight.

Properties and Patterns

Be careful with categories. If you assign Screen = 1, Operating System = 2, Processor = 3, you created a ranking that does not exist. The model might think processor faults are “bigger” than screen faults. That is why one-hot encoding uses separate columns with binary values. No fake ranking.

Data sets have their own properties too. Trends and cycles can mess with models that expect stationary data. Retail sales spike before Christmas every year. Some models need you to remove these patterns first. Skewed data, where values bunch up on one side, can also throw off results and might need normalizing.

Quality Risks

The authors list the usual suspects. Missing values from data never collected or lost during transfer. Duplication from combining tables incorrectly. Small sample size, where the data is fine but there is just not enough. Outliers, like that weird day 50 trainees showed up at the call center. Bias, the classic being a hiring model trained on past hires picked by a biased recruiter. And wrong data set, where resumes of hired candidates tell you about hiring patterns but nothing about who performs well on the job.

Quality Checks

Eyeball your data. Visual inspection catches things algorithms miss. Look for blanks, weird values, duplicates. Plot it. Box plots show outliers. Histograms show skew. A quick chart saves hours of debugging.

Cross-check transformed data against the source. Row counts should match. Column sums should add up. And if your data updates regularly, set up a quality dashboard so you do not repeat manual checks every time.

Mitigation Strategies

When you find a problem, you have six options. Accept it and move on. Correct it if you can. Remove the bad rows or columns. Impute a value based on other data. Enrich by adding data from other sources. Or discard the whole data set if it is beyond saving.

Each option has trade-offs. Removing data can introduce bias. Imputation is an educated guess. The right choice depends on your situation.

My Take

This chapter is not glamorous. Nobody gets excited about checking data types and counting null values. But the authors are right to spend time on it. I have seen expensive projects fail because someone skipped prep work and built a beautiful model on top of messy data.

The practical tips at the end: allow plenty of time for preparation, check whether your transformations create new problems, and always ask if your data set is suitable for your question.

That last one is the most important.

Previous: Chapter 7: Sourcing Data Next: Chapter 9: Basic Concepts

About

About BookGrill.net

BookGrill.net is a technology book review site for developers, engineers, and anyone who builds things with code. We cover books on software engineering, AI and machine learning, cybersecurity, systems design, and the culture of technology.

Know More