Cleaning and Transforming Data - Study Notes from Data Engineering with Python Ch 5

You can build the best pipeline in the world. You can read files, write to databases, schedule everything with Airflow. But if the data going through that pipeline is messy, none of it matters.

Chapter 5 of Data Engineering with Python by Paul Crickard is about that messy middle step. The “T” in ETL. You extract data, you load it somewhere, but in between you need to clean it, reshape it, and sometimes enrich it with outside sources. This chapter walks through all of that using pandas and a real dataset.

The Dataset: Albuquerque E-Scooter Trips

Crickard uses a real dataset from the City of Albuquerque. It contains e-scooter trip records from May to July 2019. About 34,000 rows with columns like trip ID, vehicle ID, start and end locations, duration, timestamps, and user IDs.

It is messy in all the normal ways. Some columns have inconsistent naming (one column is ALL CAPS while the rest are lowercase). Date fields are stored as plain text. There are missing values. The typical stuff you run into with real data.

Step 1: Exploratory Data Analysis (EDA)

Before you fix anything, you need to understand what you are working with. Crickard calls this exploratory data analysis, or EDA.

Here is how it works. You load your CSV into a pandas DataFrame, then start poking around.

Check the basics first:

  • df.columns shows you what columns exist
  • df.dtypes tells you what type each column is (integer, object, datetime, etc.)
  • df.head() and df.tail() show the first and last few rows
  • df.sample(5) grabs random rows so you see variety in the data

Look at specific columns:

You can pull a single column like df['DURATION'] or multiple columns by passing a list. This is useful when you want to focus on just the fields that matter.

Filter rows with conditions:

Pandas lets you filter using conditions. You can use df.where() or the bracket notation df[(df['user_id']==8417864)]. The bracket notation is cleaner because it does not fill non-matching rows with NaN values.

You can combine conditions too. Store each condition in a variable, then combine them with & for AND logic.

Get statistical summaries:

  • df.describe() gives you count, mean, standard deviation, min, max, and quartiles for numeric columns. For text columns, it shows count, unique values, and the most frequent value.
  • df['column'].value_counts() shows every unique value and how often it appears. Pass normalize=True to see percentages instead of raw counts.
  • df.isnull().sum() is the quickest way to find missing values across all columns.

In the scooter data, this revealed that the DURATION column had 2,308 null values and the end_location_name column was missing 2,070 entries. Those are the kinds of problems you need to know about before you start cleaning.

Step 2: Dropping Rows and Columns

Once you know what is wrong, you start removing what you do not need.

Dropping columns you do not need:

The dataset had a region_id column that was the same value (202) for every single row. It adds nothing. You drop it with df.drop(columns=['region_id'], inplace=True).

The inplace=True part is important. Without it, pandas returns a new DataFrame and leaves the original unchanged. With it, the original gets modified directly.

Dropping rows with missing data:

The dropna() method removes rows with null values. But here is the thing: if you just call dropna() with no parameters, it will remove any row that has any null value in any column. That might wipe out thousands of rows you actually need.

The smarter approach is to use the subset parameter. For example, df.dropna(subset=['start_location_name'], inplace=True) only drops rows where that specific column is null. The scooter data had 6 rows with no start location. Those are useless, so they get removed.

You can also set a thresh parameter. Say you only want to drop a column if more than 25% of its values are null. You calculate the threshold and pass it in.

Filling nulls instead of dropping:

Sometimes you do not want to drop nulls. You want to replace them with a default value. That is what fillna() does. You can fill all nulls with one value, or you can pass a dictionary to fill different columns with different values. For example, filling missing start locations with one address and missing end locations with another.

Dropping rows based on conditions:

What if you want to remove all rows from a specific month? Filter the DataFrame to get those rows, then pass their index to drop(). Crickard shows this by filtering out all May data and dropping it from the original frame.

Step 3: Creating and Modifying Columns

Fix column name casing:

The scooter data had one column named DURATION while everything else was lowercase. Inconsistent naming causes bugs. A data scientist querying your data warehouse should not have to remember that one random column is all caps.

You can fix all column names at once with a list comprehension: df.columns = [x.lower() for x in df.columns]. Or rename specific columns using df.rename(columns={'DURATION': 'duration'}, inplace=True).

The same works for values inside columns. Use .str.upper(), .str.lower(), or .str.capitalize() on a column to change its values.

Split columns into multiple columns:

The started_at field contained both date and time in one string, like “5/21/2019 18:33”. Using str.split(expand=True), you can break that into separate date and time columns. The expand=True parameter creates new columns instead of putting a list into a single column.

Fix data types:

Here is the problem with the scooter data. The started_at column looked like a datetime, but pandas read it as a plain text object. That means date comparisons do not work properly. Filtering for dates after “2019-05-23” returned all 34,226 rows because it was doing string comparison, not date comparison.

The fix is pd.to_datetime(). You pass the column and the format string, and pandas converts it to a proper datetime type. After that, date filtering works as expected.

Add columns based on conditions:

You can iterate through a DataFrame with iterrows() and set values based on conditions. But that is slow. The better way is to use df.loc[] with a condition: df.loc[df['trip_id']==1613335, 'new_column'] = 'some_value'. Much faster, same result.

Step 4: Enriching Data

Sometimes your data is clean but incomplete. The scooter data had street addresses but no GPS coordinates. If you want to put this on a map or do spatial analysis, you need latitude and longitude.

Crickard shows how to enrich data by combining it with external sources. The approach:

  1. Take the unique starting addresses (no need to geocode duplicates)
  2. Clean the addresses (split off city/state/zip, replace “@” with “and” for intersections)
  3. Geocode them using an external service (in this case, Albuquerque’s public geocoder)
  4. Join the coordinates back to the original data

Joining DataFrames:

Pandas gives you two main ways to combine DataFrames: join() and merge().

join() works on the index and can produce duplicate columns. You end up with street_new and street_geo which is messy.

merge() is cleaner. You specify the column to join on, and it only keeps one copy of the shared column. pd.merge(new, geo, on='street') gives you the original data plus the new x and y coordinate columns without duplicates.

Step 5: Putting It All Together in Airflow

The chapter finishes by wrapping everything into an Airflow DAG. The idea is simple: take the cleaning steps you just learned and turn them into Python functions, then chain those functions as tasks in a pipeline.

The DAG has three tasks:

  1. Clean task: Read the raw CSV, drop unnecessary columns, fix column names, convert date types, write out a cleaned CSV
  2. Filter task: Read the cleaned CSV, filter rows between two dates, write out the filtered CSV
  3. Copy task: Use a BashOperator to copy the filtered file to another location

The tasks run in sequence: clean, then filter, then copy. Crickard uses PythonOperator for the first two tasks and BashOperator for the file copy.

One useful lesson from this section: Crickard originally used “move” instead of “copy” in the BashOperator. That caused the DAG to fail on subsequent runs because the file was no longer there. When you are working with files in pipelines, be careful about whether you move or copy. Moving can break things on re-runs.

Key Takeaways

  • Always explore before you clean. Use describe(), value_counts(), isnull().sum(), and dtypes to understand your data before you start changing it.
  • Drop what you do not need. Useless columns and rows with critical missing values should go. Use subset in dropna() to be precise about which nulls matter.
  • Be consistent with naming. Column name casing should be uniform. Pick lowercase, uppercase, or whatever your team uses, and stick with it.
  • Fix your data types early. String dates will bite you. Convert them to datetime as one of your first cleaning steps.
  • Use merge over join when possible. It is cleaner and avoids duplicate columns.
  • Enrich when needed. Your data does not have to stay limited to what was collected. Geocoding, lookups, and joins with external data can add significant value.
  • Wrap it in Airflow. Once your cleaning logic works in Python, packaging it into an Airflow DAG makes it repeatable and schedulable.

My Take

This is a solid practical chapter. The scooter dataset is a good teaching tool because it has real problems: missing values, inconsistent naming, wrong data types, and addresses that need enrichment. These are the exact issues you will hit in production.

If I had to add one thing, it would be more emphasis on validation after cleaning. Crickard shows you how to fix problems, but in a production pipeline you also want to verify that your cleaning actually worked. Row counts before and after, null checks on critical columns, that sort of thing. But that may come in later chapters.

The Airflow section is brief but makes the right point: cleaning code belongs in a pipeline, not in a one-off script. If you have to clean data once, you will have to clean it again.


About

About BookGrill.net

BookGrill.net is a technology book review site for developers, engineers, and anyone who builds things with code. We cover books on software engineering, AI and machine learning, cybersecurity, systems design, and the culture of technology.

Know More