Reading and Writing Files in Python - Study Notes from Data Engineering with Python Ch 3

Chapter 3 is where Crickard moves from setup to actual work. You installed all those tools in Chapter 2. Now you use them. The chapter covers one of the most fundamental tasks in data engineering: getting data out of text files and into something useful.

The three tools in play here are Python (with its standard libraries), Apache Airflow, and Apache NiFi. Each one handles file processing differently, and Crickard walks through all three.

Generating Fake Data with Faker

Before you can read files, you need files to read. Crickard introduces the Faker library, which generates realistic fake data. Names, addresses, ages, zip codes, coordinates. You install it with pip and then call simple methods like fake.name() or fake.street_address().

This is a practical choice. Instead of hunting for sample datasets, you just generate exactly what you need. Throughout the chapter, Crickard uses Faker to create 1,000 records for testing.

CSV Files: The Bread and Butter

CSV is the most common file format you will encounter in data engineering. Comma-separated values. Simple concept, but commas inside text fields can cause problems. That is why escape characters (usually quotes) exist.

Writing CSVs with Python’s Built-in Library

Python has a csv module in its standard library. Here is how it works at a high level:

  1. Open a file in write mode ('w')
  2. Create a writer object with csv.writer()
  3. Write a header row with writerow()
  4. Write data rows the same way
  5. Close the file

The writer handles newlines automatically (it adds \r\n by default). You can also configure the delimiter, quoting behavior, and dialect if you need something other than the defaults.

For bulk data, you loop through and write rows one at a time. Crickard generates 1,000 records with Faker inside a for loop, writing each row with fields like name, age, street, city, state, zip, longitude, and latitude.

Reading CSVs

Reading follows a similar pattern but with a useful twist. Instead of the basic csv.reader(), Crickard recommends csv.DictReader(). The dictionary reader lets you access fields by name instead of position. So instead of row[0], you write row['name']. Much more readable, much less error-prone.

You can also use with open(...) to open the file, which handles closing automatically. No need to remember file.close().

CSVs with pandas

Then there is pandas. The pd.read_csv() function loads a CSV straight into a DataFrame. Think of a DataFrame as a spreadsheet in memory: rows, columns, and an index.

pandas is heavier than the built-in csv module, but it gives you a lot more power. You can peek at data with df.head(10), query it, transform it, and export it back out with df.to_csv(). One useful tip from the chapter: pass index=False when exporting to CSV, otherwise pandas writes the row numbers as an extra column with a blank header. That catches a lot of beginners off guard.

You can also build DataFrames from scratch using Python dictionaries and then export them. The keys become column names, and the values (as lists) become the column data.

JSON: The Other Common Format

JSON (JavaScript Object Notation) is the second format Crickard covers. You see JSON everywhere, especially in API responses. Python has a built-in json module for handling it.

Writing JSON

The approach is a bit different from CSV. Instead of writing row by row, you build up a Python dictionary with all your data, then dump the whole thing to a file at once using json.dump().

Crickard creates a dictionary with a 'records' key that holds a list. Each record is its own dictionary with name, age, street, city, state, zip, longitude, and latitude. After the loop finishes, one call to json.dump() writes everything.

Reading JSON

Reading is the reverse: json.load() reads the file and gives you back a Python dictionary. From there you access data with standard dictionary notation like data['records'][0]['name'].

One important gotcha Crickard highlights: load and dump are for files. loads and dumps (with the ’s’) are for strings. They do different things. Mixing them up is a common mistake.

JSON with pandas

You can read JSON into a DataFrame with pd.read_json(), but it is not always straightforward. If your JSON has nested structures (like records inside a 'records' key), you need to normalize it first.

Crickard shows how to use json_normalize() with a record_path parameter to flatten nested JSON into a table structure. This is a common real-world scenario since most JSON from APIs is not flat.

When writing JSON from a DataFrame, the orient parameter controls the output format. The default ('columns') groups data by column. Setting orient='records' gives you each row as a separate JSON object, which Crickard says is much easier to work with in tools like Airflow.

Building a Data Pipeline in Airflow

This is where things get interesting. Crickard takes the Python file-handling skills from the previous sections and wraps them into an Airflow pipeline.

What is a DAG?

Airflow organizes work into DAGs (Directed Acyclic Graphs). A DAG is a collection of tasks that run in a defined order. Each task moves in one direction when completed. No loops, no going backwards.

You define tasks using operators. Airflow has prebuilt operators for common actions:

  • BashOperator runs shell commands
  • PythonOperator runs Python functions
  • PostgresOperator runs database queries (covered in the next chapter)

Building a CSV-to-JSON Pipeline

Crickard builds a simple two-task pipeline:

  1. A Bash task that prints a status message
  2. A Python task that reads a CSV file and converts it to JSON using pandas

The DAG configuration includes default arguments like owner, start date, retry count, and retry delay. Then you set a schedule interval for how often the pipeline runs. Airflow supports cron expressions (0 * * * * for hourly) or preset shortcuts like @daily, @weekly, @monthly.

One useful warning from the book: the DAG does not run at the start date. It runs at start date plus the schedule interval. So if you set a daily schedule with today’s start date, it will not run until tomorrow. That trips people up.

Connecting Tasks

You connect tasks using either set_upstream()/set_downstream() methods or the bit shift operators (>> and <<). Both do the same thing. Crickard uses the bit shift approach throughout the book:

task_one >> task_two

This means task_one runs first, then task_two.

Running the Pipeline

You copy the DAG file to Airflow’s dags folder, start the webserver and scheduler, and open the GUI at localhost:8080. From there you can monitor runs, check task status, and view logs to see what each task produced.

File Processing with Apache NiFi

NiFi takes a completely different approach. Instead of writing code, you drag and drop processors onto a canvas and connect them visually. It is more steps than Python, but the pipeline is visual and easier to understand at a glance.

The CSV Pipeline in NiFi

Crickard builds a pipeline that reads a CSV, filters for people over 40, and writes each matching record to its own file. The processor chain looks like this:

GetFile reads the CSV from disk. You configure the input directory, file name filter, and set “Keep Source File” to true (otherwise NiFi deletes the original).

SplitRecord breaks the file into individual rows. It needs a CSVReader and CSVRecordSetWriter configured as controller services. Make sure to set “Treat First Line as Header” to true on the reader.

QueryRecord is where it gets powerful. You can write SQL queries against the data flowing through the pipeline. Crickard creates a query called over.40 with the SQL SELECT * FROM FlowFile WHERE age > 40. Only matching records pass through.

ExtractText pulls values out of the flowfile using regex. Crickard extracts the person’s name so it can be used as a filename.

UpdateAttribute changes the filename attribute of the flowfile from the default (data.CSV) to the extracted name. Without this, every output file would have the same name and overwrite each other.

PutFile writes the final result to disk.

The pipeline connections use “relationships” like success, splits, matched, and the custom over.40. Each relationship determines which records flow to which processor.

The JSON Pipeline in NiFi

The JSON pipeline follows a similar pattern but uses JSON-specific processors:

  • SplitJson instead of SplitRecord, using a JsonPath expression ($.records) to find the array of records
  • EvaluateJsonPath to extract values from JSON fields using $.key notation
  • JsonTreeReader and JsonRecordsetWriter for the QueryRecord processor
  • AttributesToJSON to rebuild the flowfile content from extracted attributes
  • JoltTransformJSON to modify the JSON structure (Crickard shows removing a field)

Jolt is a JSON transformation library. The example is simple (removing the zip field), but it supports much more complex transformations.

Key Takeaways

Python’s standard library is enough for basic file handling. The csv and json modules handle most common cases without any extra dependencies.

pandas is worth the overhead when you need to transform data. DataFrames make it easy to read, query, reshape, and export data in different formats.

Airflow is for scheduling and orchestrating. Write your logic in Python functions, wrap them in operators, connect them into a DAG, and let Airflow handle the when and how often.

NiFi is visual and code-free. More processors to configure, but the pipeline is transparent. Anyone can look at it and understand what is happening. Changes do not require code rewrites, just reconfiguring or reconnecting processors.

Know the difference between load/dump and loads/dumps in Python’s json module. The version without ’s’ works with files. The version with ’s’ works with strings. This is one of those small things that will save you debugging time.

The chapter does a good job of showing the same task (read file, process data, write output) done three different ways. Each tool has its strengths. Python gives you full control. Airflow adds scheduling and monitoring. NiFi adds visual pipeline design and no-code processing.

Next chapter covers databases, which is where things start to get really practical.


Previous: Building Your Data Engineering Setup (Ch 2)

Next: Working with Databases (Ch 4)

About

About BookGrill.net

BookGrill.net is a technology book review site for developers, engineers, and anyone who builds things with code. We cover books on software engineering, AI and machine learning, cybersecurity, systems design, and the culture of technology.

Know More