Building Your Data Engineering Setup - Study Notes from Data Engineering with Python Ch 2

Chapter 1 was all theory. Now it’s time to actually install stuff. Chapter 2 of Data Engineering with Python by Paul Crickard is a setup chapter. You install the tools, configure them, and make sure everything talks to each other.

Here’s the thing about setup chapters: they can feel boring. But if you skip this, nothing else in the book works. So let’s walk through what gets installed and why it matters.

What You’re Building

By the end of this chapter, you’ll have six tools running on your machine:

  • Apache NiFi for building data pipelines visually
  • Apache Airflow for building data pipelines with Python code
  • Elasticsearch as your NoSQL database
  • Kibana as a GUI for Elasticsearch
  • PostgreSQL as your relational database
  • pgAdmin 4 as a GUI for PostgreSQL

Two pipeline tools and two databases. Each database gets its own admin interface. That’s the full local stack for data engineering work.

Apache NiFi: The Visual Pipeline Builder

NiFi is the main tool for this book. It lets you build data pipelines using drag-and-drop processors. No code required. You configure processors, connect them together, and NiFi handles the data flow.

The install is straightforward: download the tarball, extract it, run the start script. One thing that trips people up is Java. NiFi needs Java installed and the JAVA_HOME variable set. If you run the status command and don’t see a path for JAVA_HOME, you need to install a JDK and export the variable in your shell profile.

NiFi runs on port 8080 by default. Crickard changes it to port 9300 right away because Airflow also wants port 8080. Smart move. You change this in the nifi.properties config file under the web properties section.

How NiFi Works (Quick Version)

The NiFi GUI has a canvas where you build data flows. The key concepts:

  • Processors do the actual work (read files, write files, query databases, transform data)
  • Connections link processors together
  • Relationships define what happens on success or failure
  • FlowFiles are the data objects moving through your pipeline

NiFi ships with over 100 built-in processors. You drag them onto the canvas, configure their properties, and connect them. The book walks through a simple example: a GenerateFlowFile processor creates text data and passes it to a PutFile processor that writes it to disk.

It’s a trivial pipeline. But it shows you the pattern. Every NiFi pipeline follows this same flow: processor creates or reads data, connections move data between processors, and relationships handle success and failure paths.

One useful feature: you can right-click the queue between processors and inspect the FlowFiles sitting there. You can see their contents, metadata, and attributes. Very handy for debugging.

The PostgreSQL JDBC Driver

There’s a small but important step here. If you want NiFi to talk to PostgreSQL (and you will), you need to download the PostgreSQL JDBC driver and drop it into a drivers folder inside your NiFi directory. This gets used later when you set up database connection pools in NiFi.

Apache Airflow: The Code-First Pipeline Tool

Airflow does the same job as NiFi but takes a completely different approach. Instead of a visual interface, you write your pipelines in Python. If you’re a strong Python developer, this will feel more natural.

Install it with pip. You can include sub-packages for specific integrations like PostgreSQL, Slack, and Celery. The book installs all three:

pip install 'apache-airflow[postgres,slack,celery]'

After installing, you need to initialize the database (airflow initdb), start the web server, and start the scheduler. These run as separate processes, so you’ll need multiple terminal windows.

DAGs, Not Pipelines

Airflow calls its pipelines DAGs (Directed Acyclic Graphs). That’s a fancy way of saying: tasks with dependencies, and no circular loops.

The web UI at localhost:8080 shows you all your DAGs. Airflow installs a bunch of example DAGs by default. The book suggests turning those off by setting load_examples = False in the airflow.cfg file and resetting the database. Good advice. Those examples clutter things up fast.

The graph view for a DAG shows you the task dependencies and execution order. You can trigger DAGs manually, watch them run, and see which tasks succeeded or failed. All from the browser.

One note: the default database for Airflow is SQLite. That’s fine for learning on your laptop. But in production, you’d switch to PostgreSQL or MySQL. SQLite only allows one task to run at a time, which defeats the purpose of a pipeline scheduler.

Elasticsearch: Your NoSQL Database

Elasticsearch is technically a search engine, but the book uses it as a NoSQL database. You’ll store data in it and query it. Download the tarball, extract, optionally name your cluster and node in the config file, and start it up.

It runs on port 9200. Once it’s up, you can hit http://localhost:9200 in your browser and see a JSON response with cluster info. That tells you it’s working.

Kibana: The Elasticsearch GUI

Elasticsearch doesn’t have a built-in visual interface. It’s all API calls. Kibana fixes that. It connects to Elasticsearch on port 9200 and gives you a web UI on port 5601.

Here’s what Kibana offers:

  • Discover tab for browsing your data records
  • Visualizations for charts, maps, and dashboards
  • Developer Tools for testing Elasticsearch queries directly

The Developer Tools tab is especially useful for data engineers. You can write and test queries before putting them into your pipeline code. The book loads some sample e-commerce data to show off dashboards with bar charts, maps, and real-time filters.

PostgreSQL: Your Relational Database

PostgreSQL is the relational database for this book. Open source, mature, and comparable to Oracle or SQL Server. It also has PostGIS for spatial data, which is a nice bonus.

Install via your package manager, start the cluster, then set a password for the default postgres user. The book creates a database called dataengineering that gets used throughout later chapters.

Command line works fine for PostgreSQL. But for people who prefer a GUI, there’s pgAdmin 4.

pgAdmin 4: The PostgreSQL GUI

pgAdmin 4 is a web-based admin tool for PostgreSQL. After installing and logging in, you can:

  • Browse your databases and tables visually
  • Create tables with a point-and-click interface
  • Run SQL queries
  • Manage users and permissions

The book walks through connecting pgAdmin to your local PostgreSQL instance, navigating to the dataengineering database, and creating a simple table. In the next chapter, Python’s faker library populates this table with test data.

The Big Picture

Here’s why all these tools together. In real data engineering work, you need:

  1. A way to move data (NiFi or Airflow)
  2. Places to store data (PostgreSQL for structured, Elasticsearch for unstructured)
  3. A way to see what’s happening (pgAdmin, Kibana, web UIs)

In production, you wouldn’t run all six tools on one machine. But for learning, having everything local means you can experiment without worrying about cloud costs or network issues.

My Take

This is a necessary chapter. Not exciting, but essential. A few things stood out:

Port conflicts matter. NiFi and Airflow both default to port 8080. Crickard catches this early and changes NiFi’s port. When you’re running multiple services locally, always check for port collisions first.

Two pipeline tools is a good idea. NiFi for visual, Airflow for code. Different teams and different problems call for different tools. Knowing both gives you flexibility.

The admin GUIs save time. You can do everything from the command line, sure. But when you’re learning or debugging, seeing your data in pgAdmin or Kibana makes everything faster.

SQLite as the default Airflow backend is a trap. It works for the book exercises, but the moment you want parallel tasks, you’ll need to switch. Worth knowing upfront.

The setup in this chapter is the foundation for everything else in the book. Take the time to get it working. Once these six tools are running, the real work begins.


Previous: What is Data Engineering? (Ch 1)

Next: Reading and Writing Files (Ch 3)

About

About BookGrill.net

BookGrill.net is a technology book review site for developers, engineers, and anyone who builds things with code. We cover books on software engineering, AI and machine learning, cybersecurity, systems design, and the culture of technology.

Know More