What is Data Engineering? Study Notes from Data Engineering with Python Ch 1

Chapter 1 of Data Engineering with Python by Paul Crickard starts with the basics. What is data engineering? What do data engineers actually do? And how is it different from data science?

Here are my notes.

Data Engineers Move Data From A to B

At the simplest level, data engineering is about moving data from one place to another. You pull data out of a source (extract), you clean it up or change its format (transform), and you put it somewhere useful (load).

That’s ETL: Extract, Transform, Load. You’ll see this abbreviation everywhere in data engineering.

Crickard uses a good example to explain why this matters. Imagine an online store that sells widgets. They have a database that tracks every sale. If you want to know how many blue widgets sold last quarter, you run a SQL query. Simple.

But what happens when the company grows? Now there are databases in North America, Europe, Asia, and Africa. Each region has its own system. You can’t just query one database anymore.

That’s where a data engineer steps in. They connect to all those databases, pull the data together, and load it into one place, a data warehouse, where people can actually get answers.

It Gets More Complex Fast

The real questions companies want answered are things like:

  • Which locations sell the most widgets?
  • When are peak selling times?
  • How many people add items to their cart but never buy?
  • What products get bought together?

To answer these, you need to do more than just move data. You need to transform it along the way.

For example, if your stores are in different time zones, you need to convert all timestamps to one standard (ISO 8601 is the go-to). You also might need to add a location tag to each transaction so you can compare regions.

So the pipeline looks like this: extract from each regional database, add location fields, convert dates to a standard format, then load everything into the warehouse.

The Skills You Need

Crickard makes it clear that data engineers need to know a lot of different things. Here’s the rough breakdown:

At the start of the pipeline you need to know how to pull data from different sources. Different file formats, different database types. SQL and Python are must-haves.

During transformation you need to understand data modeling and business logic. What does the company actually want to learn from this data? That shapes how you structure things.

At the loading end you need to know data warehouse design. What kind of database to use, how to set up the schema, how to make it queryable.

And underneath all of that you might be managing the infrastructure too. Linux servers, cloud platforms (AWS, GCP, Azure), and tools like Apache Airflow or NiFi.

It’s a broad role. That’s part of what makes it interesting.

Data Engineering vs Data Science

Here’s how Crickard draws the line. Data engineers build the pipelines and infrastructure. Data scientists use that infrastructure to do analysis, build models, and find patterns.

They use similar tools. Both write Python. Both work with databases. But data engineers focus on getting data where it needs to go in a clean, reliable way. Data scientists focus on what the data means.

In organizations that haven’t matured yet, data scientists end up doing the engineering work themselves. That’s not a great use of their time. Having dedicated data engineers frees up data scientists to do what they’re actually good at.

The two roles should work closely together. When data engineers understand what the data scientists need, they build better pipelines.

The Three Vs of Big Data

Crickard brings up the classic framework for thinking about big data challenges:

  • Volume: Moving a thousand rows is different from moving millions. The tools and techniques change with scale.
  • Variety: Data comes from databases, APIs, files, and all sorts of formats. You need tools that handle all of them.
  • Velocity: Data is always flowing faster. Millions of users on a social network, purchases happening around the world. Sometimes you need near real-time processing.

The Tools

Programming Languages

SQL is the main language of data engineering. Almost everything involves SQL at some point. Even non-SQL databases often provide SQL-like query interfaces.

Java and Scala power a lot of the open source Apache ecosystem (Spark, NiFi, Kafka). Scala runs on the JVM and has been gaining ground as a more modern alternative to Java.

Python is the focus of this book, and for good reason. It has a huge ecosystem of libraries for data work: pandas, numpy, matplotlib, scikit-learn, tensorflow, and more. It’s well documented and has a massive community.

Databases

On the source side, you’ll usually be pulling from relational databases like PostgreSQL, MySQL, Oracle, or SQL Server. These store data in rows and are great for recording transactions.

On the warehouse side, you’ll see columnar databases like Amazon Redshift, Google BigQuery, and Apache Cassandra. Columnar storage is better for fast queries on large datasets because it reads only the columns you need instead of entire rows.

Crickard also mentions Elasticsearch, which is a search engine built on Apache Lucene. It stores data as documents and uses its own JSON-based query language. It’s useful but different from the relational or columnar approach.

Data Processing Engines

When you need to transform data at scale, you use a processing engine. The most popular one is Apache Spark. It lets you write transformations in Python, Java, or Scala. It works with DataFrames (which Python developers will feel right at home with) and RDDs (Resilient Distributed Datasets) for distributed processing.

Other engines mentioned:

  • Apache Storm: Uses “spouts” to read data and “bolts” to transform it. You connect them to build a pipeline.
  • Apache Flink and Samza: More modern options for stream and batch processing. Good for unbounded streams, like data from a temperature sensor that never stops sending readings.

Data Pipelines and Scheduling

When you combine a data source, a programming language, a processing engine, and a data warehouse, you get a pipeline.

But a pipeline that you have to run manually isn’t very useful. You need a scheduler.

The simplest option is crontab. Schedule your Python script to run every few hours. Done.

But here’s the problem. Once you have more than a handful of pipelines, crontab falls apart. How do you track which ones succeeded and which failed? How do you handle backpressure when one step runs faster than the next? You need something better.

Apache Airflow

Airflow is the most popular Python framework for data pipelines. Built by Airbnb. It has a web server, scheduler, a metastore, a queue, and executors. You can run it on a single machine or scale it out to a cluster.

Airflow uses DAGs (Directed Acyclic Graphs). A DAG is just Python code that defines a series of tasks with dependencies. Each task runs after the one it depends on. So “extract” runs first, then “transform,” then “load.” Each step flows in one direction, no loops.

Apache NiFi

NiFi is another pipeline framework, and Crickard says the book will use it more than Airflow. It was originally built by the NSA and is used at several US federal agencies.

NiFi has a polished GUI, built-in scheduling, backpressure handling, and monitoring. You can do a lot just by configuring existing processors without writing much code. It also supports clustering, version control through NiFi Registry, and edge data collection with MiNiFi.

Crickard also gives a quick mention to Luigi, built by Spotify. Another Python-based pipeline tool with a graph structure and GUI, similar in spirit to Airflow.

My Takeaway

Chapter 1 is a solid overview. Crickard covers what data engineering is, where it fits in the data ecosystem, and what tools you’ll be working with. Nothing too deep yet, but it sets the stage well.

The key thing I took away: data engineering is broad. It touches programming, databases, infrastructure, cloud platforms, and business logic. That’s a lot of ground to cover, which is why having a structured book to work through is helpful.

On to Chapter 2, where we set up the actual development environment.


These are my study notes from Data Engineering with Python by Paul Crickard (Packt, 2020). This is a retelling in my own words, not a substitute for the book. If you want the full details and code examples, grab a copy.


About

About BookGrill.net

BookGrill.net is a technology book review site for developers, engineers, and anyone who builds things with code. We cover books on software engineering, AI and machine learning, cybersecurity, systems design, and the culture of technology.

Know More