Introduction to Data Engineering - The Oil Refinery, the Lifecycle, and the People
Chapter 1 was about understanding data itself. Chapter 2 answers the bigger question: what do data engineers actually do with it?
Nwokwu opens with a problem statement that’s pretty universal. Organizations have tons of data. They need to collect it, store it, clean it, and make it useful. But how? That’s what data engineering solves.
The Oil Refinery Analogy
The best part of this chapter is the oil refinery analogy. It makes the whole field click.
Imagine you work at an oil refinery. Crude oil comes in. It’s messy, full of impurities, and completely unusable in its raw form. The refinery cleans it, separates it into different products (gasoline, diesel, jet fuel), checks quality standards, and then sends the finished products out to gas stations and vendors.
Now replace crude oil with raw data. A data engineer takes messy, unstructured data from different sources, cleans it up, removes errors and noise, transforms it into useful formats, and delivers it to the people who need it. Maybe it goes to an analyst’s dashboard. Maybe it feeds into an API. Maybe it becomes a dataset for a machine learning model.
The crude oil is your raw data. The refinery machines are your processing tools. The different fuel types are your different data outputs for different users.
I’ve been in IT for over 20 years and I’ve seen dozens of analogies for data work. This one is simple and it actually holds up. If you can explain your job using an oil refinery, people get it.
The Data Engineering Lifecycle
Here’s how it works. The data engineering lifecycle has five stages:
1. Source Systems - This is where data comes from. Databases, APIs, IoT devices, cloud storage, streaming platforms. A social media platform generates unstructured data (text, images, videos). A smart thermostat generates semi-structured JSON data every minute. An e-commerce platform generates a mix of structured order records and semi-structured customer reviews.
As a data engineer, your first job is to know your sources. What format is the data in? How often does it arrive? How much of it is there?
2. Storage - Before you do anything with data, you need somewhere to put it. The book lays out three main options:
- Database - best for structured, transactional data where you need quick reads and writes
- Data lake - good for raw, unstructured, or semi-structured data without a fixed schema
- Data warehouse - designed for structured data from multiple sources, formatted for analysis and reporting
When choosing a storage system, you think about scalability (will it grow with your data?), performance (how fast are reads and writes?), storage suitability (does the tech match the use case?), and access tiers.
Access tiers are interesting. Hot storage is for data you access all the time. It’s fast but expensive. Cold storage is for data you rarely touch. It’s cheaper but slower. Archive storage is for stuff you basically never look at but need to keep for compliance. Each tier has different cost and speed tradeoffs.
3. Ingestion - This is moving data from your sources into your storage. Two approaches:
- Batch ingestion - collect data in big chunks at scheduled intervals. Like running a job every night at 2 AM.
- Streaming ingestion - process data in real time as it arrives. Like monitoring sensor data second by second.
Before you start ingesting, you need to answer: where is this data going? How often does it arrive? What format is it in? Do I need to transform it on the way in?
4. Transformation - Now the data is stored, but it’s still messy. Transformation is where you clean it up. Remove duplicates. Fix inconsistencies. Handle missing values. Apply business logic. Aggregate things into useful summaries.
Here’s the thing: you can’t just transform blindly. You need to understand the business use case first. What are stakeholders going to do with this data? What questions are they trying to answer? The business logic drives what transformations you apply.
5. Serving - The last stage. This is where clean data reaches the people who need it. The book highlights two main use cases:
For analytics, data goes to dashboards, reports, and ad hoc queries. Business analysts use it to track KPIs, build monthly reports, or answer one-off questions from leadership.
For machine learning, data engineers work with ML engineers to prepare clean datasets, create features (like “average purchase frequency per customer” or “total spend in the last 90 days”), and store them in feature stores for model training.
Working with Stakeholders
This section is something a lot of technical books skip, and I appreciate that Nwokwu includes it. Data engineering is not just about pipelines. It’s about people.
There are two types of stakeholders:
Downstream stakeholders use the data you produce. Data analysts, data scientists, ML engineers, executives. They need you to understand what data points they require, how often they need refreshes, and what latency is acceptable.
Upstream stakeholders provide the data. Software engineers building source systems, third-party data providers. You need to understand how their data is generated, what volume to expect, what format it comes in, and whether schema changes are coming.
The practical advice here is solid. Talk to your stakeholders early. Understand their pain points. Ask what data they wish they had. And when requirements change (and they will), keep communicating. Regular check-ins, documented requirements, small iterations.
Delivering Business Value
The book makes a point that I think a lot of engineers need to hear: don’t just build things because they’re technically interesting. Ask “why” before you write any code. What decision will this data influence? If you can’t answer that, step back and figure it out first.
And when your work does make an impact, say so. If your pipeline saves someone three hours a day, tell people. Communicate in business language: time saved, cost reduced, revenue gained. That’s how you stay relevant.
Where the Field Is Now
Nwokwu gives a quick overview of how data engineering has evolved. In the early days, it was ad hoc scripts and manual database management. Now it’s automated pipelines, real-time streaming, lakehouse architectures (combining data lakes and warehouses), data mesh (decentralized data ownership), and a growing focus on data observability and quality.
Technologies like Delta Lake and Apache Iceberg are letting organizations store structured and unstructured data in the same environment. Cloud-native platforms are giving companies more flexibility. Data engineering is no longer about moving data from point A to point B. It’s about building intelligent, automated systems.
Why It Matters
Here’s my take on why this chapter is important. If Chapter 1 was “what is data,” Chapter 2 is “what do you do with it.” The lifecycle gives you a mental framework for the entire field. Every chapter that follows in this book maps to one or more stages of that lifecycle.
And the stakeholder section grounds everything in reality. You’re not building pipelines in a vacuum. You’re building them for people who have deadlines, questions, and budgets.
Good chapter. It sets up the rest of the book nicely.
This is part 3 of 18 in my retelling of “Data Engineering for Beginners” by Chisom Nwokwu. See all posts in this series.
| < Previous: Understanding Data - Types, History, and Why It Matters | Next: Database Fundamentals > |