Understanding Data - Types, History, and Why It Matters

The book opens with a simple claim: data is the new oil. You’ve probably heard that phrase a hundred times. But Nwokwu doesn’t just drop the cliche and move on. She actually walks you through why that comparison holds up, starting from thousands of years ago.

A Quick History of Data

Here’s the thing most people don’t realize: data collection didn’t start with computers. It started with bones.

Around 20,000 BCE, someone in the Congo region carved notches into a baboon bone. That’s the Ishango bone, and scholars think it was used for counting or tracking things. No spreadsheets. No databases. Just a bone with marks on it. The abacus showed up later, around 2400 BCE in Babylon, and people used it for arithmetic for centuries.

Fast forward to the 1640s. A hatmaker named John Grant in London started collecting death records. Number of deaths, mortality rates by age group, causes of death. He ran what might be the first statistical data analysis ever. He predicted life expectancies and even built an early warning system for the bubonic plague. They called him the Father of Statistics. A hatmaker.

In the 1800s, the U.S. Census had a problem. The population was growing so fast that processing the census manually was taking years. Herman Hollerith built a machine that used punch cards to process the data. The 1890 census went from eight years of processing time down to two. That machine eventually led to the founding of what became IBM.

The 1900s brought magnetic tape (1928, invented by a German engineer named Fritz Pfleumer) and the early ideas of networked computers. In the 1960s, Joseph Licklider imagined computers connected together and sharing resources. That idea became cloud computing decades later. And in the 1970s, Edgar Codd developed the relational data model, which is still how most databases work today.

Then the 1990s brought the internet. Tim Berners-Lee introduced the World Wide Web, and suddenly data could be collected, shared, and analyzed by anyone, anywhere.

That’s a long history. But here’s what I found interesting: at every stage, the pattern is the same. People collect data, run into limits, and then invent something new to handle it. That hasn’t changed.

Three Types of Data

Nwokwu breaks data into three categories based on structure. This is fundamental stuff, but she explains it cleanly.

Structured data is the neat and tidy kind. Rows and columns. Every column has a specific type (numbers, dates, strings), every row is a record. Think Excel spreadsheets, banking transactions, CRM systems. You can query it with SQL. It’s easy to work with.

But here’s the problem: it’s rigid. If you need to add a new column or change the structure, it can be a headache. And it can’t handle things like images, audio, or free-form text.

Unstructured data is the opposite. No predefined format. Emails, social media posts, videos, PDFs. This is actually the majority of data generated today. It can hold deep insights, like sentiment in text or patterns in images. But it’s hard to analyze. You need machine learning, NLP, or computer vision to make sense of it. It’s also huge in terms of storage.

Semi-structured data sits in the middle. It has some organizational properties, like tags or metadata, but no fixed schema. JSON and XML are the classic examples:

{
  "name": "Ada",
  "age": 30,
  "isStudent": false
}
<person>
  <name>Ada</name>
  <age>30</age>
</person>

Semi-structured data is flexible and handles nested, complex formats well. NoSQL databases and cloud storage systems are built for it. But querying it isn’t as straightforward as SQL on a relational table.

Here’s a quick comparison:

FeatureStructuredUnstructuredSemi-structured
FormatTables with rows and columnsFree-form (text, images, audio)Hierarchical like JSON, XML
SchemaRigid, predefinedNoneFlexible, self-describing
StorageRelational databasesFilesystems, object storageNoSQL databases
AnalysisSQL, BI toolsML, NLP, computer visionParsing, schema-on-read

If you’re new to this, the big takeaway is: different data needs different tools. You can’t treat a video file the same way you treat a bank transaction record.

Why Data Matters

The chapter walks through several industries where data is making a real difference.

In healthcare, data is being used for predictive analytics. Doctors can analyze patterns in a patient’s blood sugar, lifestyle, and family history to catch problems early. The NHS in the UK uses data engineering to pull messy records from hospitals, clinics, and labs into one place so doctors can access accurate, up-to-date patient information.

In supply chain, companies are using IoT sensors to track products in real time. During COVID, the companies that used data to predict shipping delays were the ones that stayed ahead.

In transportation, think about how Uber works. Location data, traffic patterns, rider behavior, weather conditions, all processed in real time to get you a ride. And self-driving cars generate terabytes of data per mile, all feeding back into machine learning models.

In AI, data is what makes large language models possible. Early AI models were limited because they didn’t have enough data to train on. Now, models learn from billions of words across countless contexts. Data didn’t just improve AI. It made modern AI possible.

Data vs Information

One last thing from this chapter that I want to highlight. Nwokwu makes a clear distinction between data and information.

Data is raw facts. Numbers, text, readings. On its own, it doesn’t mean much.

Information is what you get after you process, organize, and structure that data. A monthly bank statement is information. A sales dashboard is information. A doctor’s patient summary is information.

The gap between data and information is where data engineering lives. Data engineers are the ones who take messy, scattered, raw data and turn it into something useful. That’s the core of this whole field, and the rest of the book builds on that idea.

Good opening chapter. Nothing too heavy, but it sets the stage well.


This is part 2 of 18 in my retelling of “Data Engineering for Beginners” by Chisom Nwokwu. See all posts in this series.

| < Previous: Book Retelling Series Intro | Next: Introduction to Data Engineering > |

About

About BookGrill.net

BookGrill.net is a technology book review site for developers, engineers, and anyone who builds things with code. We cover books on software engineering, AI and machine learning, cybersecurity, systems design, and the culture of technology.

Know More