Data Science Foundations Chapter 7: Where to Find and How to Source Your Data
You have a great hypothesis. Your stakeholders are on board. But none of it matters without the right data.
Chapter 7 of “Data Science Foundations” by Stephen Mariadas and Ian Huke is about sourcing. Where do you get data? How do you collect it? How do you know if it is any good?
I spent years working with data in IT and finance. Finding the right data set is often harder than building the model.
The Big Data Problem
Data is everywhere. The authors use a coffee shop visit to show this. You walk in, the camera records you. The payment system logs the transaction. The coffee machine sends status updates. You post a review. One moment, five systems creating data.
This explosion led to the concept of Big Data, described through three Vs. Volume is the sheer amount. Velocity is how fast it gets created and processed. Your card payment clears in seconds. Variety covers all the different formats. Video is unstructured. Payment records are structured. Social media is somewhere in between.
Some people add two more. Veracity asks if the data is trustworthy. Variability points out that “employee” in one database might include contractors, and in another it might not. Same word, different meaning.
How Data Gets Collected
The book lists several methods. Cameras capture images and video. Sensors record measurements automatically. Transferred data happens when a barcode gets scanned or a card gets read. Human entry covers keyboards and touch screens. Derived data combines existing records into something new, like building customer profiles from purchase history. And synthetic data is artificially generated when real data is limited or privacy is a concern.
Here’s the thing. These methods often chain together. A sensor reads a temperature. That gets transferred to headquarters. An algorithm processes it. A maintenance request gets created automatically. That is the internet of things in action.
When data comes from people, expect errors. The authors recommend drop-downs, checkboxes, and format validation. Even surveys can go wrong. The UK 2021 census had to downgrade its gender identity statistics because the question confused respondents.
Types of Data
The book sorts data by availability and creation.
Public data is anything you can access openly. But accessible does not mean free to use. Open data is usually released by governments with clear licenses. Census data is a good example. Proprietary data belongs to an organization and stays private.
On the creation side, operational data comes from running a business. Transaction records, production logs. Research data is collected specifically for a study or experiment.
Where to Find Data
Start inside your organization. Companies sit on data they barely use. Outside, government portals and open data sites can be rich sources. You can buy data from third parties. You can pull from social media through APIs or screen scraping. And sometimes you collect it yourself through surveys or field studies. Do not underestimate the work involved.
Storing Data: Lakes, Warehouses, and Marts
Organizations set up separate stores so analysts can work without slowing production systems.
A data lake holds raw data in its original format. Everything gets dumped in. A data warehouse is more organized. Data gets cleaned before loading. It brings together information from multiple systems for reporting. A data mart is a slice of the warehouse focused on one area, like sales or marketing.
ETL vs ELT
Getting data into these stores involves extract, transform, and load. The order of the last two matters.
ETL cleans data before loading it. Warehouses use this. ELT dumps raw data first, transforms later. Lakes favor this. Many data scientists prefer ELT because raw data gives more options for analysis.
Data Governance and Lifecycle
Data governance covers the rules for managing data. Who can access it. How it gets stored. When it gets deleted.
The book introduces CRUD as the four stages of data life: create, record, update, delete. You build a data set, store it, modify it during analysis, and delete it when done.
Good governance brings safety (proper use, restricted access) and efficiency (reuse data, delete what you no longer need).
Data Quality: Three Lenses
Not all data is good data. The authors evaluate quality through three lenses.
Accuracy asks if data is complete and correctly recorded. Latency asks how fresh it is. Census data from years ago might be fine. Financial trading data from five minutes ago might be too old. Lineage asks where the data came from and what happened to it since. Modified or removed records can wreck your analysis.
And even perfect data is worthless if it does not relate to your problem.
Metadata
Metadata is data about data. When was it created? How big is it? What does it contain? Is it confidential? Good metadata saves enormous time when searching for the right data sets.
Key Takeaways
- Garbage in, garbage out. Data quality determines project quality
- Consider multiple sources and compare their strengths
- Check if the data has the right properties for your hypothesis
- Understand how data will be extracted, transformed, and loaded
- Follow your organization’s data governance policies
- Verify it is ethical and legal to use the data
Sourcing data is not glamorous. But skip it or do it poorly, and everything built on top falls apart.
Previous: Chapter 6: Properties of Data Next: Chapter 8: Data Preparation