Data Quality: What Bad Data Looks Like and How to Catch It
Chapter 8 of Data Engineering for Beginners opens with a statement that should be obvious but apparently is not: even the best pipelines and storage systems are meaningless if the data they deliver is garbage.
And here is what I found interesting. Nwokwu says a lot of organizations do not actually care about data quality. They know it matters. They say it matters. But when it comes time to invest in it, they build workarounds and makeshift solutions instead. If you have worked in IT long enough, you know exactly what this looks like.
What Is Bad Data?
Bad data is any data that does not meet the standards needed for meaningful analysis. That sounds vague, so let me make it concrete.
Bad data is incomplete - missing values that leave gaps. It is poorly formatted - dates that are all over the place, text in numerical columns. It is outdated - reflecting conditions that no longer exist. And sometimes it is just plain wrong.
Picture a sales department generating a monthly revenue report. Some entries are missing. Some sales are recorded in the wrong month. Some are duplicated. The report is useless. Not because the reporting tool is bad, but because the data feeding it is bad.
Here is the thing: bad data is both a cultural problem and a technical problem. Culturally, if employees are not trained to enter consistent data, or if nobody is accountable for data quality, errors pile up. Technically, legacy systems and poorly integrated databases introduce their own mess.
One common source of bad data: multiple systems storing data separately. Sales in one system, inventory in another. When you try to combine them, you get duplicates, missing fields, and conflicting labels. “Customer ID” in one system is “Client Number” in another. Same concept, different name. Good luck joining those tables without a headache.
The Eight Dimensions of Data Quality
The book introduces eight lenses through which you can evaluate data quality. Think of them as a checklist. When you find bad data, these dimensions help you figure out exactly what is wrong and what to fix.
1. Accuracy
Does the data match reality? If a customer’s email is stored as “[email protected]” instead of “[email protected]”, that is an accuracy problem. Your marketing campaign goes to the wrong inbox. If a delivery address is wrong, packages go to the wrong location.
An accuracy SLA might say: error rate must be below 0.1% per reporting period.
2. Completeness
Is all the required data actually there? In a banking application, if the First Name, Last Name, or Social Security Number fields are NULL, the registration process might halt entirely.
Completeness also means record counts match between systems. If the source has 10,000 records but only 9,950 made it to the destination after migration, you have a gap.
You can measure this with a completeness score:
Completeness Score = (Total Records - Missing Records) / Total Records * 100
Here is a quick example from the book:
| Field | Total Records | Missing | Score |
|---|---|---|---|
| First Name | 1,000 | 50 | 95% |
| Last Name | 1,000 | 30 | 97% |
| SSN | 1,000 | 100 | 90% |
Overall completeness: (95 + 97 + 90) / 3 = 94%. Not terrible, but that SSN field needs attention.
To check for missing values in SQL:
-- Find all rows where a specific column is NULL
SELECT *
FROM table_name
WHERE column_name IS NULL;
You can also set up data volume checks. If your batch pipeline normally outputs 200 rows and suddenly spits out 1,500, something is wrong with the query.
3. Consistency
Is the same data represented the same way everywhere? “John Doe” in one system and “J. Doe” in another is an inconsistency. A customer’s email should match across all tables in your system.
Consistency also covers referential integrity. Every foreign key in a child table must point to a valid primary key in the parent table. If an order references CustomerID 9999 but that customer does not exist, you have a problem.
Here is how to check:
-- Find orders that reference non-existent customers
SELECT orders.customer_id
FROM orders
LEFT JOIN customers ON orders.customer_id = customers.customer_id
WHERE customers.customer_id IS NULL;
Any results from this query are orphaned records. Orders without customers. Bad news for reporting.
4. Validity
Does the data conform to expected formats and ranges? This is about whether data makes logical sense.
The book gives a good example with a student table:
| Student ID | Name | Age | Birth Date | Grade |
|---|---|---|---|---|
| 101 | Alice Green | 18 | 12-05-2006 | A |
| 102 | Bob White | -3 | 01-15-2029 | B |
| 103 | Jane Doe | 20 | 09-10-2004 | X |
Three problems jump out. Bob’s age is negative. Bob’s birth date is in the future. Jane’s grade is “X” which is not a valid grade. You can catch these at the database level with CHECK constraints:
-- Age must be between 1 and 120
CREATE TABLE AgeValidation (
student_id INT PRIMARY KEY,
name VARCHAR(50) NOT NULL,
age INT NOT NULL,
CHECK (age BETWEEN 1 AND 120)
);
-- Birthdate cannot be in the future
CREATE TABLE Student (
student_id INT PRIMARY KEY,
name VARCHAR(50) NOT NULL,
birthdate DATE NOT NULL,
CHECK (birthdate <= CURRENT_DATE)
);
-- Grade must be one of these values
CREATE TABLE StudentGrades (
student_id INT PRIMARY KEY,
subject VARCHAR(50) NOT NULL,
grade CHAR(1) NOT NULL,
CHECK (grade IN ('A', 'B', 'C', 'D', 'F'))
);
These constraints prevent bad data from entering the database in the first place. Much better than fixing it after the fact.
5. Uniqueness
Are records duplicated? If the same customer appears twice in your banking database, you might create multiple accounts for one person. That causes transaction errors, reporting mistakes, and even fraud detection problems.
-- Find duplicate customers
SELECT customer_id, COUNT(*) AS record_count
FROM sales
GROUP BY customer_id
HAVING COUNT(*) > 1;
You can also enforce uniqueness at the schema level with a UNIQUE constraint.
6. Timeliness
Is data available when it is needed? If your SLA says customer data must be loaded by 9:00 AM and it arrives at 9:30 AM, that is a timeliness failure. Reports are late. Decisions are delayed.
7. Accessibility
Can the right people actually get to the data? If sales and inventory teams need real-time stock data but it is locked behind an admin-only legacy system, the data is useless no matter how accurate it is.
8. Relevance
Does the data actually serve the current need? If the marketing team is preparing a Valentine’s Day campaign and you hand them customer data from two years ago, that is irrelevant. They need recent purchase behavior.
The Data Quality Hierarchy
This is a framework from the book that shows how different people in an organization care about data quality at different levels. Think of it as a pyramid:
Bottom: Accessibility - Can I find and access the data?
Next: Timeliness - Is it available when I need it?
Next: Relevance - Does it support my current business question?
Top: Accuracy - Is it a true representation of reality?
Each layer builds on the one below. There is no point worrying about accuracy if you cannot even access the data. And there is no point having timely data if it is not relevant to what you are trying to do.
Data engineers tend to focus on accuracy and completeness. Stakeholders start at the bottom of the pyramid and work up. Both perspectives matter.
Best Practices
The chapter closes with practical advice for maintaining data quality:
Build a data quality culture. Leadership needs to prioritize it. Every department should own the data they produce. Data stewards take responsibility for specific datasets.
Create a framework. Define what “good data” means using the six core dimensions (accuracy, completeness, consistency, timeliness, uniqueness, validity). Set thresholds. Tools like Monte Carlo can monitor data pipelines for quality issues in real time.
Profile your data. Look at it carefully before you start building on top of it. Find missing values, duplicates, and inconsistent formatting. Great Expectations is an open-source tool that lets you write validation rules for your data pipelines.
Monitor continuously. Use dashboards (Looker, Tableau, PowerBI) to track completeness, consistency, and error rates. Catch problems before they reach your stakeholders. Being proactive beats firefighting every time.
Data quality is not glamorous work. Nobody gets excited about NULL checks and completeness scores. But without it, everything downstream falls apart. Your ML models train on garbage. Your reports mislead. Your executives make bad decisions. The chapter makes this case well, and the practical SQL examples give you something to start with right away.
This is part 12 of 18 in my retelling of “Data Engineering for Beginners” by Chisom Nwokwu. See all posts in this series.
| < Previous: Pipeline Orchestration and Transformations | Next: Data Security > |