Data Science Foundations Chapter 4: Ethics, Laws, and Doing the Right Thing With Data

Imagine your company asks you to build a model that predicts health outcomes for people. Sounds great, right? Better treatments, healthier population, maybe even lower costs. But what if your health data gets shared? What if your insurance premiums go up because of something the model found? What if you get denied a service?

That is exactly how Chapter 4 opens. And it is the right question to start with. Just because you can do something with data does not mean you should.

Ethics First, Then Laws

The authors split this chapter into two big ideas: ethics and lawfulness. They are related but not the same thing.

Something can be legal but still unethical. And something can be ethical but technically against some regulation. You need to think about both. Every time you start a data project, collect data, or build a model.

Here’s the thing. There is no universal rulebook for ethics. People have been arguing about what is ethical since ancient Greece. The authors keep it practical with two questions:

  1. Will this harm anyone?
  2. Are people happy for their data to be used this way?

Simple questions. But they cut through a lot of noise.

When Does Ethics Matter in a Project?

Short answer: always. The chapter walks through each stage of a data project and asks ethical questions at every step. During discovery, check if a data impact assessment happened. During sourcing, ask about data quality and lineage. Biased data leads to biased results. During analysis, ask if you are using data as intended. During communication, check if results are presented honestly. In production, ask if anything changed from the original intent.

The recruitment example hits close to home. You build a model to predict good job candidates. Your training data comes from past hiring decisions. But those decisions had bias baked in. Now your model carries that bias forward. The people who gave their data to apply for a job never imagined it would train an algorithm that screens future applicants. Even if they signed a consent form, is that really informed consent?

The A-Level Grading Disaster

The book brings up the 2020 UK exam situation. Covid cancelled exams. Someone had the bright idea to use data science: look at how schools performed historically, then assign grades based on those patterns. Rigorous and scientific compared to trusting teachers, right?

But students at smaller schools got excluded from the model. Students who worked hard got downgraded because their school had a weak track record. The backlash was massive. They scrapped the model and went with teacher assessments instead.

Was the model technically correct? Probably. Was it fair? Not for the individuals who got hurt by it.

GDPR and Data Protection

The legal side is more concrete. The authors focus on the UK’s Data Protection Act 2018, which implements GDPR. Instead of rigid rules, GDPR works on principles. Seven of them:

  • Lawful, fair, and transparent - you need a legal basis, must use data fairly, and people need to know what you are doing with it
  • Limited purpose - collect for a specific reason, do not repurpose without justification
  • Data minimisation - only use what you need
  • Accuracy - do not work with bad data
  • Storage limitation - delete it when you no longer need it
  • Integrity and confidentiality - keep it secure
  • Accountability - prove you follow all of the above

Some data gets extra protection: race, political opinions, genetics, biometrics, health data, sexual orientation. Stricter conditions apply.

The ICO (Information Commissioner’s Office) enforces all of this. And it is not just big corporations that get caught. The book mentions a school that got in trouble for using facial recognition in their cafeteria. A school.

Contracts and Intellectual Property

Data restrictions go beyond privacy laws. Contracts decide who can use data generated by outsourced operations and how. Third-party data licenses often have time limits and usage restrictions.

And do not forget intellectual property. Text, music, images can all be represented as data. Copyright still applies.

How to Protect Yourself and Others

The chapter closes with practical risk mitigation strategies:

  • Follow your organization’s policies - know where they are and what they say
  • Data classification - label data (open, internal, confidential, secret) and handle it accordingly
  • Obfuscation and redaction - hide or blank sensitive fields when sharing reports
  • Aggregation - group data so individuals cannot be identified in small datasets
  • Anonymization - remove personal data, but be careful. Combining anonymized data with other sources can still reveal identities
  • Access controls - prevent people from accessing data they should not see

My Take

I have worked in IT long enough to see what happens when people skip the ethics conversation. They build something clever, ship it, then spend months cleaning up the mess.

This chapter does not give you a checklist that solves everything. It gives you the habit of asking uncomfortable questions early. That is more valuable than any compliance framework. The two questions the authors propose (will this harm anyone, and are people okay with this) should be taped to every data scientist’s monitor.

The A-level grading story says it all. Smart people, good math, terrible outcome for real humans. Data science is not just statistics. It is decisions that affect people’s lives.


This is part of my retelling of “Data Science Foundations” by Stephen Mariadas and Ian Huke. See all posts in this series.

Previous: Chapter 3: Project Delivery Next: Chapter 5: Discovery

About

About BookGrill.net

BookGrill.net is a technology book review site for developers, engineers, and anyone who builds things with code. We cover books on software engineering, AI and machine learning, cybersecurity, systems design, and the culture of technology.

Know More