Data Science Foundations Chapter 5: The Discovery Phase and Asking the Right Questions

You got a data science project. Great. But before you touch any data, before you write a single line of code, you need to stop and think. That is what Chapter 5 of “Data Science Foundations” by Stephen Mariadas and Ian Huke is about. The discovery phase. The part most people want to skip. And it is the part that saves you from wasting months on something that never had a chance.

Someone Has to Own This Thing

Every project needs a sponsor. The person accountable for the whole thing. They set direction, define success, and keep things on track.

Here’s the thing. Discovery is basically a reality check. Can we actually do this? Do we have the people, the data, the tools, the money? Is it even worth doing? Sometimes the answer is no. Better to find out now than six months in.

Build the Right Team

In a small company, one person wears many hats. In a big organization, each role is a whole department. But you should think about who covers each area: sponsor (direction), subject matter expert (domain knowledge), researcher (data gathering), data engineer (cleaning and transforming data), data architect (infrastructure and security), compliance officer (regulations), project manager (coordination), and data scientist (models and analysis).

One important point. You need someone who will challenge the models. A second pair of eyes reduces bad assumptions. This is not optional.

Know Your Domain

Data does not exist in a vacuum. You need to understand the real world context around your problem. The authors call this domain context.

Good example from the book. Imagine you analyze phone faults. At a regular office company, phone reliability is annoying but not critical. At an emergency response organization, phones are used in extreme conditions. Phone reliability is life or death. Same data, completely different stakes.

Your audience matters too. Presenting to the managing director is different from presenting to a department manager. Different knowledge, different priorities.

Resources: What Do You Actually Have?

You also need tools, infrastructure, and data. Most of the time, you work with what your organization already has. No dream setup. So during discovery, check what is available. Can your tools handle the analysis? Enough storage and computing power?

And where is your data coming from? Is it available? Good quality? Do you have the right to use it? Can you combine data from different sources? These questions save you from nasty surprises later.

Define the Problem Clearly

This is where many projects go wrong. Different stakeholders see different problems.

The book uses a music streaming example. The company has a subscription problem. But the sales director thinks it is about attracting new subscribers. The finance director thinks existing subscribers leave because they do not use the service enough. Two very different problems needing two very different approaches.

So you frame the problem. Say the finance director wins. The idea: if subscribers listen more, they are less likely to cancel. So recommend music they enjoy.

Now you have specific questions. Does streaming more affect cancellation? Do recommendations make people listen more? If someone likes one artist, which other artists will they enjoy?

Also think about when the project ends. If you cannot define a clear endpoint, set a time box. Fixed time, then evaluate. This stops projects from dragging on forever.

Build Your Hypotheses

A hypothesis is a testable statement. It is not vague. It is specific.

Bad hypothesis: “People who stream more stay longer.” Too general. Stream how much? Over what period?

Better hypothesis: “Subscribers who stream more than eight hours of music in a month are less likely to cancel their subscription the following month.” Now you can actually test that.

And hypotheses chain together. If the first one holds, you test the next: “More music recommendations in a month lead to longer listening time.” Then: “Recommending artists a subscriber already listened to increases the chance they play the recommendation.”

Each hypothesis builds on the last. Each one gets you closer to understanding the real problem.

Do Your Homework First

The authors recommend desk research before talking to stakeholders. Read industry news. Check internal documents. Find previous similar projects. When you sit down with people, you ask smarter questions and waste less of everyone’s time.

Use interviews for deep one-on-one conversations. Use workshops when you need group consensus. Prepare for both. Have questions ready.

Bottom Line

Discovery is not glamorous. Nobody tweets about their great feasibility study. But it is the foundation everything else sits on. Get it wrong, and you build models that answer the wrong questions.

Slow down at the start to go faster later. Understand your domain, build the right team, frame the problem clearly, and develop testable hypotheses. Then start working with the data.

Previous: Chapter 4: Ethics and Lawfulness Next: Chapter 6: Properties of Data

About

About BookGrill.net

BookGrill.net is a technology book review site for developers, engineers, and anyone who builds things with code. We cover books on software engineering, AI and machine learning, cybersecurity, systems design, and the culture of technology.

Know More