Data Science Foundations Chapter 10 Part 1: Picking the Right Model for Your Data

You have data. You have a question. But which model do you actually use?

Chapter 10 of “Data Science Foundations” by Stephen Mariadas and Ian Huke is the biggest chapter in the book. So big I split this retelling into two parts. This is Part 1. It covers the types of analytics, understanding your data and hypothesis, and how to pick the right model.

Three Flavors of Analytics

Before picking a model, you need to know what kind of answer you want. The authors break analytics into three types. Think of them as three levels of ambition.

Descriptive analytics looks backward. What happened? You take historical data, summarize it, and spot patterns. A retail store checking which products sold best last year. A website counting visitors from last month. Dashboards with bar charts and averages.

It gives you a clear picture of the past. But it does not explain why things happened.

Predictive analytics looks forward. What will probably happen next? You take historical patterns and use statistical models or machine learning to forecast outcomes. An online store predicting who will buy next week. A weather app estimating tomorrow’s temperature. Regression, decision trees, neural networks.

But here’s the thing. Predictive analytics tells you what might happen. It does not tell you what to do about it.

Prescriptive analytics answers that last question. What should we do? It combines predictions and optimization to recommend actions. A delivery company finding the fastest routes. A hospital staffing up for flu season. A marketing team picking which emails to send to which customers.

Most advanced type. And the most useful for decision-makers.

The Coffee Shop Example

The authors use a nice example to show how all three work together. Imagine you run a coffee shop.

Descriptive: you look at last month’s sales and see lattes sell best on weekdays. Predictive: you forecast latte sales will go up in winter because people want warm drinks. Prescriptive: you stock up on latte ingredients, add seasonal flavors, and adjust your staff schedule.

Simple. But powerful at scale.

Know Your Data First

Before you pick any model, you need to understand your data. The authors call this exploratory data analysis, or EDA. It is the “look before you jump” step.

Visualize your data. Plot it. Calculate basic statistics. Two measures matter a lot here: variance and standard deviation.

Variance tells you how spread out your data points are from the average. High variance means data is all over the place. Low variance means tight clustering around the mean. Standard deviation is the square root of variance. Easier to interpret because it uses the same units as your data.

Why does this matter? Say you are analyzing house prices. If the standard deviation is huge, prices vary wildly. That affects which model you pick and whether you need to handle outliers. The book gives a finance example: an investment firm uses standard deviation of portfolio returns to measure volatility. Low value means consistent returns. High value means some customers win big while others lose.

Setting Up Your Hypothesis

Data science is still a science. You need a hypothesis before you test anything.

A hypothesis is a formal statement about a relationship between variables. The null hypothesis (H0) says there is no effect. The alternative hypothesis (HA) says there is one. You test to see which the data supports.

The process: set a significance level (usually 5%), calculate a test statistic, compare it to a critical value. If the statistic exceeds the threshold, you reject the null hypothesis. The book uses a drug trial example. A company tests if a new drug lowers blood pressure better than the old one. The hypothesis test tells them if the difference is real or just noise.

Picking the Right Model

So you know your data. You have a hypothesis. Now what?

The authors lay out six main categories of models. Each one answers a different type of question.

Correlation measures whether two variables are related. Do they move together? In opposite directions? Not at all? The output is a number between -1 and +1. But correlation does not mean causation. TV streaming is going up. So is Earth’s temperature. They “correlate.” Netflix is not causing climate change.

Regression predicts a continuous number. How will battery capacity change as a phone ages? You draw a best-fit line through your data points. The book shows a phone battery dropping from 100% to 65% over seven years. The regression equation lets you predict what happens at year 10.

Time series handles data collected over time. Stock prices, monthly sales, hourly temperature. It finds trends and seasonal patterns to forecast future values.

Classification sorts data into predefined categories. Is this email spam or not? Will this customer leave or stay? It requires labeled training data.

Clustering finds natural groups in your data without predefined labels. Customer segmentation is a classic use case. You do not know the groups in advance. The algorithm discovers them.

Association discovers patterns between items. People who buy bread often buy butter too. Retailers use this for recommendations and store layouts.

A Warning About Human Bias

One thing the authors highlight that I appreciated: watch out for model selection bias. Data scientists tend to reuse models they already know. But “comfortable and familiar” does not equal “best for this problem.” A good data scientist questions the model choice every time.

Key Takeaways

  • Descriptive, predictive, and prescriptive analytics answer different questions: what happened, what will happen, what should we do
  • Always explore your data first with EDA, variance, and standard deviation
  • Set up a proper hypothesis before testing
  • Match your model to your question: correlation for relationships, regression for numbers, time series for trends, classification for categories, clustering for unknown groups, association for item patterns
  • Fight the urge to reuse old models just because they are convenient

This chapter is dense. And we are only halfway through. Part 2 covers time series, classification, and clustering with practical examples and pitfalls.

Previous: Chapter 9: Basic Concepts Next: Chapter 10 Part 2: Time Series, Classification, and Clustering

About

About BookGrill.net

BookGrill.net is a technology book review site for developers, engineers, and anyone who builds things with code. We cover books on software engineering, AI and machine learning, cybersecurity, systems design, and the culture of technology.

Know More