Data Science Foundations Chapter 9: Averages, Probability, and the Math You Actually Need
Math scares people. I get it. You hear “standard deviation” and suddenly you are back in high school staring at the board. But here’s the thing. Chapter 9 of Data Science Foundations by Stephen Mariadas and Ian Huke covers the math basics you actually need for data science. And none of it is that hard.
Three Ways to Find the “Middle”
You probably know about averages. But there are three kinds, and each tells a different story.
Mean is what most people think of. Add all numbers, divide by how many you have. If your friend group earns 30k, 35k, 40k, 45k, and 500k, the mean salary is 130k. Does that represent the group? Not really. That one rich friend pulled the number way up. Outliers mess with the mean.
Median is the middle value when you sort your numbers. Same group sorted: the median is 40k. Much closer to reality. The authors point out that median works better for skewed data. And most real-world data is skewed.
Mode is the value that shows up most often. It is the only one that works with categories, not just numbers. If five people picked “blue” as favorite color and three picked “red,” the mode is “blue.” But it can be unstable. Small data changes can shift it completely.
The authors’ advice? Use all three together. I agree. In my years working with data, relying on just one measure is a fast way to get fooled.
How Spread Out Is Your Data?
Knowing the center is not enough. You also need to know how spread out the data is.
Range is the simplest. Biggest minus smallest. If top exam score is 95 and bottom is 65, range is 30. Easy but misleading. One extreme score changes everything.
Variance is better. You take each data point, subtract the mean, square the result, then average those squared differences. High variance means data is all over the place. Low variance means everything clusters near the center.
Standard deviation is the square root of variance. It gives you a number in the same units as your data. If exam scores have a standard deviation of 10, most students scored within 10 points of the average. Practical and easy to explain.
Interquartile range (IQR) focuses on the middle 50% of your data. Cut off the bottom 25% and top 25%, look at what is left. Outliers do not affect it. The IQR tells you how the bulk of your data behaves.
Probability: The Language of “Maybe”
A probability is a number between 0 and 1. Zero means it will not happen. One means it will definitely happen. The chance of something NOT happening is just 1 minus the chance it will. That is a complementary event.
Independent events do not affect each other. Flipping a coin and rolling a dice are independent. The coin does not care what the dice does. Dependent events are connected. Pull a card from a deck without putting it back, and the next pull is affected because the deck changed.
Conditional probability asks “what is the chance of A, given that B already happened?” Written as P(A|B). What is the probability a customer buys something, given they already added it to cart? That is conditional probability.
The book covers probability distributions too. Discrete ones handle things you count, like heads in 10 coin flips. Continuous ones handle things you measure, like temperature or time. The normal distribution (the bell curve) is the big one in statistics.
Then there is the law of large numbers. Flip a coin 10 times, you might get 7 heads. Flip it 10,000 times, you will be very close to 50-50. More data means results get closer to the true probability. This is why big datasets matter.
The Cartesian Plane and Distance
This sounds like pure geometry, but it matters. Many machine learning models rely on distance between data points.
The Cartesian plane is a grid with x and y axes crossing at (0, 0). Every point has two coordinates. The point (3, 4) means go 3 right and 4 up. Simple.
Distance between two points uses the Pythagorean theorem. Points A(1, 2) and B(4, 6). Horizontal difference is 3, vertical is 4. Square both, add them, take the square root. You get 5. Classic 3-4-5 triangle.
Why care? Because algorithms like k-nearest neighbors literally measure distances between data points to make predictions. If you do not get how distance works, those algorithms feel like black boxes.
My Take
This chapter is a refresher for anyone who studied math before, and a solid primer for those who did not. No heavy notation. No proofs. Just concepts you will actually use.
The probability section is the most valuable part. Understanding uncertainty is what separates reading a chart from making decisions based on data. Do not skip this chapter. Everything that comes after, model selection and evaluation, builds on these ideas.
Previous: Chapter 8: Data Preparation Next: Chapter 10 Part 1: Model Selection