Data Science Foundations Chapter 12: How to Know If Your Model Actually Works
You built a model. It runs. It gives you numbers. But does it actually work? That is what Chapter 12 of “Data Science Foundations” by Stephen Mariadas and Ian Huke is about. Building a model is one thing. Trusting it is something else.
Probability: The Foundation
Before you evaluate anything, you need probability. It is a number between 0 and 1. Zero means something will never happen. One means it will definitely happen. Coin flip? 0.5 chance for heads. Roll a die? About 0.167 chance for any one number. Simple. But everything in model evaluation comes back to this: how likely is it that what we see is real?
P-Values: The Surprise Meter
A p-value tells you how surprised you should be by your results. If nothing special was going on, how likely would it be to see data like this?
Small p-value (0.05 or less) means your results are unusual. Something is probably happening. Large p-value means your results look normal. Nothing interesting.
Say you flip a coin 20 times and get 17 heads. The p-value for that is about 0.003. Only 0.3% chance of getting 17 heads with a fair coin. So the coin is probably biased. The common threshold is 0.05. Below that, you call it statistically significant. But here is the thing. You pick that threshold before you run the test. Not after. Choosing after is moving the goalposts.
Hypothesis Testing in Five Steps
The authors give a clean process. First, state two hypotheses. The null says nothing special is happening. The alternative says something is. Second, pick your significance level (usually 0.05). Third, collect data and calculate a test statistic. Fourth, find your p-value. Fifth, decide. If the p-value is below your threshold, reject the null. If not, you fail to reject it.
Notice the language. You “fail to reject” the null. You do not “accept” it. Absence of evidence is not evidence of absence.
Train vs Test: The Overfitting Problem
Your model scores great. But on what data? If it only works well on the data it trained on, you have a problem.
Overfitting means your model memorized the training data instead of learning real patterns. High scores on training data, terrible on new data. Like a student who memorizes answers but cannot solve fresh problems. Underfitting is the opposite. The model is too simple and fails on both sets. It missed the pattern entirely.
Split your data. Train on one part. Test on the other. If scores are similar, your model generalizes. Big gap? Something is wrong.
Not All Errors Are Equal
False positive: the model predicted something that did not happen. False negative: it missed something that did happen. Which one is worse? Depends on context.
Medical testing? A false negative (telling a sick person they are fine) can kill someone. Spam filter? A false positive (flagging a real email as spam) is annoying but not dangerous.
The confusion matrix shows all this in a table. From it you get precision (of everything you flagged, how many were right) and recall (of everything that existed, how many did you catch). The F1 score combines both into one number.
Reading the Model
R-squared shows how well data fits a regression model. Higher is generally better, but a high R-squared on a bad model is still a bad model. Feature importance tells you which variables matter most. If “number of bedrooms” ranks high for house prices but “front door color” ranks low, that makes sense. Residual analysis looks at gaps between predictions and reality. If those gaps show a pattern, your model is missing something.
For recommendation systems, three metrics matter. Support is how often an item appears in your data. Confidence is how often a rule holds true. If 80% of bread buyers also buy butter, confidence for that rule is 0.80. Lift compares the rule to random chance. Above 1 means the association is real.
Bottom Line
Model evaluation is not optional. Test your hypotheses properly. Check for overfitting. Understand what errors your model makes and whether those errors matter in your context. A model that looks great on paper can still be useless in practice.
Previous: Chapter 11: Visualisations Next: Chapter 13: Communication