Real World Bilingual Data Science - A Python and R Case Study

The whole book has been building to this. Six chapters of philosophy, syntax comparisons, and interoperability tricks. Now Chapter 7 drops a real project on the table. Build it with both languages. Together. Start to finish.

And it works.

The Dataset: 1.88 Million Wildfires

The authors pick the US Wildfires dataset from the USDA. 24 years of geo-referenced wildfire records. 1.88 million fires. 140 million acres of forest burned. Stored in an SQLite database with 39 features.

But here’s the thing. They don’t use all of it. They filter down to 2015 only, drop Alaska and Hawaii, and keep six columns: cause of fire, land owner code, day of year discovered, fire size, latitude, and longitude. That gives them about 73,000 observations. Manageable on a normal laptop.

The goal is a classification model: given five features, predict the cause of a fire. But the authors are upfront. This is not an ML tutorial. They skip cross-validation and hyperparameter tuning on purpose. The point is the bilingual workflow, not the model.

Planning Before Coding

Before writing a single line of code, the authors stop and ask five questions. What is the end product? Who uses it? Can we break it into pieces? Which language handles each piece? How do the pieces connect?

In my 20 years in IT, the projects that fail hardest are the ones that skip this step. The authors lay out a simple table:

  1. Data importing in R (using RSQLite and DBI)
  2. EDA and visualization in R (ggplot2, GGally, leaflet)
  3. Feature engineering in Python (scikit-learn)
  4. Machine learning in Python (scikit-learn)
  5. Interactive mapping in R (Leaflet)
  6. Web interface in R (Shiny runtime in R Markdown)

R handles the data import, exploration, and final user-facing product. Python handles the machine learning. Each language does what it does best. No ego, no turf wars. Just good planning.

Exploration in R

R connects to the SQLite file with RSQLite and DBI, runs a SQL query to grab only the columns and year needed, and immediately closes the connection. Clean.

Then ggplot2 goes to work. Latitude and longitude become x and y coordinates, fire size maps to point size, owner code maps to color. You can literally see the continental US emerging from the data points. Faceting by fire cause reveals that lightning and debris burning are common while others are rare. The data is visibly imbalanced.

A pairs plot from GGally shows correlations between features are low (good for ML), but fire size has an extreme right skew. Most fires are tiny, a few are enormous. A log-transformed density plot confirms this.

Then they add a Leaflet interactive map with marker clustering. Thousands of fire locations, zoomable, without melting the browser.

Machine Learning in Python

Here is where the language switch happens. The authors use reticulate to set up a Python virtual environment from inside RStudio. The R data frame passes straight into Python.

Feature engineering is straightforward. Select the five predictor columns, encode the categorical target with LabelEncoder, split into training and test sets with stratified sampling to handle the class imbalance they spotted during EDA.

The model is a random forest classifier. Well-established, easy to understand, doesn’t need feature scaling. They fit it, predict on the test set, and get 58% accuracy. Not amazing, but way better than random chance on 13 classes.

So here’s what happened next. They take the confusion matrix back into R by accessing py$conmat and plot it with ggplot2. Python trains the model, R visualizes the results. Two languages, one pipeline.

The Interactive Product

The final deliverable is an R Markdown document with Shiny runtime. Five slider inputs let a user pick values for owner code, day of year, fire size, latitude, and longitude. The document feeds those values into the Python model using clf$predict(r_to_py(input_df)) and returns a prediction in real time.

It is simple. It is not production-grade. But it demonstrates the entire pipeline from raw SQLite data to interactive web app, using both languages where they shine.

What Actually Matters Here

The model accuracy is not the point. The Shiny app is not the point. The point is the workflow.

R imported data because its database packages were already loaded. R explored and visualized because ggplot2 and Leaflet are excellent for that. Python built the model because scikit-learn’s API is clean and consistent. R built the front end because Shiny integrates naturally with R Markdown.

No language did everything. Both contributed what they do best. Reticulate made the handoff seamless. The whole thing lived inside a single RStudio project. That is bilingual data science in practice.

My Take

This is a satisfying ending to the book. After chapters of “here’s how R works” and “here’s how Python works” and “here’s how they talk to each other,” you finally see it all assembled. The case study is not flashy. The authors are honest about its limitations. But it proves the thesis from the preface: picking the right tool for each part of the job produces better work than forcing one tool to do everything.

If you’ve been following this book series, this chapter is the payoff. If you skipped straight here, go back. The case study makes more sense when you understand the design decisions behind it.

Previous: Chapter 6 - Using Both Languages Together | Next: The Appendix - Python and R Cheat Sheet

About

About BookGrill.net

BookGrill.net is a technology book review site for developers, engineers, and anyone who builds things with code. We cover books on software engineering, AI and machine learning, cybersecurity, systems design, and the culture of technology.

Know More