R for Python Developers - Lists, Factors, and Data Wrangling

In Part 1 we covered R basics: setting up your environment, installing packages, working with tibbles, and understanding R’s type system. Now we get to the good stuff. Lists, factors, finding things in your data, and the iteration patterns that make R feel so different from Python.

R Lists Are Not Python Lists

Here’s the thing. When R says “list,” it does not mean what Python means. An R list is a one-dimensional container where every element can be a different type. Vectors, other lists, data frames, matrices, whatever. You can stuff anything in there.

And here’s a fun fact: a data.frame is actually a list. Specifically, it’s a list where every element is a vector of the same length. That’s it. That’s the whole secret.

typeof(my_tibble)
# [1] "list"

You’ll run into lists a lot when working with statistical models. If you run lm() to fit a linear model, the result is a list. It holds coefficients, residuals, fitted values, and a dozen other things. You access named elements with the $ operator, just like you’d access columns in a data frame.

pg_lm <- lm(weight ~ group, data = PlantGrowth)
pg_lm$coefficients
# (Intercept)   grouptrt1   grouptrt2
#       5.032      -0.371       0.494

The ~ in that formula reads as “described by.” So weight ~ group means “weight described by group.” If you have multiple predictors, you can use . to mean “everything else.” It’s tidy and expressive, even if it looks strange at first.

Factors: R’s Take on Categorical Data

Factors are like pandas category dtype. They store categorical variables as integers with labels attached. The class is factor, but the underlying type is integer. Same pattern as data frames being lists.

typeof(PlantGrowth$group)
# [1] "integer"
class(PlantGrowth$group)
# [1] "factor"

The labels are called “levels.” When you print a factor, you see the labels, not the integers. This is a legacy thing from when memory was expensive and storing integers was cheaper than storing long strings.

Factors can also be ordered. Diamond color grades, for example, show up as D < E < F < G < H < I < J. That ordering matters for statistics and plotting. Most of the time factors just work. But when they don’t, they’ll make your life miserable. Worth knowing they exist.

Finding Stuff: Indexing and Logical Expressions

R indexing starts at 1. You already knew that. But here’s the problem: negative indices don’t count backward like in Python. In R, letters[-4] means “everything except the fourth element.” Not the fourth from the end.

The : operator generates sequences. 23:26 gives you c(23, 24, 25, 26). But unlike Python slicing, you always need both sides. letters[23:] is an error.

The real power comes from logical indexing. You ask a question, get a TRUE/FALSE vector back, and use it to filter.

sum(diamonds$price > 18000 & diamonds$carat < 1.5)
# [1] 9

Nine diamonds that are expensive and small. R treats TRUE as 1 and FALSE as 0, so you can do math on logical vectors. That’s the same as NumPy masking, just with different syntax.

The Tidyverse Way: dplyr Verbs

Writing diamonds$price > 18000 repeatedly gets old fast. The Tidyverse gives you a cleaner way with dplyr and the pipe operator %>%. You read it as “and then.”

diamonds %>%
  filter(price > 18000, carat < 1.5) %>%
  select(color)

Take diamonds, and then filter rows, and then select columns. Five core verbs handle most data wrangling:

  • filter() keeps rows matching a condition
  • arrange() reorders rows
  • select() picks columns
  • summarise() aggregates (mean, sd, count)
  • mutate() transforms (adds new columns)

Pair summarise() with group_by() and you get grouped aggregation. It’s basically .groupby() from pandas.

PlantGrowth %>%
  group_by(group) %>%
  summarise(avg = mean(weight), stdev = sd(weight))

Clean, readable, and no $ signs everywhere. This is why people love the Tidyverse.

Iteration: The Anti-Loop Culture

R programmers avoid for loops like it’s a sport. The authors joke that some R users probably have a wall sign: “Days since last for loop.”

The old way is the apply family: tapply(), lapply(), sapply(), and friends. They work, but they’re clunky. Then came plyr (pronounced “plier”), then dplyr (d-plier), each one cleaner than the last.

The pattern behind all of this is split-apply-combine. Split data into groups, apply a function to each group, combine the results. This idea eventually led to the concept of “tidy data” itself.

For iterating over lists and vectors, the purrr package offers map(), which does what Python’s map() does but fits neatly into Tidyverse pipelines.

Final Thoughts from Chapter 2

So here’s what happened with R: unlike Python, there’s no single “R way” to do things. People mix base R, Tidyverse, and old-school apply calls in the same script. The language is over 20 years old and still figuring out its identity.

The authors describe R as going through a “teenage growth spurt.” It grew fast, it can be awkward, but it can also be really cool. Learning to blend the different dialects is the key to getting comfortable.

For a Pythonista, the biggest adjustment is not the syntax. It’s the philosophy. R was built by statisticians for statistics. Everything from formula notation to vectorized operations to factors reflects that heritage. Once you accept that R thinks differently, it starts to click.

Previous: Chapter 2 Part 1 - R for Pythonistas | Next: Chapter 3 - Python for R Users

About

About BookGrill.net

BookGrill.net is a technology book review site for developers, engineers, and anyone who builds things with code. We cover books on software engineering, AI and machine learning, cybersecurity, systems design, and the culture of technology.

Know More