Why Sponsor Oils? | blog | oilshell.org
A data frame is a table-like data structure available in languages like R and Python. Statisticians, scientists, and programmers use them in data analysis code.
Once you've tried data frames, you'll reach for them during every data analysis project.
This post introduces the concepts behind them, and then shows how they work by solving the same problem in multiple ways:
To run the code, clone the oilshell/blog-code repo, then follow the instructions in data-frames/README.md..
This blog is about a new Unix shell, so the connection to data frames may not be obvious. I write about them for a few reasons:
In the last post, I analyzed Oil's code with data frames.
When writing the praise in that section, I looked for a friendly intro to the idea, but the best hit was from an inactive Clojure project. Surprisingly, Wikipedia doesn't have a page on data frames! This post aims to fill that gap.
I've talked to more than one person who has done data analysis without data frames, e.g. at Recurse Center.
Data frames may someday be a feature of the Oil language.
Why? Because the output of both ls
and ps
is a table. (Caveat: this
work is far in the future.)
Before I show source code, here's some more color on how data frames are used. You may:
Logically, data frames look like SQL tables, but:
Here's a measurement of hits over time to different pages on a website:
date | URL | hits |
2018-11-30 | /site.html | 300 |
2018-11-30 | /blog/ | 1,000 |
2018-12-01 | /site.html | 320 |
2018-12-01 | /blog/ | 1,300 |
2018-12-02 | /site.html | 310 |
2018-12-02 | /blog/ | 1,800 |
2018-12-02 | /data-frames.html | 7,500 |
Let's analyze this data set in two ways.
date | hits |
2018-11-30 | 1,300 |
2018-12-01 | 1,620 |
2018-12-02 | 9,610 |
URL | percentage |
/data-frames.html | 59.9% |
/blog/ | 32.7% |
/site.html | 7.4% |
Notice that the first query aggregates over the date variable, while the second aggregates over URL, then sorts by percentage.
How would we write code to do this analysis?
We could write imperative code with explicit loops, building data structures
cell-by-cell. I do this in without_data_frames.py, where the main()
function is a dense 30 lines of code.
Excerpt:
# ...
by_url = collections.defaultdict(int)
by_date = collections.defaultdict(int)
total_hits = 0
for date, url, num_hits in reader:
num_hits = int(num_hits)
by_date[date] += num_hits
by_url[url] += num_hits
total_hits += num_hits
# ...
This works for small problems, but becomes tedious as the analysis gets more complex.
The next two solutions, using data frames, are shorter and use a functional style. They're more like expressions on values rather than algorithms to update mutable data structures.
Here's the core of the first query, excerpted from with_pandas.py:
# Daily Traffic
daily = traffic.groupby('date').sum()
Pandas uses the method chaining idiom on data frame objects.
It takes two steps to transform traffic
into daily
:
date
column, e.g. 2018-11-30
.num_hits
column within each group.The second query takes a few more steps:
# Popular Pages
popular = (
traffic.groupby('url')
.apply(lambda x: sum(x.num_hits) / total_hits * 100.0)
.reset_index(name='percentage')
.sort_values(by='percentage', ascending=False)
)
sum()
, we apply a formula to each group, creating a new
percentage
column.R implementations of these queries use the same ideas. They may look less familiar, due to a few features of R:
%>%
operator that
performs left-to-right function application. It's not part of R itself.lambda
you saw above
in Pandas.42 -> x # same as x = 42 or x <- 42
Here's core of the first query, excerpted from with_dplyr.R:
# Daily Traffic
traffic %>%
group_by(date) %>%
summarize(num_hits = sum(num_hits)) ->
daily # assign to variable 'daily'
The second query is similar:
# Popular pages
traffic %>%
group_by(url) %>%
summarize(percentage = sum(num_hits) / total_hits * 100.0) %>%
arrange(desc(percentage)) ->
popular # assign to variable 'popular'
Again:
summarize
step uses a formula to compute a new percentage
column.arrange
step sorts the resulting data frame by decreasing percentage.I like the %>%
operator because it reads left-to-right like a Unix
pipeline. However, there are significant differences. The data passed
between stages is:
The main point here is that data frames are a stable concept across multiple languages. The syntax is different, but the idea is largely the same.
Appendix B solves the problem with "base R", rather than with dplyr.
Raw SQL easily handles the Daily Traffic query:
SELECT date, SUM(num_hits) FROM traffic GROUP BY date;
It starts to look awkward for the Popular Pages query:
SELECT url,
SUM(num_hits) * 100.0 / (SELECT SUM(num_hits) FROM traffic)
AS percentage
FROM traffic
GROUP BY url
ORDER BY percentage DESC;
The inner SELECT
clause is begging to be extracted into a total_hits
variable.
In run.sh, I've used a Common Table
Expression for total_hits
, which will
improve readability in complex cases, but makes the code longer in this case.
It would be nice if SQL had Python-like scalar variables (i.e. variables that
aren't tables).
It could also be done with two separate queries and a Python variable, but this would be incorrect in the face of concurrent updates. If someone adds a row between the two queries, then the total percentage will be more than 100%.
As your analysis becomes more complex, the advantage of data frames over SQL becomes clearer. In many contexts, it's useful to view SQL as a data extraction language, while R or Python is the data analysis language.
The rest of this post is biased toward R, because it's what I use for data analysis. Feel free to leave a comment with analogous resources for Python or other languages.
Tidy datasets are easy to manipulate, model and visualize, and have a specific structure.
Each variable is a column, each observation is a row, and each type of observational unit is a table.
Once the concepts are clear, using data frames is a matter of internalizing the API through practice. I refer to this document most often:
Disclosure: This section contains affiliate links. I earn a commission if you buy a book.
I mentioned my desire to write book reviews a few times, and R and data frames are good topics to start with.
Why? I've purchased at least six books on R, which tells me that it's the hardest programming language I've learned. In comparison, I felt productive in Python before buying any books at all.
I'm narrowing my recommendations to these two books:
R for Data Science: Import, Tidy, Transform, Visualize, and Model Data |
|
This book, by Wickham and Grolemund, explains data analysis following the philosophy of the tidyverse. After using several data frame libraries, I'm now fully "on board" with the tidyverse. It's well-documented and a pleasure to use. |
|
The Art of R Programming: A Tour of Statistical Software Design |
|
This is the best book for programmers who want to learn R. If you're already fluent in another language, this book explains R in familiar terms. Other R books are aimed at stats students with no programming background, and weren't as useful to me as this one. |
I only recommend these books after you've written a few lines of code with data frames, using online resources like the ones above. The runnable code from this post may also be a useful starting point.
I could write more about other books I own, but the short takeaway is that the two above are the best for programmers that want to learn to use data frames.
My copies:
R for Data Science was written by the creator of the tidyverse. The book Python for Data Analysis is analogous since it was written by the creator of Pandas:
However, I haven't read it, so I can't comment on its quality.
This post explained the concept of data frames, and showed concrete examples of code in R, Python, and SQL.
I linked to my two favorite online resources about data frames, and two books out of six that I found most useful.
If you're interested concrete examples of how I use R, see the last post.
Leave a comment if you have questions. There are further thoughts and recommendations below.
Thanks to Eric Higgins and Yuri Vishnevsky for valuable feedback on a draft of this post.
Mentioned in this post:
subset()
,
aggregate()
, etc. are hard to remember and inconsistent with each other.Other:
The wiki page I mentioned above has links to more implementations (some of them defunct).
Please leave a comment if you have experience with another data frame library.
If you're heavily invested in the Python ecosystem and don't know R, you may want to try Pandas first.
Otherwise, I recommend learning R with dplyr and the rest of the tidyverse.
A language that doesn't affect the way you think about programming, is not worth knowing. — Alan Perlis
R is inconsistent and difficult to learn, but the tidyverse largely fixes this:
For what it's worth, I wrote code in five ways for this blog post, and I found the dplyr version the most elegant. The syntax is a little unusual, but the ideas fit together well.
(Regarding R vs. Python debates: I think the answer for many data science problems is to use both. But I use R when the code requires data frames.)
For comparison, I've also written with_base_R.R, which uses functions built into R, rather than the newer dplyr library.
It's not important to understand this code. I only include it to show that dplyr is nicer.
The tidyverse, which includes dplyr, is essentially a "reform" of R, enabled by Lisp-like metaprogramming. This is inspirational to me because I hope to "reform" shell with the Oil language.
(February 2020 Update)
Is a Data Frame Just a Table? (2019, PDF) is a great description of data frames from a computer science perspective. This paper addresses two different conceptions of tables and their usage in computer programs:
It discusses these differences between the models: