6.1: Aggregating the Data#
If you would like to download this Jupyter notebook and follow along using your own system, you can click the download button above.
In this section, we will look to explore the data we create in the first five chapters. This exploration will primarily use aggregation to gain insights from the data. We will use the pandas Python library to manage our work here.
Before we get started, we need to import pandas.
import pandas as pd
Wordcount#
We first need to read in the data we want to explore. Let’s start with the wordcount data.
words = pd.read_csv('../data/wordcount.csv')
words.head()
section | docno | words | |
---|---|---|---|
0 | business-and-industry | 0 | 1319 |
1 | business-and-industry | 1 | 7601 |
2 | business-and-industry | 2 | 7043 |
3 | business-and-industry | 3 | 2764 |
4 | business-and-industry | 4 | 4961 |
We can see that the data has three columns. The first two columns are the index for our documents, while the third column is the actual word count for that document. Since the documents are organized by topic (or “section” here), we can group by the “section” column and determine the mean word count per topic.
words.groupby('section').words.mean()
section
business-and-industry 25187.60
environment 32520.61
health-and-public-welfare 35433.65
money 16553.86
science-and-technology 31133.47
world 21294.12
Name: words, dtype: float64
Let’s also get the median word count, to see if there are any major differences.
words.groupby('section').words.median()
section
business-and-industry 11825.0
environment 16355.0
health-and-public-welfare 11394.5
money 10134.0
science-and-technology 14501.5
world 6560.5
Name: words, dtype: float64
We can see that the mean word count of the “world” topic is disproportionately larger than the median word count.
Now let’s do the same aggregation for the conditionals data.
conditionals = pd.read_csv('../data/conditionals.csv')
conditionals.groupby('section').conditionals.mean()
section
business-and-industry 129.60
environment 156.27
health-and-public-welfare 157.47
money 86.88
science-and-technology 140.57
world 115.27
Name: conditionals, dtype: float64
conditionals.groupby('section').conditionals.median()
section
business-and-industry 58.0
environment 72.0
health-and-public-welfare 56.0
money 47.5
science-and-technology 69.5
world 34.0
Name: conditionals, dtype: float64
Unsurprisingly, we see a similar phenomenon with the conditionals, where the mean for the “world” topic is significantly larger than the median.
Estimator#
Next we will explore the data that was produced using the “is_world” estimator we created. Let’s first read in that data.
world = pd.read_csv('../data/is_world.csv')
world.head()
section | docno | is_world | |
---|---|---|---|
0 | business-and-industry | 0 | False |
1 | business-and-industry | 1 | False |
2 | business-and-industry | 2 | False |
3 | business-and-industry | 3 | False |
4 | business-and-industry | 4 | False |
We can see that the data had the same index as the wordcount and conditionals data above, but that the column “is_world” has True and False values. We can use aggregation and take advantage of the fact that True and False become 1 and 0 when aggregated to figure out the proportion of documents that we classified as “world” by topic, using the same group-by method as above.
world.groupby('section').is_world.mean()
section
business-and-industry 0.02
environment 0.00
health-and-public-welfare 0.01
money 0.08
science-and-technology 0.02
world 0.70
Name: is_world, dtype: float64
70 percent of the “world” documents we correctly classified as “is_world,” and every topic had fewer than 10 percent false positives. Not bad!
Probability dataset#
Now let’s look at the “is_world” data with probabilities.
world_prob = pd.read_csv('../data/is_world_prob.csv')
world_prob.head()
section | docno | is_world_prob | |
---|---|---|---|
0 | business-and-industry | 0 | 0.0059 |
1 | business-and-industry | 1 | 0.0079 |
2 | business-and-industry | 2 | 0.0148 |
3 | business-and-industry | 3 | 0.0018 |
4 | business-and-industry | 4 | 0.0117 |
Instead of just a binary True/False, this dataset has probabilities that the document is a “world” document. Let’s look at the mean probability by topic.
world_prob.groupby('section').is_world_prob.mean()
section
business-and-industry 0.050015
environment 0.028742
health-and-public-welfare 0.048991
money 0.096624
science-and-technology 0.047916
world 0.728035
Name: is_world_prob, dtype: float64
These numbers look quite similar to the aggregated binary data.