6.1: Aggregating the Data#

If you would like to download this Jupyter notebook and follow along using your own system, you can click the download button above.

In this section, we will look to explore the data we create in the first five chapters. This exploration will primarily use aggregation to gain insights from the data. We will use the pandas Python library to manage our work here.

Before we get started, we need to import pandas.

import pandas as pd

Wordcount#

We first need to read in the data we want to explore. Let’s start with the wordcount data.

words = pd.read_csv('../data/wordcount.csv')
words.head()
section docno words
0 business-and-industry 0 1319
1 business-and-industry 1 7601
2 business-and-industry 2 7043
3 business-and-industry 3 2764
4 business-and-industry 4 4961

We can see that the data has three columns. The first two columns are the index for our documents, while the third column is the actual word count for that document. Since the documents are organized by topic (or “section” here), we can group by the “section” column and determine the mean word count per topic.

words.groupby('section').words.mean()
section
business-and-industry        25187.60
environment                  32520.61
health-and-public-welfare    35433.65
money                        16553.86
science-and-technology       31133.47
world                        21294.12
Name: words, dtype: float64

Let’s also get the median word count, to see if there are any major differences.

words.groupby('section').words.median()
section
business-and-industry        11825.0
environment                  16355.0
health-and-public-welfare    11394.5
money                        10134.0
science-and-technology       14501.5
world                         6560.5
Name: words, dtype: float64

We can see that the mean word count of the “world” topic is disproportionately larger than the median word count.

Now let’s do the same aggregation for the conditionals data.

conditionals = pd.read_csv('../data/conditionals.csv')
conditionals.groupby('section').conditionals.mean()
section
business-and-industry        129.60
environment                  156.27
health-and-public-welfare    157.47
money                         86.88
science-and-technology       140.57
world                        115.27
Name: conditionals, dtype: float64
conditionals.groupby('section').conditionals.median()
section
business-and-industry        58.0
environment                  72.0
health-and-public-welfare    56.0
money                        47.5
science-and-technology       69.5
world                        34.0
Name: conditionals, dtype: float64

Unsurprisingly, we see a similar phenomenon with the conditionals, where the mean for the “world” topic is significantly larger than the median.

Estimator#

Next we will explore the data that was produced using the “is_world” estimator we created. Let’s first read in that data.

world = pd.read_csv('../data/is_world.csv')
world.head()
section docno is_world
0 business-and-industry 0 False
1 business-and-industry 1 False
2 business-and-industry 2 False
3 business-and-industry 3 False
4 business-and-industry 4 False

We can see that the data had the same index as the wordcount and conditionals data above, but that the column “is_world” has True and False values. We can use aggregation and take advantage of the fact that True and False become 1 and 0 when aggregated to figure out the proportion of documents that we classified as “world” by topic, using the same group-by method as above.

world.groupby('section').is_world.mean()
section
business-and-industry        0.02
environment                  0.00
health-and-public-welfare    0.01
money                        0.08
science-and-technology       0.02
world                        0.70
Name: is_world, dtype: float64

70 percent of the “world” documents we correctly classified as “is_world,” and every topic had fewer than 10 percent false positives. Not bad!

Probability dataset#

Now let’s look at the “is_world” data with probabilities.

world_prob = pd.read_csv('../data/is_world_prob.csv')
world_prob.head()
section docno is_world_prob
0 business-and-industry 0 0.0059
1 business-and-industry 1 0.0079
2 business-and-industry 2 0.0148
3 business-and-industry 3 0.0018
4 business-and-industry 4 0.0117

Instead of just a binary True/False, this dataset has probabilities that the document is a “world” document. Let’s look at the mean probability by topic.

world_prob.groupby('section').is_world_prob.mean()
section
business-and-industry        0.050015
environment                  0.028742
health-and-public-welfare    0.048991
money                        0.096624
science-and-technology       0.047916
world                        0.728035
Name: is_world_prob, dtype: float64

These numbers look quite similar to the aggregated binary data.