6.2: Visualizing the Data#

If you would like to download this Jupyter notebook and follow along using your own system, you can click the download button above.

In this section, we will create data visualizations from the data we created in the first 5 chapters. We will use the pandas, numpy, and matplotlib Python libraries to create these data visualizations.

Before we get started, we need to import pandas and numpy, and allow matplotlib to plot our visualizations inline.

%matplotlib inline
import pandas as pd
import numpy as np

Visualizing wordcount data#

Let’s start with the wordcount data, just like we did last section. Instead of aggregating using pandas, however, we will create a histogram using matplotlib. Thankfully, pandas has a matplotlib integration that makes plotting as easy as adding a .plot() to the end of a dataframe.

words = pd.read_csv('../data/wordcount.csv')
words.words.plot(kind='hist', bins=20, title='Total words')
<Axes: title={'center': 'Total words'}, ylabel='Frequency'>
../_images/ebacb89162b70af59f48516ff2d6d86d36b07018ca8c525aad67016a43db001d.png

The data is very right-skewed, with most of the documents having fewer than 100,000 words, but we can that see a couple documents have almost 500,000 words. If we were going to use this wordcount data for some kind of modelling, we would probably want to transform it in some way. Let’s look at what the log of the data looks like.

np.log(words.words).plot(kind='hist', bins=20, title='Total words')
<Axes: title={'center': 'Total words'}, ylabel='Frequency'>
../_images/9a359d8c7ecc1c3f754a200aaaf7c0f83eae008e5db71c69792be95b77496691.png

Wow! The log of the data is very close to normal, despite the heavily skewedness of the raw data.

Now let’s use visualize the aggregations from section 6-1.

words.groupby('section').words.mean().plot(
    kind='barh', title='Mean word count by category')
<Axes: title={'center': 'Mean word count by category'}, ylabel='section'>
../_images/ed16999e3f40d7c36f204e9655cd7287a01c8220b3a588f059cb0f9069f89ee0.png
words.groupby('section').words.median().plot(
    kind='barh', title='Median word count by category')
<Axes: title={'center': 'Median word count by category'}, ylabel='section'>
../_images/69b566d5b21ee03f56641a7051fd85507e2bcc53a79cb0391d0bdf7d4e21a6c2.png

Here we can visually see the difference between the mean and median word count for the world category.

Visualizing estimator data#

Now let’s create some data visualizations based on the “is_world” dataset.

world = pd.read_csv('../data/is_world.csv')
world.groupby('section').is_world.mean().plot(
    kind='barh', title='Mean "Is World" probability by category')
<Axes: title={'center': 'Mean "Is World" probability by category'}, ylabel='section'>
../_images/1ad3807e3bf5194387d5f8d5fbf00d079738d524189612734e8c05730c712717.png

Using the same aggregation from section 6-1, we can now visually see how 70 percent of the “world” topic documents have been accurately classified. The next visualization shows the same chart for the probability data.

world_prob = pd.read_csv('../data/is_world_prob.csv')
world_prob.groupby('section').is_world_prob.mean().plot(
    kind='barh', title='Mean "Is World" probability by category')
<Axes: title={'center': 'Mean "Is World" probability by category'}, ylabel='section'>
../_images/9988f8be579049feb67ee61e32f6519576c00dcf4a140e746dde8c874b2b76cf.png