5.3: Labels#

To train a supervised model, we need to create a set of labels that tell the model what we’re looking for. Another way to think about this is, what will the model be doing? Will it be distinguishing between two types of documents? Will it be clustering documents into multiple different categories? Or will it simply be determining if a document is or is not like a specific example document? Your answer to the question of what the model will be doing is what determines the labels that need to be generated. Creating labels will take place via the script scripts/generate_labels.py in the skeleton estimator. Labels should be stored in a quantgov.ml.Labels object, which takes three arguments:

  • index: a sequence holding the index values for each document being labeled.

  • label_names: a sequence holding one name for each label. Even when there is only one label, this parameter should be a sequence such as ['label'] or ('label',).

  • labels: an array-like of label values with shape [n_samples x n_labels].

The Labels object has a save method which can be used to save the object to a file.

For classification problems, the nature of the Labels object determines the type of classification model to be trained. If the values are True and False, the problem is assumed to be binary classification. If other values are given, the problem is assumed to be multiclass classification. If a numpy array or pandas DataFrame of zeroes and ones are given, the problem is assumed to be multilabel classification, where each row is assumed to represent a document and each column is assumed to represent a label.

Labels - Practice Estimator#

In our case, our practice question will be based around determining where in the Federal Register bodies of text come from. Specifically, we want to train the model to identify whether or not a document is more likely to have been published in the “World” section of the Federal Register than in any other section. Since there are only two possible outcomes, our problem is a case of binary classification and the labels we need to create should have either True or False values.

As mentioned above, label creation is handled in the create_labels.py script. By default, this script randomly generates True and False labels, which is, by design, not particularly useful. The important part of this script is the create_label function, which begins on line 14 and looks like this:

def create_label(streamer):
    """
    Assign a label to a set of documents using a CountVectorizer
    Arguments:
    * streamer: a quantgov.corpus.CorpusStreamer object
    Returns: a quantgov.estimator.Labels object
    """
    label_names = ('randomly_true',)
    labels = tuple(random.choice([True, False]) for doc in streamer)

    return quantgov.estimator.Labels(
        index=tuple(streamer.index),
        label_names=label_names,
        labels=labels
)

This function takes a CorpusStreamer object, which iterates over a corpus and saves the index. It returns a Labels object, which is made up of three parts: the index to the labels, a sequence of names for each kind of label generated, and the labels themselves.

Since we are only producing one kind of label—the binary label for whether or not a document is part of the “World” section — we only need one label name, which we can set as is_world (aka True). Next, we need to tell the script how to find the documents that it will be using to train the model, and what how to label those documents. We want label all “World” section documents as True and all other documents as False. In our corpus, the documents are split into folders that tell us where the text is from. Unsurprisingly, there is “world” folder. Looking back at driver.py, we remember that we established an index that references the folder level that the world folder is found in - we labeled this index as section and it is the first level of the index. In Python, the first element number is 0, not 1, so by checking if the first element of the index for each document is equal to 0, we can generate our labels.

Advanced Tip

There are many ways to generate your labels. You might want to look for the occurrence of a specific set of words in the text, for example, or examine part of the metadata generated by the corpus. For custom label generation and for non-binary classification problems, this is the time to define these features.

After modification, our create_label function should look like this:

def create_label(streamer):
    """
    Assign a label to a set of documents using a CountVectorizer
    Arguments:
    * streamer: a quantgov.corpus.CorpusStreamer object
    Returns: a quantgov.ml.Labels object
    """
    label_names = ('is_world',)
    labels = tuple(doc.index[0] == 'world' for doc in streamer)

    return quantgov.ml.Labels(
        index=tuple(streamer.index),
        label_names=label_names,
        labels=labels
)

Overall, the only changes to the create_labels.py are on lines 24 and 25. The conditional statement doc.index[0] == 'world' checks the first element of the documents’ index attribute, and evaluates to True if that element is equal to the string “world”. Otherwise, it evaluates to False. This gives us the labels we need to train our algorithm.

Now let’s make it happen! Run the following in the command prompt:

python scripts/create_labels.py -o data/labels ../federal_register

The create_labels.py script takes one positional argument, the path to the trainer corpus, and the -o keyword argument to specify the output path for the labels object. The above code should have produced a labels file in the estimator’s data folder.