5.2: Vectorization#

In order to use mathematical statistical techniques on text documents, those documents need to be converted to set of numbers, or vectors. This process is called vectorization, and the Python objects used to carry this out are called vectorizers. As shown in this quick explainer, the words within the documents are assigned numbers in a vector-like form. In addition to converting the text into a form that sklearn can process, vectorization also drastically increases the efficiency of creating an algorithm.

QuantGov estimators currently expect a joblib-pickled scikit-learn vectorizer, such as a CountVectorizer. Users may want to customize the CountVectorizer defined in the scripts/create_vectorizer.py script in the estimator skeleton. Recommendations for this customization can be found in the CountVectorizer() documentation.

A CountVectorizer does exactly what its name implies: it counts individual words, mapping each word it encounters to a specific position in the document vector. Each individual document is mapped to a vector with a count of how many times each word in the entire directory appears in that document. Many modifications can be made to customize vectorization. The CountVectorizer has arguments to specify word pattern, allow for counting multi-word phrases, eliminate stop words, and more. There are also other, more complex methods of vectorization, such as Google’s word2vec algorithm. The default setup, however, is a sufficient starting place for this tutorial.

Vectorization can be accomplished with the command quantgov ml vectorize which takes two positional arguments:

  • vectorizer: the path to the saved vectorizer object

  • corpus: the path to the target corpus

The following argument must also be specified:

  • -outfile or -o: the path to which to save the vectorized trainers

Vectorization - Practice Estimator#

Note

This next step in the tutorial will not work if you have not completed Chapter 3 and have the federal register corpus downloaded.

Navigate into the estimator folder with and run the following to create the trainers and vectorizer:

python scripts/vectorize_trainers.py -o data/trainers --vectorizer_outfile data/vectorizer ../federal_register

Once this code finishes running, a trainers file and a vectorizer file should appear in the estimator’s data folder.

Advanced Tip

If a user is interested in implimenting a custom tokenaztion method, this is the step an the file in which the customization would be done. See build_tokenizer() documentation

Advanced Tip

Advanced users may implement their own vectorizers subclassing the sklearn.base.BaseEstimator and the sklearn.base.TransformerMixin classes; all vectorizers should take an iterable of text objects to their fit and fit_transform methods.

Advanced Tip

In evaluating models, the vectorization step takes place before segmentation for cross-validation; this means that users should take care not to let information correlated with the variable of interest be used here. Instead, include a transformer step in the candidate model pipeline, which is used after the test-train split.