5.2: Vectorization#
In order to use mathematical statistical techniques on text documents, those documents need to be converted to set of numbers, or vectors. This process is called vectorization, and the Python objects used to carry this out are called vectorizers. As shown in this quick explainer, the words within the documents are assigned numbers in a vector-like form. In addition to converting the text into a form that sklearn
can process, vectorization also drastically increases the efficiency of creating an algorithm.
QuantGov estimators currently expect a joblib-pickled scikit-learn vectorizer, such as a CountVectorizer
. Users may want to customize the CountVectorizer
defined in the scripts/create_vectorizer.py
script in the estimator skeleton. Recommendations for this customization can be found in the CountVectorizer() documentation.
A CountVectorizer
does exactly what its name implies: it counts individual words, mapping each word it encounters to a specific position in the document vector. Each individual document is mapped to a vector with a count of how many times each word in the entire directory appears in that document. Many modifications can be made to customize vectorization. The CountVectorizer
has arguments to specify word pattern, allow for counting multi-word phrases, eliminate stop words, and more. There are also other, more complex methods of vectorization, such as Google’s word2vec
algorithm. The default setup, however, is a sufficient starting place for this tutorial.
Vectorization can be accomplished with the command quantgov ml vectorize
which takes two positional arguments:
vectorizer: the path to the saved vectorizer object
corpus: the path to the target corpus
The following argument must also be specified:
-outfile or -o: the path to which to save the vectorized trainers
Vectorization - Practice Estimator#
Note
This next step in the tutorial will not work if you have not completed Chapter 3 and have the federal register corpus downloaded.
Navigate into the estimator folder with and run the following to create the trainers and vectorizer:
python scripts/vectorize_trainers.py -o data/trainers --vectorizer_outfile data/vectorizer ../federal_register
Once this code finishes running, a trainers
file and a vectorizer
file should appear in the estimator’s data folder.
Advanced Tip
If a user is interested in implimenting a custom tokenaztion method, this is the step an the file in which the customization would be done. See build_tokenizer() documentation
Advanced Tip
Advanced users may implement their own vectorizers subclassing the sklearn.base.BaseEstimator
and the sklearn.base.TransformerMixin
classes; all vectorizers should take an iterable of text objects to their fit
and fit_transform methods
.
Advanced Tip
In evaluating models, the vectorization step takes place before segmentation for cross-validation; this means that users should take care not to let information correlated with the variable of interest be used here. Instead, include a transformer step in the candidate model pipeline, which is used after the test-train split.