5.5: Training the Model#
Once a model has been selected and the configuration file edited, the final model can be trained on the full trainer set (instead of the 5-fold set) with the quantgov ml train
command, which takes the following positional arguments:
modeldefs: the path to the module defining the candidate models
configfile: the path to the configuration file defining the model
trainers: the path to the saved
Trainers
objectlabels: the path to the saved
Labels
object
The following argument must also be specified:
outfile or -o: the path to which to save the estimator. By convention, QuantGov estimators have the extension
.qge
.
The resulting Estimator is a self-contained file that can be distributed. However, users may need to ensure that they are using compatible versions of the underlying libraries, such as scikit-learn and pandas. These should be documented by the author of the estimator.
Warning
It is important to note that any preprocessing that was done on the training documents should also be done on any documents that are being classified by the algorithm that those training documents produced. Ignoring this may result in data leakage and faulty results.
Training - Practice Estimator#
All that remains is to actually train the model that we have configured in the model.cfg
file using all the trainer documents. This can be accomplished using the following command in the estimator folder:
quantgov ml train scripts/candidate_models.py data/model.cfg data/vectorizer data/trainers data/labels -o data/is_world_classifier.qge
The resulting model is packaged in the is_world_classifier.qge
file in the data folder and can now be used to classify any document as a document that may or may not be found in the “world” section of the Federal Register!