5.5: Training the Model#

Once a model has been selected and the configuration file edited, the final model can be trained on the full trainer set (instead of the 5-fold set) with the quantgov ml train command, which takes the following positional arguments:

  • modeldefs: the path to the module defining the candidate models

  • configfile: the path to the configuration file defining the model

  • trainers: the path to the saved Trainers object

  • labels: the path to the saved Labels object

The following argument must also be specified:

  • outfile or -o: the path to which to save the estimator. By convention, QuantGov estimators have the extension .qge.

The resulting Estimator is a self-contained file that can be distributed. However, users may need to ensure that they are using compatible versions of the underlying libraries, such as scikit-learn and pandas. These should be documented by the author of the estimator.

Warning

It is important to note that any preprocessing that was done on the training documents should also be done on any documents that are being classified by the algorithm that those training documents produced. Ignoring this may result in data leakage and faulty results.

Training - Practice Estimator#

All that remains is to actually train the model that we have configured in the model.cfg file using all the trainer documents. This can be accomplished using the following command in the estimator folder:

quantgov ml train scripts/candidate_models.py data/model.cfg data/vectorizer data/trainers data/labels -o data/is_world_classifier.qge

The resulting model is packaged in the is_world_classifier.qge file in the data folder and can now be used to classify any document as a document that may or may not be found in the “world” section of the Federal Register!