5.4: Candidate Models#


QuantGov is able to test multiple candidate models, each with a variety of hyperparameters. Candidate Models must be specified in a Python module as a module-level variable named models which is a sequence of quantgov.ml.CandidateModel objects. CandidateModels are initialized with three arguments:

  • name: the user-facing name of the model

  • model: a scikit-learn estimator or pipeline, or a class that implements the scikit-learn interface

  • parameters: a dictionary of hyperparameters to tune, where the keys are the parameter names and the values are a sequence of possible values. See the documentation for grid_param in the scikit-learn documentation in the narrative documentation and in the api documentation.

Starter sets of candidate models can be seen in the quantgov.ml.candidate_sets module. These can be imported directly (as in the estimator skeleton) or copied and customized as needed. To ease initial forays into using these models, the QuantGov library includes starter sets for different types of problems. The default QuantGov estimator uses the candidate set for a binary or multiclass classification, which is proper for the problem being solved in this tutorial. This candidate set consists of two pipelines. The first uses a Term-Frequency-Inverse Document Frequency (TF-IDF) algorithm to normalize word counts and a Logistic Regression Classifier with a ridge penalty for classification. The second uses a TF-IDF for normalization and a Random Forest Classifier for classification. The Logistic Regression, or Logit, model is tested with different penalty coefficients from -0.01 to 100. The Random Forests model is tested with different numbers of trees from 5 to 100.

Advanced Tip

Any sklearn model can be used with the QuantGov library. It is recommended that users who are interested in trying a wade variety of models research which models may be best for their specific question and insert these in the quantgov.ml.candidate_sets module.

Advanced Tip

Hyperparameters that will be used in the grid search can also be easily adjusted in the quantgov.ml.candidate_sets module.

The candidate models can be trained using the quantgov ml evaluate command, which takes the following positional arguments:

  • modeldefs: the path to the module defining the candidate models

  • trainers: the path to the saved Trainers object

  • labels: the path to the saved Labels object

  • output_results: the path to a csv file which will list the results of every model evaluated, with every combination of hyperparameters.

  • output_suggestion: the path to a file which will hold the configuration of the best performing models.

Advanced Tip

The following arguments are optional and should be taken advantage of by advanced users:

  • folds (defaults to 5): the number of folds to use in cross-validation.

  • score (defaults to ‘f1’): the scoring metric for comparing models. See the scikit-learn documentation for a list of valid options.

The evaluation command will automatically select the highest performing score and output its model and parameters to the file specified in the output_suggestion parameter. Users should inspect the full scoring results, however, and select a model that balances simplicity and performance.

Candidate Models - Practice Estimator#

Our next step before we can evaluate our estimator is to specify a set of candidate models with ranges of parameters. We do this in the candidate_models.py script, by defining a sequence of CandidateModel objects that specify each model (often several steps collected together in a scikit-learn Pipeline) and a dictionary of parameters to test, where the keys are the parameter, and the dictionary values are sequences of the parameter values to test.

In the command prompt run:

quantgov ml evaluate scripts/candidate_models.py data/trainers data/labels data/evaluation.csv data/model.cfg

This command will use an exhaustive search approach to model evaluation, trying every candidate model with every combination of parameters. It evaluates models using cross validation, holding out one fold (from the folds specified in the configuration file) and training the model on the rest, then scoring the results from predicting the held-out fold. This process is repeated holding out each fold once, producing a mean and standard deviation which describe the probability distribution of the underlying metric. The command prompt will print out information about the different model and parameter combinations. Typically, this will take a few minutes to run, but can take multiple days depending on the number of documents in the corpus, the number of hyperparameters that are used, and the number of models that are selected to evaluate.

When the command finishes running, there will be two new files in the data subdirectory of the estimator: evaluation.csv and model.cfg. Open the evaluation.csv file in your favorite spreadsheet editor. Each row of the CSV represents a single candidate model and set of parameters. The most directly relevant columns are “mean_test_score” and “std_test_score”, which contain the mean and standard distribution for the evaluation metric—in our case, the F1 score—for that model with that set of parameters. A good rule of thumb is to use the simplest model that’s within one standard deviation of the best-performing version.

In our case, the best-performing model is the Logit model with a regularization coefficient of 100. If we open the model.cfg file in a text editor, we see that this is the model that has been pre-loaded by QuantGov.