Chapter 5: Machine Learning on a Corpus#


The QuantGov library provides a framework for training machine learning estimators using a corpus, packaging those estimators, and using those models to make predictions about other corpora. It is important to note from the onset that it is impossible to provide a cookie-cutter framework for machine learning as a whole. Machine learning is an extremely broad field of study that boasts thousands of unique algorithms across dozens of algorithm types that are all custom-developed to solve specific types of problems.

Therefore, the first goal of the machine learning portion of the QuantGov library is to provide flexibility to the experienced data scientist, so that they can implement customize throughout the process of creating an algorithm. The second goal is to provide new data scientist with a more direct avenue that requires the bare minimum customization and Python code, while still producing fantastic algorithms. The following sections of the tutorial are directed more towards the second audience, but will also attempt to alert more experienced data scientist where customization is possible.

Note

QuantGov is built on the scikit-learn library, one of the best open-source machine learning libraries available. Users training new models should familiarize themselves with the basic models in this library. More experienced users can be assured that anything available in the scikit-learn library can be implemented in the QuantGov framework through moderate amounts of customization.

Important

While the practice corpus collected in section 3.4 does not need cleaned, any corpus file cleaning or preprocessing should happen prior to any of the steps discussed in chapter 5. For text data this could include removing unnecessary text or removing extra whitespaces.