3:2: Types of Drivers#

The QuantGov library supports a variety of different types of drivers. Different drivers allow for users to quantify files from different locations. Below, we will walk through each type of driver that QuantGov supports.

Recursive Driver#

The RecursiveDirectoryCorpusDriver serves files stored in a directory (also called a folder) on the local computer. The driver is called “recursive” because it will enter into subdirectories recursively to serve all files found. The index labels, which are specified in the object constructor, should match how many levels of directories deep the corpus files are stored.

Drawing on the example from Section 3-1, the CFR corpus has three index levels: year, title, and part. To use the RecursiveDirectoryCorpusDriver, we would organize our files so that there was one directory for each year. Within each year directory, we would have one folder for each title that appears in each year. Then, within the title directories, we would have a text file holding each part. This means that for the 1997 edition of Title 26 Part 1, we would have a file in data/clean/1997/26/1.txt. The corpus driver for this circumstance would be as simple as:

import quantgov as qg

from pathlib import Path

BASE_DIR = Path(__file__).resolve().parent

driver = qg.corpus.RecursiveDirectoryCorpusDriver(
    directory=BASE_DIR.joinpath('data', 'clean'),
    index_labels=('year', 'title', 'part')
)

Name Pattern Driver#

The NamePatternCorpusDriver uses a regular expression to specify the name pattern of files in a given folder. All files are expected to be in the specified folder, and the file names are expected to contain all the elements of the index.

Again using the CFR as an example, in this case we might have the 1997 edition of Title 26 Part 1 in the file data/clean/1997-26-1.txt. We would then specify each part of the index in named pattern groups. The resulting driver would look like this:

import quantgov as qg

from pathlib import Path

BASE_DIR = Path(__file__).resolve().parent

driver = qg.corpus.RecursiveDirectoryCorpusDriver(
    directory=BASE_DIR.joinpath('data', 'clean'), pattern),
    pattrn=r'(?P<year>\d+)-(?P<title>\d+)-(?P<part>\d+)'
)

While regular expressions may seem confusing at the onset, they are extremely useful and quite logical. For more information on regular expressions, see these links:

Index Driver#

The IndexDriver simply reads a csv file where each row represents a document. The last column is assumed to be the path to the file containing that document. Any preceding columns are assumed to be the index. The first row is assumed to be a header, and the column headers for the index columns are used as the index level names.

For the CFR example, the index csv file might look like this:

year,title,part,path
1997,26,1,data/clean/97-26-01.txt
1997,26,2,data/clean/97-26-02.txt
1997,26,3,data/other/Y1997T26P02.txt

The assuming the file above is saved in data/index.csv, the driver.py file for this corpus would be as follows:

import quantgov as qg

from pathlib import Path

BASE_DIR = Path(__file__).resolve().parent
driver = qg.corpus.IndexDriver(index=BASE_DIR.joinpath('data', 'index.csv'))

Custom Driver#

Users may also define their own driver classes. Custom drivers should subclass quantgov.corpus.structures.CorpusDriver and override the stream method. The stream method should generate quantgov.corpus.structures.Document objects, which hold both the index and the text of the document.