3.1: Driver Basics#

Each corpus should contain a Python script named driver.py. This driver serves two important functions. First it specifies how the corpus should be indexed. An index is one or more values that, taken together, uniquely identify each document in the corpus. An index can be as simple as an id number, or it can be more descriptive. For example, in the Code of Federal Regulations (CFR), each document, representing a single subdivision called a part, is represented by three pieces of metadata: the year in which the part was printed, the title to which the part belongs, and the part number. These three numbers will always be a unique representation for a specific document.

Opening up the example driver in the downloaded corpus with Visual Studio Code or another text editor will display a few lines of code.

Let’s start with line 6:

directory=Path(__file__).parent.joinpath('data', 'clean')

This line directs the code to where the documents are located within the corpus folder. In the test example here, the documents will eventually be stored within the data/clean directory path. This path should be edited if the documents will be stored in a different location or if folder names are changed.

Line 7 is also important:

index_labels='filename'

This line lets the driver know what should be used as part of the index. In this case, each document name is the only item used in the index. As another example, if a set of documents were stored in folders by year, and then topic, and then filename, line 7 would look like this:

index_labels=('year', 'topic', 'filename')

Note that unlike line 6, which needs to have the actual folder names and directly reference a directory path, line 7 contains implied directory references. The above example implies that there are three more levels of folder/file structure and we are name their values in the index as year, topic, and filename. These can be replaced with bananas, apples, and blueberries if you prefer and the script would not break.