3.4: Build a Practice Corpus#
Now that we have covered the basics of a QuantGov corpus, we will walk through an example from start to finish. In this example, we will be processing rules found in the Federal Register in the year 2016. You can download these documents here. The Federal Register is divided into six different sections by topic, and the dataset contains the first 100 rules published in each section that were not corrections or withdrawals. Inside the zip folder, the documents are organized by folders corresponding to the sections. The file names are the order in which the rules for that section were obtained, and the files contain the text of the rule proposal. For example, the file located at money/0032.txt
is the 32nd proposed rule that appeared in the Federal Register‘s section called “Money” in 2016.
To start creating a corpus. Open up a command prompt and navigate to the place where you would like to build the corpus in your directory. Run the command quantgov start corpus federal_register
. This will automatically download all of the supplemental items discussed in section 3.3 that help analyze the documents downloaded from the Federal Register.
Inserting the Files#
To turn these documents and the additional files into a QuantGov corpus, we need to do two things: add the documents to be analyzed to our corpus folder and adjust the driver.py
file to reflect the file structure. Begin by unzipping the downloaded Federal Register files. After this, copy the topic folders into our main corpus folder. As mentioned in section 3-3, these folders should be copied into the data folder. For descriptive purposes, let’s put them in an additional folder that describes the documents we are working with called fr_docs
. After doing this, the aforementioned file should be located at /data/fr_docs/money/0032.txt
.
Adjusting the Driver#
The corpus index is used to identify each document uniquely within the corpus. It can be as simple as the name of the file that the text is stored in but can also provide useful information. For example, the index used for the Code of Federal Regulations corpus, used to create Federal U.S. RegData, has three parts: the year in which the edition of the CFR was published, the CFR title, and the individual part number. It is generally best to choose an index that corresponds to the natural structure of the documents you’re using.
For our 2016 Federal Register corpus, the most natural index has two parts: the section or topic of the Federal Register in which the document was published, and its number within that section. Therefore the index_labels
line in the driver.py
file should be adjusted and read: index_labels=('section', 'docno')
.
We also need to adjust the directory path for the corpus. The directory path tells the driver where to find the indexed files. Adjust that line in driver.py
to read: directory=Path(__file__).parent.joinpath('data', 'fr_docs'),
Note
Sometimes it may be difficult to determine what should be part of the directory path or the index in a corpus. A good rule of thumb is to think of what information would be useful in a dataset as columns. In this case, it would be clearly useful to know the topic and document number of each row of data. But, since all files are part of fr_docs
, that information in a column would not be useful.
As mentioned in section 3-2, QuantGov supports many different types of drivers. In this case, we are fine using the default, which is the RecursiveDirectoryDriver
.
The final driver.py
file should look like this:
import quantgov as qg
from pathlib import Path
BASE_DIR = Path(__file__).resolve().parent
driver = qg.corpus.RecursiveDirectoryCorpusDriver(
directory=BASE_DIR.joinpath('data', 'fr_docs'),
index_labels=('section', 'docno')
)