Chapter 4: Running NLP on a Corpus#


Now that you have a practice corpus up and running, you can easily use the QuantGov library to analyze it. While it is easy to write custom Python scripts to analyze a corpus using the driver interface, we have integrated a number of the most common NLP tasks into the QuantGov library itself. To see all available commands, run the following in a command line:

quantgov nlp -h

As mentioned previously, NLP is the process of teaching a computer to understand language and the context of language. The computer may use this information to analyze language (in its many forms), manipulate language, or even produce language. The result of any of these actions have broad applications in our everyday lives. In fact, you most likely interact with NLP on a daily basis without even knowing it. Here are a few common applications of NLP:

  • Spell-check and other grammar checks that are run by word processors are backed by NLP. A computer would be unable to recommend the use of “than” compared to “then” if it did not understand the over-arching context of a sentence.

  • While they may not be loved, automated interactive voice response systems used by almost every large company to direct phone call inquiries take in, analyze, and act on human language. All forms of this fall under the NLP umbrella.

  • A more loved variant of voice response systems in the form of Google, Alexa, Siri, etc. also use NLP to process requests.

  • Often used and under-appreciated, spam folders are an extremely common application of NLP. Spam messages do not come with built in spam alerts that automatically alert email applications to deposit the message in a spam folder. These decisions are made based on well-trained algorithms that recognize email content that you have in the past, or most likely will in the future, consider to be spam.

  • Chat-bots are a more direct example of language in text form that is taken in, processed, and appropriately responded to.

While all of the above are common forms of NLP. The QuantGov library will primarily focus on processing the content of language in the form of machine-readable text. We will work through some of these examples in the next two sections.