4.1: Built in NLP Analyses#

The QuantGov library includes a number of built-in utilities to perform NLP on a corpus. These can be run using the quantgov nlp set of commands. All nlp analyses are run at the document level and output the resulting data in a .csv file.

  • count_words: The default for this analysis is a simple count of the number of words in each document in a corpus. This does need a bit more explanation. The default definition of a word in the QuantGov library is defined by the regular expression \b\w+\b. This expression looks for any word characters separated by whitespace. Word characters, or word boundaries, are letters and numbers. This can be overridden by using the --word_pattern argument.

  • count_occurrences: This analysis counts the non-overlapping occurrences of a list of words, phrases, or regular expressions. The longest items in the list take precedence. A total column may be specified with the --total_label argument. A common use of this when analyzing regulatory text is a count of common restrictive terms like the words “shall” and “must”.

  • shannon_entropy: Shannon entropy measures, in broad terms, the frequency of new ideas introduced in documents, with simpler and more focused documents having a lower entropy score. A good baseline for this metric are Shakespeare works which typically have an entropy score between 9.0 and 9.7.

  • conditional_counter: This analysis counts the occurrences of words and phrases that are defined as conditionals. These words typically bridge to new or different thoughts. These words and phrases are “if”, “but”, “except”, “provided”, “when”, “where”, “whenever”, “unless”, “notwithstanding”, “in the event”, and “in no event”.

  • sentence_length: This analysis simply calculates the average sentence length of each document in the corpus.

  • sentiment_analysis: Produce a quantification of polarity and subjectivity for a document. The analysis will return for each document two metrics: sentiment_polarity and sentiment_subjectivity. Sentiment polarity is measured on a -1.0 to 1.0 scale where -1.0 would be extremely negative and 1.0 would be extremely positive. Sentiment subjectivity is measured on a 0 to 1.0 scale where 0.0 is very objective and 1.0 is very subjective.

A quick example of sentiment analysis

The sentence “The Pirates are the worst team in the MLB.” receives a score of (-1.0, 1.0) as it is highly negative and highly subjective. “The Pirates were founded in 1887.” receives a score of (0.0, 0.0) as it is neither negative or positive and is not subjective in the slightest.