Chapter 1: The Basics#

What is Python?#

Let’s start at a high level and briefly discuss Python and why we choose to use it for text extraction and analysis.

To summarize, Python is an extremely popular interpreted programming language that has simple and easy to use syntax. Because Python is used by a large number of people across many different professions, there is a great well of help guides, user guides, and websites that will support any programmer as they learn the ropes. One of these great resources is a website called StackOverflow. This website is completely community driven and provides a platform for new users to ask questions and for more experienced users to answer questions. Due to Python’s open-source roots, most Python developers enjoy giving back to and contributing to the community, making resources like StackOverflow extremely active and up to date.

Outside of Python’s popularity and support base - the language also benefits from its flexibility. Python can be used for all sorts of tasks like web-development, data visualization, and machine learning. While Python may not be the fastest or the most efficient at any of these individual tasks, it has the capability to complete each of them, which not the case for many other popular programming languages.

What is NLP?#

Natural Language Processing, or NLP is the process of teaching a computer how to understand language and context.

While NLP has broad implications, there are a few popular functions that QuantGov has built in functionality to support. These functions will allow users to complete tasks like counting words in a document, counting the occurrences of a specific word in a document, and counting the occurrences of conditionals in a document. In addition, other built in tasks quantify the linguistic complexity of documents like sentence length and Shannon Entropy.

What is Machine Learning?#

Machine learning is a buzz word that has garnered a lot of attention in recent years. If you were to poll random individuals on what they believed machine learning is, you would probably get a variety of responses. Some may pay tribute to the famous 1999 film The Matrix, while others might suggest that machine learning is similar to training a dog.

While both of these would be interesting (and probably common) responses, neither are true. Machine learning can be summarized and explained by its core method, its core resource, and its core advantage. Let’s start with method. Machine learning is the application of statistical algorithms. These algorithms take various forms and range in complexity, but they all aim to accomplish the same thing - pattern identification. This is where the core resource comes into play. Machine learning algorithms strive to identify patterns within large amounts of data. The cool thing about the relationship between machine learning algorithms and pattern recognition, is that almost every set of data has patterns that can be identified. These algorithms can be used to identify individuals within a facial recognition system, generate a custom playlist out of millions of songs, or even predict the direction and intensity of natural disasters. The applications are truly endless. This brings us to the core advantage. The core advantage of machine learning is its ability to use the processing power of computers to run these algorithims over and over again in an attempt to identify patterns in large amounts of data. A task that may be possible for humans but would take exponentially longer and would be much more error prone.

What is Computational Social Science#

Computational social science is an increasingly popular field of study. While it can be considered a field of study, it is more an intersection of computational science (specifically statistics and computer science) and social science (such as political science, economics, psychology, and many others). This intersection has mostly arisen due to the increasing amounts of data available for analysis in all fields of social science paired with advances in the computational science that made large-scale statistical analysis more feasible for the average researcher.

QuantGov is a great tool for those looking to start or advance their computational social science skills. While the platform has been mostly used in the fields of economics and political science, QuantGov does not target a specific social science field. The platform can be used to analyze text from any source and from any field of study.

Now that we have covered the basics, the remainder of this chapter will walk through how to install Python and QuantGov on both Windows and Mac computers so that you can start using NLP and machine learning to solve all sorts of unique problems.