# Predicting Sentiment from Text

After having scraped and analyzed some professor review data, I wanted to know if I could predict sentiment from the text. Reading the reviews as a human, it certainly seems like you can tell a good review from a bad review without looking at the overall score, but could I do this through a machine learning algorithm? First, I used overall score to distinguish good reviews from bad (see the spread here). Since there are so many “5” scores, those became the good reviews. Then I counted “1” and “2” scores as bad reviews and threw out the rest of the scores because they were ambiguous.

To use the text to predict the sentiment, I decided to use a “bag of words,” in which I would disregard grammar and word order and just count how often a word appeared in a review. I also threw out “stopwords,” which are common words like “these” and “am.” This is the loss of a lot of information but can make the analysis problem much more computationally tractable. Each of these words then becomes a feature that can help us predict the sentiment.

One way to predict the sentiment from these features is to form a decision tree. For example, I could predict sentiment with the tree below. This predicts sentiment correctly about half the time. To do a better job than this we can sample the reviews (and the features) and use each sample to make a different decision tree. We train the decision tree by splitting the space of features up in the best way possible (so that the bad and good reviews are separated, as unmixed as possible). We do this splitting by looking at all of the values of the features and deciding where to place the fit, then we evaluate how good the split is and move on to the next split. Eventually we are able to choose amongst the splits, selecting the best one. For example, for a particular sample, it it could end up that the best first split is whether the review contains the word “worst.” Then we proceed iteratively, looking at the two buckets of reviews that we have and deciding how to best split those, and so on. This will train a single tree (for instance like the tree pictured).

But one tree isn’t good enough, so we select another sample of reviews and a sample of features and do this recursive “best” splitting again and again, with each sample making a new tree. We end up with a whole “forest” of trees and we use this forest by running a new review through each tree, determining whether teach tree says the review is bad or good, and then going with the sentiment of the majority of trees to predict the sentiment of this review.

I implemented this with the RandomForestClassifier from sklearn and it is pretty accurate, around 94% on the data I set aside for testing the model. You can find the code in my GitHub (look at sentimentFromText.ipynb).

# Professor Reviews and Learning Python

Last fall, I decided to learn Python, with a desire to analyze text and implement some machine learning. So, I decided to start by learning BeautifulSoup and using the tools there to scrape a professor rating site. The project went well, and I was able to write some code (that you can find on my GitHub). I got the hang of scraping and wrote code to collect numeric and text information from reviews of professors by school or by state.

Next, I began to analyze those reviews. I started the project intending to look gender differences in ratings, following other reports of differences, such as Sidanius & Crane, 1989, and and Anderson & Miller, 1997. So, I had to have the gender of the professors, something that was not available in the dataset that I had scraped. I decided to use pronouns to assess gender, and in case there were no pronouns in the text or the pronoun use was unclear, I assigned gender based on name.

I compiled reviews for schools in Rhode Island, New Hampshire, and Maine, discarded reviews with no gender assigned, and began by looking at differences in numeric rankings, which include an overall score and a difficulty score. Male and female professors have nearly identical mean scores — women’s mean overall is 3.71 and men’s is 3.75 and women’s difficulty is 2.91 and men’s is 2.90. The overall and difficulty means for each professor are correlated, as you can see here:

Interestingly, women seem to have fewer reviews than men. On average, female professors have about 9 reviews and male professors have about 12. This difference seems to be stable when looking at each individual year, and could be due to male professors teaching larger classes (but I have no data on that). The end result is that there are far fewer reviews of female professors. In my dataset, there are 107,930 reviews of male professors and just 64,799 reviews of female professors.

The data set also has a self-report of grades from most reviewers. You can see in the data overall scores go down when  reviewers get a bad grade, but women seem to be hit harder by this than men.

Note also that far more reviewers report receiving high grades. In fact, over 160,000 of the 173,000 reviews in my dataset report getting A’s.

The overall scores show a bimodal distribution, as you can see in the histogram of overall scores (reviewers can report scores from 0 to 5 with half-points possible). The next thing I decided to was to categorize these reviews into positive or negative, getting rid of reviews in the middle, and then to do some analysis of the text in those reviews. I’ll report on that next.

# Design for Learning Stats

From my course blog for Math | Art | Design.

A student at Brown, Daniel Kunin, has created a terrific visual resource for explaining statistics. It is called Seeing Theory, and it is hosted at Brown. Under the hood, it features Mike Bostock’s JavaScript library for creating visualizations, D3. For anyone interested in visualizing quantitative information, it’s a delight!

View original post