After having scraped and analyzed some professor review data, I wanted to know if I could predict sentiment from the text. Reading the reviews as a human, it certainly seems like you can tell a good review from a bad review without looking at the overall score, but could I do this through a machine learning algorithm? First, I used overall score to distinguish good reviews from bad (see the spread here). Since there are so many “5” scores, those became the good reviews. Then I counted “1” and “2” scores as bad reviews and threw out the rest of the scores because they were ambiguous.
To use the text to predict the sentiment, I decided to use a “bag of words,” in which I would disregard grammar and word order and just count how often a word appeared in a review. I also threw out “stopwords,” which are common words like “these” and “am.” This is the loss of a lot of information but can make the analysis problem much more computationally tractable. Each of these words then becomes a feature that can help us predict the sentiment.
One way to predict the sentiment from these features is to form a decision tree. For example, I could predict sentiment with the tree below. This predicts sentiment correctly about half the time. To do a better job than this we can sample the reviews (and the features) and use each sample to make a different decision tree. We train the decision tree by splitting the space of features up in the best way possible (so that the bad and good reviews are separated, as unmixed as possible). We do this splitting by looking at all of the values of the features and deciding where to place the fit, then we evaluate how good the split is and move on to the next split. Eventually we are able to choose amongst the splits, selecting the best one. For example, for a particular sample, it it could end up that the best first split is whether the review contains the word “worst.” Then we proceed iteratively, looking at the two buckets of reviews that we have and deciding how to best split those, and so on. This will train a single tree (for instance like the tree pictured).
But one tree isn’t good enough, so we select another sample of reviews and a sample of features and do this recursive “best” splitting again and again, with each sample making a new tree. We end up with a whole “forest” of trees and we use this forest by running a new review through each tree, determining whether teach tree says the review is bad or good, and then going with the sentiment of the majority of trees to predict the sentiment of this review.
I implemented this with the RandomForestClassifier from sklearn and it is pretty accurate, around 94% on the data I set aside for testing the model. You can find the code in my GitHub (look at sentimentFromText.ipynb).