troll on a computer

Finding Trolls on Twitter

A couple of months ago, I left my job as a math professor and joined Insight Data Science in order to start a new career in data science. Why on earth would I do that, you ask? I enjoyed my job as a professor, and I was even (arguably) good at it. I got to do fun things with students. The truth is simply that I wanted to work on real problems with good people, and Insight has been so much fun, that I am sad now that it is coming to a close (but I am excited to see where I will land next).

Our first task at Insight is to select a project. There were some consulting project available, but I ended up deciding to strike out on my own, trying to better understand trolls on twitter. I’ve been a twitter user for several years, and I am well aware of the problems that trolls cause for users of twitter and other social media.

I decided I wanted to take a machine learning approach to finding trolls. My first stab at the problem was to use shared block lists. Twitter allows users to block other users, and in the past couple of years, Twitter has started allowing users to share these block lists. There is one service helping with this sharing called @blocktogether. So I collected multiple blocklists (as a result, I currently block 250K users), and my hope was that I could use these to curate a list of trolls. I was able to identify a number of interesting things about blocked users vs non-blocked users. For instance, blocked users tweeted less per day on average but had more followers, so apparently we are not following the rule of “don’t feed the trolls.”

A few things got in the way — I found that the lists coming from @blocktogether cut off the number of users they would allow me to download. But even worse, I began to suspect that even without my download problems, I would still struggle to find a sufficient number of users on large numbers of blocklists (twitter is a very big place after all). I would also have no way of knowing if a user appeared on multiple blocklists simply because the blocklists were being shared, rather than because users independently decided to block them.

So, I decided I had to come up with my own definition of trolling and to use a rules-based approach to identify trolls. My first stab at that was to say that trolling was repeatedly mentioning a user in a tweet with a negative sentiment. I was able to identify these trolling tweets, and I made a machine learning model to predict which users were trolls. The model did fine, but as I looked through the users identified as trolls, I realized that I was really doing a good job of finding arguments on twitter, but a less good job of finding users I would truly consider trolls.

As I thought about the problem, and I landed on one older example of trolling that helped me change my perspective. This was the case of Robin Williams daughter who posted a tribute to her father and was set upon by a pair of trolls telling her things like “I hope you choke and have to scream out for help but no one helps you and they just watch you choke.” I looked back on other examples I had of real trolling and realized that most of them involved saying “you” — these were personal attacks that needed the second person pronoun. At that point, I changed my criteria for trolling to be that there were at least two mentions of the same person, that they had negative sentiment, and that they used the word “you” (or similar words like “your”).


With this criteria, I was able to find a number of users that I consider trolls. These were users that were on my blocklist, who also engaged in trolling behavior in their last 200 tweets. That is, they mentioned a specific user at least twice, used you language in those tweets, and those tweets had negative sentiment (I used the Vader python package to get the sentiment). Out of about 10K users on blocklists, I found that 44% had trolled by this definition. I also needed “human” tweets, so I grabbed a random collection on people who were on positive lists (so that they were about as engaged as the trolls) using the twitter API. It turns out that about 7% of those had engaged in trolling behavior, so I threw those out.

Screen Shot 2017-07-25 at 3.40.12 PMFrom these categorized users, I collected tweets without mentions (since I had used the mentions to determine if they were trolls or not). I split those tweets off into a training a test set so that I could see if I could predict the trolls from their no-mention tweets. I turned their tweets into a bag of words and bigrams, vectorized them using TF-IDF and fed the vectors into a logistic regression model. My model was a pretty good predictor, and gave me the collection of words that were most predictive, which was illuminating. Basically, trolls are talking about gamergate (still!) and politics.

Now, here are my words of caution. I need to update this model regularly, because the topics trolls are talking about are going to change, so my model needs to change to catch todays trolls. I also need to still refine my model a bit. My model looks for negative sentiment, but saying “I’m really angry at you” isn’t trolling, it’s expressing a negative feeling. I need to search for hate-speech, and I have not yet implemented this in the model. I would also love to do some test to see how little text I can use and identify a troll. What if I just grabbed a couple of tweets plus the users description. Is that enough? And finally, if I wanted to really deploy this model to catch trolls, I need to be sure that it errs on the side of avoiding false negatives in order to protect the free speech of twitter users. I have not explored different thresholds for this model (although note that it has a nice ROC curve).

If you are interested in using the app I created, find it at http://trolltrackr.com. You can also follow me on twitter (I’m not a troll, I promise).

Predicting Sentiment from Text

After having scraped and analyzed some professor review data, I wanted to know if I could predict sentiment from the text. Reading the reviews as a human, it certainly seems like you can tell a good review from a bad review without looking at the overall score, but could I do this through a machine learning algorithm? First, I used overall score to distinguish good reviews from bad (see the spread here). Since there are so many “5” scores, those became the good reviews. Then I counted “1” and “2” scores as bad reviews and threw out the rest of the scores because they were ambiguous.

To use the text to predict the sentiment, I decided to use a “bag of words,” in which I would disregard grammar and word order and just count how often a word appeared in a review. I also threw out “stopwords,” which are common words like “these” and “am.” This is the loss of a lot of information but can make the analysis problem much more computationally tractable. Each of these words then becomes a feature that can help us predict the sentiment.

One way to predict the sentiment from these features is to form a decision tree. For example, I could predict sentiment with the tree below. This predicts sentiment correctly about half the time. To do a better job than this we can sSample decision treeample the reviews (and the features) and use each sample to make a different decision tree. We train the decision tree by splitting the space of features up in the best way possible (so that the bad and good reviews are separated, as unmixed as possible). We do this splitting by looking at all of the values of the features and deciding where to place the fit, then we evaluate how good the split is and move on to the next split. Eventually we are able to choose amongst the splits, selecting the best one. For example, for a particular sample, it it could end up that the best first split is whether the review contains the word “worst.” Then we proceed iteratively, looking at the two buckets of reviews that we have and deciding how to best split those, and so on. This will train a single tree (for instance like the tree pictured).

But one tree isn’t good enough, so we select another sample of reviews and a sample of features and do this recursive “best” splitting again and again, with each sample making a new tree. We end up with a whole “forest” of trees and we use this forest by running a new review through each tree, determining whether teach tree says the review is bad or good, and then going with the sentiment of the majority of trees to predict the sentiment of this review.

I implemented this with the RandomForestClassifier from sklearn and it is pretty accurate, around 94% on the data I set aside for testing the model. You can find the code in my GitHub (look at sentimentFromText.ipynb).

Professor Reviews and Learning Python

Last fall, I decided to learn Python, with a desire to analyze text and implement some machine learning. So, I decided to start by learning BeautifulSoup and using the tools there to scrape a professor rating site. The project went well, and I was able to write some code (that you can find on my GitHub). I got the hang of scraping and wrote code to collect numeric and text information from reviews of professors by school or by state.

Next, I began to analyze those reviews. I started the project intending to look gender differences in ratings, following other reports of differences, such as Sidanius & Crane, 1989, and and Anderson & Miller, 1997. So, I had to have the gender of the professors, something that was not available in the dataset that I had scraped. I decided to use pronouns to assess gender, and in case there were no pronouns in the text or the pronoun use was unclear, I assigned gender based on name.

I compiled reviews for schools in Rhode Island, New Hampshire, and Maine, discarded reviews with no gender assigned, and began by looking at differences in numeric rankings, which include an overall score and a difficulty score. Male and female professors have nearly identical mean scores — women’s mean overall is 3.71 and men’s is 3.75 and women’s difficulty is 2.91 and men’s is 2.90. The overall and difficulty means for each professor are correlated, as you can see here:

Plot of histograms, scatter with linear, and residuals

Interestingly, women seem to have fewer reviews than men. On average, female professors have about 9 reviews and male professors have about 12. This difference seems to be stable when looking at each individual year, and could be due to male professors teaching larger classes (but I have no data on that). The end result is that there are far fewer reviews of female professors. In my dataset, there are 107,930 reviews of male professors and just 64,799 reviews of female professors.

The data set also has a self-report of grades from most reviewers. You can see in the data overall scores go down when  reviewers get a bad grade, but women seem to be hit harder by this than men.

Bar graph showing overall mean score by grade and gender.

Note also that far more reviewers report receiving high grades. In fact, over 160,000 of the 173,000 reviews in my dataset report getting A’s.

The overall scores show a bimodal distribution, as you can see in the histogram of overall scores (reviewers can report scores from 0 to 5 with half-points possible). The next thing I decided to was to categorize these reviews into positive or negative, getting rid of reviews in the middle, and then to do some analysis of the text in those reviews. I’ll report on that next.

Histogram of overall scores, showing bimodal distributions for both men and women.

 

Design for Learning Stats

From my course blog for Math | Art | Design.

Math | Art | Design

A student at Brown, Daniel Kunin, has created a terrific visual resource for explaining statistics. It is called Seeing Theory, and it is hosted at Brown. Under the hood, it features Mike Bostock’s JavaScript library for creating visualizations, D3. For anyone interested in visualizing quantitative information, it’s a delight!

View original post

Reading Mathematics: Click/Clunk

Years ago, I did some work to help students read mathematics textbooks. I gave a presentation on the material at an NCTM conference and wrote a piece for students that is still in use by the Harvard Bureau of Study Council. I was recently reminded of the work because of an email I received about it, so I’m going to be looking at making use of the work again and perhaps writing up something additional about it and getting the work out more broadly. It is based on the “click/clunk” method which is used in some reading methods in education.

Perceptron

I’m learning about various machine learning algorithms, so I want to get recorded what I have learned and where I learned it, in part so that I can relearn it again after I inevitably forget it!

The perceptron is “baby’s first neural network.” It can be used successfully for learning binary classification of data that is linearly separable. The basic idea is that you have some training data that comes to you as vectors. You can start by guessing a weighting for those vectors, which is basically a guess at the subspace that separates your data (the weighting gives the normal vector for the subspace), or you can just initialize the weights to 0. Then you look at a random data point. First, you have to see how your current perceptron categorizes the data point, which you can do by taking a dot product of the weight vector with the data point vector and just looking at its sign.

If it is incorrectly classified, you need to adjust your weighting, which you do by subtracting (a scaling of) the vector of your current data point from your weighting vector, giving that normal vector a bit of a bump so that you will be correctly classifying the current data point. Then you pick another point and do the whole thing again. You are continuously adjusting your weights, so presumably your perceptron is getting better all the time. It is also useful to note that you need to use some kind of activation function to distinguish between correctly and incorrectly classified data points and it seems pretty typical to use a threshold step function.

Does this process ever end? Yes, it will provided that your data is linearly separable. You can even find the proof here. How long does it really take? I don’t know. Presumably it’s not the worst thing to do since lots of people reference it. What if your data isn’t really linearly separable? Well, it will go on forever, so you better pick a maximum number of iterations. Will it give you something reasonable after a reasonable number of iterations if you data is linearly separable-ish? No idea, but it seems like it might.

I read several useful pieces to figure out what I do know. I found this material from a presentation in graduate course on machine learning (there’s a lot of other interesting stuff in the webpage for the 2007 course http://aass.oru.se/~lilien/ml/). I also relied heavily on this material on perceptron from a CMU course in computer vision. Both of these sources have useful illustrations that I decided not to replicate here, so you should go look at them.The wikipedia page on perceptron had some good material, and I got curious a little about history, so I read http://web.csulb.edu/~cwallis/artificialn/History.htm.