Guest post: Understanding users through Twitter data and machine learning

Tuesday, 24 February 2015

Twitter is a rich source of a user’s interests: the public bio, observations, people followed, Retweets and favorites. What if we could process all this information in real time to build awesome web apps that personalize content based on the Twitter profile?

MonkeyLearn (@monkeylearn) is a technology platform that enables this type of deep app/site customization. Here we will show you how to process a Twitter user’s public information to empower customization, as well as other kinds of intelligent applications.

As a prerequisite, you need to first have Twitter API credentials via a registered Twitter app, as well as have signed up with MonkeyLearn and have an API token.


  1. Gather data about a Twitter user, including:
    • Profile information
    • Tweets 
    • Favorites
  2. Analyze the text to filter on language and assign topic categories
  3. Create visualizations
    • A pie chart of the most common topic categories
    • A word cloud of the most important keywords in a category 

You can get the full source code here.

Gather user data

First, we create a tweepy API object with our Twitter API key credentials:

# tweepy is used to call the Twitter API from Python
import tweepy
import re

# Authenticate to Twitter API
api = tweepy.API(auth)

Once you have a Twitter client, we retrieve Tweets and favorites, filtering for text-heavy Tweets and calculating a separate quality score:

def get_tweets(api, twitter_user, tweet_type='timeline', max_tweets=200, min_words=5):
    tweets = []
    full_tweets = []
    step = 200  # Maximum value is 200.
    for start in xrange(0, max_tweets, step):
        end = start + step
        # Maximum of `step` tweets, or the remaining to reach max_tweets.
        count = min(step, max_tweets - start)

        kwargs = {'count': count}
        if full_tweets:
            last_id = full_tweets[-1].id
            kwargs['max_id'] = last_id - 1

        if tweet_type == 'timeline':
            current = api.user_timeline(twitter_user, **kwargs)
            current = api.favorites(twitter_user, **kwargs)
    for tweet in full_tweets:
        text = re.sub(r'(https?://\S+)', '', tweet.text)

 // calculate a “score” of tweet relevance/information quality
        score = tweet.favorite_count + tweet.retweet_count
        if tweet.in_reply_to_status_id_str:
            score -= 15

        # Only keep tweets with at least min_words words.
        if len(re.split(r'[^0-9A-Za-z]+', text)) > min_words:
            tweets.append((text, score))
    return tweets

In the provided source code, you’ll also see us go one step further and include friends descriptions into our content corpus.

Filter on language

The next step is to filter the Tweets and content to English. We can do this easily using MonkeyLearn’s API, classifying text in batch mode:

import requests
import json

# This is a handy function to classify a list of texts in batch mode (much faster)
def classify_batch(text_list, classifier_id):
    Batch classify texts
    text_list -- list of texts to be classified
    classifier_id -- id of the MonkeyLearn classifier to be applied to the texts
    results = [] 
    step = 250
    for start in xrange(0, len(text_list), step):
        end = start + step
        data = {'text_list': text_list[start:end]}
        response =
            MONKEYLEARN_CLASSIFIER_BASE_URL + classifier_id + '/classify_batch_text/',
                'Authorization': 'Token {}'.format(MONKEYLEARN_TOKEN),
                'Content-Type': 'application/json'
            print response.text
    return results

If you need additional language support, MonkeyLearn has a number of language classifiers, including Spanish, French and many others. Look at our source code for the filter_language() method on how to swap out for your desired language.

Detect categories

Now that we have a list of Tweets and descriptions in English, we can use a MonkeyLearn topic classifier to categorize the text and create a histogram of the most popular categories for the user:

from collections import Counter

def category_histogram(texts, short_texts):
    # Classify the bios and tweets with MonkeyLearn's topic classifier.
    topics = classify_batch(texts, MONKEYLEARN_TOPIC_CLASSIFIER_ID)
    # The histogram will keep the counters of how many texts fall in
    # a given category.
    histogram = Counter()
    samples = {}

    for classification, text, short_text in zip(topics, texts, short_texts):
        # Join the parent and child category names in one string.
        category = classification[0]['label'] + '/' + classification[1]['label']
        probability = (classification[0]['probability'] *
        MIN_PROB = 0.3
        # Discard texts with a predicted topic with probability lower than a treshold
        if probability < MIN_PROB:
        # Increment the category counter.
        histogram[category] += 1
        # Store the texts by category
        samples.setdefault(category, []).append((short_text, text))
    return histogram, samples

# Classify the expanded tweets using MonkeyLearn, return the historgram
tweets_histogram, tweets_categorized = category_histogram(expanded_tweets, tweets_english)

# Classify the expanded bios of the followed users using MonkeyLearn, return the historgram
descriptions_histogram, descriptions_categorized = category_histogram(expanded_descriptions, descriptions_english)

Display the most popular categories

The above histogram counts how much Tweet activity a user has in each category. Using matplotlib, we create a pie chart that shows the distribution:

Guest post: Understanding users through Twitter data and machine learning

The previous pie chart represents my own interests, which is a pretty accurate breakdown given my Twitter activity. I’m a software engineer and geek, so I’m very interested in Computers & Internet/Programming. Also I’m an entrepreneur, so I’m also interested in Business & Finance/Small businesses.

Extract keywords from a given category

The pie chart offers a high level summary of a user’s interests. We can dig deeper, finding specific interests in that category. To do that, we’ll again use our keyword extractor to highlight the most important terms in each category.

First, for each category, we’ll join all the content:

joined_texts = {}
for category in tweets_categorized:
    if category not in top_categories:
    expanded = 0
    joined_texts[category] = u' '.join(map(lambda t: t[expanded],    tweets_categorized[category]))

We then use MonkeyLearn to extract keywords for each category, only keep the top 20 by relevance:

keywords = dict(zip(joined_texts.keys(), extract_keywords(joined_texts.values(), 20)))
for cat, kw in keywords.iteritems():
    top_relevant = map(
        lambda x: x.get('keyword'),
        sorted(kw, key=lambda x: float(x.get('relevance')), reverse=True)
    print u"{}: {}".format(cat, u", ".join(top_relevant))

The following clouds show the keywords that represent the computers & internet, and the business & finance categories respectively:

Guest post: Understanding users through Twitter data and machine learning

Guest post: Understanding users through Twitter data and machine learning

As another data point, you can see the pie chart and word cloud for Katy Perry, in which we identify events and Special Occasions and Entertainment are key categories, given her career and busy event schedule.


Using the Twitter API and MonkeyLearn, it’s simple to classify and extract relevant information from the Tweets and users descriptions. Together they offer useful insights into an individual usage, which can be used for a variety of applications:

  • For news or content sites: allow users to login in via Twitter to quickly understand their interests and tailor your content accordingly.
  • For e-commerce sites: recommend products based on user’s previous Tweets, favorites and follow graph.

We encourage using Twitter API and sign up to MonkeyLearn to discover new applications with the programming language you love.


A huge thanks to Agustin Azzinari and Rodrigo Stecanella for their contributions to the source code and Federico Pascual and Martin Alcala Rubi for their writing and editing.