Search Relevance Infrastructure at Twitter

Monday, 15 August 2016

Millions of people all over the world search on Twitter every day to see what’s happening. During major events such as the recent Euro 2016 final, we observe record traffic spikes as people turn to Twitter to find timely information and perspectives, and overall traffic volume has been steadily increasing over time. The Search Quality team at Twitter works on returning the best quality results for our users.

Compared to traditional information retrieval applications, the Twitter search challenge is unique, for a few reasons:

  • Real-time Intent: A large proportion of our searches have a strong intent to find topical, real-time information. The state of the world moves rapidly and in some cases, results that are even a few minutes old can feel outdated and irrelevant. Query suggestions, i.e. typeahead, spelling and related searches, also need to be fresh and real-time.
  • Corpus Size: The corpus being searched on is huge, with hundreds of millions of new Tweets being created daily, in many languages.
  • Document Format: The documents have unique properties: 140 characters of unstructured text, but with rich entity types including hashtags, @ mentions, images, videos, and external links. Unlike web pages, Tweets don’t have hyperlinks between themselves, so link-based algorithms like PageRank cannot be used directly for ranking.
  • Multiple Result Types: The search results page is a blend of multiple types of results including Tweets, other accounts, images, videos, news articles, related searches and spelling suggestions. The different result types need to be ranked against each other in order to compose a page that best satisfies the searcher’s intent.
  • Personalization: Each searcher has their own social graph, interests, location and language preferences, so results need to be personalized in order to be relevant.

In order to return relevant, high quality search results at this scale with low latency, we need to solve interesting and novel technical challenges in a variety of areas: information retrieval, natural language processing, machine learning, distributed systems, data science, etc.

Over the last few months, we’ve made significant investments in our search relevance infrastructure with the goal of improving ranking capabilities and experimentation efficiency. This post highlights some of this work. Note that this is distinct from our core indexing and retrieval platform components that we query in production to retrieve Tweets (unranked).

Real-Time Signal Ingester

The variety and timeliness of signals used by our ranking models have a huge impact on the ultimate quality of search results. Additionally, many of the signals mutate rapidly after the Tweets have been indexed, so we need to keep them up to date. We wrote a new Heron-based signal ingester to process streams of raw signals and produce features for our ranking components to use in production. We added flexible schemas for encoding and decoding new feature updates dynamically with minimum code changes and operational overhead. As the Twitter app evolves, we can quickly add and test new ranking signals that become available and appear promising in offline experiments.

Fast, Lightweight Experimentation

The faster and cheaper we can make the ideate->test->iterate loop, the more ideas we can test and the more we can innovate. We make heavy use of traditional A/B testing, but we’ve also built a complementary offline experimentation system to test changes more efficiently. Twitter search results and queries churn rapidly, so to separate signal from noise we built a sandbox environment that freezes the state of the world at a given point in time so we can generate stable, reproducible results for any change we want to test. In order to gain better insight, we’ve added tooling to analyze and display differences between results, and easily obtain judgment labels from in-house human raters based on our Search Quality Judgment Guidelines. One particularly nice benefit is that this allows us to validate expensive index changes, e.g. adding new index fields for retrieval, tokenization updates, etc., and refine them before deploying to production.

Training and Deploying Machine Learned Models

Machine learned models are commonly used for search ranking as they provide a principled and automatic way to optimize feature weights and integrate new ranking features. To make them work well, it’s important to identify the right objective functions to optimize that correlate well with ultimate customer satisfaction. We established a pipeline to seamlessly collect training data sets for model training and validation, and deploy trained models to production servers. Scale brings additional challenges, e.g. the first stage of search ranking happens on index shards within a very tight loop where a large number of matching documents for a query are scored under strict CPU, memory and latency constraints. We worked with the Twitter Cortex team to create a lightweight runtime that enables running models under these constraints and deployed ranking models trained using our internal ML platform tools, e.g. Whetlab.

These are critical building blocks that have allowed us to test and ship many relevance gains making search better for our users. In future posts, we’ll dive deeper into specific aspects of search quality and projects we’re currently working on. Stay tuned!


The Search Quality team is Tian Wang, Juan Caicedo, Zhezhe Chen, Jinliang Fan, Lisa Huang, Gianna Badiali, Yan Xia and Yatharth Saraf. We would also like to thank the Search Infrastructure, Heron and Cortex teams for invaluable assistance at various stages.

This post is unavailable
This post is unavailable.