Related Queries and Spelling Corrections in Search

Friday, 11 May 2012

As you may have noticed, searches on twitter.com, Twitter for iOS, and Twitter for Android now have spelling corrections and related queries next to the search results.

At the core of our related queries and spelling correction service is a simple mechanism: if we see query A in some context, and then see query B in the same context, we think they’re related. If A and B are similar, B may be a spell-corrected version of A; if they’re not, it may be interesting to searchers who find A interesting. We use both query sessions and tweets for context; if we observe a user typing [justin beiber] and then, within the same session, typing [justin bieber], we’ll consider the second query as a possible spelling correction to the first — and if the same session will also contain [selena gomez], we may consider this as a related query to the previous queries. The data we process is anonymized — we don’t track which queries are issued by a given user, only that the same (unknown) user has issued several queries in a row, or continuously tweeted.

To measure the similarity between queries, we use a variant of Edit Distance tailored to Twitter queries; for example, in our variant we treat the beginning and end characters of a query differently from the inner characters, as spelling mistakes tend to be concentrated in those. Our variant also treats special Twitter characters (such as @ and #) differently from other characters, and has other differences from the vanilla Edit Distance. To measure the quality of the suggestions, we use a variety of signals including query frequencies (of the original query and the suggestion), statistical correlation measures such as log-likelihood, the quality of the search results for the suggestion, and others.

Twitter’s spelling correction has a number of unique challenges: searchers frequently type in usernames or hashtags that are not well-formed English words; there is a real-time constancy of new lingo and terms supplied by our own users; and we want to help people find those in order to join in the conversation. To address all of these issues, on top of our context-based mechanism, we also index dictionaries of trending queries and popular users that are likely to be misspelled, and use Lucene’s built-in spelling correction library (tweaked to better serve our needs) to identify misspelling and retrieve corrections for queries.

Initially, we started computing-related queries and spelling correction in a batch service, periodically updating our user-facing service with the latest data. But we’ve noticed that the lag this process introduced resulted in a less-than-optimal experience — it would take several hours for the models to adapt to new search trends. We then rewrote the entire service, this time as an online, real-time one. Queries and tweets are tracked as they come, and our models are continuously updated, just like the search results themselves. To account for the longer tail of queries that has less context from recent hours, we combine the real-time, up-to-date model with a background model computed in the same manner, but over several months of data (and updated daily).

Within the first two weeks of launching our related queries and spelling corrections in late April, we’ve corrected 5 million queries and provided suggestions to 100 million more. We’re very encouraged by the high engagement rates we’re seeing so far on both features.

We’re working on more ways to help you find and discover the most relevant and engaging content in real time, so stay tuned. There are other big improvements we’ll be rolling out to Twitter search over the coming weeks and months.

Acknowledgments
The system was built by Gilad Mishne (@gilad), Zhenghua Li (@zhenghuali) and Tian Wang (@wangtian) with help from the entire Twitter Search team. Thanks also to Jeff Dalton (@jeffd) for initial explorations and to Aneesh Sharma (@aneeshs) for help with the design.