Over time Tweets have acquired a language all their own. Some of these have been around a long time (like @username at the beginning of a Tweet) and some of these are relatively recent (such as lists) but all of them make the language of Tweets unique. Extracting these Tweet-specific components from a Tweet is relatively simple for the majority of Tweets, but like most text parsing issues the devil is in the details.
We’ve extracted the code we use to handle Tweet-specific elements and released it as an open source library. This first version is available in Ruby and Java but in the Twitter spirit of openness we’ve also released a conformance test suite so any other implementations can verify they meet the same standards.
It all started with the @reply … and then it got complicated. Twitter users started the use of @username at the beginning of a Tweet to indicate a reply, but you’re not here to read about history. In order to talk about the new Twitter Text libraries one needs to understand the Tweet-specific elements we’re interested in. Much of this will be a review of what you already know but a shared vocabulary will help later on. While the Tweet-specific language is always expanding the current elements consist of:
For this first version of the Twitter Text libraries we’ve released both Ruby and Java versions. We certainly expect more languages in the future and we’re looking forward to the patches and feedback we’ll get on these first versions.
For each library we’ve provided functions for extracting the various Tweet-specific elements. Displaying Tweets in HTML is a very common use case so we’ve also included HTML auto-linking functions. The individual language interfaces differ so they can feel as natural as possible for each individual language.
The Ruby library is available as a gem via gemcutter or the source code can be found on github. You can also peruse the rdoc hosted on github. The Ruby library is provided as a set of Ruby modules so they can be included in your own classes and modules. The rdoc is a more complete reference but for a quick taste check out this short example:
class MyClass include Twitter::Extractor usernames = extract_mentioned_screen_names("Mentioning @twitter and @jack") # usernames = ["twitter", "jack"] end
The interface makes this all seems quite simple but there are some very complicated edge cases. I’ll talk more about that in the next section, Conformance Testing.
The source code for the Java library can be found on github. The library provides an ant file for buildinf the twitter-text.jar file. You can also peruse the javadocs hosted on github. The Java library provides Extractor and Autolink classes that provide object-oriented methods for extraction and auto-linking. The javadoc is a more complete reference but for a quick taste check out this short example:
import java.util.List; import com.twitter.Extractor; public class Check { public static void main(String[] args) { List names; Extractor extractor = new Extractor(); names = extractor.extractMentionedScreennames("Mentioning @twitter and @jack"); for (String name : names) { System.out.println("Mentioned @" + name); } } }
The library makes this all seems quite simple but there are some very complicated edge cases.
While working on the Ruby and Java version of the Twitter Text libraries it became pretty clear that porting tests to each language individually wasn’t going to be sustainable. To help keep things in sync we created that Twitter Text Conformance project. This project provides some simple yaml files that define the expected before and after states for testing. The per-language implementation of these tests can vary along with the per-language interface, making it intuitive for programmers in any language.
The basic extraction and auto-link test cases are easy to understand but the edge cases about. Many of the largest complications come from handling Tweets written in Japanese and other languages that don’t use spaces. We also try to be lenient with the allowed URL characters, which creates some more headaches.
Did someone say … cookies?
X and its partners use cookies to provide you with a better, safer and
faster service and to support our business. Some cookies are necessary to use
our services, improve our services, and make sure they work properly.
Show more about your choices.