It's Not Rocket Science, But It's Our Work

Sunday, 1 June 2008

The general public is fascinated with every bug that pops up on board the Mars Phoenix Lander because no matter how small, they always seem mission critical. It’s exciting stuff and we hang on every bit of news. Will NASA scientists find a workaround for the short circuited mass spectrometer? Will the important science get done? Will we find water on Mars!?

Twitter’s system architecture shortcomings are nowhere near as interesting and we’re certainly not as exciting as rocket science or exploring Mars to most people. However, folks who use Twitter get frustrated when the service is slow or down which is why we have been trying lately to be more communicative about engineering and operations details.

The TechCrunch blog is particularly interested in behind-the-scenes details about Twitter engineering and systems because the folks who work there use Twitter and also have a very large audience of technology fans. Earlier this evening, TechCrunch posted some specific technology questions for us on their blog so we thought it appropriate to answer them here on our blog.

Before we share our answers, it’s important to note one very big piece of information: We are currently taking a new approach to the way Twitter functions technically with the help of a recently enhanced staff of amazing systems engineers formerly of Google, IBM, and other high-profile technology companies added to our core team. Our answers below refer to how Twitter has worked historically—we know it is not correct and we’re changing that.

Q: Is it true that you only have a single master MySQL server running replication to two slaves, and the architecture doesn’t auto-switch to a hot backup when the master goes down?
A: We currently use one database for writes with multiple slaves for read queries. As many know, replication of MySQL is no easy task, so we’ve brought in MySQL experts to help us with that immediately. We’ve also ordered new machines and failover infrastructure to handle emergencies.

Q: Do you really have a grand total of three physical database machines that are POWERING ALL OF TWITTER?
A: We’ve mitigated much of this issue by using memcached, as many sites do, to minimize our reliance on a database. Our new architecture will move our reliance to a simple, elegant filesystem-based approach, rather than a collection of database. Until then, we are adding replication to handle the current growth and stresses, but we don’t plan on ever relying on a massive number of databases in the future.

Q: Is it true that the only way you can keep Twitter alive is to have somebody sit there and watch it constantly, and then manually switch databases over and re-build when one of the slaves fail?
A: There’s a lot of necessary handholding and tweaking of our current system. Nevertheless, we’re growing our operations team to meet ongoing challenges.

Q: Is that why most of your major outages can be traced to periods of time when [a system administrator] was there to sit and monitor the system?
A: There are a number of reasons for our past outages. Everything from faulty process, environment, configuration, and just plain load. Our system must be designed for peaks; currently we’re tightly coupled which means that massive traffic on one part affects all. We’re addressing this by breaking the stack into small lightweight pieces which are designed for failure.

Q: Given the record-beating outages Twitter saw [recently], is anyone there capable of keeping Twitter live?
A: Of course, this is our work. Our growing team is collectively rolling up our sleeves to build a utility-class system. We’re all focused on designing something that persists and becomes the background.

Q: How long will it be until you are able to undo the damage [you] caused to Twitter and the community?
A: We’re working extremely hard to keep the service stable and performing, as well as architecting a system that stands the test of time. We’d love to be able to tell you exactly how long this will take, but it’s no easy task. It will take time, time well spent.

The folks at TechCrunch singled out a former employee of Twitter by name in their questions but Twitter is a team—we share responsibility for our victories as well as our mistakes. At the scale we’re working, the tiniest detail matters. If the Mars Lander is off by a fraction, it burns up. A minor localized change on Twitter can have a systemic impact—good or bad. We’re working on a better architecture. In the meantime, we’re looking for ways we can optimize and extend our current architecture’s runway. Thank you for being patient while we do our work and thanks for using Twitter.

—Jack Dorsey and Biz Stone