Infrastructure

Partly Cloudy: The start of a journey into the cloud

Monday, 8 April 2019

Recently, Twitter Engineering embarked on an effort to migrate elements of the Twitter stack to the cloud. This article is the first in a series describing that journey. In this article, we introduce the project and describe motivations and early discoveries. Each article in the series will dive deeper into a particular aspect of the migration.

The journey begins

Three years ago, a team of engineers at Twitter set out on a mission—to find out whether it was possible to migrate all of Twitter’s infrastructure to the cloud.

Putting our infrastructure in the cloud sounds simple: just lift and shift, right? However, our data infrastructure alone ingests over a trillion events, processes hundreds of petabytes, and runs many tens of thousands of jobs on over a dozen clusters each day—it would be anything but simple.

After considerable investigation, several use-cases looked promising, but we determined that an all-in migration was more than we were able to take on at the time. The migration itself would be risky, costly, and disruptive for the business. It would have required mostly freezing the Twitter infrastructure for more than a year and preventing the product teams from making progress. So we made the decision to continue to operate our own data centers.

With the big-picture question answered, most of the teams went back to their day jobs, keeping our infrastructure running in top shape.

Clear skies ahead?

While we had proven that we could be cost-effective at scale using on-premise infrastructure, the desire and motivation to extend our capabilities to the cloud remained.

The advantages, as we saw them, were the ability to leverage new cloud offerings and capabilities as they became available, elasticity and scalability, a broader geographical footprint for locality and business continuity, reducing our footprint, and more.

As it turned out, that initial foray was just the start of the cloud story at Twitter. We realized that rather than an all-or-nothing approach, the best strategy would be a selective one: we would move just the parts of our infrastructure most immediately suited for the cloud. What are those parts? First, a little background...

Know your use case: Production vs Ad Hoc

At Twitter, we operate separate Hadoop clusters for different use-cases:

Real-time clusters are where we land the incoming data generated by end users in the course of using Twitter.
Processing clusters are where we run regularly scheduled production jobs with dedicated capacity.
“Ad-hoc” clusters are used for one-off queries or for occasional analysis that is less predictable than the production jobs.
Dedicated dense storage clusters allow us to manage less frequently accessed data.

When migrating to the cloud, the same basic setup and up-front work is required, no matter the mission-critical nature of the cluster/workload. Therefore it makes sense to select those use cases that can benefit most from the cloud capabilities with the least amount of risk. For Twitter it is the ad-hoc and cold storage use cases that are going to cloud, while our production clusters remain on-premises.

Hence, the name

Like all projects, our project needed a catchy, internal designation. At one of our early brainstorming sessions, many names came up, until one of our engineers proposed “Partly Cloudy”. The initial logo was a picture of Captain Underpants-inspired character flying fist-first through a cloud towards the sun. The logo didn’t stick, but the codename did.

While the project branding exercise was in good fun, mostly, it also served a purpose. Moving to the cloud involves a cultural change in Twitter engineering and a different way of thinking. A memorable codename can only help.

Partnerships

Partly Cloudy turns out to be not just complex, but a challenging and exciting project to work on as well. It requires patience, focus, tenacity and contributions from across the Twitter Engineering organization. Such an undertaking isn’t something that the Hadoop team could have done alone.

We have collaborated closely with many infrastructure teams across Twitter. Over the past decade, Twitter engineers have built up quite a suite of tools to do everything from provisioning hosts, networking, getting software deployed, monitoring, alerting, security, logging and all the other aspects involved in operating large distributed services at scale. When building services in our own data centers, engineers can take these tools for granted. Now we are working with our partner teams to extend this infrastructure into the cloud.

Partnerships extend outside of Twitter as well. While testing simulated workloads at scale, we found that Google Cloud was best suited to us in terms of technical capabilities, engagement from technical experts, networking, and choice of machine shapes and sizes. These turned out to be critical differentiators.

With the groundwork laid, other teams at Twitter have started to leverage cloud capabilities as well. See, for example, the use of Cloud Bigtable to power our advertiser analytics system.

Ultimately, our work culminated in a partnership with Google, announced ahead of Google Cloud Next 2018.

Forecast: Partly Cloudy

This post is the first in a series that describes our work moving to the cloud, the challenges we encountered, and how we tackled them. It continues with:

If you are interested to hear about any specific aspects, Tweet us at @TwitterEng and let us know!

If you are interested in helping in our journey to the cloud, we are hiring!

This post is unavailable

This post is unavailable.