Productionizing ML with workflows at Twitter

By and
Friday, 22 June 2018

Cortex provides machine learning platform technologies, modeling expertise, and education to teams at Twitter. Its purpose is to improve Twitter by enabling advanced and ethical AI. With first-hand experience running machine learning models in production, Cortex seeks to streamline difficult ML processes, freeing engineers to focus on modeling, experimentation, and user experience.

Once upon a time…

Machine learning pipelines were run at Twitter through a series of scripts or ad-hoc commands. Training and serving a model meant manually triggering and waiting for a series of jobs to complete or writing a collection of custom scripts to do so. Example jobs include extracting, transforming, and loading data, as well as training, validating, and serving models.

The process of managing and triggering machine learning pipelines presents problems that can impact the quality of models. Firstly, it reduces the frequency in which models can be retrained by creating unnecessary overhead and causing frequent errors. Each full run of the training pipeline requires active monitoring from engineers in order to trigger next steps and remedy issues with the data or scripts. Secondly, it reduces the rate of experimentation and tuning that can be done to improve the model. For example, in order to tune the hyperparameters of a model, engineers would have to manually trigger, manage and record results from multiple runs of a machine learning pipeline. This tedious and repetitive task reduces engineering productivity and slows iteration time.

In an effort to reduce the cost of maintenance of production machine learning pipelines, improve engineering productivity, and increase the rate of experimentation, Cortex developed ML Workflows: a tool designed to automate, schedule, and share machine learning pipelines at Twitter.

Airflow at Twitter

When we started building ML Workflows, our philosophy was to create a simple solution that would solve most ML needs while reusing existing components and open source technologies. Rather than reinvent the wheel, Cortex evaluated technical solutions based on a simple Python API to describe workflow DAGs paired with a backend job execution system.

After careful consideration of open-source projects, such as Luigi or Azkaban, as well as internal solutions, we picked Aurora Workflows and Apache Airflow as the frontrunners.

Why Airflow?

Aurora Workflows is an orchestration engine part of our company’s continuous deployment efforts. It supports full automation and is used to coordinate test and release stages across Aurora and CI (Jenkins) jobs.

Apache Airflow is an open source technology for creating, running, and managing data pipelines. It has a powerful UI to manage DAGs and an easy to use API for defining and extending operators.

Despite Aurora Workflows being integrated, we chose to base our product on Airflow as it:

  • Offers a fully-fledged Web UI with fine-grained view and control over workflows.
  • Provides dynamic parameter passing between operators through the use of XComs
  • Offers an API that is well designed for tasks of all sizes and creates a consistent  development experience for our users.
  • Supports execution of arbitrary Python code and is extendable thanks to its support of custom plugins.
  • Has an active community in addition to being used by large tech companies in production.

Integration at Twitter

Integrating Airflow as a productionized system within Twitter required modifications to Airflow and its deployment procedures. Here are some of the most noteworthy contributions we made.

Authentication and authorization

At Twitter, engineers generally authenticate with internal web services via Kerberos. To support Kerberos in Airflow, we took advantage of the existing framework provided by Flask-Login and created a child class of LoginManager. At initialization time, this class verifies that the principal name is known to Kerberos. It then sets up before and after-request filters that drive the GSSAPI process. On success, LDAP groups for the user are queried and and carried along with the user object. This allows us to offer group-based DAG-level control of actions or even visibility. Taken together, this approach allows Airflow to neatly drop into our authentication and authorization infrastructure with no fuss.

Self-service model

We are working towards a self-service model for teams to stand up their own instances. Which provides user groups complete control of their instance and DAGs, and it is much simpler to manage permissions and quotas of a single service account. This approach also paves the path  to a potential multi-tenant Airflow server at Twitter.

In our self-service model, each team’s deployment differs from another in only Aurora configuration and DAG population. So rather than copy and paste of a whole lot of config, we use a simple, short json file to specify the per-instance variables such as owning service

account, allowed groups, backing database, Kerberos principal credentials, and DAG folder locations. At start up time a simple command invoked through a custom CLI generates the appropriate Aurora configurations to start the scheduler, web server, and worker instances.

Stats integration

Airflow, like most python-based open sourced systems, uses the StatsClient from statsd. Twitter uses StatsReceiver, a part of our open-source Finagle stack in github/twitter/util. However, the two models are different: in util/stats, each metric is registered, which places it on a list for future collection; in statsd, metrics are simply emitted and the right thing happens on the back end. To solve this problem we built a bridge that  recognizes new metrics and register them as they appear. It is API compatible with StatsClient which allows us to inject an instance of our bridge object into Airflow as it is starting. With metrics now collected by our visualization system, we are able to provide a templated dashboard to simplify creation of a monitoring dashboard for our self-service clients.

Distributed workers with Celery

Airflow makes use of Celery to orchestrate a scaled multi-worker node configuration. We configured Celery to work with Twitter cloud containers and by default use a SQLAlchemy broker to exploit Airflow’s MySQL database as a message queue for Celery. This allows our users to setup their scalable Airflow workers without having to maintain an extra Redis or RabbitMQ service. However this does not prevent them from switching over to another message broker if they decide to support one in the future.

ML on Airflow

In order to make machine learning pipelines easy for our customers to build on Airflow there were a series of extensions we needed to add or improve upon.

Custom ML operators

Firstly, to help Twitter engineers run ML tasks, we developed reusable operators for them, including:

  • Aurora operators: At Twitter we use Aurora to schedule and run services and jobs. We implemented a set of operators that allow users to run code in Aurora. These operators are the foundation of many of our other operators.
  • DeepBird operators: Being our core ML tool, we provide operators that run training process, launch prediction service with trained models, and run load test on prediction service. Together they allow our engineers to train and validate their models with DeepBird from end to end.
  • Hyper parameter tuning operators: We created a set of operators to support hyper parameter tuning. Paired with our DAG constructor classes they support tuning hyper parameters of arbitrary DAG.
  • Utility operators: We also have operators for common tasks such as launching CI jobs, managing files on HDFS, creating, updating and monitoring JIRA tickets, sending information to our model metadata store, etc. When we identify a widely required routine among our users, we consider adding it as an operator. Contributing operators are not solely restricted to our team, but have been made by our users as well.

Type checking in operators

We did not want our users to incur the cost of encountering errors while DAGs were running due to compatibility in arguments being passed from one operator to another. Thus we developed a type checking system for operators, which verifies input and output data types of all operators in DAGs (args and XCOMs).

For example, suppose that we have two connected operators Foo and Bar in a DAG. Foo outputs an integer XCOM value, and Bar takes that XCOM value but is expecting a string. Our type checking system would raise an error to let the DAG developer know the types do not match.

All our operators are built upon this type checking system and have their input and output types declared through python decorators.

DAG constructors

Our users often had DAGs which processed the same operations but with different parameter values. To prevent having to create multiple DAGs they created a constructor class around a  DAG, and then exposed a dependent set of parameters within it’s constructor.

We decided to support this pattern by making DAG constructors a first class citizen in ML Workflows. We developed a DAG constructor base class which allows users to declare a list of parameters of their DAG. For example:


This post is unavailable
This post is unavailable.

This post is unavailable
This post is unavailable.

This post is unavailable
This post is unavailable.

We also developed a UI for our users to ad-hocly create instances of their DAGs with different parameters:

This post is unavailable
This post is unavailable.

This post is unavailable
This post is unavailable.

This post is unavailable
This post is unavailable.

A python file is generated when a user creates a new DAG and is placed in Airflow’s DAG_FOLDER which makes use of Airflow’s ability to automatically load new DAGs.

To make these DAG instances persistent on our stateless cloud containers, we record information of them in the user’s Airflow database. When Airflow schedulers and workers restart, the DAGs are automatically re-created and loaded by Airflow.

Airflow administrator UI plugin

While creating many additions to Airflow to better support our ML use cases on the backend we also wanted to provide a nice UI layer to interact with certain features on the frontend. In order to achieve this we developed a UI plugin that adds a Flask view to the Airflow webserver. The view loads our JavaScript bundle, which is a small single page web app built with React. The app contains user interfaces of our features, such as hyper parameter tuning and DAG constructors, and calls Flask endpoints declared in our plugin to fetch data and invoke corresponding functionalities.

Use case deep dive

One of the most powerful use cases of ML Workflows is hyperparameter tuning. Leveraging Airflow’s DAGs, we enabled this feature in ML Workflows by implementing DagConstructor, which parameterizes a DAG, along with HyperParameterTuner, which builds a DAG that conducts the experiments and reports results to the user.

After a user fully automates a workflow using ML Workflows, hyperparameter tuning can be applied to the DAG by:

  1. Wrapping the DAG in a DagConstructor, exposing the hyperparameters to be tuned.
  2. Adding an additional task to the DAG that pushes via xcom workflow_result: a dictionary representing key value pairs of the metrics resulting from the workflow.
  3. Defining the hyperparameter tuning method, the TuningParameters and other tuning configurations with the HyperparameterTuningSpec object.
  4. Creating a HyperparameterTuner by passing it the DAG constructor and the tuning spec.

The HyperparameterTuner will generate a DAG with several subdags, each conducting an experiment using hyperparameters as specified by the HyperparameterTuningSpec. The DAG also includes tasks for recording results and reporting them to the user.

To make this more tangible, suppose a user has created a simple DeepBird model defined as a Python file which can be trained given some parameters. To apply hyperparameter tuning, the model needs to be wrapped in a SubDAG that exposes the parameters to the tuner.

This pattern is used so often it is included in ML Workflows as a general DAG constructor for the purpose of hyperparameter tuning, aptly named DeepBirdWorkflowDag.

HyperparameterTuner is then given this DAG, along with a HyperpameterTuningSpec to build the hyperparameter tuning workflow

This post is unavailable
This post is unavailable.

This post is unavailable
This post is unavailable.

The HyperparameterTuningSpec specifies the hyperparameters, tuning method, optimizing metric, and the number of experiments. This resulting DAG is generated:


This post is unavailable
This post is unavailable.

Zooming into batch_workflow_training_subdag, the individual experiments can be observed:

This post is unavailable
This post is unavailable.

This post is unavailable
This post is unavailable.

This post is unavailable
This post is unavailable.

Here, each batch_workflow_subdag_op represents an experiment and is a DAG constructed from DeepBirdWorkflowDag.

After running this workflow, the results are recorded and sent to the user.


The ML Workflows product has been adopted by several internal teams so far and has delivered immediate impact. One example is in the Timelines Quality team which adopted ML Workflows and as a result reduced the interval for retraining and deploying their models to production from four weeks to one week. The team also ran an online experiment to examine the difference of having the models retrained more often. The result of the experiment was positive, indicating that shorter retraining intervals provide better timeline quality and ranking for our users.

Several teams have also begun to experiment with hyperparameter tuning on top of ML Workflows and are seeing early results. As an example, the Abuse and Safety team applied our hyperparameter tuning tools to one of their tweet-based models, which allowed them to automatically run a number of experiments and return an improved model based on offline metrics.

As teams have adopted ML Workflows, the common Python operators and utility functions have grown and teams are benefiting from reusing common components to construct their workflows.  Teams have adopted the DAG constructor pattern making it easy to run workflows with different parameters. As more teams at Twitter continue to adopt ML Workflows, we expect to see a large-scale, positive impact on engineering productivity, iteration speed, and model quality.

Future work

While the initial results of ML Workflows are exciting, there is a lot of work for the team ahead:

  • Broad adoption - Adoption by teams is expected to continue over the course of the year with a goal of broad adoption by the end of the year. The team actively supports onboarding customers, provides daily support on Slack, and offers office hours to help teams.  
  • ML Dashboard integration - Cortex is working to develop a unified UI layer for all of its products including integrations with ML Workflows. In the meantime, the team has already added UI functionality directly to Airflow as shown by our custom plugin.
  • Hyperparameter tuning - So far we have deployed random search and grid search techniques. Our goal is to create a better integration to leverage more advanced parameter space search techniques such as Bayesian optimization.  
  • Production hardening - Our goal is to deliver a reliable production system. That means investing in metrics, robustness, and production deployment strategies.  
  • Managed Airflow - Currently teams need to deploy their own ML Workflows instances under their respective service accounts. In the future we would like to offer a more centralized solution in order to further eliminate maintenance costs.


We would like to thank Paweł Zaborski and Stephan Ohlsson for spearheading the integration of Airflow at Twitter. Major thanks go to the Cortex Platform Tools team for initiating the analysis of alternatives, developing the design document, and integrating Airflow into the Twitter stack: Devin Goodsell, Jeshua Bratman, Bill Darrow, Newton Le, Xiao Zhu, Yu Zhou Lee, Samuel Ngahane, Matthew Bleifer, Andrew Wilcox, Daniel Knightly, and Masoud Valafar. Honorable mention to management for supporting us in this endeavor: Nicolas Koumchatzky and Sandeep Pandey. And finally, a special thanks for all the Tweeps that contributed feedback during our beta offering.


This post is unavailable
This post is unavailable.