The ability to share photos directly on Twitter has existed since 2011 and is now an integral part of the Twitter experience. Today, millions of images are uploaded to Twitter every day. However, they can come in all sorts of shapes and sizes, which presents a challenge for rendering a consistent UI experience. The photos in your timeline are cropped to improve consistency and to allow you to see more Tweets at a glance. How do we decide what to crop, that is, which part of the image do we show you?
Previously, we used face detection to focus the view on the most prominent face we could find. While this is not an unreasonable heuristic, the approach has obvious limitations since not all images contain faces. Additionally, our face detector often missed faces and sometimes mistakenly detected faces when there were none. If no faces were found, we would focus the view on the center of the image. This could lead to awkwardly cropped preview images.
A better way to crop is to focus on “salient” image regions. A region having high saliency means that a person is likely to look at it when freely viewing the image. Academics have studied and measured saliency by using eye trackers, which record the pixels people fixated with their eyes. In general, people tend to pay more attention to faces, text, animals, but also other objects and regions of high contrast. This data can be used to train neural networks and other algorithms to predict what people might want to look at.
The basic idea is to use these predictions to center a crop around the most interesting region [1].
Thanks to recent advances in machine learning, saliency prediction has gotten a lot better [2]. Unfortunately, the neural networks used to predict saliency are too slow to run in production, since we need to process every image uploaded to Twitter and enable cropping without impacting the ability to share in real-time. On the other hand, we don’t need fine-grained, pixel-level predictions, since we are only interested in roughly knowing where the most salient regions are. In addition to optimizing the neural network’s implementation, we used two techniques to reduce its size and computational requirements.
First, we used a technique called knowledge distillation to train a smaller network to imitate the slower but more powerful network [3]. With this, an ensemble of large networks is used to generate predictions on a set of images. These predictions, together with some third-party saliency data, are then used to train a smaller, faster network.
Second, we developed a pruning technique to iteratively remove feature maps of the neural network which were costly to compute but did not contribute much to the performance. To decide which feature maps to prune, we computed the number of floating point operations required for each feature map and combined it with an estimate of the performance loss that would be suffered by removing it. More details on our pruning approach can be found in our paper which has been released on arXiv [4].
Together, these two methods allowed us to crop media 10x faster than just a vanilla implementation of the model and before any implementation optimizations. This lets us perform saliency detection on all images as soon as they are uploaded and crop them in real-time.
These updates are currently in the process of being rolled out to everyone on twitter.com, iOS and Android. Below are some more examples of how this new algorithm affects image cropping on Twitter.
Before:
After:
We’d like to thank everyone involved at Twitter who worked with us on this new update. In particular, the Video Org leadership, Media Platform, Magic Pony, Comms and Legal teams, with special thanks to:
[1] E. Ardizzone, A. Bruno, G. Mazzola
Saliency Based Image Cropping
ICIAP, 2013
[2] M. Kümmerer, L. Theis, M. Bethge
Deep Gaze I: Boosting Saliency Prediction with Feature Maps Trained on ImageNet
ICLR Workshop, 2015
[3] G. Hinton, O. Vinyals, J. Dean
Distilling the Knowledge in a Neural Network
NIPS workshop, 2014
[4] L. Theis, I. Korshunova, A. Tejani, F. Huszar
Faster gaze prediction with dense networks and Fisher pruning
arXiv:1801.05787, 2018
Did someone say … cookies?
X and its partners use cookies to provide you with a better, safer and
faster service and to support our business. Some cookies are necessary to use
our services, improve our services, and make sure they work properly.
Show more about your choices.