Netty 4 at Twitter: Reduced GC Overhead

Tuesday, 15 October 2013

At Twitter, Netty (@netty_project) is used in core places requiring networking functionality.

For example:

Finagle is our protocol agnostic RPC system whose transport layer is built on top of Netty, and it is used to implement most services internally like Search
TFE (Twitter Front End) is our proprietary spoon-feeding reverse proxy which serves most of public-facing HTTP and SPDY traffic using Netty
Cloudhopper sends billions of SMS messages every month to hundreds of mobile carriers all around the world using Netty

For those who aren’t aware, Netty is an open source Java NIO framework that makes it easier to create high-performing protocol servers. An older version of Netty v3 used Java objects to represent I/O events. This was simple, but could generate a lot of garbage especially at our scale. In the new Netty 4 release, changes were made so that instead of short-lived event objects, methods on long-lived channel objects are used to handle I/O events. There is also a specialized buffer allocator that uses pools.

We take the performance, usability, and sustainability of the Netty project seriously, and we have been working closely with the Netty community to improve it in all aspects. In particular, we will discuss our usage of Netty 3 and will aim to show why migrating to Netty 4 has made us more efficient.

Reducing GC pressure and memory bandwidth consumption

A problem was Netty 3’s reliance on the JVM’s memory management for buffer allocations. Netty 3 creates a new heap buffer whenever a new message is received or a user sends a message to a remote peer. This means a ‘new byte[capacity]’ for each new buffer. These buffers caused GC pressure and consumed memory bandwidth: allocating a new byte array consumes memory bandwidth to fill the array with zeros for safety. However, the zero-filled byte array is very likely to be filled with the actual data, consuming the same amount of memory bandwidth. We could have reduced the consumption of memory bandwidth to 50% if the Java Virtual Machine (JVM) provided a way to create a new byte array which is not necessarily filled with zeros, but there’s no such way at this moment.

To address this issue, we made the following changes for Netty 4.

Removal of event objects

Instead of creating event objects, Netty 4 defines different methods for different event types. In Netty 3, the ChannelHandler has a single method that handles all event objects:

class Before implements ChannelUpstreamHandler {
  void handleUpstream(ctx, ChannelEvent e) {
    if (e instanceof MessageEvent) { ... }
    else if (e instanceof ChannelStateEvent) { ... }
      ...
    }
}

Netty 4 has as many handler methods as the number of event types:

class After implements ChannelInboundHandler {
  void channelActive(ctx) { ... }
  void channelInactive(ctx) { ... }
  void channelRead(ctx, msg) { ... }
  void userEventTriggered(ctx, evt) { ... }
  ...
}

Note a handler now has a method called ‘userEventTriggered’ so that it does not lose the ability to define a custom event object.

Buffer pooling

Netty 4 also introduced a new interface, ‘ByteBufAllocator’. It now provides a buffer pool implementation via that interface and is a pure Java variant of jemalloc, which implements buddy memory allocation and slab allocation.

Now that Netty has its own memory allocator for buffers, it doesn’t waste memory bandwidth by filling buffers with zeros. However, this approach opens another can of worms—reference counting. Because we cannot rely on GC to put the unused buffers into the pool, we have to be very careful about leaks. Even a single handler that forgets to release a buffer can make our server’s memory usage grow boundlessly.

Was it worthwhile to make such big changes?

Because of the changes mentioned above, Netty 4 has no backward compatibility with Netty 3. It means our projects built on top of Netty 3 as well as other community projects have to spend non-trivial amount of time for migration. Is it worth doing that?

We compared two echo protocol servers built on top of Netty 3 and 4 respectively. (Echo is simple enough such that any garbage created is Netty’s fault, not the protocol). I let them serve the same distributed echo protocol clients with 16,384 concurrent connections sending 256-byte random payload repetitively, nearly saturating gigabit ethernet.

According to our test result, Netty 4 had:

5 times less frequent GC pauses: 45.5 vs. 9.2 times/min
5 times less garbage production: 207.11 vs 41.81 MiB/s

I also wanted to make sure our buffer pool is fast enough. Here’s a graph where the X and Y axis denote the size of each allocation and the time taken to allocate a single buffer respectively:

As you see, the buffer pool is much faster than JVM as the size of the buffer increases. It is even more noticeable for direct buffers. However, it could not beat JVM for small heap buffers, so we have something to work on here.

Moving forward

Although some parts of our services already migrated from Netty 3 to 4 successfully, we are performing the migration gradually. We discovered some barriers that slow our adoption that we hope to address in the near future:

Buffer leaks: Netty has a simple leak reporting facility but it does not provide information detailed enough to fix the leak easily.
Simpler core: Netty is a community driven project with many stakeholders that could benefit from a simpler core set of code. This increases the instability of the core of Netty because those non-core features tend to lead to collateral changes in the core. We want to make sure only the real core features remain in the core and other features stay out of there.

We also are thinking of adding more cool features such as:

HTTP/2 implementation
HTTP and SOCKS proxy support for client side
Asynchronous DNS resolution (see pull request)
Native extensions for Linux that works directly with epoll via JNI
Prioritization of the connections with strict response time constraints

Getting Involved

What’s interesting about Netty is that it is used by many different people and companies worldwide, mostly not from Twitter. It is an independent and very healthy open source project with many contributors. If you are interested in building ‘the future of network programming’, why don’t you visit the project web site, follow @netty_project, jump right into the source code at GitHub or even consider joining the flock to help us improve Netty?

Acknowledgements

Netty project was founded by Trustin Lee (@trustin) who joined the flock in 2011 to help build Netty 4. We also like to thank Jeff Pinner (@jpinner) from the TFE team who gave many great ideas mentioned in this article and became a guinea pig for Netty 4 without hesitation. Furthermore, Norman Maurer (@normanmaurer), one of the core Netty committers, made an enormous amount of effort to help us materialize the great ideas into actually shippable piece of code as part of the Netty project. There are also countless number of individuals who gladly tried a lot of unstable releases catching up all the breaking changes we had to make, in particular we would like to thank: Berk Demir (@bd), Charles Yang (@cmyang), Evan Meagher (@evanm), Larry Hosken (@lahosken), Sonja Keserovic (@thesonjake), and Stu Hood (@stuhood).