Helping your customers learn more about their audience in a privacy-forward manner

Wednesday, 2 December 2015

Brands are always on the lookout for ways to better understand the people that matter most to them: their customers and prospects. Businesses that truly know the makeup of these audiences and their interests are able to make smarter and more strategic marketing decisions. Gnip’s primary customers — the software and analytics providers that serve these brands — are always searching for new ways to help their clients better understand their unique audiences.

In response to this need, we recently announced the beta availability of our new Gnip Audience API, which delivers aggregate information about custom-defined groups of Twitter users. As a member of the engineering team who helped create it, I’m especially excited about how it helps our customers’ clients explore audiences of Twitter users across many demographic models, all in a privacy-forward manner.

I had the opportunity to speak at Flight 2015 about the various design and technical considerations we developed in order to bring this product to life. The top three of these considerations were:

It had to protect user privacy
It needed to be fast and support sub-second queries for audience demographics
It needed to scale to huge audiences without affecting query return times

Designing such an API was challenging, to say the least. The baseline for comparison was a “brute-force” approach where, for a given audience and set of demographic models, the API would count users matching the demographics. While this provided a good data point for comparing our work with alternative approaches, it just was not a viable option for our team.

Exact demographic counts are expensive to compute and raise potential privacy issues, so we turned our attention to approximating counts using statistical sampling. If an audience could be randomly sampled – yet remain representative – then a query against the much smaller set of Twitter users would require fewer system resources, and would produce results from a query much faster. Additionally, statistical theory provided formalized error bounds, allowing us to control the accuracy of results. Have you heard about HyperLogLog? If not, be sure to watch the video of my presentation below to learn more about it.

After a lot of exploration and testing, we chose statistical sampling as the basis for the Audience API’s architecture as it balanced simplicity, accuracy, query latency, and scalability with maintaining user privacy.

Our close attention to user privacy also required that we ensure that the aggregate demographic and interest information returned about brand audiences couldn’t be reverse-engineered to tease out user-specific data. When building the Audience API, we considered several malicious attack vectors, including set-balancing, side-channel and homogeneous group search methods. My presentation explains what these issues are and how to design a product against them.

We see the architectural benefits of our design decisions every day, and customers of the Audience API are creating solutions that allow their clients to explore their audiences and retrieve valuable insights like never before.To learn more about the Audience API and how your business can build new brand-focused solutions on top of it, I again invite you to watch the above video from my Data Track Session from Twitter Flight 2015. If you’d also like to learn how you can join the Audience API beta program, contact your account manager or reach out to us at [email protected].