This is the third post in our series on Real-time Data Visualization.
The much talked about Twitter IPO is behind us, but all eyes are still on the company to see what it has in the works after going public. In this post, we’ll look at Twitter’s contribution to real-time analytics, which is not just one of the hottest topics within big data today, but one which Twitter has excelled in.
When it initially launched, Twitter set the standard for real-time data with it’s core feature of ‘Trending topics.’ Back then, we were used to seeing trending blog posts after a few hours on Technorati. Twitter stepped in making it possible to view thousands of tweets from just a few seconds ago. Not just with trending topics, even with its search feature, Twitter touted the real-time mantra – ‘See what’s happening right now.’ We’re aware of the many revolutions, natural disasters, and news stories that broke out on Twitter at real-time speed. Despite the reputation, Twitter continued to innovate in real-time analytics to emerge a leader in the field.
The Perfect Storm:
In 2010, Twitter used relational databases that were vertically scaled, and with 90% utilization, they were not able to manage the load of all those tweets. They were partitioning their databases to split the load across multiple machines, but they’d soon outgrow that too. They had plans to move to the open-source Cassandra database, and needed to scale quickly. Then, in 2011, Twitter acquired BackType and with it, a new real-time computing processor called Storm. This would become the biggest game changer in Twitter’s story of real-time analytics.
Nathan Marz, creator of Storm, positioned it as the Hadoop of real-time processing. Here’s an interview in which he describes Hadoop as extracting data all at once, and Storm extracting data as it comes:
In a pre-release teaser, Marz gave use cases for Storm such as it being able to stream the top 50 most influential twitter users every second. That takes some serious computing power, and ushered a new paradigm in real-time analytics. What’s more, this sort of continuous computation that Storm enabled required little or no maintenance. Jobs, or topologies as they’re called in Storm, could go on for months uninterrupted, Marz says in one of his older talks. Twitter open-sourced Storm as soon as acquiring BackType, and it’s since been an Apache Incubator project.
How Twitter Uses Storm:
In an interesting post earlier this year, Twitter outlined how they predict the context of trending terms a real-time computation powered by Storm, in combination with human input from Amazon’s Mechanical Turk. Such out-of-the-box thinking is often required when working with real-time data, and Twitter shows us how it’s done.
Storm at Spider.io:
Storm is used not just at Twitter, but in many other companies, and use cases have even been documented in detail. For example, Spider.io, an anti-malware company that fights ad spam, particularly in the case of display ads. They’re in an arms race with large botnets, which requires them to process “billions and billions and billions” of ad impressions every month. They’ve written in detail about their experience with Storm, and even posted an in depth deck on SlideShare:
They emphasize that implementing something like Storm requires a culture and mindset that isn’t attached to the status quo, but is willing to try an open-source option if it’s a better solution. Another reason that made them consider Storm was that it naturally fit into their existing architecture. However, as they found out, working with Storm requires an ecosystem of supporting tools like Cascading, Mahout, and Trident. They advise against trying Storm because it’s cool.
Storm at Yahoo:
Yahoo is one of the companies doing interesting work with Storm. They use it to stream data like user activity, ad beacons, content feeds, and social feeds. Their Storm engine handles a whopping 130k user events per second. They use this data to personalize users’ homepage with breaking news stories and interest-based content, and even handle critical tasks like ensuring their ads system doesn’t exceed customers’ set budget limit. Due to their extensive use of Storm, and their deep integration of Hadoop, they’ve created and open-sourced Storm-YARN which makes Storm clusters play well withing the Hadoop environment they’re part of. With it, developers can launch a Storm cluster with just a single command-line function. Andrew Feng, Architect on Yahoo’s Hadoop team, discussed the work they do with Storm at this year’s Hadoop Summit.
Hortonworks’ Commitment to Storm:
Picking up on Yahoo’s lead, Hortonworks is committing even more resources to developing Storm, and making it part of their Hortonworks Data Platform starting 2014. In a post last month, they mention “one of the most common use cases that we see emerging from our customers is the antithesis of batch: stream processing in Hadoop.” Here’s their 2014 roadmap for Storm:
Hortonworks being one of the trendsetters in the Hadoop ecosystem, this move will drive much wider adoption for Storm. While Storm is set to take off in the open-source world, Twitter is on a different journey.
What’s Ahead for Twitter:
With a rich history in real-time analytics, Twitter will need to build on that legacy and continue innovating. We’ve seen promising signs of that with Twitter’s acquisitions in 2013:
MoPub – Helps publishers manage their ad inventory. The MoPub Marketplace is a real-time bidding exchange for advertisers.
These three acquisitions are strategically aimed at bolstering Twitter’s fledgling ads platform with real-time capabilities. The future is bright indeed for Twitter, and deservedly so for a champion of real-time analytics.
Stay tuned for the upcoming posts.
Tip: Get our whitepaper ‘The Ultimate Guide to Real-time Data Visualization’ to read more on the topic.