This is the fifth post in our Data Visualization Spotlight series where we showcase how different organizations are using data visualization and analytics to solve their day to day problems.
Known as “the SMS of the Internet”, this 140-character online social networking and micro blogging service revolutionized the way we connect with people online. As on September 2013, the company’s data showed that 200 million users send over 400 million tweets daily.
At Twitter, they have to deal with massive data sets daily. To analyze these data sets, their engineers create complex workflows using a variety of tools and languages, including Pig and Scalding. One difficulty many of them face when using these tools is visibility—when a Pig script is executed, multiple MapReduce (Related read: MapReduce) jobs might be launched, either in parallel or in a serial fashion if one job depends on the output of another. As these jobs run, the status of individual jobs can be monitored with the Hadoop Job Tracker UI, but overall progress of the script can be difficult to monitor.
Table of Contents
Ambrose, visualizing and monitoring large scale data workflows
Ambrose was born at one of Twitter’s quarterly held Hack Week. Its creators Bill Graham and Andy Schlaikjer wanted to have a platform that would allow visualization and real-time monitoring of large scale data workflows.
Ambrose presents a global view of all the MapReduce jobs derived from workflows after planning and optimization. As jobs are submitted for execution on the Hadoop cluster, Ambrose updates its visualization to reflect the latest job status.
Ambrose provides the following in a web UI:
- A workflow progress bar depicting percent completion of the entire workflow
- A table view of all workflow jobs, along with their current state
- A graph diagram which depicts job dependencies and metrics
- Visual weighting of jobs based on resource consumption
- Visual weighting of job dependencies based on data volume
- Script view with line highlighting
Fig: In this screenshot, we see the Ambrose UI for a workflow compiled from a single Pig script. The circular chord diagram in the upper left highlights dependencies between jobs. As a job’s status changes, the color of its arc in the diagram changes. Statistics for the job most recently started are displayed to the right of the chord diagram. Summary information and status of all jobs is displayed in the table beneath these two views. Image Source: blog.twitter.com
Fig: With Ambrose, the real-time status of a complex series of MapReduce jobs can be visualized succinctly, so that users can quickly understand how far computation has progressed and diagnose failures in context. Image Source: github.com/twitter/ambrose
The interface presents multiple responsive “views” of a single workflow. Just beneath the toolbar at the top of the window is a workflow progress bar that tracks overall completion of the workflow. Below the progress bar is a graph diagrams which depicts the workflow’s jobs and their dependencies. Below the graph diagram is a table of workflow jobs.
All views react to mouse over and click events on a job, regardless of the view on which the event is triggered. Moving your mouse over the first row of the table will highlight that job’s table row along with the job’s node in the graph diagram. Clicking on a job in any view will select it, updating the highlighting of that job in all views. Clicking again on the same job will deselect it.
Because sharing is caring—Going Open Source
Image Source: blog.twitter.com
At the Apache Pig Hackathon held in May 2012, Twitter open-sourced Ambrose. Initially when it was open sourced it only worked with Pig, however with contributions from the Open Source community the framework allowed support for other runtimes like Hive, Cascading and Scalding.
Fig: The open sourced version also included a graph layout of Pig EXPLAIN data. This visualization can be used to debug and better understand the Pig scripts. Image Source: Hortonworks
Comprehensive visibility is the first step to managing complex workflows and Twitter’s data visualization tool Ambrose helps in providing that visibility into jobs. By providing the right context, it makes it easier for you to plan your jobs properly, monitor progress and diagnose failures well in time.
In the next post of the Data Visualization Spotlight series, read how Airbnb used conditional probability models and data visualization to make its search algorithm more location relevant.