This is the second post in our series on real-time data visualization.
Last week, we looked at how we got from relational databases to big data and real-time analytics. This week, we’re taking a deep-dive into how a real-time business intelligence system works. If you’ve used a real-time dashboard before, or are planning on building one in future, this post can serve as a primer to help you understand what happens behind the scenes, and how the real-time data reaches your dashboard.
Despite the extremely short time duration from end to end, there are four broad steps to how data is visualized in real-time. Here’s an illustration of each of these steps, which we’ll discuss further below:
1. Streaming data is captured
Live streaming data is captured using scrapers, collectors, agents, listeners, and is stored in a database. This database is usually a NoSQL database like Cassandra, MongoDB, or sometimes even Hadoop’s Hive. Relational databases are not suited for this sort of high performance analytics, and the rise of NoSQL databases is key to enabling real-time analytics today.
2. The data is stream processed
The streaming data is processed in many ways like splitting, merging, doing calculations, and connecting it with outside data sources. This is done by a fault-tolerant, distributed database system like Storm. Hadoop, which is the most common big data processing framework, is not ideal for real-time analytics due to its dependency on MapReduce’s batch-oriented processing. However, Hadoop 2.0 allows for using other computational algorithms instead of MapReduce, which opens up the possibility for Hadoop to be used in real-time systems going forward. After processing, the data is ready to be read by the visualization component.
3. The processed data is read by the visualization component
The processed data is stored in a structured format, like JSON or XML, in the NoSQL database. From there, it’s read by the visualization component. In most cases, this is a charting library embedded in an internal BI system, or as part of a broader visualization platform like Tableau. The frequency at which processed data is refreshed in the JSON or XML file is termed as the update interval.
4. The visualization component updates the real-time dashboard
The visualization component then reads the data from the structured data file (JSON/XML), and draws a chart, gauge, or other visualization in the reporting interface. The frequency at which processed data is drawn on client-side is called the refresh interval. In some applications, like stock trading applications, along with rendering a chart, there are pre-set rules that are triggered based on the streaming data.
While this all may sound complex, what’s amazing is that the entire process takes place in seconds, or even milliseconds. This is possible because of advances in database technology, particularly NoSQL databases. It’s further helped by capable querying tools like Storm, which are exclusively meant for real-time processing. Additionally, visualization tools have matured to support these demanding scenarios, bringing together a whole ecosystem that enables real-time analytics in today’s big data applications.
P.S. – If you found this interesting, I recommend you get the white paper on which this series is based. It’ll allow you to read through the entire topic at once rather than in parts.