Scaling Up Event Ingestion

I was very excited the day I joined DataTorrent. Quite impressed with the real time big data streaming technology the company had been building natively on Hadoop, I was looking forward to play with the platform. My first big assignment was to assess the platform’s scalability using an existing Machine-generated Data Application.

This Application addresses the issue of ingesting a large amount of data in real-time from a variety of sources into Hadoop. Many businesses today are powered by a complex grid of integrated systems, sensors, actuators, monitors, apps, and more – all working 24/7/365 -generating magnanimous amount of data. These devices are deployed by many customers in geographically distributed data centers and offices. The case of machine generated data is a classic one where high velocity meets high variety of data sources, resulting in a high volume of data that makes it very difficult for real-time processing.

For this exercise, I used a grid with 37 nodes each having 256 GB of RAM and 12 hyper-threaded CPU cores. The data set was modeled to resemble real data. In order to test the limits of the platform, I increased the input feeds linearly. The Application scaled pretty well till 150M events per second, beyond which the application was not able to scale. While trying to figure out the reasons for the application’s inability to scale beyond 150M (million) events, I found out that a single operator was processing quite a few output streams which represented output of the upstream operators. As a result, this “Unifier” operator was trying to do a lot in a single thread and thus became bottleneck due to CPU scarcity. Current DAG looked as shown below. Total data emitted by the multiple instances of upstream operator (O1) is much then that can be handled by a single downstream operator (O2) making it bottleneck.


To overcome such scenario, DataTorrent has a feature called “Cascading Unifier” based on the Divide and Conquer policy. As part of this feature, multiple processing operators are deployed, each processing small amount of data. Updated DAG using Cascading Unifier feature looks like following:


Each operator O2 in Layer 1 now operate on the data from 2 upstream operators, thus reducing the load on each and removing the bottleneck. The Operator O2 in Layer 2 operates on the processed data from operators in Layer 1.

By successfully configuring this feature without changing functional code, I was able to scale the Application linearly till 1.6B (billion) events per second as shown in the Scalability Test Chart. At this point, the application used all the available CPU resources in the cluster as expected.