A search on Google with “define real-time processing” yields “Batch processing requires separate programs for input, process and output. An example is payroll and billing systems. In contrast, real time data processing involves a continual input, process and output of data. Data must be processed in a small time period (or near real time)“. Emphasis is by Google Search itself. It’s interesting that Google decides to contrast the real-time with batch processing but (even though tangent to today’s topic) I think Google got it wrong when it cited Data Science Central’s article to define the properties of batch processing. Batch processing simply means that you periodically process the data because it brings in the optimization of the resources through amortization. How you do it is completely orthogonal to what it is. Contrary to Data Science Central’s definition of batch processing, I can use the same program to collect, process, and store (sounds very similar to ETL?), yet it can be batch processing.
Ok, so back on topic. The definition of stream processing is exactly opposite of my definition of batch processing. In stream processing, you do not collect your data to reach certain quorum or timeout before you trigger your process. As soon as the data event is received, the program processes it, and creates the output. It’s event processing. So “real-time” word is somewhat redundant. Yet, a lot of systems do use “real-time” to describe them as low latency systems. Of course nobody can guarantee that the actual processing will be low latency. It’s function of what the process is trying to do (application logic). e.g. an application logic could be a delay loop where each event that it received is output exactly after 1 minute. No platform in the world can guarantee that 1 minute to be smaller than 60 seconds. The low latency refers to the overhead that the system adds to the application outside of the processing the application does.
There are various reasons for the latencies. Some could be coming from underlying hardware (e.g. slower CPU, RAM, NIC, or HDDs), some could be coming from system mandated middleware (Databases, messaging buses, networking protocols), and yet some could be coming from the paradigm the very systems profess.
In this article we look at 2 different paradigms omnipresent for Big Data Real-Time stream processing.
- Tuple at a time ‘ As the name suggests the paradigm is to process the event as it comes. This kind of processing is true to the definition of stream processing. Though it’s very expensive to keep track of state of individual tuple. Typically to ensure the guaranteed processing of individual event, an acknowledgement event (ack) needs to be sent by the consumer and that ack needs to be received by the producer in timely fashion.
- Micro batching ‘ The idea here is to process in the same fashion as the batch processing, but keeping the batch sizes very small. Usually all the events that can be collected within a few milliseconds. So instead of ack for individual event, you send the ack for a batch of events. Although it reduces the expenses of per event ack, you automatically add processing latency which is not added by your application logic.
Amongst the well-known real-time stream frameworks ‘ Storm implements the first paradigm whereas Spark Streaming decided to follow the second one to capitalize on existing in memory map-reduce framework the elder brother Spark has.
Windowed Stream Real-Time Event Processing
DataTorrent’s RTS, though, takes a radically different approach named Windowed Stream Real-Time Event Processing. You may understand it as the best of the 2 paradigms or as a compromise between the 2 paradigms to achieve low latency. Like Storm, DataTorrent defers the batching decisions (which you cannot really avoid sometimes e.g. when you are doing aggregation operation) to the application logic by enabling a tuple at a time processing, but like Spark streaming does the ack once for a bunch of tuples. It’s not really an ack in case of RTS. DataTorrent uses a concept of beacons. These beacons are ultralow latency pseudo-events which are inserted into the stream periodically but are invisible to the application logic that processes the stream. They serve dual purpose to the system. They are primarily used by the system to track the processed events so in case of failure, the recovery starts at the state the application achieves immediately after the most recent beacon before failing event. The beacons also help to notify the aggregating application logic the segments of the stream on which aggregation can be done. Each of these segments is called a “window” and is uniquely identifiable in the stream using a pair of beacons. Beacons are also very useful in many different bookkeeping tasks which aid in recovery, snapshots of the application state, or orderly auto scaling of operators etc.
Let’s use a visual to differentiate the 3 different types of paradigm in case of in order processing of incoming events.
In the above image, the orange dots represent the individual events in the stream. The blue boxes represent the processing.
In case of tuple at a time, you process individual events as and when they come through and on completion of the processing send an ack out. Here the latency for processing each event is the time to send the ack. In another words
Where L(tuplei) and L(ack) are latency to process the ith tuple and latency of an ack respectively. The latency being directly proportional to the number of events severely limits the scalability of this paradigm. That’s why the acks are handled asynchronously. None-the-less however negligible, the per ack latency is unavoidable and creates operational headache elsewhere in the system.
With Micro-Batch paragirm, you collect a few events, process them, and on completion of the processing send an ack out. Here the latency of processing each event in the batch is the sum of the time to send the ack, the time to collect the subsequent events in the batch, and the time the application logic takes to process “all” but the very event in the batch. Astute readers should have noticed that this is very unfortunate. Not only the latency is compounded with the number of events in the batch but also both system and the application logic are contributing to it.
Where Pi is incremental processing time of application logic for processing ith event in the batch, Ci is incremental latency for collecting the ith event, and N is the number of events in the micro-batch.
With Windowed Stream Real-Time Event Processing, the latency to process almost any event is precisely zero. The only event which has non-zero latency is the event that immediately follows the beacon. This is the reason the beacon is optimized aggressively to have very little latency.
Where L(beacon) is latency of the beacon, and i is the sequence number of event after the most recent beacon.
Now that the paradigms are clear, one can easily deduce that when you have to act on the event the very moment it’s received, there is no match to the paradigm DataTorrent RTS implements. Interestingly Windowed Stream Real-Time Event Processing also helps a great deal with scaling. As the number (N) of events in a window increases the average latency (Lavg) for any events in that window goes down proportionately.
This concept is very fundamental to the core of DataTorrent RTS and allows us to write applications which process billions of events every second.
In this small article, we learned how real-time stream processing is different from batch processing, what the different paradigms are for doing the real-time processing and how they influence the latency as well as throughput of the applications. When you are set out to evaluate different technologies, hopefully processing paradigms discussed here will help you differentiate amongst them.
If you are using a paradigm which is different from the ones listed here, or you have feedback on this article ‘ please share it below.