The Hadoop data lake is only as good as the data in the lake and the real value lies in what you do with the data after it gets into the lake.  Obviously then the products & services in the Hadoop ecosystem are tuned to make money AFTER the data lands into the Hadoop cluster. The pricing models of these products are usually tied to amount of storage or more commonly to the number of nodes in the Hadoop cluster. This leaves very little incentive to invest in improving the key barrier to entry and adoption to these solutions ‘ Ingesting data into the Hadoop data lake and distributing results from the data lake to enable business actions

To fill this gap, I have seen customers struggle with data ingestion & distribution solutions built using a hodgepodge of connectors &  open source tools (eg. Sqoop, Flume etc).

Common problems that customers continually struggle with include ‘

1. No centralized visibility  – Given the piecemeal approach there are multiple different connectors & technologies involved in the ingestion process making it extremely opaque. When there are failures, it becomes very hard to pinpoint root cause. Connecting multiple disparate technologies make it very to have an enterprise quality operable solution.

2. Lack of speed ‘ The ingestion solutions are not able to adapt to changes in the input data volumes. This often leads to bottlenecks and failures delaying the ingestion process.

3. Frequent ‘re-runs’ ‘ When the ingestion jobs fail, there is no visibility into exactly where the failure occurred and the solutions also do not save any intermediate results. There is no way to separate failed records, fix them and re-run only the failed records. As a result the entire failed batch has to be ‘re-run’ ‘ often multiple times! This often leads to duplicate data, which then causes additional headaches.

4. Inability to capture change data – Since most of the ingestion solutions were built to run in batch mode, change data has to be ‘accumulated’ over a period of time and then loaded into the data lake as a bulk batch. This leads to lost opportunities to react to the changes at the right time.

Being a Hadoop native streaming analytics company, our business is built around being able to ingest & analyze data, distribute results & insights at scale.  Below are some of the  capabilities that are fundamental requirements of a data ingestion and distribution platform. These are available out-of-the-box in DataTorrent RTS and are working at scale in customer environments

1. Ability to handle variety of input sources & output destinations

This is  a very important pre-requisite for being able to integrate with various components of the data management & collection platforms to ingest data. Also the connectors need to be fault tolerant and support exactly once processing semantics. Out of the 400+ operators in DataTorrent RTS Malhar open source library, 75 are dedicated for input/output

2. Handle extreme fluctuations in input data

There are typically a lot of variations in the size, format and frequency at which data is ingested in Hadoop. Handling these without impacting the throughput & latency SLAs requires the ingestion and analytics platform to auto scale to avoid congestion.  This requires smart auto-scaling capability that can automatically provision new resources (YARN containers) as needed. The ingested data stream needs to be auto partitioned while pinning the multiple operator instances involved in processing each record to each other to form parallel execution pipelines within the same topology.

3. High throughput & low latency handling

Many streaming platforms force users to choose between high throughput (micro-batch) or low latency (record-at-a-time). In reality, this is usually not acceptable as even small latencies on a per record basis add up across hundreds of thousands of records in each step of the streaming ingest application. The DataTorrent RTS platform accomplishes this by an innovative way of processing each event called ‘Streaming Window Event Processing’

4. Enrichment & tagging on the fly

While the data is being ingested, it needs to be enriched & tagged with reference data from various relational or non-relational data sources. Variety of analyses needs to be performed and notifications triggering business action need to be sent. All of these should be made easy by providing out of the box operators. Developers should be able to re-use any of their existing business logic and not have to ‘re-implement’ anything

5. Preserve order

Depending on various criteria, input records might need to be ‘re-ordered or processed in order of arrival over specific time windows. This requires the platform to not just be able to do the reordering but also support multiple application defined time windows of potentially varying length. The ordering can be simply based on the arrival sequence OR often based on the timestamp or some other sequence identified that is embedded within the record.

6. Stateful fault tolerance

In the event of failure of ingestion process or node the streaming analytics platform needs to not just recover the app on another node but also make sure that the application state is restored to the exact point before the failure. This requires the ingestion platform to keep track of the application state.  With or without failure the ingestion platform needs to ensure that there are no duplicates from upstream source. This again needs stateful application management.

7. Non-Stop modifications

Any number of factors can change and impact how a streaming ingest and analytics app runs. The app should not need downtime to change any app properties and thresholds or even if new operators need to be added to a already running app.

8. Simple to build and operate

There should be end-to-end visibility around the number of records/files being ingested through every stage of the ingestion pipeline (including the error stage). Integrability with internal systems using REST APIs & a visual management and monitoring console should be supported. The solution should feel like a single application, if not actually be one.

As an example, here is a snapshot of a real ingestion & distribution application based on DataTorrent RTS



I welcome your feedback and comments on data ingestion into Hadoop.  How are you avoiding the data drought?