The Hadoop data lake is only as good as the data in the lake. Given the variety of data sources that need to pump data into Hadoop, customers often need to set-up one-off data ingestion jobs. These one-off jobs copy files using FTP & NFS mounts or try to use standalone tools like ‘distCP’ to move data in and out of hadoop. Since these jobs stitch together multiple tools, they encounter several problems around manageability, failure recovery and ability to scale to handle data skews. DataTorrent dtIngest solves this problem. It is industry’s first unified stream and batch data ingestion application for Hadoop

What sets dtIngest apart

DataTorrent dtIngest is built for data stewards in the enterprise. It intends to make their job of configuring and running Hadoop data ingestion and data distribution  pipelines a point-and-click process. It includes several enterprise-grade features not available in the market today –

Apache 2.0 open-source Project Apex based– Built on Project Apex dtIngest is a native YARN application. It is completely fault tolerant so unlike other tools like distCP, it can ‘resume’ file ingest on failure. It is horizontally scalable and supports extremely high throughput and low latency data ingest.

Simple to use & manage- A point-and-click application user interface makes it easy to configure, save & launch multiple data ingestion & distribution pipelines. Centralized management, visibility, monitoring and summary logs through integration with dtManage.

Batch as well as stream data- dtIngest supports moving data between NFS, (s)FTP, HDFS, AWS S3n, Kafka and JMS so you can use one platform to exchange data across multiple endpoints.

HDFS small file ingest using ‘compaction’- Configurable automatic compaction of small files into large files during ingest into HDFS. Helps prevent running out of HDFS namenode namespace

Secure and efficient data movement- Supports compression and encryption during ingest. Works with kerberos enabled secure Hadoop clusters.

Runs in any Hadoop 2.0 Cluster- Certified to run across all major Hadoop distros in physical, virtual or in the cloud deployments.

Some sample applications of dtIngest

  • Bulk or incremental data loading of large as well as small files into Hadoop
  • Distributing cleansed/normalized data from Hadoop
  • Ingesting change data from Kafka/JMS into Hadoop
  • Selectively replicating data from one Hadoop cluster to another
  • Ingest streaming event data into hadoop
  • Replaying log data stored in HDFS as Kafka/JMS streams

Using dtIngest

  1. dtIngest is free to use with Project Apex & all DataTorrent editions. It is available under the application package named ‘dtIngest’
  2. dtIngest is designed to move data between any of the supported data sources & destinations. Just pick & configure the ones you need. Also, when moving data from file based sources, you can choose for the pipeline to run ‘one time’ or to continuously poll the input directories to look for files that match the filtering criteria

Unified Streaming Platform for Hadoop

  1. When copying data to ‘file’ based destinations several useful options are available. These include being able to keep source directory structure, overwrite files on destination & automatically creating a hourly directory structure to track when data was written at the output.

step4

  1. Before saving the data into the destination location, it can be compressed as well as encrypted.

Screen Shot 2015-07-29 at 10.14.46 PM

  1. When ingesting data into HDFS, small files can be combined into ‘large’ files to help work around the namenode namespace restrictions.

Screen Shot 2015-07-29 at 10.16.08 PM

  1. After configuring the ingestion application, hitting the ‘Launch’ button will automatically provision the requisite connectors like Kafka, JMS, HDFS etc., automatically instantiate the right operators for compression, encryption etc. and connect everything into the application DAG (Directed Acyclic Graph)

The screenshot below shows sample of a DAG that is generated to read files from FTP and ingest them into HDFS

Screen Shot 2015-07-30 at 5.46.25 AM

  1. Once the application is launched, all the same metrics around throughput, latency etc. that are available for all datatorrent applications are also available with dtIngest as it is managed using dtManage (management platform for project Apex & Datatorrent RTS)
  1. Summary logs from the application will be available under the ‘summary’ folder in the HDFS directory dedicated for the application.

/user//datatorrents/apps/APP_ID/summary

On the sandbox, you can substitute the with “`dtadmin“`

To understand how dtIngest manages to accomplish all the capabilities I mentioned above, visit this blog post.

dtIngest will form the foundation of all streaming and batch processing applications based on project Apex and DataTorrent RTS. We have some exciting plans to evolve dtIngest with even more data sources, destinations and computational modules as we build more applications based on dtIngest!