The Hadoop data lake is only as good as the data in the lake. Given the variety of data sources that need to pump data into Hadoop, customers often need to set-up one-off data ingestion jobs. These one-off jobs copy files using FTP & NFS mounts or try to use standalone tools like DistCP to move data in and out of hadoop. Since these jobs stitch together multiple tools, they encounter several problems around manageability, failure recovery and ability to scale to handle data skews. DataTorrent data ingestion, enrich, transform data services solves this problem. It is industry’s first unified stream and batch data ingestion services for Hadoop and cloud.

What sets ingestion data services apart

DataTorrent ingestion data services is built for data stewards in the enterprise. It intends to make their job of configuring and running Hadoop data ingestion and data distribution pipelines easy by lowering total cost of ownership and time to value. Our customers have been able to go to production under a month. It includes several enterprise-grade features not available in the market today:

Apache 2.0 open-source Project Apex based– Built on Project Apex ingestion data services is a native YARN application. It is completely fault tolerant so unlike other tools like distCP, it can resume file ingest on failure. It is horizontally scalable and supports extremely high throughput and low latency data ingest.

Simple use & manage- Ingestion and copy data services available in DataTorrent AppFacroty are easy to configure. Additionally you can launch multiple data ingestion & distribution pipelines with minimal changes very quickly. Centralized management, visibility, monitoring help reduce support costs..

Batch as well as stream data- ingestion data services supports moving data between NFS, (s)FTP, HDFS, AWS S3n, Kafka and JMS so you can use one platform to exchange data across multiple endpoints.

HDFS small file ingest using Compaction- Configurable automatic compaction of small files into large files during ingest into HDFS. Helps prevent running out of HDFS namenode namespace.

Secure and efficient data movement- Supports compression and encryption during ingest. Works with kerberos enabled secure Hadoop clusters.

Runs in any Hadoop 2.0 Cluster- Certified to run across all major Hadoop distros in physical, virtual or in the cloud deployments.

Some sample applications of ingestion data services

  • Bulk or incremental data loading of large as well as small files into Hadoop
  • Distributing cleansed/normalized data from Hadoop
  • Ingesting change data from Kafka/JMS into Hadoop
  • Selectively replicating data from one Hadoop cluster to another
  • Ingest streaming event data into hadoop
  • Replaying log data stored in HDFS as Kafka/JMS streams

To understand how DataTorrent data services manages to accomplish all the capabilities I mentioned above, visit this blog post. These data services in DataTorrent AppFactory will form the foundation of all streaming and batch processing applications based on DataTorrent RTS. We have some exciting plans to add more sources and sinks to our data ingest and copy data services. We will be adding more data in motion analytics and cloud to these data services too.