For successful launch of fast, big data projects, try DataTorrent’s AppFactory.

Today DataTorrent is excited to announce the creation of a new project to marry the fault-tolerant, high-performance, scalable Kafka messaging system with the power and flexibility of Hadoop 2.x, specifically YARN. The title of this project is KOYA – Kafka On YARN  (  Check out the Kafka JIRA ticket  ).

The Hadoop stack has become integral part of data flow within enterprises, and is the future reference architecture for big data. At a high level, this can be viewed as data being collected from sources, ingested into Hadoop, processed within Hadoop, and then loaded back to external systems. YARN (Hadoop 2.x) has enabled clusters to be used for more data processing paradigms and freed Hadoop from MapReduce, emerging as the de-facto distributed operating system for big data. The pillars for deliverance of big data by Hadoop still remain the same, which are linear scalability, partitionable paradigms, massively distributed resource utilization via commodity hardware, high performance, fault tolerance, security, etc.

At DataTorrent we envision a YARN eco-system consisting of distributed big data applications. This eco-system needs a message bus with pillars identical to Hadoop to connect to external systems. An additional requirement is that it should enable data replay. Apache Kafka has all the pillars and is very complimentary to Hadoop. It has strong adoption within the big data community, which we believe will enable Kafka to be readily accepted by the YARN community. Figure 1 shows a common high level Kafka architecture diagram.

high-lvl-kafka-f1

For adoption into the YARN eco-system, Kafka must function as a native YARN application. It needs to leverage all features of YARN as a distributed operating system. This includes operations, cluster, resource prioritization, and security. With an aim towards making Kafka a native YARN application, we have a working functional prototype that we are reviewing with our customers. We intend to experiment and solidify this code to get to alpha quality, with an expected release to the open source community sometime in Q2 2015.

KOYA is a YARN application that launches Kafka within YARN. It then manages the resource negotiation with Resource Manager, and ensures that Kafka operates in a YARN native way. For an external publisher or subscriber, KOYA would not look any different than Kafka since the same code is being run as a YARN application. The architecture is as shown in the Figure.

KOYA - Apache Kafka on YARN

DataTorrent has made an all-in-bet on YARN from an architectural perspective. When we started 2.5 years ago we foresaw Hadoop 2.x YARN as the de-facto distributed operating system for big data with a slew of YARN native applications. This eco-system would provide tremendous value to enterprises as, among other things, it does away with the need to run a cluster per application. Today we know our vision has been validated.

Over the past 2.5 years we have developed expertise in YARN that we want to leverage to help the big data community. DataTorrent RTS, our real time streaming application platform for ingestion and streaming analytics, has already been well received by the YARN community. Additionally, by open sourcing the more than 450 Java-based operators and UI widgets in our Malhar library, we have enabled quick development of applications. As a part of this our continual efforts to enrich YARN eco-system we are announcing KOYA as a full open-source project.

Our initial goals for KOYA are as follows

  1. Run Kafka natively in YARN
  2. Leverage YARN for Kafka broker management
  3. Automate management tasks as an alternative to Kafka command line utilities
  4. Automate broker recovery
  5. Make it very easy to operate Kafka clusters within YARN
  6. Ensure that Kafka code runs as is, i.e. minimal or no changes are made to internals of Kafka.

We plan to work on the following features.

  1. KOYA application master with full HA support
  2. Policies to deal with repeat broker failure
  3. Sticky allocation of containers and full automatic recovery from outages
  4. Ability to create on-demand Kafka cluster
  5. Admin web-service for Kafka metrics, and details

We invite members of the Kafka and YARN communities to work with us to make KOYA a success (Apache JIRA KAFKA-1754). We do anticipate that as more folks join this effort we would enlarge our goal and hopefully have much deeper integration of Kafka with YARN. In fact, when we discussed this with Jay Kreps, committer and a PMC member of Apache Kafka, he had this to say

“DataTorrent’s Kafka on YARN efforts makes for a great out-of-the box experience for Kafka users in the Hadoop ecosystem. I’m really happy to see DataTorrent betting on Kafka and contributing this to the community”

We believe KOYA will significantly enrich the YARN based eco-system and we hope that KOYA gets adopted as a way for applications to communicate within the YARN eco-system and with systems external to YARN.

DataTorrent helps our customers get into production quickly using open source technologies to handle real-time, data-in-motion, fast, big data for critical business outcomes. Using the best-of-breed solutions in open source, we harden the open source applications to make them act and look as one through our unique IP that we call Apoxi™. Apoxi is a framework that gives you the power to manage those open source applications reliably just like you do with your proprietary enterprise software applications, lowering the time to value (TTV) with total lower cost of ownership (TCO). To get started, you can download DataTorrent RTS or micro data services from the DataTorrent AppFactory.