We live in an era where computational resources are being rapidly commoditized. Cloud is pervasive and is democratizing IT. Big Data led by Apache Hadoop is the lead edge of this revolution from an IT perspective. Just like automobiles did to travel in last century, mobile is doing to communication, and web is doing to information access in the past decade; Big Data is on track to change the way enterprise IT is done. Big Data is the compute and storage layer underneath mobile, social, internet of things, and other emerging ecosystems. They all have one thing in common; they all produce increasing amounts of data that requires increasing amounts of resources be it compute or storage. Big Data, led by Hadoop and its scale-out architecture based on commodity hardware is enabling this disruption. Today, a team of few folks in any remote corner of the world can viably disrupt existing business. Exciting times indeed, and I feel very fortunate to be part of it.

Since it’s inception over 10 years ago, Hadoop remains largely under-productized. The genesis was the usage of MapReduce at the core of Hadoop 1.0. The first generation of Big Data technology had attempted to do search indexing better. A noble endeavour, but nevertheless in hindsight an underachievement, or not visionary enough. The real and pervasive disruption required a simpler question to be asked – “What can we do with massively distributed resources?” YARN (Hadoop 2.0) heralded the advent of next generation Hadoop, post-MapReduce. YARN was still in alpha in 2012, but we knew that this was the real disruption. We asked ourselves, “What would it  take to productize big data? What would it take to commoditize the expertise needed to successfully launch big data projects?” Big Data applications need to be as mass market as mobile applications are today. Big Data applications had to be easy to develop, easy to operate, and easy to integrate into our current technology stack. Additionally, they had to meet business SLA’s with low total cost of ownership, and low time to market.

When we started in 2012, we asked ourselves what  it would take to make this Big Data revolution more powerful, pervasive, and wider in terms of use cases. We zeroed in on a massively distributed, native Hadoop (YARN) data-in-motion architecture. It helped greatly that my team ran the Yahoo! Finance backend. We had aggregated trading market data from throughout the world, on a worldwide, multi-colocation, massively distributed data-in-motion architecture, that handled data in millisecond latency at very high throughput. Our new native Hadoop data-in-motion platform had to meet or beat this Yahoo! Finance backend. Trading data is meant to never go down. There is no batch to back it up. We intended to bring this kind of high SLA platform to a mass market as part of big data Hadoop eco-system. Such 6 9s SLA software need not be only in the purview of traders in New York. Additionally, we focused on operability, ease of use, ease of deployment and integration, low cost of ownership, and be enterprise grade. The only way to achieve this was to have operability as a founding principle from day one. We measured the cost of development, launch and operations as well as time to market with each release, and relentlessly focused on improving these metrics. The platform had to match the quality of current scale-up platforms that Hadoop eco-system intends to replace. We squarely aimed to make Hadoop not just big, but wider too in terms of use cases. This goal is part and parcel of the charter of Apache Apex.

With this charter in mind, we proposed Apache Apex for incubation last year, and our proposal was accepted in August 2015. As part of incubation, we were happy to see CapitalOne, DirecTV (now AT&T), General Electric, and Silver Spring Networks among the enterprises that joined our open source community. Apache Apex was blessed with great mentors, namely Alan Gates, Chris Nauroth, Hitesh Shah, Justin McClean, Taylor Goetz, and Ted Dunning. The Apache Software Foundation provided a framework to develop a fabulous community. The ASF welcomed Apache Apex with open arms as we learned the Apache way.

Today, Apache Apex is in production with customers enabling use cases including log processing, billing, big data ingestion and movement, fast real-time streaming analytics, ETL, fast batch, database off-load, alerts/monitoring, machine scoring models, and real-time dashboarding. Apache Apex is being used for both streaming as well as batch use cases. Verticals include ad-tech, internet of things, financial services, and telecommunications.

This journey would not have been possible without the high calibre of founding engineers and co-founders, namely Chetan Narsude, and Thomas Weise. We also got a great initial team, in Pramod Immaneni, David Yan, and Gaurav Gupta. A lot of contributors, and community participants helped along the way – Thanks to all of you for making Apache Apex happen. This list is not complete without Phu Hoang, co-founder and CEO, who helped navigate the business aspects of this relatively small team.

With the skyrocketing growth of Apex usage, meetups, and community throughout the world, I am excited to see what the future holds. Again, congratulations Apache Apex on this journey to becoming a top-level project.