How Apache Apex saves your big data dream from the grave

In September, I published a blog about Apache Apex (incubating), and how it helps enterprises leverage big data in many more ways than before. In that blog, I had mentioned that it’s now  time to productize big data. In this blog, I will discuss why a high percentage of Apache Hadoop based projects are not yet in production, why there is a very high failure rate in big data projects, why a lot of enterprises are still kicking the tyres with Hadoop, and how that relates to the operational aspects of the Hadoop eco-system.

Apache Hadoop has been around for a while now. Yet, its promise to effect the transition of businesses into data-driven powerhouses is still to be realized. Hadoop has the potential to bring to reality not just the revenue-generating capability of big data, but more importantly the market altering ability of big data. Many projects however, do not see the light of the day. The primary reason for this is the issue of operability, which a high level term used to define if the enterprise is able to operate the product successfully. Operability makes its presence felt incontrovertibly in terms of cost of support, uptime, SLA, business impact, and downtime; effectively resulting in the loss of market opportunity. Had it not been for operability, so many big data projects would not have fallen apart at the seams. As big data projects attempt to go live, the issue of operability creeps in. Ideally, a big data platform must ensure isolation of operability from the functional business logic. In the attempt to get on the big data journey, if enterprises are unable to ensure this isolation, then they risk productization of their big data applications. The cost of a lost opportunity is dear, and especially so if competitors can productize their applications.

Big data applications and operational factors

A big data application must adhere to different operational factors in order to navigate to success. Given that challenge of maintaining operability, a very small percentage of applications are in production. For the ones that never made it, operational aspects were the single biggest cause of failures. What is required to productize a big data application and support it in a production environment?  Here are a few:

  • SLA requirement

In a nutshell, a service level agreement is a collection of requirements that an application, and in this case, a big data application, must adhere to, for satisfying business needs. SLAs often include commitments around a slew of issues, of which, the major ones are latency, resource needs, and uptime. Latency answers a critical business question — Can we get results within the specified business time?. If an application misses the latency requirement, the value of the results diminishes rapidly. For example, an application should detect a security breach and take action before the business incurs losses due to the breach. SLAs also cover resource needs, which makes the application cost-wise viable. The Hadoop eco-system is built around open source components, which means that the resource costs include the cost of hardware resources as well as a lot of human resources. Hardware requirements manifest as the throughput of an application, which in turn implies scalability and enhanced performance, while human resources are the expertise, that are part of a buy or build decision. Many big data platforms force developers to choose between latency and throughput. An ideal platform should not force this choice, but rather cater to both. Similarly, a big data application must work with expertise that is easy to find internally (build) or externally (buy). Uptime is the ability of the application to provide results on a high percentage basis. For example, an uptime requirement as high as 99.99% over a year is not uncommon for a big data application. In fact such high uptime numbers are a necessity for enterprise grade applications.

  • Fault-Tolerance and High-Availability

The ability of an application to recover and self heal from a hardware outage is a critical part of fault-tolerance. Ideally, a big data application should have this capability built into it, while ensuring a self-healing recovery. Self-healing recovery means recovery with little or no human intervention. A recovery must include no-data-loss, no state-loss, and no impact on the SLA. This means that self-healing recovery should not result in any SLA miss in terms of accuracy of the result as well as results being available in business time. High availability is derived from fault tolerance but they are not the same. High availability most often boils down to an entire stack having no one single point of failure. For a big data stack, achieving high availability and fault tolerance does not come easy. It requires a very rare expertise. A platform that takes care of high availability and fault tolerance drastically improves the success rate of an application launch.

  • Security and certifications

To become operational, a big data application must promise the highest standards of security and data protection. Big data applications that fail to meet this very critical factor, most often fail. The Hadoop ecosystem comes with built-in support for Kerberos authentication. A big data application should not only be Hadoop native (i.e. in addition to Kerberos) to ensure security of data, but also must provide native support for custom code because enterprises often have custom code for authorization, authentication, encryption, and anonymization. To ensure high security standards, big data applications should build on the security  investments already put in, i.e. leverage existing custom code.

  • Scalability and Performance

A critical aspect of operations is to gracefully handle peak demands. Additionally growing data and compute needs imply that big data applications must be future-proof from a scalability and performance perspective. The  Hadoop eco-system in particular, has a scale-out architecture that relies on commodity hardware. A big data application must achieve linear scalability to be future proof while being highly performant to meet hardware resource requirements of the SLA. In a scale-out architecture, it is critical that every part of the application be distributable and be horizontally scalable (scale-out), with no bottlenecks.

  • Ease of integration with Operations team (DevOps)

For an application to succeed, it must align smoothly with DevOps processes, thus ensuring a successful launch. Big data applications projects routinely fail if this alignment is not achieved. The primary reason for this is the strong focus on functional code that developers operating in the open-source realm bring in. These developers often mix the functional and the operational aspect of the specifications, making the big data application tough to align with DevOps. Any changes require DevOps to circle back with application developers, and time to market takes a hit. Additionally open source platforms are weak in enterprise grade operability. We often hear “not mature yet”. Operability impacts  applications/products including those that are not part of big data. But with open source roots (aka “not mature yet”) of big data, operability poses a much larger and decisive risk. Ideally, the underlying platform should provide a complete separation of functional specifications and operational specifications to enable a smooth application lifecycle–DevOps alignment. Operational aspects must be natively supported and must be something that the platform aspires for. Application development projects must also integrate with ease into the DevOps monitoring and support tools. While web services is an obvious choice, an ideal choice is in implementing the web services in a Hadoop native way. This means that current tools/webservices used to support Hadoop work as is.

  • Operational expertise

A big data platform stands true to its promise only when it drastically reduces the operational expertise required for the upkeep of the application written on the platform. This is because businesses groom developers to bring in the functional expertise, and not necessarily the operational expertise. Business logic (aka functional logic) is the core competency of enterprises, and must be developed by enterprise developers. Arranging for operational expertise is very hard as it is a rare expertise, and to add insult to injury most often is it not the core expertise of enterprises. That is why big data platforms and applications must be able to handle operational vices, while leaving developers free to work on the functional code.

  • Ease of upgrading and backward compatibility

In the hurry to launch big data applications, teams often fail to ensure ability to do smooth upgrade and backward compatibility. Because big data applications involve large sized datasets, businesses are often unwilling to upgrade frequently. This implies that a big data application  must be supported in multiple versions, while allowing backward compatibility. Ideally, a big data application should be Hadoop native to ensure these conditions, and a recovery from Hadoop restarts is an added bonus.    

The above figure illustrates what happens to a lot of big data projects. In the remaining section, I will walk through how Apache Apex helps with such projects. Do note that since we are discussing open source projects from Apache Software Foundation, I have not listed vendor lock-in as an operational issue. Lock-in at times causes cost escalation that make an application unviable.


Apache Apex is built to address all operational factors.

Operability must be a first class citizen of a big data application. It must not be, in any case, “slapped on”. Apache ® Apex is the industry’s only open source platform that guarantees a clear distinction between operability and business logic while building the support for operability into the platform. That is why projects using Apex take off smoothly, with business houses only focussing on business logic, while leaving Apex to iron out issues of operability.  

Apex allows developers to write their business logic, while taking care of all the operational nitty gritties. Apex API does not impose restrictions on business logic; developers only need to implement the compute function as data arrives.

  • SLA requirements

Apex adheres to the primary SLA factors, namely latency,  resource needs, and uptime. Apex can achieve sub-millisecond latencies, all the while ensuring a high throughput. While scaling linearly to billions of events per second, Apex guarantees high performance. Its hands-off fault tolerance mechanism coupled with the ability to self-heal ensures a high uptime. To the question —  Will we get results within the specified business time?, Apex comes back with a resounding Yes!.

  • Fault-Tolerance and High-Availability

Apex is designed to bring in support for fault tolerance natively, with no code requirements from the  application developer. Applications self-heal from hardware failures, thus ensuring no impact on the business results. Apex guarantees no data loss, no state loss, and all of this with no human or external intervention. Additionally, Apex leverage upon the Hadoop ecosystem, and benefits OOTB (Out Of The Box) capabilities of this ecosystem as it matures.

  • Security and certifications

Apex is designed to meet the highest standards of security and data protection. Because Apex derives advantages from the Hadoop native capabilities, data security is guaranteed in part by Kerberos. Additionally, Apex also enables easy integration with custom code that enterprises have for authorization and even authentication. This enables the full Apex stack to be secure and certified, using both native security support as well as custom security support. Additionally, Apex also handles considerations such as encryption, and anonymization in terms of supporting custom logic natively. When customers ask – I have code written 10 years ago, can I use it? Apex once again comes back with a resounding Yes!

  • Scalability and Performance

Apex applications deftly handle big data needs while being Hadoop native. Their linear scalability and very high performance make them naturally future-proof. With its event handling capacity running into billions of events per second, coupled with millisecond-level latency as dictated by SLAs, Apex is geared towards ensuring high performance, even as data size grows. The built-in support for compute locality in Apex helps optimize resource requirements, and often, drastically lowers network bandwidth and cluster size requirements. The ability to handle parallelized pipelines along with unifiers aids a scale-out architecture. All Apex components, including the master, buffer management, message queues, and operators can be distributed. The skew handling support and built-in publisher-subscriber mechanism for data transfer enables scale as two different consumers of data can continue with negligible interference. Further, the Apex master is non-blocking, and does not impede data flow. Apex infrastructure is designed to avoid bottlenecks as a guiding principle.

  • Ease of integration with Operations team

Operations teams ensure a smooth “go live” for projects, and big data projects are no exception. DevOps functions as the arbitrator that enables an application and its platform to become operational. Apex is DevOps friendly; it separates functional logic from operational logic. This means that a functionally certified application can be managed by DevOps by simply changing operational parameters without the need for a new functional certification. Apex provides deep web services that assist in the integration of the application with the DevOps monitoring tools. Apex leverages DevOps tools and Hadoop processes as is, while requiring no additional support for the Hadoop cluster.

  • Operational expertise

Apex completely separates business logic from operational logic, thus enabling enterprises to focus on their core business while taking care of the operational activities. Apex is built to look after operability issues such as fault tolerance, integration with DevOps, security, high performance and scalability, dynamic adjustment of topology to meet SLAs, and so on. Developers need no deep operational expertise; they can integrate existing business logic, or build new logic on to Apex.

  • Ease of upgrading and backward compatibility

Apex applications are among the easiest to upgrade in the Hadoop eco-system. Apex is built such that applications with different versions can run concurrently in a cluster. Additionally, there is a strong isolation among YARN applications. Apex web services are versioned, and hence it helps with diagnosing mismatch. Consumers of the application web services do not need to be upgraded along with the applications. Apex applications are fully stateful, thus having the capability to survive a Hadoop restart.

Apex is designed to fulfill the disruptive promise of big data, by enabling enterprises to extract value of the big data investments. Apex effectively ensures that big data applications never set out on the way to the grave.

Watch this space for deep-dive details on the operability aspects mentioned in this blog. In the interim, join the team of Apex contributors. Download Apex, explore it, and subscribe to the Apex forum.