Part 1: Failure of open source technologies to deliver successful business outcomes
Part 2: High-level guidelines for achieving successful business outcomes with big data
Part 3: Development Pattern: Application stitched with loosely coupled big data services
Part 4: DataTorrent Apoxi Framework

In previous parts of this blog series, we saw how and why using open source technologies without a strategic plan has led to failures in big data. As a result, the current big data environment needs a new approach to figure out ways to be successful with big data product launches. Previously, I had presented the notion of a fast, big data stack, where we intend to take care of 90% of operational issues and render operability as a configuration task. I had also mentioned that we have seen our customers reduce their stack to a few components: KASHD (Apache Kafka, Apache Apex, Apache Spark, and HDFS, Druid for real-time OLAP).

Customers are, in effect, reducing the number of open source technologies to five and leveraging DataTorrent’s certified, and hardened operators to reduce time to market. Reducing the technology stack greatly helps, but it still does not address the crux of the issue. We need to break this problem down further, and the solution is to get away from a thought process that is stuck in open source technologies. We need to think about hardened big data services. Services that give us the outcomes we want. We can create a big data product by stitching these big data services together.

For operational excellence, it is best to have data services interact with each other via loose coupling. Such a setup will drastically reduce time to market; a.k.a. improve the success rate. Additionally, it greatly helps reduce operational costs/complexity and makes it much easier to launch big data products. Big data products conceived this way are a collection of reusable big data services loosely connected via a message bus. We have done this in software development before: we reused libraries, we used webservices, federated our queries, and scaled better. This is a cookie that can be repeatedly cut, a process that can be rinsed and repeated for big data to be accessible for the mass market.

One of the byproducts of data lakes were monolith big data applications. These have had a very high rate of failure. They are becoming extinct in the data swamp of first generation big data ecosystems. I will cover the fallacy of data lakes in a future blog.

Let’s look at a big data product development strategy that revolves around loosely coupled big data services from the perspective of viable time to market for these products. As enterprises develop these consumable, and reusable data services, the future product then mainly consists of putting these services together while developing minimal business logic. This approach saves both development cost, as well as operational cost. In my previous blog, I had listed how operational issues have become a graveyard for big data products. Let’s evaluate how a loosely coupled big data product, constructed with certified data services, compares with a monolith big data product in the context of the operational and data integration issues listed in that blog.

  • Fault Tolerance and High Availability: Data services certified for fault tolerance and high availability greatly reduce the work for the application developer. For new product launches, only the new code needs to be tested. The message bus provides a strong buffer from outages impacting upstream or downstream data services. High availability can even be enhanced on a per data services basis by running certain big data services in a hot-hot mode as needed. Such knobs are not available in a monolithic big data application.
  • SLA Requirements: SLA has various facets, namely latency, hardware cost, development cost, time to market, etc. From a latency and hardware cost perspective, both loosely coupled applications or monolith applications are similar. Unless latency is sub-millisecond, a monolith application provides no inherent advantage. Regarding development cost and time to market, a loosely coupled application greatly wins over a monolith as a lot less needs to be done. More often, a loosely coupled big data product simply leverages big data services currently running in production, and a new product launch translates to launching a new data service and connecting it up with other data services in production.
  • Security and Certifications: Once again, using pre-certified big data services saves time. It is much easier to secure a data service as opposed to a monolith application. Once SecOps have certified a big data service, reuse of it does not cause additional processes to be triggered. The development team thus leverages the security expertise of other data service creators. A monolith application has a high-cost to get security built in as it has minimal reuse.
  • Scalability and Performance: On scalability and performance, there is not much difference between the two. A big data service scales just as well as a big data monolith application. It is easier to dynamically add more services as needed, but this is not a critical criterion for a loosely coupled application to be preferred.
  • Ease of Integration for Operations Team (DevOps): A loosely coupled application is designed to integrate by construction. Loose coupling is all about how to integrate various big data services, and thus has easy integration points. Additionally, pre-certified data services mean that DevOps already knows how to manage these and any additional feature that comes about due to the new application is incremental in nature. A monolith application, however, causes a huge risk on DevOps side. Commonly, they need to train anew, ensure all areas are covered and oftentimes, find a repeated mistake, most likely done by a new engineer.
  • Operational Expertise and Cost: Loosely coupled applications score much higher in basic operational aspects. Once a data service is operationally certified, it’s reuse requires no new technology ramp up. The ability to operate each big data service independently without impacting other big data services drastically reduces the operational cost. Reuse of an existing data service leverages current DevOps expertise. Monolith applications carry a very high operational cost. One of the main reasons for high failure rates in the big data ecosystem is the operational issues associated with monolith applications.
  • Ease of Upgrading, Cloud, and Backward Compatibility: Loosely coupled applications have the easier path to upgrade, as software can be rolled out one data service at a time. The delineation of a feature set into a data service is done based on how easy it will be to operate, upgrade, etc. DevOps can even do a hot-hot plan as mentioned earlier by testing one of the replicas of a data service on old data, adding more capacity by adding another data service, staging a data service, and monitoring it before launching it live. The loosely coupled architecture enables one to keep backward compatibility issues contained within the data service. The fundamental tenet of loose coupling is that the coupling is based on a contractual API. Two versions of the same data service can be run in parallel, each of them may not be compatible with each other. The consumers upgrade as and when they are ready. This allows the consumer to decide when and how to upgrade/move to the next version. Equally importantly, the dev-QA-ops team of each big data service can now function on their own independent timeline. Schema evolution is much easier as it can be done per big data service. The application as a whole or the big data service(s) that consume from the changed data service can pick and choose when to upgrade. It is also much easier to distribute big data services on various data centers, or even to cloud. An upgrade by migrating to cloud can be done with one big data service at a time. Moreover, it may be that the application is running partly on cloud and partly on-prem. Thus an enterprise has much higher flexibility in being cloud agnostic, and selectively picking cloud where required. Monolith applications give no such leeway. It is all or nothing. The first generation of big data applications was rendered non-agile due to a lack of ease in upgrading, being very hard to move to cloud, and backward compatibility issues.

There are other areas where being loosely coupled greatly helps. Today, enterprises face a variety of data sources for the same product. For example, it is very common to sell a product via a local store or on the web or via a mobile app. These multiple channels, also known as omni-channel, exist in almost every vertical. A vast majority of the time, the product or the processing needed to be done in these channels are similar except for the source, i.e., except for the ingestion part of the pipeline. A loosely coupled application with an ingest data service for each channel works very well for omni-channel processing. In such applications, the transformations, analytics, store/load, decision making, etc. can be reused across multiple channels. Such big data products are able to provide a unified view of the business in a timely manner and at much lower cost. The loosely coupled framework also makes it easy to break silos within the enterprise, as the consuming big data service can be from a different group. Each of the silos can make their data available on a message/service bus. A loosely coupled framework adds fewer restrictions on various teams. A monolith application inherently requires enterprises to break the silos in a very hard way. Given that silos are more often an organizational issue, enterprises need to adopt technology frameworks that suit their structure, as opposed to those that dictate a big reorganization.

In the above figure, the first big data service ingests and enriches incoming data. The second big data service does analytics based on OLAP or rules or machine scoring and takes any required action. The third big data service stores the data in fast access storage as well as archives it for future reference. The fourth big data service uses the data stores in fast access to train models. Each of these big data services could be split further, for example, the “analyze and act” data service could be split further by separating analytics and action into their own data services (decision making).

In the fourth part of this blog series, I will walk through DataTorrent’s Apoxi framework. It is a framework that enables big data services to be loosely coupled and is geared towards the successful launch of big data products in a timely manner.