The transformation of enterprises to be data-driven is in full force and we are at the early to middle stage of this journey. The next generation IT open source stack is emerging to help enterprises successfully achieve business outcomes.

There are three aspects that enterprises need to manage on this journey. These are data analytics, on both data-in-motion and data-at-rest; ensuring that the open source stacks are operable and productized, i.e. it is viable; and lastly, the successful delivery of data-driven business intelligence.

Over the past few decades, data analytics has gone through three distinct phases. The first phase was marked by basic analytics that were done before the advent of big data. This was driven by relational databases and were mainly SQL based. These databases required enterprises to convert unstructured data to structured data that was then indexed in a database. A lot of present day analytics continues to be done this way.

Big data marked the arrival of the second phase of data analytics, that scaled out on commodity hardware and was led by open source. The second phase started with storing unstructured data in data lakes. This phase was initially storage driven and data lake-centric. The initial usage attempts had SQL influence and enterprises tried to transform incoming unstructured data to structured data in data lakes.

Data lake-centric processing has now started to unravel, as it did not deliver business outcomes successfully. Scale-out models required a re-thinking of the basic setup, which did not have to mimic databases. As data lakes failed, a lot of new products and use cases emerged. Processing unstructured data with minimal transformations evolved as enterprises realized the value of analytics done in real-time.

The emerging trends include data-in-motion analytics, edge/fog computing, heterogeneous environments led by cloud, NoSQL databases etc. This phase marked the transition from “database-centric” analytics, to mass market, commoditized, analytics driven by both cloud and innovation in open source. These analytics were consumable by machines and not restricted to humans, thus delivering huge value and opening up scale-out analytics to a whole new set of use cases. In the overall end game, this second phase is preparing enterprises for the next big move, i.e. the arrival of artificial intelligence/machine learning/natural language processing.

In this third phase, the importance of structured data (i.e. RDBMS) will diminish; analytics will be machine consumable (humans will not be involved) and analytics will be done on unstructured or loosely structured data (including NoSQL databases). Heterogeneous environments will replace data lake hubs as the central architecture. Data-in-motion will be integrated into the IT stack and will be equally, if not more, important than data-at-rest. This transformation will not be easy, but it is urgent. Data (unstructured) is growing faster than Moore’s law and thus pushing enterprises to move faster on this journey. Machine consumable data needs a high-level of operability; and time to market is the most critical part of this transformation as those not making the transition in time will lose out. The successful migration from the first phase to the third phase of data-driven transformation requires the IT stack to be operational and productized.

In a previous blog, I discussed how applications put together by stitching loosely coupled data services help in getting successful business outcomes. This pattern is needed for both the second and the third phases of the journey for enterprise companies to becoming truly data-driven. Enterprises need to master how to operate and productize the second phase before they take on AI/ML/NLP. Additionally, the database and organizational silos of the first phase inhibited Omni-channel processing. Customer interaction is now on multiple channels, be it POS, web, or mobile. Enterprise products need to handle Omni-channel data streams to fully know their customers. The ability to launch Omni-channel products will provide a significant advantage to enterprises.

The new emerging open source IT stack requires three legs to get to a successful outcome on a heterogeneous environment. These are cost structure (aka open source), feature parity with proprietary stacks, and operational viability. Open source has partially resolved the cost structure as its license is free. A free license along with commoditization of hardware resources by cloud has drastically altered the cost structure in favor of open source over proprietary-only stacks. Additionally, the open source community is very innovative and creative.

Over the past decade, the open source community has significantly narrowed the gap with proprietary software. Open source innovation is also better positioned to address the second and third phase of the afore mentioned journey. The innovation is being done in the open and is fast changing. The downside of this is that there are frequent changes, too many new features, and too many new technologies.

By itself, open source is not designed to be easily operable or productized. The much needed innovative and creative behavior itself inhibits enterprises from achieving outcomes in a timely manner. This “Wild West” requires some governance software to be viable and deliver value from these innovations. This is where the third leg comes in. Viability has always been described in terms of time to market and total cost of ownership. There are too many technologies in open source and we need to consolidate the best-of-breed into a viable stack. This part of the solution is proprietary and is geared towards lowering time to value and enables enterprises to operate a product in a cost-effective way.

DataTorrent’s Apoxi™ is an example of such a product. Our Apoxi framework enables rapid application development by stitching together data services. DataTorrent’s fast big data stack reduces the number of technologies by picking the best-of–breed open source technologies. We leverage Apache Kafka for data transport/message bus, Apache Apex for data-in-motion processing, Druid for OLAP, Drools for a rules engine, Apache Spark for Machine Learning, and HDFS/S3/Azure for storage. A set of pre-constructed applications, and re-usable data services are provided in DataTorrent’s AppFactory. These applications and data services use DataTorrent’s fast big data stack and leverage Apoxi to bind the applications together.

Apoxi provides the tight integration that makes these data services operable. Customers can create their own applications by stitching together applications from these data services using Apoxi. When needed, customers can easily add custom code to these data services or create their own data services in an Apoxi-compliant way by using the components.

Apoxi also fits very well with organizational structures in current enterprises, as it does not force every team to dump data into a single data lake. Organizations are able to contribute their data services into a single framework that works on heterogeneous environments, including the hybrid cloud.

It is DataTorrent’s belief that the only way to render critical business outcomes enterprise customers are seeking when leveraging open source in a heterogeneous environment is with the use of Apoxi.