In a previous blog post Why Can’t You Get To Business Outcomes Just Using Open Source?, Amol Kekre provided some interesting revelations about open source technology. While open source was a key contributor towards making big data technologies popular and adoptable, it failed miserably when it came to enterprise readiness. This is because the goal of open source was to make newer ideas feasible but not necessarily solving a problem statement which enterprises face today. While adopting open source is a way to go, it needs a framework to make it more stable for providing viable ROI. This blog will talk about open source offerings for Online Analytical Processing (OLAP) and what we did to make it enterprise-ready in order to deliver positive ROI for our customers.
Where do we come from?
While traditional OLTP systems served the purpose of providing the right KPIs, it was hard to manage them with ever increasing data sizes. This gave birth to OLAP where fundamentally, the aggregate computation is/can be split between ingest and query time, while in OLTP all the computations happens at the query time. OLTP uses SQL for firing queries on the data and computation is done based on SQL Execution Plan. OLAP creates cubes of multi-dimensional data with precomputed aggregates at ingestion time for querying later. Some OLAP systems also allow you to postpone part of the cube generation at query time effectively distributing computation work between ingest and query time. OLAP served the purpose of business KPIs (Key Performance Indicators) along with dealing with large amount of data. Fast forward to the present when providing KPIs on data at rest was not enough as businesses are increasingly concerned with data-in-motion. Providing the ability to calculate KPIs on large volumes of data, both at rest and in-motion was not an easy task thereby significantly increasing the challenge for enterprise as well as open source developers. This is where projects like Druid and other closed source projects came into existence. There is a huge potential in open source technologies, but they needed to be made enterprise-ready. Lower Time to Value (TTV – Time to get expected value from the product) and Lower Total Cost of Ownership (TCO – Cost of getting that value, right from requirements gathering to production maintenance to exit strategy) are DataTorrent’s high-level guidelines as this is what is required to be enterprise-ready.
To solve the problem of real-time analytics in enterprise-ready fashion, we evaluated a few projects like Druid, ElasticSearch, and Kylin based on TCO and TTV requirements and found Druid to be the clear winner.
Druid is built internally for OLAP and was the most hardened and ready for OLAP needs. Druid is capable of not just providing analytics on large amounts of data but also the same in real-time or to be precise, near real time. But with open source Druid, the cluster itself was hard to manage increasing TCO. Also, with proper testing of complex setup, the TTV was going higher. Open source Druid also poses issues related to making production environments operable and reliable. The overall effect was that it was hard to get positive ROI out of a Druid open source setup. Other than analytics and visualisation of data, one of the future goals is to make aggregates available for leaf level cubes as a feed to another system. This would be useful in real-time alerting system. Because Druid can serve as both OLAP for data at rest as well as OLAP for data-in-motion (near real-time), we wanted to improve it further to do change data capture on the continuous data feed to build a true real-time alerting system. So for the purposes of ensuring deployments are successful for enterprise customers and the need for an enhanced feature-set, we took another approach.
Delving into what makes Druid Druid
Even with the production specific problem seen in the open source version of Druid, it was clear that Druid is a great and hardened technology in terms of the basic OLAP feature-set. It had the maturity to be made enterprise-ready. Hence we dug into the technical components of Druid and discovered a few key components. We found that IncrementalIndex and QueryableIndex are the ingest and storage data structures respectively. IncrementalIndex is a write efficient data structure while QueryableIndex is a read efficient data structure. IncrementalIndex is used for the purpose of ingestion and conversion to QueryableIndex does the ingest time computations. The QueryableIndex, on the other hand, could be loaded in memory (both on-heap and off-heap) and can be used for computations on the fly. Off-heap loading of data was a big plus because that helped take away any Java GC issues. At the same time, a wide number of QueryRunners provided by the Druid open source project are also used as a compute engine for computation at query time.
The big (data) question
The next question, and a daunting one, was how to use these data structures and processing engine for the data which is not just big but also in-motion. DataTorrent’s RTS platform has a stream-processing engine based on Apache Apex, that is meant to solve the exact problem of managing large amounts of data with continuous in-flow. So, we took above data structures and the compute engine from druid-api & druid-processing library and created DataTorrent RTS Operators to build a new and enterprise-ready OLAP data service – Online Analytics Service or OAS for short. Because the core components of Druid were plugged into a system meant to handle streams of big data, the results were very promising. Since the RTS platform acts as an OLAP system manager for computations and storage, all key features of RTS platform were readily available in Online Analytics Service.
DataTorrent RTS has built-in support for fault-tolerance, that automatically made Online Analytics Service fault-tolerant. This means that OAS is operationally ready and one need not worry about handling failures and data integrity in the system. OAS is highly-available, and it self heals. Thus applications built using OAS will have low TTV and TCO. OAS uses RTS platform’s distributed and scalable architecture. Because of this, OAS can scale up or scale down depending on the ingestion and query needs in a distributed manner. Something that is really important to enterprise companies is Data Security. The DataTorrent RTS platform provides deployment in a secure Hadoop cluster, so OAS deployed in a Hadoop secure environment can use the kerberos authentication for maintaining the data security. RTS platform also seamlessly integrates with RBAC systems and thereby OAS also integrates with various RBAC systems. With all the above key and enterprise required feature, OAS works seamlessly with high ingestion rates and very low latency for both ingestion and query. All of this put together made OAS an enterprise-ready OLAP system. Customers can get OLAP analytics on real-time (data-in-motion) and historical (data at rest) combined. This is a very powerful combo where a single system can tell you about the current trend as well as a historical trend.
Interfacing with OAS
OAS has two types of integrations, one for ingestion and one for query.
As OAS is a Druid-based system, it provides a similar interface to that of Druid for querying systems. This helps in integrating OAS with BI tools like Apache Superset (open source), Tableau (enterprise) etc.. DataTorrent’s upcoming service-based architecture helps a great deal in achieving these integrations. Both OAS and Superset (docker) act as interdependent services providing a complete end-to-end Data Analytic Solution for actionable insights into data.
On the ingestion side, OAS is enabled to stream-in the data from Apache Kafka. Kafka’s fault-tolerant and distributed streams of data help in achieving exactly-once semantics. Kafka is a popular choice of data streams in the big data world, the integration is even easier in available enterprise setups.
Figure 1: High-level Overview of OAS Components
Here is is a high-level overview of how the whole system looks:
- Data Source can be anything which publishes relevant data for OLAP analysis into Kafka.
- Online Analytics Service, part of the DataTorrent RTS platform, takes this real-time feed of data from Kafka and does the OLAP computation and makes it ready for querying.
- The BI Tool (composed of Superset) which is another service in the DataTorrent RTS platform provides a way through DT Gateway to fire the real-time queries to OAS and populate the charts based on user needs. This BI Tool is completely customizable to user visualization needs.
OAS and BI Tools used as services in the DataTorrent RTS platform provide a complete analytics solution for end users with very low TTV and TCO. While development was in progress, we found out this type of system can serve another very important enterprise need i.e. alerting on KPIs and change data capture. Hence as designers of OAS, we built it in such a way that it can be easily extended in the future to create a true OLAP based real-time alerting system on data-in-motion. For e.g. enterprises will be able to track a KPI using OAS and then introspect using the real-time data when the KPI exceeded some kind of limit. So this will give ability to monitor a KPI, react and take action while you still have time to do something about it. In many cases the system can take humans out of the loop and make substantial productivity gains. This means OAS can be applied to track, visualize, and (eventually) predict and alert on trends. The next effort for us would be to solve the exact same problem by harnessing OAS system to build a strong Alerting Service that is operationally easily for enterprise customers. In an upcoming blog we’ll cover details on the DataTorrent Real-time Alerting System.