In my previous blog, I had discussed the need to operationalize and productize big data. Today I will walk through another aspect of the big data ecosystem that was ingrained into the big data stack during its inception.
The first generation of the big data ecosystem chose MapReduce as its only engine. This was done as the target was search indexing, which is how and why Apache Hadoop came about. MapReduce was a storage-heavy engine that could read and write files to storage and thus was only used for batch processing. With 128MB as block size, big files had to be constructed, then copied into HDFS, after which MapReduce kicked in. This caused the early big data ecosystem to focus on batch processing and data lakes. With enormous success in search indexing, and with the perceived low cost of big data, the big data community asked, “What is the next use case to tackle with MapReduce?” Almost immediately open source developers tried to replace or offload costly and proprietary big data warehouses using MapReduce as the engine.
Free open source software running on commodity hardware seemed like a good combination to take up data warehouses. This meant that the majority, if not all, of the first-generation big data ecosystems, were batch-oriented and data lakes became the center of this stack. The data flow of first generation big data was through storage, i.e. analytics is done post data-at-rest. To be precise, there was no data flow, as data simply landed in a lake to be looked at later. The first-generation big data ecosystems missed the primary question, which should have been: “Given a commoditized scale out of resources, what all can we do with it?” Using this unasked question, big data, and cloud as a backdrop, I will discuss another aspect of big data in this blog. It is a holistic view of what big data should have been and is now increasingly transforming itself via the second-generation wave. This second wave is centered around real-time, and it heralds the arrival of fast, big data analytics.
A couple of decades ago, enterprise IT stack moved from costly and non-agile mainframes to personal computers, then onto webservers with LAMP (Linux, Apache, MySQL, PHP) architecture. This architecture was not scalable but sufficed in early web days. Enterprises soon realized that the web is better served by leveraging collective data and not just user profiles. They then started using big data to process ETL pipelines that fed back to caches in the front-end server. But this loop was not in real-time, and often took days to update. Data continued to grow faster than Moore’s law, and big data became a critical component of the web.
With the advent of the Internet of Things, data growth is set to accelerate. Data sources are moving from humans to machines as sources move from web to mobile to machines. This has created a dire need to scale out data pipelines in a cost-effective way. Big data and cloud ecosystems realized that they cannot be just about search indexing or data warehousing. They need to service all enterprise data flow, be it human (web, mobile) generated or machine generated. Enterprise data flow is an end-to-end pipeline where unstructured or semi-structured data generated at various source systems flows through intermediate systems where it gets transformed to structured data and analyzed for consumption. The consumer is either a human (data analyst) or a machine (full automation, where results are sent back to front-end caches). In the future, the consumer of these analytics will eventually be a machine. Efficiencies will be gained as the big data analytics move from human consumption to machine consumption. Just as LAMP architecture helped standardize the web, fast big data analytics will help big data standardize and provide efficiencies to the entire enterprise IT stack.
LAMP architecture relies on a lookup in a database per user in real-time. That worked well as a real-time, agile, operable, and cost-effective setup. But it does not scale for big data analytics. The usage of first generation big data as a backend never complemented all the positive aspects of LAMP, and therefore was not able to replace the scale-up databases that powered the web, notably MySQL. The first generation of big data was neither real-time nor operable. It was not even as agile as LAMP. It was cost-effective as it used commodity hardware to scale out, but lack of mass success in productizing big data left a lot of this cost-effectiveness unrealized. To get back the business advantages enterprises had with LAMP, we need the big data stack to be like it. A website that updated users’ profile changes a day after the edit would lose to another website that did so in real-time. The profile changes had to be reflected on the next page.
The same applies as we use big data for backend processing. The need to use big data is driven by the fact that algorithms using collective knowledge, i.e. data from all the users, provide a much better user experience. We are seeing the uptick of this simple principle in backend processing as seen in machine learning, AI, OLAP analytics, behavioral targeting, and predictive analytics for machines maintenance. In all these use cases, the ability to respond in real-time provides a dramatic competitive advantage. An enterprise that can do predictive analytics in real-time will gain a competitive edge over one that does not. Machine learning is useful only if enterprises can score incoming events in real-time, be it financial transactions, web pages, mobile, or IoT. Action needs to be taken before cost escalates. The value of data drops significantly with time.
Second generation big data needs to be real-time, agile, and operable. Moreover, to get real-time, agility, and even to some extent operability back, data lakes cannot be in between this data flow. This is sub-optimal even if data lakes had not turned into data swamps. The best performing data lake will still impede data flow. Data storage blocks data flow, and then the flow is no longer flowing, aka not real-time. Data lakes have a place, but it is on the other side of this real-time data flow. Data lakes are useful for archival, for training models using historical data, or for value extraction from the perspective of historical data. More often this analytical data is used to enrich a real-time data flow.
Another push for real-time big data processing is the need to take humans out of the loop, aka automation. As enterprise IT stacks move to automate their ETL pipelines, in order to gain more efficiencies, the need to analyze big data in real-time continues to grow. In the next phase of big data and cloud, we will increasingly analyze and predict with machines without humans in the loop. The enterprises that do so will have huge efficiencies and gain a tremendous market advantage. This need has resulted in the advent of the fast big data application stack. Fast big data analytics is the future of big data. Over the next decade, all ETL pipelines will move to real-time. Batch applications will still be used to analyze historical data but will supplement the core real-time data flow.
As we decide on and implement a fast big data stack, we should look towards setting up a product stack as easy as LAMP. To get started, let’s list the principles around which this stack can be developed. Let’s walk through each of our requirements for adopting a fast big data architecture/stack.
The first is operability, and it depends on a lot of factors. I have listed the abilities we have to focus on to harden a fast big data stack in my previous blog. I will now discuss, in no particular order, some important strategic decisions that impact the operability of real-time big data applications:
- Number of components in the stack: Operability disproportionately decreases as the number of components go up in a big data stack. Each component needs to be hardened and devOps needs to create the expertise in each and integrate them into the IT stack. Thus, we have to make strong efforts to narrow down the stack. We have seen that a lot of our customers have narrowed their stack down to KASH -> Apache Kafka for message bus, Apache Apex for real-time processing, Spark for machine learning, and HDFS for storage. In my previous blog, I listed some other components that are useful around a KASH setup. Reduction in the number of components improves both the total cost of ownership (TCO) as well as time to value/market (TTV).
- Loosely coupled data services: Constructing the application via loosely coupled data services around a service bus helps in operability. Development of reusable data services greatly lowers TTV. The concept of using libraries in your software code has helped reduce time to development. The same principle applies to big data. Enterprises make dramatic gains in operational efficiencies with the reuse of data services. Instead of developing a monolithic application, we should de-construct the application into data services that are loosely coupled through a service bus. Over a period of time, a slew of such data services will not only drastically reduce TTV but also lower TCO. DataTorrent provides a lot of glue logic via our proprietary Apoxi framework that enables the easy assembly of real-time applications. I will discuss DataTorrent’s Apoxi framework in a later blog.
- Metrics and Integration: A major part of devOps operational cost revolves around knowing what is going on. The more details devOps have about a big data application, the higher the operability. The availability of system and application metrics greatly helps in this regard. Furthermore, integrating this data with current monitoring tools is needed to ensure that there is minimal change in devOps tooling. DataTorrent’s Apoxi framework includes a metrics platform and various ways to integrate into monitoring systems including ELK and/or Splunk for log aggregation. Metrics and integration is extremely critical for real-time applications as there is an immediate impact on product performance due to the real-time nature of the product.
- User Interface: A fast big data stack must position its UI as a first-class citizen. Visual feedback is critical for business users and reduces time to interpret data as compared to raw numbers. A previous blog discusses fast big data visualization product features provided by DataTorrent.
- Configure: A fast big data stack must take care of all operability issues and leave only business logic to the user. For big data application developers, the motto must be “configure” operability and develop “business logic”. A lot of open source components force users to develop operability themselves. This is often the primary cause of failures of big data products since operability is not the core competence of most enterprise developers.
Agility is a very important facet of big data application development. This covers all aspects that involve time. For big data application development and operations to be agile, each task in the life cycle must be made agile, aka reduce the time to perform those tasks. Big data applications developed from scratch take a lot more time to develop and are prone to failure. Building apps from pre-baked components (data services) saves time. Applications constructed by loosely coupling data services allows for a quick upgrade of a data service. Staging a data service is easier as a new version can be swapped in with minimal impact to the current application. DataTorrent’s Apoxi framework includes a services component that greatly improves agility.
Real-time is the most important pillar of a fast big data stack. Current big data stacks lack this path, especially for ETL pipelines. The following figure shows the core data flow on a product whose backend is a big data stack.
In such a setup, the database within LAMP architecture gets replaced by a big data technology stack. In a future blog, I will cover details on how to achieve a real-time response from a big data stack for serving live events. This is the genesis of the future of big data, and is the core principle of the emerging fast big data stack.
Other requirements include: what big data stacks have partially achieved today, i.e. cost effectiveness, and scalability. Cost effectiveness, however, does not just reduce the hardware required, it includes aspects like the cost of developmental resources, being cloud agnostic, and gluing together pre-baked reusable data services. Development cost can be reduced if the particular project required expertise that is commonly available, and additionally, one used software components/services that are reusable. Reduction of components in the stack also reduces cost as efforts spent on hardening these pay off for multiple products. DataTorrent’s AppFactory consists of hardened data services constructed with hardened operators. Scalability has to be designed into the data service from day one. This means choosing components that are proven to scale. Replacing a component in the future from within the stack is an extremely costly endeavor. Care should be taken to ensure that components can linearly scale out. This requirement translates to a component’s ability to “linearly scale out by adding resources.” Some example use cases of already existing big data services/templates and applications include fraud prevention app, iot-energy & utilities smart meters, adtech, iot-sensors, retail personalization, health care, telcos network optimization, schema repository, rules-engine, data-in-motion OLAP, GPS tracking, etc. We also have data services/templates for common continuous big data use case patterns like archival, on-prem to/from cloud sync, data preparation, data lake sync up, etc. These data services also demonstrate the re-use of a common functionality that helps reduce TCO and TTV.
DataTorrent Apoxi framework takes all of these requirements into account and has glued together best-of-breed components into a production viable fast big data stack. This architecture will help enterprises quickly standardize, lower TCO, and reduce TTV. We are seeing our customers use our Apoxi framework to leverage big data in order to transform IT processing just the way LAMP did for web servers. Additionally, macro data services from DataTorrent’s AppFactory will help reduce TCO and TTV even further. With a focus on TCO & TTC, this next generation big data stack is crystalizing, and it will power the next decade and help enterprises get off MapReduce or similar engines.
To get started, you can download DataTorrent RTS or take a look at our pre-baked use case patterns found in DataTorrent’s AppFactory. In my next blog, I will discuss details of principles involved in setting such an architecture up using DataTorrent’s Apoxi framework.