Part 1: Failure of open source technologies to deliver successful business outcomes
Part 2: High-level guidelines for achieving successful business outcomes with big data
Part 3: Development Pattern: Application stitched with loosely coupled big data services
Part 4: DataTorrent Apoxi Framework

In part one of this four-part blog series, I discussed how the approach of blindly adopting open source technologies has led to a high failure rate in companies attempting to become data-driven. In this section, I will walk through high-level guidelines required for successful business outcomes using big data.

In my previous blog, I discussed why the addition of big data to the IT stack should not repress the basic success rates we have had so far with architectures like LAMP. All that has changed with big data is that we need to scale in order to handle the exponential growth in data. LAMP could not handle scale, but in other aspects, LAMP worked well. While scale is a huge problem, it is not insurmountable. For big data products, we can have scale and a successful launch.

Let’s look at the four pillars of LAMP I listed in my earlier blog. It is operable; it is able to serve a web page in under one second, a.k.a. it is real-time; it is commoditized, a.k.a. it is cost-effective; and we are able to launch web pages very quickly, a.k.a. it is agile. These four pillars are a must if we have to get to LAMP-like solutions for big data. What LAMP lacks is scale. LAMP works only so far as a user profile has to be fetched. The moment we want analytics on “that type of user” it is big data and LAMP no longer scales. In short, as we add the fifth pillar, a.k.a. scale, we need to retain all of the previous four pillars of LAMP. IT departments should be as comfortable with the big data stack as it is with LAMP. They must be able to launch big data products quickly and continue to support it within a viable cost structure. Current big data technologies can scale, but they are deficient in all of the other four pillars required for IT departments to be successful.

Today an enterprise starts big data and cloud projects by using an open source stack provided by Hadoop distributions. Let’s walk through this process one step at a time. Distributions, notably Hortonworks, Cloudera, and even MapR, have north of 30 technologies in their distributions. With each release, they upgrade this bundle to include a preferred version of each technology. Let’s look at what a customer will go through as they put together a big data product.

In general, that product may include any or all of these technologies. Usually, products have anywhere between four to ten of these technologies in the pipeline.

The above pipeline has N technologies, which for an ETL pipeline, are arranged one after the other. Data flows from technology 1 to technology N. At a minimum, the user would need interaction/integration between tech 1 and tech 2 to be certified and hardened; between tech 2 and tech 3 to be certified and hardened, and so forth. This is needed for the user to be able to launch the product on time and not get caught up in operational issues.

Given a stack with 30+ technologies, we are looking at close to 1,000 such combinations. Additionally, each tech has various versions. It is common to have three versions being used in the ecosystem. This means nine combinations for each integration and roughly 9,000 combinations to be tested over the product life cycle. We do need to take this number with a grain of salt as all technologies will not interact with each other. Nevertheless, the fallacy of a big data distribution being a viable starting point to take a product to market is clear. It will not work. The distributions simply transfer the cost of productization to user and that leads to a high failure rate.

I will go a step further and state general laws of operability in a pipeline with N technologies. I will walk through the impact that the number of technologies in your stack has on various aspects of operability.

Uptime				        = U1 * U2 * … * Un

For the pipeline to be up 24×7, all components need to remain up, therefore it is an AND function.

Cost					= C1 + C2 + … + Cn

The total cost of ownership is equal to the sum of the cost of every component.

No single point of failure		= F1 and F2 and … Fn

For the pipeline to have no single point of failure, none of the technology components can have a single point of failure. Therefore, it is an AND function.

Ease of Integrate into current stack	= I1 and I2 and … In

The ease of integration of the pipeline into a current IT stack depends on the ease of integration of each technology component. A single, badly-behaved technology will cause integration to go bad. Therefore, it is an AND function.

Security				= S1 and S2 and … Sn

For a pipeline to be secure, each of the technology components in the pipeline have to be secure. There can be no security holes in any of the components. Therefore, it is an AND function.

Highly Available			= H1 and H2 and … Hn

The entire pipeline is not highly available if a single technology component is not highly available. Therefore, it is an AND function.

All of the above operability issues show that operability is exponentially and inversely proportional to the number of technologies in the pipeline; i.e., the more the components lower the operability, the lower the success rate. This directly impacts a viable product launch in a timely manner as time to market, and the ability to successfully extract value from big data is impacted. At a certain point, for a lot of enterprises the success rate is zero percent. The product does not launch, and we have a failure. This is a direct outcome of putting together a stack with unbaked, open source software from the open source community. Herein lies the root cause for why a majority of big data products based on open source fail. The expertise required to stitch together various open source technologies simply does not exist within enterprises outside of Silicon Valley. Even at the epicenter of big data this expertise is very rare. Give the constant changes in open source, this is an impossible task.

Let’s see if Hadoop distros can step in and help. If our pipeline is, say, ten technologies deep. Do note as we put together a pipeline, it is not just about integration between neighboring technologies, as outages, data blockage, or latency within one technology can easily impact the entire technology stack upstream. With 30 technologies to choose from the distros, and assuming no technology repeats itself, we are looking at 30*29*28*27*26*25*24*23*22*21 possibilities. This has trillions of combinations.

Additionally, we are not yet counting the fact that each of these technologies has multiple versions and can be repeated. The reality of this situation is not that bad, but the above number is still too large to be viable. Do take these numbers with a bagful of salt, but the point remains that this is fruitless exercise. It is very clear that relying on Hadoop distributions to harden and test pipelines with various permutations and combinations is foolhardy. They just will not be able to comprehend various combinations, let alone pre-test them for you. All the operational details are therefore left to the customer. What the distributions will effectively get you is a collection of technologies that pass basic unit tests. Do note this is assuming that each of these open source technologies is hardened and ready for production. Given habits within open source developer ecosystems, that too is in question. We see how the current big data open source ecosystem even with distributions do not make an operable product development environment for enterprises.

The first generation of big data was not focused on real-time, but rather on storing data in a data lake. Furthermore, putting storage in between ingestion and analysis did not help. Fast big data has become the next generation of big data. So, for real-time, we need fast to be added to big data and storage to be on the side.

An open source ecosystem with commodity hardware appears to have met our next pillar, namely cost-effectiveness. But this has been met partly. Free open source and commodity hardware along with cloud helped, but the need to have deep, big data expertise hurt. An ecosystem that relied on professional services from open source developers further hurt as the incentive for them to productize big data was low.

Our fourth pillar, namely agility, goes hand-in-hand with operability. Enterprises routinely take 18 months to launch a big data product; sometimes up to two years. Most fail along the way. The inability to launch products on a monthly or quarterly basis is unviable in current markets. LAMP would have failed if it too required 11-18 months.

Trying to put together a combination of open source technologies and in a DIY manner is thus fraught with danger. We need to look at it from another perspective. We need a new approach that addresses all the five pillars needed for success. In the third part of this blog series, I will discuss how an big data application developed by stitching together loosely coupled big data services is able to address all the five pillars needed to achieve successful business outcomes. It heralds the arrival of next generation of big data.