I recall reading an article claiming 70% of Hadoop deployments will fail to deliver in 2017. This seemed pretty high but the fact is that Hadoop, Big Data, real-time analytics, AI, ML etc. are tricky to code and even harder to stand up 7/24 “lights out” style.
At DataTorrent, we pride ourselves on creating Apache Apex, a stream processing engine that’s proven to have these very capabilities. But the fact is that people don’t buy a processing engine. They buy an outcome or a pipeline of data services that gather disparate data into a form that’s understood, enriched, sliced, diced and analyzed for both insight and action.
In the short six months I’ve been onboard with DataTorrent, it’s clear that as a vendor we can’t just be myopically focused on the engine. We need to engulf more of the pipeline or the technology in this market will certainly either fail or be exclusive for a few very large companies with rooms full of PhD Big Data types.
By taking our understanding of the complexity associated with Big Data analytics, and applying it to the actual challenges our customers are trying to solve – we’ve had an epiphany. We’ve realized the need to focus on delivering applications and not just a component for customers.
We therefore looked at our existing client base and where they were spending the most time and money trying to smash the proverbial square peg into a round hole. One such area that quickly rose to the top as a pain point for customers was payment fraud. After digging in to this realization, it seems there are a few key issues:
- “I can’t ingest all my data making me inaccurate”
- “I’m not agile enough to counter new attack variables fast enough”
- “I detect but I am not fast enough to prevent”
Sidebar : We had a beer with an org that told us that the average company doesn’t act on a fraudulent activity inside of 100 days. Why you ask? Well the market had a 0% unemployment rate and last I counted there were over 700 jobs posted. So stick a human in the mix with all the false positives that inaccuracy creates and it’s like a drunk painting the Golden Gate Bridge with an earbud and a spray can.
So from a clean slate, we figured we’d look at what clients wanted, assemble current code components and we’d be off to the races.
We learned that ideally customers would like to handle “Account Takeover,” “Card Payment,” and “Chargeback.” It’d be great if you could have an integrated enrichment loop between all three and bi-directional feeds from 3rd party institutions.
Sidebar : If you do this on events in real-time, you would literally be faster than the fraudsters!
In addition to handling the three items above, we believe you’d want this application to be:
- In the act – Event based stream processing
- At scale – Scale out architecture, Scale out and up data sources, accommodate Omni-channel
- Available – Self healing, 7/24, human-free
- Flexible and Agile – dynamic rule changes, opportunity to inject ML, choice of deployment model from on-premise, hybrid and native cloud
We started from a core of Hadoop, Yarn and Apex, adding Drools for anomaly detection. Outside of the usual glue and integration, we stumbled on large chunky science projects that needed attention. A few examples of these long poles are:
- There’s no (sliding window) or state concept in Apex as its event by event so that had to be engineered
- There’s no scale out for Drools so that had to be engineered
- There’s no satisfactory availability on Drools so that had to be engineered
- Visualization tools are all geared in terms of batch e.g. showing you “what happened” so we needed to reinvent the dashboard
- Visualization widgets had no flexibility and no satisfactorily availability so that had to be engineered
- Rules couldn’t be changed on the fly so that had to be engineered
- … and then there’s the “Frankenstein” realization for test and application availability 🙂
Net net, I now totally understand and empathize with clients trying to put Big Data analytics pipelines together.
NONE of this is easy even if you have a room full of engineers with PhDs. None of the work is pretty or fun but without it, you have just individual bits and that’s most certainly not a winning formula.
The good news is we GA a payment fraud prevention application in August. It’s the first of many applications we will be offering our clients on a subscription basis.
We’ve shifted our focus from just the engine and toolkit to building enterprise-grade, Big Data real-time analytics applications using the best technology possible and making it available today!
As Jeremy Clarkson would say, “How hard can it be?” Turns out the answer was bloody hard!