As enterprises leverage big data and cloud, they continue to face problems in operability & productization. As a result, this slows down their progress in transforming into a data-driven business. To increase the success rates for big data projects, they need to have a lower total cost of ownership (TCO) and faster time to value (TTV). Additionally, most applications that are in production today extract value from data post loading, whether it is data lakes or proprietary data warehouses. The problem with this approach is the value of data depreciates rapidly with time. DataTorrent is focused on not only operationalizing and productizing big data, but to enable users to analyze and act on data before loading it in databases or data lakes. This is where “fast” comes in and the big data becomes fast big data. We therefore measure ourselves in optimizing the TCO and TTV of our customers in order to enable them to take actions in real-time.

Today big data eco-system is grounded in analytics from data lake, after the data is loaded. As part of optimizing TCO & TTV, we intend to solve a major problem that exists in big data ecosystem as well as cloud, especially those powered by open source. This problem is causing a large percentage of applications to fail to go to production, and thereby inhibiting a business to be data driven. The race to being a committer in an open source project, has resulted in non-operable code. A reward structure that favors consultancy (professional services) in open source has meant a bias against productization. This stampede has caused way too many open source projects to exist; most with low regard to operability or productization. A bane in big data & cloud applications is the emergence of too many technologies and options in every stage of the pipeline with callous disregard to TCO & TTV, and to the success of launching the application. For a business, free open source is both a boon and a curse.

In this blog, I am focusing on a how we at DataTorrent went about constructing a viable fast big data stack. Viability of a stack requires hardening of carefully selected best of breed big data technologies. These selected technologies then need to be hardened and integrated into one robust stack from both functional and operational point of view. This stack serves as a reference architecture for fast big data applications, that enterprises can adopt.

Big data applications do not exist as silos, but are always part of an end to end data flow. We believe that a big data reference architecture (i.e. scale out) works well as a collection of loosely coupled applications joined via a message bus. This architecture is easy to operate. Each application can be run with its own lifecycle, and can communicate with other applications via a versioned schema build over the message bus. In today’s IT a message bus is ubiquitous. We have selected Apache Kafka as the message bus in our reference architecture. On the compute engine we have selected Apache Apex given its focus on operability & productization. Within the compute engine we intend to provide robust set of computations. Malhar library within Apex already provides such computations, including connectors for ingest & compute & load, thereby reducing the time required to develop code for common tasks. In addition to Malhar, we now have a lot of application templates that are made available via DataTorrent’s AppFactory. I will discuss our AppFactory later in this blog. In addition to Malhar, we have decided to provide a scale-out rules engine, and for that we have selected Drools engine. For DFS we work with both HDFS and S3. We would soon be adding support for Azure File Storage. We will continue to invest in operationalizing and productizing this stack.

In addition to hardening and productizing this stack, we are also cognizant of need of our customers for other features. These include machine learning, noSql, log analysis, and OLAP. With that in mind, we will integrate with and leverage other technologies. For machine learning, we have selected Apache Spark. Given the diversity in log aggregation, we have integrated with Elastic Search, Solr & Splunk for real-time log aggregation. For noSql we have selected Apache Cassandra. For OLAP we have selected Druid. The efforts to integrate include certification and feature development to make this integration robust, operable, and productized. As we construct a fast big data stack, we retain our relentless focus on TCO and TCV.

All the technologies we select in our stack are individually hardened, tested, benchmarked, and certified for functionality as well as operability. We also certify them working together and ensure that developers will need to do minimum required for a big data application. Our aim is to take care of 90% of the efforts for a big data application.

Among the abilities we focus on and ensure as we harden this stack include

  • Able to meet tight SLA requirements: For a Fast Big Data stack it is critical that applications with tight SLA be viable. The over reliance of Hadoop 1.0 on file system caused a big data ecosystem to develop that could not handle low SLA. We test our stack for latency with optimal resource utilization.
  • Fault-Tolerance & High-Availability: One of the biggest operability issue with big data is fault tolerance, data-loss, and lack of high availability. We have built in full fault tolerance with no data loss in Apache Apex. We continue to leverage that expertise as we harden new technologies into our fast big data stack.
  • Linear Scalability: Linear scalability is needed for the application stack to sustain future needs. Inability to scale few years down the road will force a change in reference architecture, which is a very costly and risky procedure.
  • Performance: One of the component of total cost of ownership is hardware. Support costs also scale up with the number of servers. Highly performance stack is critical to manage this cost. Performance goes hand in hand with scalability as it reduces the hardware footprint.
  • Security Compliance: Big data applications handle private data and are governed by tight security regulations. Within DataTorrent, we strive to meet security regulations and we are bringing our security efforts to bear on our stack.
  • Ease of Integrations: Big data applications do not exist in silos. They are part of end of end data flow across various part of enterprise IT stack. This requires the fast big data stack to integrate easily with rest of the IT stack.
  • DevOps Friendly: To run an application 24×7 needs strong monitoring and operational support. DevOps usually already have a set of tools that they use to manage IT. The fast big data stack must integrate seamlessly into this setup. Ideally this includes RESTful webservices, metrics on both system as well as application.
  • Leverage commonly available expertise: One of the biggest and riskiest cost in big data applications is the expertise needed to develop and support these applications. Today vast majority of big data applications are not successful due to lack of this expertise. The new stack needs to lower this need. Apache Apex based applications lower this expertise as business logic is expressed in application level java code.
  • Operational management of application: Big data applications are notoriously costly in terms of operations. For any big data application these tasks are very detail oriented. These tasks include among others, upgrade, what-if testing, backward compatibility, certification, versioning, configuration management, etc.
  • Leverage existing code: Most if not all of big data applications are replacing applications based on old legacy stack is it not up to scale. This means that there is a lot of business logic that already exists. Big data stack based on Hadoop 1.0 or Spark requires all if not most of the code to be re-written from scratch. This gives rise to bugs, creates delays and often lead to failure of big data applications. A big data stack that enables developers to leverage existing code significantly improves the success rate of a new big data application.
  • End to end exactly once: A lot of big data applications require pipelines to be exactly one on an end to end basis. The stack has to natively support this, as it is very hard to do so in user code.
  • Enterprise grade UI: One of the main components of being able to launch any application is the user interface. Lack of a good UI significantly reduces the value proposition on any stack.
  • Application & System metrics: As part of the productization of our fast big data applications we ensure that robust application & system level metrics are available. These metrics are available in real-time as well as historical basis. The data gets stored in a time-series manner that can then be used for any purpose. Big data applications frequently need analysis for performance or for root-causes of issues. This means relevant data should be created and available. Additionally this data can be fed to ops monitoring systems as well as BI tools to ensure more value add.
  • Cloud first & cloud vendor agnostic: Usage or the need to use cloud has become ubiquitous. All parts of the fast big data stack are certified, benchmarked, and fully supported for Amazon big data ecosystem. We are now in process of covering Azure eco-system too. Within this stack the technologies will not be hard bound to a particular cloud vendor to ensure that the stack is cloud vendor agnostic to enable enterprises to migrate into and out of a cloud vendor. This will mean that functional code will not need to be changed if an enterprise decides to migrate out of a cloud vendor.
  • Dynamic Updates: Fast big data stack is run on a 24×7 basis. It is live and real-time. This means that any updates or changes that may have to be done should not need an application restart.

We had provided tools to big data developers keeping in mind reduction of time to develop & launch, and cost of on going support. We realized that we can further optimize big data applications moving up the application ladder. We made a small stride late last year in this direction by providing basic fast big data application templates. These were designed as kick-starters to get fast big data developers off and running. Our aim was to ensure that our software takes care of all aspects of big data applications and leave only business logic to the developers. Their aim is to ensure that fast big data applications were developed with a lot of low hanging fruits already taken care of. They were noticeably focused on connections to various technology, be in for ingesting data or for loading data. Later versions of these also included basic transformations like parsers, enrichers, filters, loaders, and dimensional computations.

We are now moving up the application ladder and grow this application factory with applications that focus on a vertical use cases. These applications leverage DataTorrent’s fast big data stack. Our mission to operationalize and productize big data & cloud is now being expressed via our AppFactory.

AppFactory will make available the building blocks that are being used to build big data applications. The factory will also consist of hardened and operable fast big data applications. Developers may need to add some code to ensure that the applications are tuned to precise functional specifications. Application Factory will significantly de-risk the push to be data driven.

We are targeting fraud detection & prevention as part of our first application suite in our AppFactory. We are bringing big data and cloud to bear on fraud prevention. The first application in this suite targets omni-channel fraud prevention for credit cards. Fraud prevention application needs to be flexible in terms of ability to add/modify rules/models; ability to subscribe to variety data sources; ability to easily integrate with other IT systems; ability to linearly scale on demand; and be both on-premise and loud compatible. Fraud prevention has to be done in real time and thus needs a fast big data stack. Fraud prevention cannot be done via data stored in a data lake. The cost structure should scale and leverage a common stack. Within an enterprise there are different teams that work on different aspects of fraud prevention. A message bus based setup that leverages commodity hardware, scales up linearly are critical aspects of these applications. Lastly, the most critical factor is that this application has to be operable and productized. Next set of fraud applications include account take over, charge back etc.

We will continually focus on enabling enterprises to successfully launch big data products in short time on-prem or on cloud, and be able to operate these products at low cost. All of this in real-time. Expect more strides and updates in near future in this endeavor.