DataTorrent Delivers Enterprise‐Grade Batch and Stream Processing
Date: July 2015
Author: Nik Rouda, Senior Analyst
Abstract: DataTorrent offers an open-source enterprise-grade unified batch and stream analytics platform, addressing a critical enterprise need for faster time to insights and to rapidly take action on those insights in a cost effective, simple manner. Businesses are finding many uses for this solution that are becoming transformative in their operations, as they can now react with immediate situational awareness. Continuing to develop more “ease of use” for self-service querying and application development will further improve the value of the platform.
DataTorrent is an innovative startup focused on solving the specific challenges related to gaining real‐time insights and taking action in a big data environment. The company’s unified batch and stream analytics platform is oriented around three major functional priorities:
• Ingestion of data from a wide variety of sources, at any volume and at any speed required.
• Analytics of these combined data sets for fast response—not in hours or days, but in minutes or seconds.
• Actionable connections to business applications and processes, triggering not just insights but also responses.
Though DataTorrent is not yet a brand name, the company and its offerings show great promise to deliver against the hype of real‐time big data. Early customers are able to articulate significant advantages from their adoption of DataTorrent solutions and are confident that they will be well-served in the future.
The Requirements for Streaming Analytics in the Enterprise
ESG research found that 58% of surveyed companies stated that 10-40% of their staff currently leverages applications built on big data analytics.1 No longer the domain of a few business analysts, data scientists, and privileged executives, big data analytics is now a “must have” apability underpinning widely used applications and weaving into daily business activities and decision making. Coinciding with this trend toward big data is the need for speed. Even a year ago, a full 23% of survey respondents indicated that they wanted to use Hadoop for real-time and/or streaming analytics.2
While Hadoop is well designed to offer the economical storage of massive amounts of unstructured data, it doesn’t always meet the needs of enterprises looking for a mature and robust streaming analytics platform because it was originally built for batch analytics via MapReduce. Although users were satisfied in the past by dashboards and reports that gave them daily, monthly, or quarterly summaries, they now expect raw information and detailed analysis to be immediate. Figure 1 further reflects this shift in timeliness, with most wanting their data to be no more than seconds old for near real-time use cases.3 Certainly traditional extract, transform, load (ETL) processes and batch analytics are not going to be adequate for this challenge! A new approach is needed. The popularity of subsequent Apache projects and related vendor-proprietary development around Spark, Kafka, Flume, and other tools demonstrates the significant interest in solving this problem.
Yet the relative recency and immaturity of various data pipelines and streaming tools for Hadoop can mean that application architects and developers don’t have as rich or as stable a platform as they might like, particularly for a mission-critical enterprise application. Production environments demand not just high performance and analytics functionality; they need to meet operational standards and service level agreements to be acceptable for the business. This area is where DataTorrent excels: the production of an open-source feature-rich, enterprise-grade unified batch and stream processing analytics platform.
To support enterprise needs, the core platform must offer:
• Linear scalability that doesn’t diminish returns as a scale-out, distributed platform.
• In-memory analytics for the highest performance, only going to disk as output.
• High availability and fault tolerance to continue without interruption or data loss.
• Full compliance with the broader Hadoop ecosystem to support popular distributions.
• Full integration with a rich variety of data sources and their particular foibles.
• A robust development framework for applications.
DataTorrent provides demonstrable features and functionality to address each of these points. Technology decision makers should evaluate prospective streaming solutions against this list to ensure that any proposed offering can deliver for the needs of the business. While some open-source and even proprietary approaches may satisfy some of these requirements, few can effectively handle all of those capabilities. Buyer beware.
A Real-world Case Study in Consumer Technology Advertising
To better understand the implications of DataTorrent in a large-scale production environment, ESG interviewed the director of engineering for a brand name global consumer technology company. Responsible for the development of a streaming analytics platform to underlie the company’s extensive advertising network, he detailed how DataTorrent was selected and implemented as the choice for the company’s environment.
Two primary concerns were outlined as core to the evaluation process:
1. Finding a strong technology partner to provide accessible engineering talent and support expertise.
2. Vetting the features and platform architectural design to ensure they could meet strict requirements today and in the future.
The advertising network context is significant as it implies the need for integration with a complex set of business processes and activities including bidding, buying, and arbitraging spots; routing customer traffic; tracking the pace of delivery, inventory, and spending; and mediating ads to tailor placement toward the optimal forums and udiences. A great observation made was that “the key is the feedback loop”—success will be increased corresponding to how well the team understands the instantaneous state of each activity. Rapidly changing variables in ad costs, competition, audience engagement, and other areas needs to be optimized in real time to achieve the best results for the business. This core competency isn’t merely theoretical; it has a direct impact on the company’s sales revenue and profitability, materially affecting results that even the boardroom and Wall Street can see.
The company has done this by building DataTorrent into the existing Cloudera Hadoop environment with Flume connectors. Apache Spark and Storm were considered but were rejected for not being quite stable or scalable enough for the business’ needs.
With the company’s previous batch analytics approach to managing advertising, the IT department’s SLA was 45‐minute updates. Today, the SLA is less than one-minute latency to report on live activity—a distinct improvement! In addition, all events are registered with zero-data loss, which is critical to ensuring that ad budgets are not overspent or misapplied, with each action reflected exactly once. Now, any service outages won’t stop data capture, and the accumulated backlog can be “drained” in a matter of hours at most.
The scalability of the DataTorrent solution has already been proven out, handling up to 1-2 billion events daily, yet capable of growing linearly with the addition of more server resources to effectively cover 4 billion, 8 billion, or more in the future. Though this scale couldn’t be pretested in an artificial lab environment, the architecture of the platform was carefully examined to understand how the distributed analytics‐processing engine would react under extreme loads. The non-blocking event window processing was seen as a key feature here. Fault tolerance is a key component of the solution. But this was also where having confidence in DataTorrent’s staff was important in order to collaboratively resolve any unexpected issues that might arise. Similarly, minor kinks were encountered on Flume integration and at the application development level, but were quickly addressed.
The streaming analytics use cases haven’t been limited to the initial design: Since deployment, the company has expanded the initiative to explore session‐aware web analytics and also real‐time updates of the analytics models themselves. Each of these new capabilities is being developed interactively by IT engaging with the business side of the company, not just as “interesting feats of technology.” Meanwhile, the company still uses batch processing on an aggregated operational data store written by DataTorrent to a scalable database, alongside Hadoop and DataTorrent, for later investigation via interfaces with common business intelligence tools and custom-built internal portals, which all complement the DataTorrent solution.
As for future product enhancements, the company would like to see DataTorrent develop a broader application and service creation interface for its “self‐service” power users, and would appreciate more ad hoc querying functionality and a way to tap into the data stream and modify the application on the fly. All of these are appreciable advantages, but there wasn’t any deep concern about DataTorrent’s ability to execute against this roadmap. The deployment so far has certainly satisfied the engineering team, and has been all but invisible to the business users (except way faster than before!).
The Bigger Truth
Big data is still an emerging school of thought impacting a wide range of analytics, with significant diversity being both an advantage and a concern for many practitioners. Some have clearly defined needs and problems to solve at once, while others are now merely exploring the possibilities. With the field of big data, streaming analytics has rightfully gained a lot of attention as a value creation technique. Faster understanding and faster response to complex real‐time information inputs may seem hard to accomplish, but also hold a tantalizing promise of far greater business agility.
Developing a platform for real‐time streaming has been fraught with difficulties for most enterprises so far. They are trying to follow the rapid changes in Apache projects and vendor implementations, but very few have the in‐house expertise to safely avoid all the risks of wasting time, money, and political capital on unproven, cobbled‐together solutions. Even global enterprises with large IT teams and budgets would prefer a more reliable path to success for streaming analytics initiatives.
DataTorrent has so far shown itself to be a capable player in this area. The combination of talent and technology the company is bringing to bear on the problems appears to be paying off for both DataTorrent and its customers. Anyone looking for help implementing a real‐time analytics application would be well advised to check out DataTorrent’s platform as an alternative to “build your own from scratch” approaches.
1 Source: ESG Brief, 2015 ‘Big Data’ Spending Trends, April 2015.
2 Source: ESG Research Report, Enterprise Big Data, Business Intelligence, and Analytics Trends, January 2015.
3 Source: Ibid.