DataTorrent Celebrates Uptick in Open Source In-Memory Data Processing
JUNE 27 2016
BY JASON STAMPER
DataTorrent was founded in 2012 by ex-Yahoo veterans, who recognized the value in being able to use Hadoop not just as a low-cost storage pool, but also as a data processing and analytics platform. Its core technology, RTS, is able to offer both batch processing and streaming analytics on top of Hadoop, and since it was open-sourced in summer 2015, contributors have grown from 40 to over 10,000.
The 451 Take
Probably DataTorrent’s biggest claim to fame is the ability to offer both batch processing and data streaming in a unified platform that runs on top of Hadoop. Its open source underpinnings – namely Apache Apex, which recently graduated to a Top-Level Project – should also give potential prospects the reassurance that they are joining a vibrant ecosystem built around in-memory data processing. Crucially, DataTorrent has also worked hard on ease-of-use; for example, offering self-service visualization as well as GUI-based application development for less-technical users.
DataTorrent was founded in May 2012 by Yahoo! veterans Phu Hoang and Amol Kekre. Phu – now CEO – was a founding member of the engineering team at Yahoo!, while Amol – now CTO – was the architect and senior engineering manager for Yahoo! Finance. He was also the director of engineering for Yahoo! Hadoop, and was one of the primary contributors to the Apache YARN project.
They started with $750,000 in seed funding from Morado Venture Partners and Yahoo! cofounder Jerry Yang’s AME Cloud Ventures, and raised a further $8m in a series A round in June 2013.
In April 2015, the company had a $15.1m series B round, taking total funding to $23.85m, with the lead investor being SingTel Innov8 Ventures. In a recent briefing with the firm, it said it used that round primarily to expand its sales and marketing functions. Its staff numbers have risen from 20 when we were briefed in October 2014, to 47 today. It says it has no imminent plans for another funding round.
Despite the increasing popularity of the open source Hadoop data storage and processing platform, DataTorrent’s founders recognized that because it runs in batch mode, its usefulness for rapid decision-making was limited. In response, the company developed a stream processing engine that sits on top of a standard Hadoop cluster. It offers unified batch and stream processing, with fault tolerance across the cluster too.
However, while DataTorrent RTS offers fault tolerance across a cluster, it isn’t fully ACID compliant. ACID compliance is desirable in some transactional use cases – it stands for atomicity, consistency, isolation and durability.
However, DataTorrent says RTS is used predominantly for analyzing or processing very large data sets stored in Hadoop in batch mode or in real time, rather than OLTP (Online Transaction Processing).
Stream processing has traditionally been reserved for use cases with very low latency requirements, such as high-frequency algorithmic trading, risk and market surveillance. But today there are far more instances where there are latency-sensitive applications, in areas such as e-commerce, online advertising and gaming, as well as the emerging Internet of Things (IoT).
DataTorrent claims to be able to handle over 1.6 billion events per second – and that’s on a 35-node Hadoop cluster sitting on standard Intel-based servers (12 cores, 256GB of memory per node). The company is able to achieve this throughput thanks to its in-memory architecture.
DataTorrent didn’t make its core RTS open source at its outset – it did that only in August 2015, announcing the incubation of Apache Apex at the Apache Software Foundation (ASF). The company told us it wanted the core technology to be reasonably mature before it made it open source.
DataTorrent told us at that time (August 2015) there were 40 members, who were all DataTorrent staff. In three months, that membership reached 400, and three months after that 4,000. Today the company says there are over 10,000 members of Apache Apex.
In late April this year, Apache announced that Apex had graduated from the Apache Incubator to become a Top-Level Project (TLP), which it says signifies that the project’s community and products have been well governed according to the ASF’s process and principles. Today then, the DataTorrent RTS Core engine is based on Apache Apex, and so benefits from not just DataTorrent but many other contributors to the code. Of course, DataTorrent needs to have some proprietary technology to sell to have a business beyond offering commercial support to Apex users, and indeed it does.
There is dtManage, a management console aimed predominantly at DevOps who need to monitor, manage, update and troubleshoot RTS. There is security infrastructure not present in Apex such as Kerberos, Role Based Access Control (RBAC), LDAP/AD integration, data encryption and secure data streams. RTS also offers rolling system and application upgrades, customized application metrics, web service API versioning and compatibility and cross-datacenter fault tolerance and data ingestion.
RTS also has additional connectors for technologies such as message buses, SQL and NoSQL databases, flat files, Kafka, Scoop, Flume, Twitter and more. There’s pre-built Java business logic for the likes of data de-duplication, real-time OLAP cubes or event-driven rules engines.
DtAssemble offers GUI-based development for less-technical users, while dtDashboard adds self-service real-time and historical data visualization. There’s also DataTorrent Data Ingestion for Hadoop, a stand-alone application that is also integrated with RTS. It gives the ability to select from batch and streaming data sources, compress, compact and encrypt data and store it into any data source in a single or multiple Hadoop cluster/location. That product is currently in beta.
RTS Core – as the main engine is now known – is on version 3.4 today. We understand the next version, 3.5, should reach general availability in the third quarter, with additional improvements to its tooling.
The company wouldn’t tell us how many paying customers it has, but it said key vertical sectors are financial services, adtech, telecom and IoT. It currently has four reference customer case studies. GE is using RTS for data ingestion and time series data for its GE Predix Cloud – a PaaS for industrial data analytics.
PubMatic, which offers publishers, design-side providers and agencies analytics about competitive benchmarking, uses RTS to offer those customers insights within a few minutes of event generation through real-time reporting, monitoring and alerting.
Capital One, one of the 10 largest banks by deposits, chose Apache Apex. It built an analytics application for threat detection and monitoring on Apex, and achieved latency of 2 milliseconds. Its requirement for 99.5% uptime was easily met, with 99.9995%, and 1,000 events burst requirement became 2,000 events burst at a net rate of 70,000 events/second.
Finally Silver Spring Network uses DataTorrent real-time analytics in its Silverlink Sensor Network, a network-based service for analyzing real-time smart grid big data.
Apex is not the only open source technology that promises rapid analytics on top of Hadoop. There is also Apache project Storm, which companies could try to use with Hadoop and various other open source tools. DataTorrent says its own technology is over 1,000 times more performant than Storm, not to mention offering far more in the way of management and monitoring (and fault tolerance), and it says its customers’ own benchmarks have backed this up when it has been compared in real-world situations.
Hadoop distributors MapR and Hortonworks support Storm however, while Cloudera has added support for the Apache Spark in-memory data-processing engine and its related Spark Streaming project.
For streaming (although not built from the ground up with Hadoop in mind), companies might look to IBM with its InfoSphere Streams, TIBCO BusinessEvents or Informatica with its Vibe Data Stream for Machine Data. There’s also SAS’s Event Stream Processing (ESP) and SQLstream’s Blaze.
Also, technologies such as BusinessEvents and complex-event-processing engines such as TIBCO’s StreamBase and Apama (acquired by Software AG in June 2013) are aimed primarily at handling events triggered by specially developed business applications, rather than events from multiple apps being consumed by Hadoop.
Potential customers might also be looking at AWS’s Kinesis stream-processing service, although DataTorrent considers that potentially complementary, maintaining that Kinesis is more likely to be used to stream data into the AWS cloud, rather than for computational processing of data.
If companies are using Hadoop to store and subsequently try to analyze machine-generated data such as log files, then rivals would include companies such as Splunk (particularly its Hunk exploratory analytics platform for Hadoop), Logentries, Loggly and X15 Software.
DataTorrent’s cofounders have a strong real-time engineering pedigree at Yahoo!, which is partly what makes its scalability and fault tolerance claims so credible. Combining batch and stream processing in a unified architecture on top of Hadoop is garnering plenty of attention.
The company is still small and only has a handful of published customer case studies, and it’s staying mum on how many paying customers it has altogether for the time being.
There’s growing interest in enabling Hadoop to run in real time rather than batch mode, as companies expand their use of Hadoop and the role it plays in a broader analytics strategy such as IoT. The GUI-based development environment and self-service analytics will be welcome in companies with limited technical staff.
The company faces at least some competition from free open source projects such as Storm and Spark, which are also now supported by Hadoop distributors such as MapR, Hortonworks and Cloudera. There are also far larger competitors offering analytics on Hadoop, albeit with different architectural approaches.