A client conversation & subsequent LinkedIn post below got me tapping at my iPad again…
What triggered this exchange was me asking, “How much time is there between data created and insight garnered?”
His response “approximately 60ms” totally confused me!
I quizzed him on the data pipeline. He said they use Kafka to Apex to Casandra and then drills the datastore with their query engine and UI.
So they use a traditional Extract, Transform, Load (ETL) model but augment with streaming data aka ingest real-time streams but immediately park it in a data lake. Aha, so the “data created” wasn’t the actual data created but rather the search query timestamp. 60ms was the return loop to visualize the answer.
That’s all fine but exactly how stale is the data? There isn’t a “use by” or “sell by” date so how do you know what you’re really seeing?
Legitimately, I could ask the following questions:
“Do you ingest real-time streaming data?”
“Do you provide real-time data analytics?”
It feels like real-time, but it isn’t thinking, analyzing and acting in the NOW.
So what’s wrong with this? Nothing really, but let’s just make sure we know what we are looking at and what we truly have here.
We ingest real-time data and use it to enrich stale data. We’ve left the architecture as-is, so we don’t have to change too much. We answer most of the current use cases and everyone’s happy right?
Well, no not really.
Have you ever gone to Google Earth, zoomed into your house only to be frustrated that it’s not even close to being a current image? You hoped the data would be a little more up-to-date and you only know that it’s not because you see a different car, a tree gone, or missing solar panels, etc.
In the same way, if you are presented with insight from your company’s analytics, and it’s purportedly ‘real-time’ information then you would assume it’s fresh and to the NOW. How old is it: minutes, hours, days, weeks or even months? You just have no idea!
Or here’s another example. Wifey and I just had a shocking experience when house hunting.
We viewed a cute cottage. We asked for the disclosures; it showed nothing alarming except for the usual roof issues, broken sockets, leaking downspout, etc. For other reasons, we didn’t pursue the cottage, but we did engage another realtor, and on mentioning this property she said, “Oh the murder house.”
The point of this example is that people make assumptions. We didn’t know the fine print of what legal disclosure was and we didn’t know we were looking at a filtered dataset.
So we continue to proliferate the ETL existing model, and as such we make static, “old data” the King and “live NOW data” a subordinate actor missing the benefit of its freshness.
We all make immediate, cognitive and insightful decisions every day. We get into a situation, we enrich current data with the past data and make a call. In other words, our organic behavior is to make “NOW data” King and “old data” serves as enrichment and not vice-versa.
We are where we are because the technology did not avail us with the above choice previously. We could not, until recently, process stream data. We couldn’t ingest or gain insight from the data and certainly not enrich in real-time. However, times have changed.
Data analytics is now no longer a technology issue. I posit that if you have the ability to drive insight and actions on data in motion, garner the value while it’s fresh AND THEN drop the data into a lake for posthumous analytics, trending, and enrichment then why the heck not?
Ever heard someone complain that their analytics came too quickly? Or do you often hear someone say, “shit if I had known that sooner I would have…”
It just takes an inquisitive mind to shift left, move some of your analytics to the stream side of your data lake. The results will be profound for both competitive and security advantages!