Friday Musings: Why streaming data? Why Now?
My first real, grown up job was supporting a trading and risk-management system a FX Options desk. I’d get in early in the morning, talk to Europe on the way in, and then walk the trading floor, making sure every trader could start their X-11 windows.
Hold up. X-11 windows? For traders? Why were we doing that?
Desktops were too underpowered in 1999 to compute the kind of analytics, or handle the amount of data we were processing. So we had a globally replicated network of Sybase databases, compute services running on other Sun Solaris machines, and front-ends that ran on the same compute hosts. We treated the desktops as dumb terminals because that’s all they were good for.
I am pretty sure I was hired because I knew Unix systems and internet protocols. These are, in turn, core skills I look for today when I hire. I’d stumbled onto Unix (DEC, BSD, Solaris and NeXT for those counting) while working at the computing facilities at college, then while working on the open-source Lynx web browser project, and at a tiny startup.
As I look back, every aspect of my 25+ years working in technology has been an attempt to solve one problem. How can we ingest data and process it, faster.
Along the way, I’ve worked on every possible solution to scale and accelerate data or analytics. Bigger iron, columnar databases, clustered databases (before RAC there was Digital’s RDB), grid computing, messaging systems, pub-sub systems, memory caches, rpc, EJB and on and on. Can we decompose the system to get more compute? Can we bring co-host components to reduce network latency? Is a shared backplane enough? How many shards are optimal? Can this data model be optimized? Should we use protobuf to reduce network traffic, or Avro for ease of use and portability? Is the cost-benefit ratio of shared cloud resources worth it?
This problem, the need to reliably process data with sub-second latency at scale, has brought me to two open-source technologies, Kafka and Flink. Together, they can be used to perform both simple and complex computational processes on large streams of data. Millions of events a second, hundreds of different operations on the data, constrained largely by the amount of hardware you choose to deploy. This in turn creates a competitive advantage. Your digital operations are faster, more resilient and more scalable than your competitors. There are other ways to solve the problem of course, and not everyone has the appetite to refactor their entire data pipeline now. At times, even if it makes complete sense technically, business conditions don’t make it economical. But, if you’re musing about this on a Friday afternoon, you owe it to yourself to take a look at this combination of technologies.