The enterprise guide to experimenting with data streaming

Streaming data into your enterprise analytics systems in real time rather than loading it in batches can increase your ability to make time-sensitive decisions. Over the last few years, more and more enterprises and industries have started experimenting with data streaming, including the healthcare and financial services sectors. The global streaming analytics market size is expected to grow from $12.5 billion in 2020 to $38.6 billion by 2025, thanks to adoption in industries like manufacturing, government, energy and utilities, media and more.

A company that is looking to explore data streaming capabilities does not need to go "all-in." In fact, it’s best if you don’t. What's becoming clear is that you can reap the benefits of data streaming without building out a fully mature solution. Limited projects and proof-of-concept work with data streaming can prove incredibly valuable for your organization. Data streaming concepts are highly transferrable. Learning one platform enables you to adopt other tools and capabilities with ease. So the key is not to start dabbling with data streaming early and often so that your engineering teams can start developing the necessary skillsets related to resilient, distributed system design and development.

Getting started

Adopting a data streaming architecture will help solve a number of challenges that will surface due to the increasing volume and scale of information organizations are able to tap into as a result of digitization. Getting started requires a shift in data strategy and implementation.

Data strategy for many businesses, such as brick and mortar stores, manufacturers, and logistics firms, is grounded in core processes oriented to weekly or monthly batch calculations. Often, supporting applications using modern, cloud-based technology stacks are tailored to process data using a monthly ETL load -- an inherent limitation to real-time enterprise insights.

When you begin prototyping for data streaming, you will quickly uncover technical limitations and hidden requirements that will impact your ability to scale your model. So it's important to make a deliberate investment in this kind of prototyping so that you can assess any roadblocks to a long-term strategy while creating tangible short-term opportunities to pilot streaming tactics and technologies.

Embracing the incremental failures of prototyping is an effective path to a scalable data streaming architecture. Your best prototypes can scale into industry-leading competitive advantages. Failed prototypes, on the other hand, can be shut down after minimal investment and maximum learning.

For example, my team built one proof of concept for a client to collect and correlate WiFi, authentication gateway, and endpoint protection platform (EPP) logs. We shut it down due to a lack of any data science models able to correlate events across these sources, but we were able to take away the learning that Syslog, Kafka, Confluent Kafka Connect, and Flink are capable of solving similar integration challenges in the future.

Building a POC (proof of concept) or MVP (minimum viable product) always doubles as a risk management strategy by establishing technical feasibility and product viability with minimal investment.

Let's explore ways a data streaming prototype can add value.

Validate the streaming model

Start with a small team and a targeted goal of creating a POC solution to solve a particular business and technical problem. Then, evaluate the results to decide how best to scale the POC.

Teams should approach prototyping with an exploratory mindset vs. executing a preconceived outcome on a small scale. Embrace failure and learnings when validating your streaming model with prototypes.

If the concept is successful, enhance and scale up.
If the concept is a failure, start over using lessons learned to inform the next prototype.
If the concept is not a complete success or failure, keep iterating.

POC, MVP, pilot -- whatever name it goes by, prototyping will stop teams from creating products that don't (or can't) meet the business's needs. You will learn a lot and mitigate a lot of risk by taking this “build, measure, learn” approach to validating your data streaming model before you try to scale it.

Start by choosing a data streaming platform

Apache Kafka is a great place to start as it is the most widely adopted platform. Its cloud counterparts, Microsoft Azure Event Hub and AWS Kinesis, are either 100% compatible at a protocol level or operate using very similar concepts. Apache Kafka, Azure Event Hub, and AWS Kinesis are products focused on data ingestion. Google Dataflow and IBM Streaming Analytics are also popular options that act as a superset -- bigger platforms with more capabilities. Since the POC has few risks related to scalability and data retention, you can even deploy a small Kafka cluster on premises. Several Kafka stack distributions such as Confluent, Bitnami, and Cloudera, provide an easy way to launch Kafka and its dependencies on container systems, virtual machines, or even spare PC desktop boxes.

A team will want to tap into relational data and push relational data records to a low-latency data stream on Kafka. They will explore Change Data Capture (CDC) protocol and find out it works similarly for both a MS SQL-based warehouse and inventory system and a PostgreSQL-based e-commerce site. Both of these data sources are easily streamed into a Kafka feed category (or "topic") as events. A modern single-page application (SPA) where customers can manage their personal profile and preferences can be also enriched to emit events to another data topic once relevant customer information is updated.

After this analysis, the team will explore how they can aggregate and analyze streaming data. The data streaming and processing landscape (and big data in general) may seem daunting at first. There are many well-known players in the space, such as Flink and Spark for stream processing, MapReduce for batch processing, and Cassandra, HBase, and MariaDB for storing large volumes of data in a read-optimized columnar format. All of the technologies I've just mentioned work best to crunch specialized, massive data loads, and the POC does not operate at such a scale. Therefore, your prototype team will want to choose a data ingestion and aggregation platform with a user-friendly interface and SQL-like data retrieval support; it will likely be Confluent Kafka Connect, Lenses.io, Striim, or a similar commercial platform.

All of these data sources, when combined, can provide timely insights via custom reports and real-time alerts. For example, if a B2B account has updated its credit limit in a self-service single page app, this event, pushed to a data stream, is available to an e-commerce site right away. Analytics on most products in the highest demand, busiest shopping hours, and even alerts on fraudulent activity (unusually high order amounts) can be produced by aggregating and processing windowed data streams from inventory and e-commerce.

Learnings

Even though the POC does not introduce complex, scalable data processing platforms such as Spark or Hadoop, you will be getting new reports and alerts in near real-time, meaning that the duration to obtain insight is reduced from weeks to minutes or even seconds. The POC will allow you to consider what other processes would benefit from real-time reporting and analytics.

Meanwhile, the POC engineering team will learn important lessons about data model design. Poor design will lead to data duplication, which could become expensive and challenging when a POC is scaled to production levels, so it's important to use these learnings when moving on to future iterations.

IT and operations will also have learned that traditional concepts such as "database rollback" are not present in the streaming world. Monitoring is a must for a data streaming platform, as are support personnel with the appropriate skills. You can reduce the cost and complexity of operational support if you choose AWS Kinesis or Azure Event Hub instead of Apache Kafka, since cloud platforms are simpler to maintain.

Benefits

Data streaming provides a natural design for decoupling integrated systems. As data flows, it becomes available to all of its stakeholders independently, enabling services written for isolated use cases like data persistence, aggregate functions, anomaly detection, and many others. All of these are independent in terms of development and deployment. The benefits of having decoupled integrated systems is that each of these pieces can be introduced incrementally. This also allows you to scope your POC and focus on pieces that are important for your organization independently.

Once you execute on a POC, there is a decision point: continue iterating, shut it down, or restart. Questions related to data modeling, integrations between systems, and potential AI/ML opportunities should surface at this point, giving your organization better insight into how to staff your development and operations teams for the future of streaming.

Lastly, increased awareness of distributed systems will enable your technical teams to improve current back-office systems and map a modernization path for your organization.

Bottom line: Your organization has a lot to gain and little to lose by piloting data streaming.

Aurimas Adomavicius is President of DevBridge, a tech consultancy specialized in designing and implementing custom software products for companies across many industries.