Don’t take data for granted

We’ve long had a running joke that the world was running out of data. It’s certainly the type of statement that gets a rise. But it could be argued that, after the emergence of big data over a decade ago, data eventually subsided in the headlines, in favor of artificial intelligence (AI), cloud and microservices. With the cloud making it almost trivial to pile up those terabytes of object storage and turn compute cores on and off with short notice, it’s tempting to wonder if we’re starting to take data for granted.

Data matters more than ever. It’s taken for granted that the so-called Three V’s of big data are no longer exceptional. Big data is so 2014 -- in the 2020s, we just term it “data.” And data is coming from more sources and places. That's led to a chicken-and-egg scenario as distributed databases grow more commonplace. The cloud enables it, and the use cases for global deployment demand it. And, by the way, did we mention the edge? In many cases, that data is not going anywhere, and processing must come to it.

There is no silver bullet to extending data processing to the edge. Moving to the edge means pushing down a lot of intelligence because there won't be enough bandwidth to bring in the torrents of data, much of it low density (e.g., instrument readings) where the value only comes from aggregation. And at the back end, or shall we say the hub (in a distributed environment, multiple hubs), it will prompt the need to converge real-time data (e.g., streaming, data in motion) with historical data (e.g., data at rest).

Eliminating data complexity

That’s been a dream since the early days of what we used to call big data where the only practical solution at the time was the Lambda architecture – separating the real-time and batch tiers. As a result, streaming typically required separate platforms, where the results would be ingested into the database or data lake. That was a complex architecture requiring multiple tools, lots of data movement and then additional steps to merge results.

Thanks to the emergence of cloud-native architecture, where we containerize, deploy microservices, and separate the data and compute tiers, we now bring all that together and lose the complexity. Dedicate some nodes as Kafka sinks, generate change data captures feed on other nodes, and persisted data on other nodes, and it’s all under a single umbrella on the same physical or virtual cluster.

And so as data goes global, we have to worry about governing it. Increasingly, there are mandates for keeping data inside the country of origin, and depending on the jurisdiction, varying rights of privacy and requirements for data retention.

Indirectly, restrictions on data movement across national boundaries is prompting the question of hybrid cloud. There are other rationales for data gravity, especially with established back office systems managing financial and customer records, where the interdependencies between legacy applications may render it impractical to move data into a public cloud. Those well-entrenched ERP systems and the like represent the final frontier for cloud adoption.

Data is living on the edge

So, on-premises data centers aren’t going away any time soon, but increasingly, as is HPE’s motto, the cloud may come to you. The draw is the operational simplicity and flexibility of having a common control plane and on-demand pricing model associated with public clouds. That’s why, ushering in the new decade, we forecast that the 2020s would become the era of the Hybrid Default. It’s why the enterprise spinoff of HPE has seen its on-demand hybrid/private cloud business more than double year over year.

Demand for the cloud is not a zero-sum game; growing demand for hybrid cloud or private cloud is not coming at the expense of public cloud. And that’s where things get crazy, as cloud providers have built an increasingly bewildering array of choices.

When we last counted, AWS had well over 250 services, and looking at the data and analytics lane, there are 16 databases and 30 machine learning (ML)services. The burden is on the customer to put the pieces together, such as when they use a service like Redshift or BigQuery and want to run data pipelines for ingesting and transforming data in motion, visualization for providing ad hoc analytics, and of course, advanced machine learning.

Help is on the way. For instance, you can now in some cases run ML models inside Redshift or BigQuery, and you can reach out to other AWS or Google databases for federated query. Azure, for its part, has strived for more of an end-to-end service with Synapse, where the pieces are built in or activated with a single click. But these are just opening shots – cloud providers, and hopefully with an ecosystem of partners, need to put more of the pieces together.

The magic of data meshes

In all this, we’ve so far skipped over what's been one of the liveliest topics over the past year: the discussion around data meshes. They arose as a response to the shortcomings of data lakes – namely that it's all too easy for data to get lost or buried, and that the teams who consume the data should take active ownership[ over it. Against that are concerns that such practices may not scale or erect yet new data silos.

And so with all this as backdrop, we’re jazzed to start setting up residency here at VentureBeat in the Data Pipeline, along with fellow partners in crime Andrew Brust and Hyoun Park. Hang on, we’re in for some ride.

Eliminating data complexity

Data is living on the edge

The magic of data meshes

More