As business data is increasingly produced and consumed outside of traditional cloud and data center boundaries, organizations need to rethink how their data is handled across a distributed footprint that includes multiple hybrid and multicloud environments and edge locations.

Business is increasingly becoming decentralized. Data is now produced, processed, and consumed around the world — from remote point-of-sale systems and smartphones to connected vehicles and factory floors. This trend, along with the rise of Internet of Things (IoT), a steady increase in the computing power of edge devices, and better network connectivity, are spurring the rise in the edge computing paradigm.

IDC predicts that by 2023 more than 50% of new IT infrastructure will be deployed at the edge. And Gartner has projected that by 2025, 75% of enterprise data will be processed outside of a traditional data center or cloud.

Processing data closer to where it is produced and possibly consumed offers obvious benefits, like saving network costs and reducing latency to deliver a seamless experience. But, if not effectively deployed, edge computing can also create trouble spots, such as unforeseen downtime, an inability to scale quickly enough to meet demand and vulnerabilities that cyberattacks exploit.

Stateful edge applications that capture, store and use data require a new data architecture that accounts for the availability, scalability, latency and security needs of the applications. Organizations operating a geographically distributed infrastructure footprint at the core and the edge need to be aware of several important data design principles, as well as how they can address the issues that are likely to arise.

Map out the data lifecycle

Data-driven organizations need to start by understanding the story of their data: where it’s produced, what needs to be done with it and where it’s eventually consumed. Is the data produced at the edge or in an application running in the cloud? Does the data need to be stored for the long term, or stored and forwarded quickly? Do you need to run heavyweight analytics on the data to train machine learning (ML) models, or run quick real-time processing on it?

Think about data flows and data stores first. Edge locations have smaller computing power than the cloud, and so may not be ideally suited for long-running analytics and AI/ML. At the same time, moving data from multiple edge locations to the cloud for processing results in higher latency and network costs.

Very often, data is replicated between the cloud and edge locations, or between different edge locations. Common deployment topologies include:

    Knowing beforehand what needs to be done with collected data allows organizations to deploy optimal data infrastructure as a foundation for stateful applications. It’s also important to choose a database that offers flexible built-in data replication capabilities that facilitate these topologies.

    Identify application workloads

    Hand in hand with the data lifecycle, it is important to look at the landscape of application workloads that produce, process, or consume data.  Workloads presented by stateful applications vary in terms of their throughput, responsiveness, scale and data aggregation requirements. For example, a service that analyzes transaction data from all of a retailers' store locations will require that data be aggregated from the individual stores to the cloud.

    These workloads can be classified into seven types.

      Account for latency and throughput needs

      Low latency and high throughput data handling are often high priorities for applications at the edge. An organization’s data architecture at the edge needs to take into account factors such as how much data needs to be processed, whether it arrives as distinct data points or in bursts of activity and how quickly the data needs to be available to users and applications.

      For example, telemetry from connected vehicles, credit card fraud detection, and other real-time applications shouldn’t suffer the latency of being sent back to a cloud for analysis. They require real-time analytics to be applied right at the edge. Databases deployed at the edge need to be able to deliver low latency and/or high data throughput.

      Prepare for network partitions

      The likelihood of infrastructure outages and network partitions goes up as you go from the cloud to the edge. So when designing an edge architecture, you should consider how ready your applications and databases are to handle network partitions. A network partition is a situation where your infrastructure footprint splits into two or more islands that cannot talk to each other. Partitions can occur in three basic operating modes between the cloud and the edge.

      Mostly connected environments allow applications to connect to remote locations to perform an API call most — though not all — of the time. Partitions in this scenario can last from a few seconds to several hours.

      When networks are semi-connected, extended partitions can last for hours, requiring applications to be able to identify changes that occur during the partition and synchronize their state with the remote applications once the partition heals.

      In a disconnected environment, which is the most common operating mode at the edge, applications run independently. On rare occasions, they may connect to a server, but the vast majority of the time they don’t rely on an external site.

      As a rule, applications and databases at the far edge should be ready to operate in disconnected or semi-connected modes. Near-edge applications should be designed for semi-connected or mostly connected operations. The cloud itself operates in mostly connected mode, which is necessary for cloud operations, but is also why a public cloud outage can have such a far-reaching and long-lasting impact.

      Ensure software stack agility

      Businesses use suites of applications, and should emphasize agility and the ability to design for rapid iteration of applications. Frameworks that enhance developer productivity, such as Spring and GraphQL, support agile design, as do open-source databases like PostgreSQL and YugabyteDB.

      Prioritize security

      Computing at the edge will inherently expand the attack surface, just as moving operations into the cloud does.

      It’s essential that organizations adopt security strategies based on identities rather than old-school perimeter protections. Implementing least-privilege policies, a zero-trust architecture and zero-touch provisioning is critical for an organization’s services and network components.

      You also need to seriously consider encryption, both in transit and at rest, multi-tenancy support at the database layer, and encryption for each tenant. Adding regional locality of data can ensure compliance and allow for any required geographic access controls to be easily applied.

      The edge is increasingly where computing and transactions happen. Designing data applications that optimize speed, functionality, scalability and security will allow organizations to get the most from that computing environment.

      Karthik Ranganathan is founder and CTO of Yugabyte.



      Welcome to the VentureBeat community!

      Our guest posting program is where technical experts share insights and provide neutral, non-vested deep dives on AI, data infrastructure, cybersecurity and other cutting-edge technologies shaping the future of enterprise.

      Read more from our guest post program — and check out our guidelines if you’re interested in contributing an article of your own!