Why unstructured data is the future of data management

Enterprises are increasingly relying on unstructured data for regulatory, analytic, and decision-making purposes. Unstructured data will power analytics, machine learning, and business intelligence.

According to the latest figures from research firm ITC, the volume of unstructured data is set to grow from 33 zettabytes in 2018 to 175 zettabytes, or 175 billion terabytes, by 2025. There has to be some kind of data management so organizations have the right kind of data available at the right time. Krishna Subramanian, president and COO of Komprise, a data management software provider, sat down with VentureBeat to discuss the business benefits and challenges associated with unstructured data.

Venturebeat: Does the average enterprise IT organization know how much unstructured data they have and how fast it is growing?

Krishna Subramanian: Intuitively they know a lot is unstructured and it is growing in double digits, but they don’t know exactly how much they have and how fast it’s growing. We know that 80-90% of the world’s data is unstructured.

Venturebeat: What’s the problem with this data growth — there is now endless cloud storage after all, right?

Subramanian: The big issue is the cost – over two-thirds of the cost of data is not in the storage, but in its active management. For every piece of data, companies typically keep a few backup copies and a replication copy for disaster recovery. If you think your data is growing at 30%, it’s more like 90-100% when you factor in all the copies of the data. It’s also wise to consider that cloud storage is not necessarily cheaper. For instance, AWS itself today offers over 16 tiers of unstructured file and object storage. If you don’t put your data in the right place and control egress costs, you may end up paying more than if you were storing it on premises because every time you even read the data you’ll be charged. The key here is that over 80% of data is not actually actively accessed and is cold. This cold data can be stored on cheaper storage and does not require the same level of backup and replication. Therefore, you need to manage hot data that is actively used and cold data that is rarely used differently. As just one example, Pfizer researchers generate between 8TB and 10TB a day, and they were running out of datacenter space. They were able to use a data management product to identify the cold data and eliminate it from their expensive storage, backups, and replication by moving it to lower cost-resilient storage in the cloud and taking it out of active management. The company wound up cutting 75% of their data storage and backup costs, all without users having to notice any change. What’s hard about data growth is that a lot of organizations don’t like to delete data. You never know when you might need it. And when you do, you want to be able to find it easily. And users and applications should not have to change their behavior when you move data around. In the past, with archiving to tape, that wasn’t possible, but now it is with cloud storage and with data management software.

Venturebeat: Why is it important to be strategic about how you manage it, store it — isn’t it just about making sure you can find it for the BI team?

Subramanian: Today, data is a valuable corporate asset. You’ve got to be strategic with it because it’s not just for your BI teams, but for the R&D and customer success teams. They need historical data to build new products or to improve the ones they already have. This is super relevant in manufacturing, such as in the semiconductor chip industry, but also in other industries that are so important to our economy, such as pharmaceuticals. COVID researchers depended upon access to SARS data when developing vaccines and treatments. Data often becomes valuable again later, and what if you don’t know what you have or you can’t find it? We’ve had customers in the media and entertainment business, and in the past when they wanted to find an old show, they’d need access to a tape archive. Then, they needed an asset tag to locate the tape. That can be very difficult, and it's why archiving is not popular. Live archive solutions that are available today make archived data instantly accessible and transparently tier data so users can easily locate files and access them anytime.

Venturebeat: How will tools and practices evolve to help IT departments better leverage this unstructured data for the organization/business users? What’s needed, where are the gaps?

Subramanian: You need a storage-independent way to look at data across all of your storage technologies, whether in your datacenter or in the cloud, to not only move data to the right place, but also to help businesses extract value from the data. Gartner calls this category “data management software,” and it includes companies like Cirrus Data for block data and Komprise for file and object data. The ultimate goal is to help business users leverage historical data, and this requires data search, data analytics, and data intelligence. These are hot areas where a lot of innovation is happening. The cloud providers offer several data warehousing and data analytics solutions that can be leveraged in conjunction with data management software, such as AWS Redshift and QuickSight. For instance, we use distributed Elastic Search in our software to rapidly search billions of files and find just the data relevant to a user, such as all the data for a particular project, and export this data to RedShift for further analysis. Why have all this data if you can’t detect significant trends, such as anomalies or ransomware? I believe we need more predictive analytics around data.

Venturebeat: Will the data management challenge spur a whole new sector of startups in the coming year or two?

Subramanian: Definitely. Analysts are beginning to recognize data management software as a new category. Beyond the use cases above, consider all the new types of data analytics companies getting funded, such as SnowFlake, Databricks, and Apache Spark. So many companies are coming to light right now to solve data management and data analytics issues at scale.

Venturebeat: How are the big cloud providers responding to problems and opportunities with unstructured data growth?

Subramanian: They are all offering more services to store data at different performance and price points. Amazon Elastic File System (Amazon EFS) and Azure Files were born to address the need for file storage in the cloud. The major CSPs are investing in partners across many areas of unstructured data management, including migration and analytics.

More