OK, so you’ve launched a Hadoop cluster to store and process lots of different kinds of data. Good luck cleaning up your messiest unstructured data before you can dig up all of those amazing business-changing insights you’ve heard about.
Typically, data cleaning, or data transformation, takes up entirely too much time. Data analysts can throw lots of their time down the chute just getting it all ready for analysis in, say, business-intelligence software. Small wonder data scientists give high marks to emerging data-cleaning tools from startups like Paxata and Trifacta.
Some early Trifacta customers have saved some substantial amounts of time by using the software; things that once took six weeks can wrap up in just a day, said Joe Hellerstein, the company’s chief executive and a co-founder, in an interview with VentureBeat.
“That’s a 30-to-1 time savings, right?” Hellerstein said. “You say, OK, well, now things should be 30 times less costly for that fraction of work. People say data transformation is 80 percent of the job. You just saved yourself a whole lot of data people.”
But time savings isn’t the coolest upside. Instead, it’s a matter of how much more analytics a company can accomplish when analysts aren’t toiling away, making sure everything looks just right. More analytics can translate into more frequent realizations that lead to business and product tweaks. Company strategy can constantly evolve and aim for success.
And that’s why products like Trifacta belong to a special class of tools that can help companies grow their revenue. We’ll be talking about such tools at our DataBeat 2014 conference in San Francisco in two weeks. And Hellerstein will speak with Metamarkets chief executive Mike Driscoll, Datameer chief executive Stefan Groschupf, and other luminaries about technologies worth looking out for.
And Trifacta is certainly worth a look.
The software takes a little bite of a large file in Hadoop and displays the sample in familiar spreadsheet format. From there, a person can click around with a mouse and highlight the important bits of text within a cell and then direct the software to pull that type of data into a new column right on screen. Each column gets a basic visualization of the data, like a histogram, to put data in some context. Users can then trim down the data set to fit a more specific focus.
The software is “helping to greatly streamline the process of preparing data,” Ravi Hubbly, the senior principal architect at Lockheed Martin, has previously said. “… My team has been able to both shorten the data lifecycle and get a better view of the data.”
The benefits don’t end there. If more people can clean up data with a few clicks on their own in Hadoop, more people can eventually analyze data without pestering IT. Suddenly, more people can make decisions based on data instead of relying on their instincts.
“That’s the goal in some customer conversations,” Hellerstein said.
And really, that’s what data-driven businesses are supposed to be all about. If Trifacta can make that happen for companies of all sizes, the technology could become a must-have for any organization adopting Hadoop.