LinkedIn is back at it, sharing with the world its tools to work with lots of different types of data. Today it’s providing insight into new data ingestion software it’s cobbled together to aggregate its own data with sources from outside the social-networking company, into a single tube of sorts.
The software, dubbed Gobblin, hints at its supreme ability to gobble up bits, which data scientists can then use as they design new products and analyze website usage.
LinkedIn won’t be keeping Gobblin private. Following in the steps of LinkedIn-initiated projects like Azkaban, Kafka, and Voldemort, Gobblin will become free for all to use under an open-source license sometime in the next few weeks, engineering manager Lin Qiao wrote in an engineering blog post that’s scheduled to go online today.
Meanwhile, LinkedIn is also shifting more of its data sets in the direction of Gobblin. Because in reality, even with LinkedIn’s previous work to simplify data engineering, managing all of the connections of data was still hard. Like any enterprise, the architecture is a mess.
“At one point, we were running more than 15 types of data ingestion pipelines and we were struggling to keep them all functioning at the same level of data quality, features, and operability,” as Qiao put it in the blog post.
Gobblin simplifies things. It sits inside of LinkedIn’s substantive Hadoop cluster, where data transformation can be more economical than, say, LinkedIn’s expensive Teradata data warehouse. And Gobblin can handle a great lot of data, too, while integrating with widely accepted protocols and database types.
That way, data from a long list of sources can flow through one common pipeline. Think data from Salesforce.com, Twitter, and Facebook, alongside clickstream data, profile views, and social sharing of pages on LinkedIn. And don’t forget about data sources LinkedIn picks up in the course of making acquisitions. And because other companies deal with several data sources, Gobblin could well be adopted once it’s out in the open.
At least at LinkedIn, the tool has already come in handy.
“We’ve been getting better at gobbling large amounts of different kinds of datasets to feed our data hungry analysts,” Qiao wrote.