Handling big data shouldn’t be difficult, according to Alpine Data Labs. In fact, the company wants to make dealing with big data drag-and-drop easy.
The startup relies on the language of the mouse and wants to help everyone inside an enterprise harness the power of advanced analytics and data science.
We recently caught up with Joel Horwitz, director of products and marketing at Alpine, to learn more about the company, its product, and the latest developments in the data science industry. Here’s the edited transcript of our discussion:
VentureBeat: What is Alpine Data Labs, and why does it matter?
Joel Horwitz: Alpine is the world’s first advanced analytics platform for [data storage and processing software] Hadoop. It matters because we were built by data scientists for data scientists originally, but now we’re starting to expand into the enterprise.
It’s pretty exciting, because when you look at how data science and analytics has been done historically, it was done as a very siloed exercise. By us making this very visual and simple-to-use user interface, it actually lowers the barrier for the common folk to get in and do some fairly sophisticated analysis.
Don’t get me wrong, it’s not the most complete solution on the market — certainly there are others that have been around for years. But our offering meets the needs of most of the business analysts we talk to today.
Want to learn more about data science? Come to our DataBeat conference next month, where we’ll have rock stars of the data world talking about the fine art of data science and more!
VentureBeat: What are some of those needs? What can you do with the platform?
Horwitz: It’s a completely horizontal platform. We have customers in media. Havas Media, they have a library of algorithms that they’re managing for their clients. They’re able to do that with [fewer] folks. … They’re able to go in and make changes on the fly because they have this visual workflow. They’re not diving into the code and searching for missing semicolons. Instead, they’re focusing on, “How do I optimize this thing to figure out how to get the best value creation for my clients?”
That’s one such customer; we have a number of others. In telecom, we have Ericsson and BlackBerry; in the financial industry, we have Morgan Stanley and Barclays; in health care, we have Kaiser Permanente. I could go on and on, but it’s a fairly horizontal play for us. We see it covering every vertical.
VentureBeat: So you were just showing me the interface, which looks pretty simple, but there’s some really complex engineering going on behind the scenes. How did you build this thing?
Horwitz: We’ve got some of the brightest minds in the industry. I think [product VP] Steven Hillion has done an amazing job of assembling some of the smartest machine learning engineers, data scientists, and business people. Also Joe Otto, our CEO, his DNA is in Greenplum, [a big data analytics company that became part of Pivotal], so he’s been doing this for a long time. These guys have the technical chops, for sure, but it’s also this ability to fully understand what the business needs are. I think that’s what you see in the product. It looks super simple from the outside. It looks like everything you’ve come to look for in an analytics product, but it was designed that way, so that it looks familiar.
But under the hood, we’re doing a lot of crazy stuff. We’re basically making things super simple on Hadoop. Most other platforms require you to move data, so if you want to use data in Hadoop, there’s a whole [extract, transform, load] process that needs to happen to migrate your data in there. For us, we don’t really care where your data is; we’ll connect to it and then we’ll let you do the analysis where it lives. That’s really powerful when you start talking about truly large data sets, because the cost of transferring that data goes up exponentially.
In addition to that, I’m excited because we just announced our adoption of Spark. Spark is really a great technology for data science, because it speeds up the iterative process that usually takes a long time on Hadoop, because Hadoop is natively a batch-based process. So Spark has basically allowed us to speed up our algorithms by a factor of 100 in some cases. Today we were demonstrating 50 million rows of retail data in 50 seconds. I was never able to demonstrate that size data set without Spark.
VentureBeat: Why is that important? Why is that speed so essential?
Horwitz: As you watch the trend of data and how it’s growing, most of the data is actually growing outside of the enterprise. It’s coming from the new mobile applications, it’s coming from Web 2.0 and the cloud applications. As people are moving their products to the cloud, you have all of this data exhaust coming from these applications. They’re all basically semi-structured or unstructured data streams, which is what Hadoop is great for.
I read a report recently that said only something like 12 percent of the data in Hadoop is actually being used and leveraged. So what is that other 88 percent of data doing? It’s just laying on the floor because it’s really hard to process. So with Spark, it really lowers the barrier to leverage the rest of the data — and [does] it quickly. Every single operation that you do on data is compounding. So if I have a linear regression or logistic regression that takes, say, 10 minutes to run, that may not seem like a lot. But when you combine that with an aggregation, a filter, a join, a sort, and all the other analytics operations, it really adds up.
VentureBeat: So you’ve talked a bit about your traction so far. I’d love to hear more about what’s coming up.
Horwitz: Where I see us going next is really expanding our ecosystem. Hadoop is an ecosystem, and there are so many great technologies. We support Chorus, which is a completely open source platform, so we’re looking to find partners that are going to work with us to expand Chorus’ reach and build an ecosystem around our platform.
VentureBeat: Now that we’ve heard a lot about Alpine, let’s talk about data science on a broader level. Where is the industry going, and what future opportunities will that enable?
Horwitz: I think that, in the early days, Hadoop came on the scene because it was a very low-cost place to throw all of this data. I think we’ve reached the stage now where people have a ton of data laying in Hadoop, and some of the niche players got in early and are doing data science on Hadoop. They’re using things like Mahout, which is an open source Apache project, or using Python, or basically trying to hack their way to something.
Early on, you had people with a problem, and then they found a solution. But what we’re finding is [data science] is creating net new opportunities.
You’re getting all these purpose-built applications that are coming out now. A simple example that people may know is this application called Decide. Decide basically scoured the web and pulled all of the web log data for pricing … and used data science to basically figure out whether the price would go up or down. And that’s a single application.
VentureBeat: So you think we’ll see a lot more of these data-based apps?
Horwitz: Yes, I can foresee a ton of other applications coming out that are very purpose-built: This is the data set; this is the algorithm; and this is how we’re going to sell it. [Acclaimed data scientist] DJ Patil actually coined the term “data products.” So I see more data products coming onto the market.
But they’ve been with us [for a while]; Google is a data product. When you search on Google, it’s running machine learning to provide you with the answers. On the other hand, you have platforms like Alpine that are actually allowing you to create a lot of different data products. But it’s not just the creation; it’s the actual fine-tuning.
So I think the reason why those became a whole company is because you’re now just managing this algorithm all the time. Historically, it’s been very challenging to make updates. When you heard about Google’s new search engine updates, like Panda, they were historically coming out around once a year. Now they’re coming out every few months.
Google, I think, has been on the leading edge of big data for a very long time. You saw it with HBase and their Bigtable and [with] Dremel. You can see that Google is just continuing to lead the edge. Basically, they are showing that these static machine algorithms are not enough anymore. They really need to be dynamic, and it really needs to be something anyone in the enterprise can go in and actually tweak.