How Stephen Wolfram plans to reinvent data science & make wearables useful (interview)

The Wolfram Language will become publicly available in the next few weeks, Wolfram Research founder Stephen Wolfram promised in a recent keynote at SXSW.

In that keynote, he demonstrated a few of the Wolfram Language's amazing powers: In response to nearly natural-language queries, it's able to parse phrases and generate remarkably sophisticated responses. A few of the demonstration's more outrageous examples, all accomplished with just one line of code (often via iterative, interactive modifications of that single line):

generating a list of former Soviet republics, displaying their flags, and computing which one was visually the most similar to the French flag.
attempting to recognize handwritten digits and translate them into computable numbers
reformatting images so they look as they would appear to a dog
deploying Wolfram code to an interactive website with a public URL
creating and deploying geometric objects into a world created by the Unity gaming engine.

A host of other Wolfram code examples are on the company's web site.

At first, the language looked like an advanced query syntax based on the knowledge behind Wolfram Alpha, the company's "computational knowledge engine," or what most of us would call a search engine. But the Wolfram Language turns out to be a powerful tool not just for querying but for creating interactive applications and integrating them into other software and websites.

We sat down with Wolfram shortly after his talk for an hour-long interview. In the first part, below, we discuss the Wolfram language, its application to wearables and corporate data science needs, and Wolfram's own life-logging efforts.

Read part 2 of this interview: In Stephen Wolfram’s future, ‘a box of a trillion souls’ will create any universe we want

VentureBeat: Your talk cleared up some questions that I had about the Wolfram language. It doesn't sound like a language like Python or Java, where you can create anything you want.

Stephen Wolfram: You can create anything you want. You're just starting from a much higher threshold, a higher level of abstraction.

Let's start with a language we both know, which is human natural language. That's a very rich thing. It has a certain amount of knowledge built into it.

Computer languages, when they originated in the 1950s and so on, they had to be a lot simpler. The original idea of the early computer languages was to put a wrapper around the intrinsic operations of the machine. Basically, what [computer] languages are specialized in is doing better and better what one might unfairly call bureaucracy management. In other words, you're building bigger and bigger programs. The programs are all pretty low level, but you're managing. You get into millions of lines of code and things like that.

Even back in the 1950s and so on, people had ideas about making higher level languages that were more abstract than what the machines were doing. Lisp was an example. APL was another partial example. It just didn't work very well at that time.

The thing that happens now is, not only do we get to have a big layer of abstraction between the machine and what we're writing code with, but we get to do another really interesting thing. We get to put all these algorithms and all this knowledge into the language, a little bit like what happened with human natural language.

We also get to leverage what comes from human natural language interfacing with our language.

What is a language? A language is a means for expressing things. We have an extremely big check mark on that score.

If you say, "Can you write a 'for' loop in Wolfram language?" Sure, you can write a 'for' loop. It's a stupid thing to do, but you can do it. People who already know C, Java, or something like that, there's a certain tendency to write the 'for' loops in the [Wolfram] language. You can do that, but it's the wrong thing to do. There's a certain amount of unlearning that ends up being necessary.

In that sense, it will be a disappointing moment in technology history if our definition of a language was a thing that had 'for' loops and had to involve this very low-level, procedural way of approaching creating functionality.

Ultimately, what we want to do with language is we want to express ourselves and get the things we want to do done. My theory of that is the more we can have that done automatically by a system, the better for us.

VentureBeat: But the Wolfram language has an API so you can deploy procedures to the cloud. You can address those procedures and interact with them. You can embed them in Java programs and Python and so forth.

Wolfram: One of the objectives is we're dealing with curating the world. Curating the world involves knowing all the chemicals that exist and things like that. It also involves knowing all the programming languages that exist and being able to interface with those things, and knowing all the connected devices that exist and being able to interface with those kinds of things as well. It's really using the same meta methodology but applied to all these different areas.

One of the consequences of that, one of the practical consequences, is yes, we are an extremely embeddable technology, so to speak.

Wearable devices and data

VentureBeat: You have this curated list of all the wearable devices that are collecting data. Presumably, this will let you interface with those devices at some point, correct?

Wolfram: Yeah. It's a messy layer. There are a zillion companies that are in this.

[Their problem now is,] "We're going to go from a device to 'how do you make contact with the cloud?'" I hope they all win, because then we won't have to build that layer. [laughter]

We've built a certain part of that layer, but, honestly, it's not the thing we're most interested in. What we're most interested in is, after the things make contact with the cloud, what happens then? The device makes web contact. Then what happens to the data?

A typical thing that happens is, once it can make Web contact, [the device] is sending something to the Web. It'll be going to some URL. In the query string of the URL, it can stick the data that it's generated.

Once it can do that, once it can talk to the web in some way, then our goal is to be able to take over from there and do interesting things with it. Get that data easily into our cloud.

We've got the cloud infrastructure. We've got this instant API mechanism, so it's easy to create APIs that these devices can talk to. More than that, we have this WDF system, this Wolfram Data Framework system, which is a way of [canonizing] all these different kinds of data that all these devices are producing.

If you're wearing six different kinds of health metrics and personal analytics tracking thingies, and they're measuring all kind of different things, we have a canonical format that all these things can map to that we can then do computations from. The question is, what do you do with it?

Our data science platform, the concept of that is to be able to take large lumps of data and do things with it. Its basic workflow is, data comes in, analysis happens, report goes out. It's a little bit generalized relative to that in the following ways.

When data comes in, we have some really good technology for finding what's interesting in the data. You can generate endless charts and graphs and tables, and things about the data. We have good ways of figuring out what is likely to be the thing where you say, "Oh, that's an interesting feature of my data," both because you know a lot about the world and because we have good algorithms for just dealing with the actual raw data. First step is automatic data analysis.

The second step is you take the output from that and you say, "Okay, if I know how to write code, I want to customize this in all kinds of ways; and I'm going to write my own special visualization, or whatever else, from the system."

Then from there you go, and you say, "Okay, and the output I want is a report," because that's the typical output that you end up wanting. Then we have this way of making this symbolic document template for the report that's really a very nice mechanism, that allows you to just basically flow this data into an interactive document that is the report, and then you can generate that.

You can just say, "Make [the report] every Monday morning, and send it to me in email." You can also trivially connect it to one of our instant APIs and say, "Go to this API, and then you'll generate a report." That API could be keyed to some custom ID, or something like that, and that could be the report for that particular customer or something.

Data science through natural language

VentureBeat: Many of the examples that you showed in your keynote were based on public data, capitals of former Soviet republics or something. What's your model for handling private data, like my own glucose reading?

Wolfram: There are several different models. For personal data, obviously you have an account, it's in our cloud, and you can upload the information, and it'll be done in the same way every other cloud company deals with: in Wolfram's own cloud.

Now, there will be consumer companies that use our cloud infrastructure and want their own private instance. It's basically just a private instance of our whole cloud infrastructure. It's our technology, their cloud, so to speak.

In the more corporate case, with our data science platform, we're able to run that in a private cloud connected to local databases that somebody already has. We think we have a really nice way to map lots of kinds of database content into a symbolic representation that's really easy to manipulate in our language.

VentureBeat: I could mount my MySQL database or my Hadoop data store to the formats that Wolfram can then parse and use?

Wolfram: Yeah, so there's four levels of hierarchy, and we're trying to deal with all of them.

Level one is in-memory, which means that thing is small enough [that] it will be manipulated in memory. That's good up to maybe a gigabyte or something of data -- which, by the way, is a lot of what people actually have.

Level two is that, but it's in files on the disk, and operations are being done on it in a chunked sort of approach.

Level three is it's in a database, a single database somewhere, and then we are effectively turning our symbolic language into a bunch of SQL queries, getting results back, and bringing them into symbolic layers [in the Wolfram language].

[The fourth level] is distributed stuff, Hadoop for instance. We've had a Hadoop link for Mathematica for a while, which we use for our own web analytics stuff internally. I think it's on GitHub if I remember correctly.

There are certainly people who have really big data, petabyte-sized data sets, but there's an awful lot more people who have gigabyte-sized data sets. We can do some really interesting things with the gigabyte-sized data sets, and we're gradually working towards the petabyte-sized data sets. Our goal is to make those four levels of hierarchy be essentially seamless so that you're interacting with the system, and it will decide whether it's storing it in memory or putting it out to files, and so on.

The other thing that's really interesting about our way of dealing with data is not only do we have our symbolic language for representing queries and things; we also get to use natural language to represent queries, which is something that takes us far off the normal reservation of what people would imagine was possible.

Where that gets really interesting is when you have a corporate database, and it has some schema, and it's got names of things in it, and we have to basically parse what those names are to be able to make natural language queries. So there'll be some field that says job_title, or something. That might be a fairly easy field to understand, but then somebody makes a natural language query that says, "People who are in finance at whatever company," you've got to figure out what that means. It's the job_title field in this database.

The neat thing is that we have the technology stack to do this. We're going to see a whole raft of natural language querying capabilities for these large corporate database kinds of things.

VentureBeat: The corporate data science person could use Wolfram to build a simple natural interface to accompany data, which then the CEO or the marketing guy could query and say, "Show me customers that pay more than $100,000 a year that we haven't talked to in the last 90 days."

Wolfram: That's right. That'll actually work. [laughs]

As you can see from that demo [at SXSW], it's pretty easy to deploy it as an app and just have a field there and type it in.

The data science platform that we're building, its one core customer base is people who call themselves "data scientists." This will give them a vastly more streamlined way to do their work, a way to deploy what they're doing across the organization in a more effective way.

Another direction is the device companies. "We've got a device. We're going to have a bunch of consumers. How are we going to communicate the data from that device to the consumers?" It's a slightly different use case.

One case is a fairly small number of data scientists dealing with a comparatively small number of executives at a company, whereas in the other case, it's probably less data, but it's dealing with many more people, and so on. Both of those things are pretty easy for us to handle in the system that we're building.

By the way, in terms of opportunities that I hope we're creating, I actually think the biggest early opportunities are around our programming cloud. My theory is there are a large number of pent-up algorithmic startups in the world. In other words, there are people who have an idea. I know a bunch of such people, who are a random professor somewhere, who's been studying some thing or another for years and years. They know how to do whatever it is. They know the algorithm. Maybe they've even written Mathematica code, in the past, to run that algorithm. The question is "How do they deploy it?"

Right now, getting a production website up and being able to do all of the plumbing for that and so on is fairly difficult, and it's a different type of skill than the skill of somebody who knows about some algorithm.

What's about to happen, and it's going to be really interesting to see in the incubators of the next couple years, is that a lot of people who have an algorithmic idea can get to actually deploy it easily. I'll be interested to see what people come up with.

Life logging insights

VentureBeat: The thing that you're wearing on your shirt, what is that?

Wolfram: That's a life logging thing [a Narrative Clip, formerly known as Memoto], which probably has run out of battery by now.

VentureBeat: How do you use that?

Wolfram: I don't use it very often. I had the very first version of this, at the previous SXSW, and last night I thought I would review what I'd done at my talk last year, and I found very first time when I had a Wolfram language wolf icon up on the screen.

It's been interesting for me because I collect lots of data. Things like trade shows are a really good place to wear this thing, because then it's trivial to just review what [I've seen].

But I've found in terms of my everyday life and times... First of all, I'm a remote CEO, who's mostly on the phone and looking at a computer screen all day.

VentureBeat: That's a pretty boring life log.

Wolfram: Actually, when you do all this personal analytics stuff, it's a little bit scary how much like physical systems the operation of people is. We did a lot of this personal analytics from Facebook data, you just get these curves.

For myself, I have all this data on myself, and while there's lots and lots of stuff going on, the overall picture, there's very smooth curves: "This is when I go to sleep every day, at that time, plus or minus 15 minutes," and it's very [ritualistic].

VentureBeat: I have yet to learn anything from my own life logging. I'm collecting far less data that you are, but so far the only thing I've learned is that there's a slight negative correlation between the hours that I sleep, and the amount of caffeine I consume, which is obvious. In other words, you could collect a lot of data, and it might not lead to anything more than very matter-of-fact conclusions.

Wolfram: I agree with you.

I collected data for 20 years without doing any analysis of it. I think I have the largest collection of personal analytics data that anybody has, which surprises me. I expected I would hear from people who said, "Oh, I've got so much more than you have," but I didn't.

I would say that if you ask me again in a year what I really learned from it, when we've built some more tools for easily being able to [query it], I think I will have better things to say.

Like here's an example. I wondered whether different computer keyboards I was using allowed A, different typing speeds, or B, different error rates. A simple question. I clearly have the data to answer that question. It's a little bit of a pain for me to answer it. I did go through eventually and answer it, but at the point where it is totally trivial for me to answer that question, I can get an answer in 30 seconds, I'm going to ask that question, and then I'll learn something. It's not profoundly important, but I might as well type on a keyboard that's five percent faster than the other one.

I think as we lower the barrier to getting these questions answered, more things will turn out to be useful.

You could say the same thing about something like web search. I used to use all these online database systems back in the late 1970s and so on, which were a pain to use. I went to the trouble of using them. Almost nobody else that I knew, except librarians, ever used these things, and you might have said, "Well, there's nothing useful you can get from all this stuff. It's only good for a few librarians," and so on, but that wouldn't have been true.

What matters is making it really easy to do it, and hopefully we'll have some good tools for that.

Tomorrow: Wolfram's predictions for the future of computing, robotics, and the fate of humanity.