Society's next big challenge: infinite data

This is a guest post by Christian Fritz, a researcher at PARC.

Over the past 10 years or so, many organizations have recognized the conceptual value of data and have started recording and retaining more and more of it. But after doing this for a while, they’re asking, "What the heck do we do with it?"

Google, Netflix, and other big companies have taught us that data is valuable for insights that can be obtained from it, so others have started exploring their own data and want to do more with it. Some companies have been doing this in a way that sets no expectation as to what can be learned from it.

Rather than starting with a question and looking for an answer, people started finding the questions data was already answering. The idea has been to find the hidden potential of data, and we’ve already seen benefits to doing this type of analysis.

Fixed data or fixed problem?

What if you are really committed to solving a specific prediction problem or have a specific question in mind? Project managers know that they need to decide which of their time, budget, and schedule are fixed and which are more flexible, and that this decision can fundamentally change the nature of a project.

We need to make that same kind of decision in data analytics projects: The common opportunistic nature of "big data" implies that the question is more flexible than the data that can be used, which is fixed. If you reverse this -- fix the question and accept flexibility in the data -- then it now defines "infinite data."

Predicting suicide

A recent example of an infinite-data problem is the Defense Advanced Research Projects Agency's (DARPA’s) interest in predicting suicide.

Consistent with DARPA's mission statement, this is an extremely ambitious endeavor, a "DARPA-hard-problem," as DARPA would say. Also consistent with the way DARPA operates, there is not one prescribed way of approaching the problem or a defined set of data sources to be used. Hence, the playing field is wide open for approaches that rely solely on analyzing existing data on suicide in more depth and using more sophisticated machine-learning algorithms following the big-data mindset.

However, we can also consider a reversed approach and focus on incrementally identifying and collecting the right data (e.g. better understand the specific problems of at-risk personnel and how these problems manifest in measurable data). Once the object of interest is being studied, the possibilities are endless regarding the kinds of data to be collected and analyzed as long as privacy concerns are met. In the context of suicide, data can come, for instance, from clinic notes, interviews, social media behavior observed by friends, shopping behavior, and location data.

But it does not have to end there -- there is always more data. Your smartphone already has access to tons of interesting data, and companies are already using it to target ads and provide buying recommendations. It seems more than appropriate to use this data to extract some value for users as well. After all, you are the ultimate owner of your behavioral data.

Infinite data: Why now?

The time is ripe for infinite data for two reasons:

Big data has laid the computational and economic foundation for dealing with vast amounts of a wide variety of data.
There are many possibilities for instrumenting environments where predictions can be made, even if the environment is physical. The possibilities exist in the form of very cheap sensors, which are already all around us.

Smartphones already capture such a slew of data about us that it now seems reasonable to even try and use this data to make difficult predictions about medical conditions such as the progression of Parkinson's Disease.

New challenges

Given the different nature of infinite-data projects from big-data projects, how do we go about executing such a project, and what is required for it to be successful? There are three big challenges with infinite data:

The streams of data never end.
There are infinite ways of pre-processing data to carve out relevant features, including simple combinations of individual data points or more complicated ones such as change detection in the frequency domain of temporally recurring events.
There are always more kinds of data that can be obtained and additional models that can be applied to infer additional data.

In practice, even though data may be infinite, our available computational power and our budget for acquiring new data sources is not. So we need to identify the most relevant and significant features obtainable.

Succeeding with infinite data

The team that will succeed with infinite-data projects in practice needs to be multidisciplinary. It needs to include subject matter experts as well as a set of experts in a broad set of disciplines of computer science including feature extraction, signal processing, computer vision, natural language processing, design of experiments, automated diagnosis, spatial and temporal analysis, modeling, and statistical machine learning.

Such a team will develop new methodologies for going about the data identification exercise in a principled fashion that makes the process repeatable and generalizable. Such a methodology might be based on inferring from data that is already available to guide further data acquisition decisions. Or, the team might deploy "cheap" experiments on sub-populations or at low resolution to estimate the value of information that can be obtained from a higher resolution and larger data collection effort.

Whatever the optimal methodology ends up being, two things are clear: challenging prediction problems will get solved and tons of new data will be collected that will set off a new wave of yet more opportunities that come along with it, closing the loop with big data as we know it.

Christian Fritz, Ph.D., is a PARC researcher working on real-world applications of artificial intelligence with particular interest in the combination of symbolic knowledge representation and machine learning; behavior recognition; planning; execution monitoring; modeling; and diagnosis.

Big data image via Toria/Shutterstock