Inside the mind of a crowd scientist, an emerging flavor of data scientist

Like any niche role -- especially in the tech realm -- data scientists, and what exactly we do all day, is a bit of a mystery. Our time is spent knee-deep in numbers, algorithms, and problem solving. We tend to think in equations, and because of that, communicating what we do and what we need doesn’t always translate across business units.

At Mindjet, an innovation and productivity software company, my projects revolve around finding ways to quantify efficiency and mechanize collaborative brainstorming. And, at this intersection of math, science, and ideation, I’m at the cusp of an even more obscure -- but critical -- specialization that’s currently emerging: crowd science.

What it is

Crowd science is a fledgling segment of data science that combines the fields of statistics, computer science, and the psychology of crowdsourcing in order to better understand patterns of innovation and find ways to make it a repeatable process. A crowd scientist like me uses mathematical techniques to glean information from groups of people, so that we can make better decisions and understand paradigm shifts in crowd behaviors and outcomes.

My three-person team employs these crowd science techniques to find trends and signals in crowdsourced data, as well as create models and algorithms for repeatable innovation within our SpigitEngage platform. Wikipedia is another great example of this -- they’ve always used crowd science to develop what is essentially a master record of all human knowledge. Even the popular but irreverent Urban Dictionary has taken advantage of the powerful, ever-expanding resource that is the global crowd. And all of them create collective, cooperative environments that aim to uncover underlying truths.

Data science vs. crowd science

You might be wondering how crowd science is any different from traditional data science. Data science also deals with finding signals and patterns in large amounts of potentially noisy data, but crowd science explores data that has a subjective element to it: the psychology, variable behaviors, and opinions of the crowd. This begets a different kind of noise from what the typical data scientist must filter out. Members of a crowd can have vastly differing opinions about a topic, might accidentally or intentionally enter incorrect data, or might try to outsmart the system. As a result, crowd scientists must eliminate outlying data points and introduce techniques that ensure honesty. Wikipedia even has a checks-and-balances system (or algorithm); if someone updates a post with faulty information, the post can be flagged and revised by other members of the collective crowd.

Encouraging honesty can also be done with gamification techniques, such as predictive markets in which users must place a “bet” on their submission -- I use quotes because it isn’t always a monetary bet, but it can be redeemable points or other forms of virtual currency. Users who put their money where their mouths are tend to be more forthright and tailor their behaviors accordingly. Crowd scientists may also use statistical techniques to flag people trying to game the system, by modeling the distribution of each user’s answers and comparing that to the distribution of the crowd. If the user deviates from the crowd frequently, the probability of dishonesty is much higher, and that user’s submissions can be weighted accordingly.

The path to coding, testing, and validating

Often the problems I work on are proposed by a member of the product team or requested by a customer as something they would like to see in the platform. We start with a brainstorming session that includes the data science and product teams. Then I go back to my desk and meditate on potential solutions. I do a lot of thinking, scribbling down thoughts and equations on paper, reading the latest in academic journals on the topic, and researching what other companies are up to. After coming up with possible solutions, the other data scientists and I will lay out the theory behind the proposed solution on the whiteboard, before going back to our computers to code it up, test it, and validate it on simulated or previously collected data. When it passes all tests, we write it up in a white paper, which we hand over to the product team. Eventually, we work with them to implement our discovery as a new feature.

More often than not, the solutions are iterative. In subsequent discussions, we think of new aspects of the problem that we hadn’t considered, and try to work that in. Ultimately we need to deliver a product, so we’ll launch what we consider to be a viable solution, even though it could be modified and updated as we unearth additional caveats to the problem.

I also spend a lot of time building models to help customers uncover patterns in activities and behavior in their crowds or networks. We work out algorithms for different product features, like reputation scoring and ranking. Perhaps most importantly, though, my team focuses on developing innovation pipelines, and finding methods for quantifying the value of different ideas. This includes analyzing reports, slicing and dicing data, and finding ways to uncover patterns in crowd data for our customers so that they can get a deeper insight into their communities. We study how information spreads across networks, build product features rooted in science, and develop tracking systems for optimization of various activities.

This is where products like Looker are truly beneficial. Looker allows me to create views that my coworkers can quickly access, and it gives them the flexibility to dig around and do their own data discovery. At VentureBeat's DataBeat conference in San Francisco next week, I'll be discussing our use of Looker and, hopefully, providing greater insight into the world of crowd science and innovation.

Anna Gordon is a data scientist at Mindjet, provider of the world’s leading enterprise innovation management platform.