Marketers, take note: Some data scientists want 'big compute'

Many companies have adopted scalable and flexible databases and open-source storage systems such as Hadoop to store and process varied and vast amounts of data to figure out how to proceed. But storing and accessing data is only part of the job of data scientists. They also need to run calculations on the data at scale, and available technology might not be sufficient.

Hence the rise of a concept called "big compute," which allows for faster and more widely distributed crunching of data.

In a blog post published over the weekend, Michael Malak, an engineer working with data at Time Warner Cable and a board member of the Data Science Association, identified a few specific hardware components that could help data scientists do better big compute: graphic-processing units (GPUs) and random-access memory (RAM).

Such elements could accelerate complex queries and processing jobs on top of huge stockpiles of data. The improvements could yield results for both internal analytics purposes and for consumer-facing applications.

GPUs have become popular ingredients for supercomputing as well as games and other visually demanding applications. In big data, they're not talked about quite as often.

A few public clouds offer GPU-based computing options. Amazon Web Services launched G2 server instances relying on NVIDIA GRID GPUs. Peer 1, Penguin Computing, and IBM's SoftLayer also have GPU-based tiers of service.

But Malak was not referring to clouds in his post, he wrote in an email to VentureBeat. He would like to see servers containing GPUs become more widely available for use in on-premise data centers. Companies like his might want to forego the cloud and instead use their in-house equipment to keep latency low, or to maintain security.

In any case, Malak isn't only interested in seeing vendors make more GPU servers for big compute. He also wants hardware makers to squeeze more RAM into servers.

He likes systems such as Apache Spark, which let users get data from memory instead of from considerably slower disk drives. Trouble is, the RAM available in multiple servers might not be enough to handle heavy-duty data that more and more big companies keep in stock.

Then again, maybe a company has enough RAM on hand, but they can only do basic math on the data. "A lot of today's data science either is simple statistics over large data sets, or it is advanced machine learning over small data sets," Malak writes in his blog post. He's looking for hardware that doesn't sacrifice complexity or size.

"In this new era of distributed RAM processing systems ... vendors need to catch up, so we data scientists can catch up," he wrote.

While he was at it, Malak also called for vendors to increase the number of cores on chips.

Big compute, Malak wrote, isn't "mainstream yet. It needs to be for Data Science to progress."

Such language could give legacy server makers some ideas about how to stand out. Dell and its competitors have been looking less exciting as Taiwanese manufacturers such as Quanta construct loads of custom servers, and a solid "big compute" server line might inject a bit more life into them. If Malak has no such luck, perhaps a trip to Taiwan could be in order.

More