Ponder addresses Pandas scalability challenges with new tools

No one likes having to redo their work. It’s not only time-consuming, but it saps energy, creativity and productivity.

Yet that can often be the case for data scientists — particularly those working with large data sets and using the popular Pandas library for the Python programming language. Oftentimes, when using pandas to prepare, transform and analyze data in machine learning (ML) workflows, data scientists are left with the choice of either sacrificing convenience or scale. And more often than not, they have to recreate their work from scratch because pandas simply can’t scale up.

“Pandas is the de-facto Swiss-army knife of data science, leveraged across industries for data exploration and machine learning,” said Gaurav Gupta, a partner with Lightspeed Venture Partners, a growing investor in the enterprise technology space. “Unfortunately, it presents users with roadblocks when working with even moderately large datasets.”

Data scientists continue to face significant efficiency gaps in their day-to-day work. According to a 2020 “State of Data Science” survey from Anaconda, data scientists spend most of their time preprocessing tasks before they even get to training models. Survey respondents said they spend 45% of their time on data loading and cleansing, or “data wrangling,” as it’s also known. They also reported back that just 21% of their time was spent on data visualization.

The problem with Pandas

One of the most popular tools in the discipline, pandas is used by millions of data scientists. Still, the free software can become unusable when it comes to the large datasets that are now the norm. Although they may use pandas extensively, data scientists can run into performance problems at scale. As a result, they must rewrite pandas workloads into big data frameworks. In turn, they are producing fewer models and gleaning fewer insights in the production process, said Doris Lee, CEO of startup Ponder. Data pipelines from that point on can also be difficult to maintain and debug.

Ponder, which was founded by researchers from UC Berkeley, has dedicated itself to bridging this gap. The company is leveraging a new round of seed funding to commercialize its open-source tools Modin and Lux. Both tools address pandas usability challenges at scale without requiring data teams to change the existing ways they work with data or alter any lines of code, Lee explained. Modin is a “drop-in replacement” for Pandas that enables scaling up to large datasets. Lux is a visualization tool that identifies insights in large and complex datasets.

“We’re making it easier for data teams to work on larger amounts of data,” said Lee. “Data scientists no longer have to choose between convenience and scale. They can have both.”

The data science startup launched in summer 2021 and was born out of RISELab. Co-founders Lee and Devin Petersohn focused their PhDs on the development of the technology. They were supported by Aditya Parameswaran, associate professor in the School of Information and Electrical Engineering and Computer Sciences.

As Lee pointed out, data scientists entering the field are often trained on Python and Pandas. And although it is a “flexible and powerful tool”—thus explaining its popularity—the pandas library hasn’t expanded to keep pace with data sets. But without being able to prepare data, it’s impossible to do downstream work. So, to deal with scalability limitations, data scientists often resort to sampling, building toy data sets, or establishing “clunky workarounds,” Lee said.

This is costly, saps engineering resources, and breeds frustration. “It’s a huge productivity sink,” she said.

The road to Modin and Lux

Modin and Lux have been a long time in the making. As Parameswaran explained, the tools are based on years of research and work to bridge usability and scalability in data science tooling. “It’s extremely technically challenging,” said Parameswaran, who serves as Ponder president. “It’s not easy to do, that’s why it’s never been done before.”

Ultimately, he called the impact of the technology “enormous.”

“We are making scalable data science accessible to millions of data practitioners who live and breathe pandas,” he said.

He underscored the importance of listening to data scientists’ concerns. Developers of such tools like Ponder’s and the data scientists who use them both benefit when they meet in the middle: What are their preferred platforms, and how can those be made better?

“The temptation in data science is to build tools and chuck them over the wall to the user,” said Parameswaran. “We’re always thinking about the user, keeping the user at the forefront of what we’re building.”

And there has clearly been demand: Ponder’s open-source tools have been downloaded more than 2.5 million times, and are used by several Fortune 100 companies, including Bristol Myers Squibb, GSK, Intel, VMware, Ford and Tesla. Modin is also part of the Microsoft Azure toolkit.

Ponder technology has been used across industry, Lee said, and will continue to be bolstered with the massive growth in AI and ML. For instance, at one ecommerce company, Modin was used to scale up data pre-processing pipelines to use 1,000 times more data. This resulted in orders-of-magnitude improvements in performance. Lux has been used by mobile companies to provide insights when detecting anomalies in their networks, and by pharmaceutical companies when diving into experimental data around drug discovery.

Ponder is supported by a $7 million seed funding round led by Lightspeed Venture Partners, with participation from Intel Capital, 8VC, and The House Fund. The company plans to use this to scale out its current team of 10 and continue growing and supporting its open-source community.

Gupta lauded Ponder’s approach to a unique, as-yet “unsolved problem” faced by data scientists. This has been instrumental to their rapid growth. “It is helping to democratize tooling for the rest of the world,” he said.

The problem with Pandas

The road to Modin and Lux

More