Finding a great data scientist can feel like searching for Princess Peach. She’s always in another castle.
There are plenty of programmers who can match a startup’s pace. There are plenty of PhDs with solid research backgrounds. But there’s a serious dearth of job applicants equipped with both skill sets.
Foursquare veteran Michael Li is working on a solution: a hacker bootcamp for data scientists. It’s called The Data Incubator.
The New York startup intends to take the brightest science and engineering PhDs and propel them into data science careers. The inaugural program is scheduled to begin this June.
It’s not designed for folks starting at square one. Applicants should already have some programming experience as well as strong qualitative and communication skills. Li plans to further familiarize his fellows with the tools and technology stack employers actually care about.
That’s an attractive proposal for PhDs uncertain about their next move. But it gets better.
Beyond room and board expenses in New York City, the six-week program won’t cost participants a penny. Instead of student tuition, The Data Incubator will charge its employer partners if they decide to hire Data Incubator alumni as data scientists or quantitative analysts (quants).
It’s the same model employed by Hacker School, another New York program for coders. More often, however, programming bootcamps will charge for their courses upfront (General Assembly) or take a slice of their alumni’s first-year salaries (App Academy).
Li doesn’t think it’s right to make PhDs pay for technical training after all the other expenses they’ve incurred at university.
“We’re really trying to turn the educational model on its head,” said Li. “The idea that you should pay for your own training when you’re so close to being employable — I just don’t think it’s right.”
An autobiographical startup
Li, who is The Data Incubator’s executive director, has worked as a plasma researcher at NASA, as a quant at major financial institutions, and most recently as a data scientist at Foursquare. Simultaneously, he worked toward a computational and applied mathematics PhD from Princeton, which he received last year.
But he was frustrated that the skill set he gleaned from his academic studies often didn’t align with professional expectations. In his private sector work, standards rose, and timelines shrank.
“You go from spending five years of your life working on one really hard, deep problem to having five days to get your project done, if you’re lucky,” he said.
While academics get their jobs by delving deep into the details, most companies have little tolerance for minutiae. They’re looking for the quick and dirty solution, not flawless code with every little detail ironed out.
As the scale of data accessible to us grows, that skill gap is becoming even more acute, Yann LeCun, Facebook’s director of artificial intelligence research, told VentureBeat. LeCun, who is also the founding director of New York University’s Center for Data Science, says the educational system has yet to catch up.
Although Li was clearly a stellar student (a former math professor attested to his academic aptitude), he felt plenty of pain points both in getting hired and in his first few months on the job.
But now, as Li marches toward The Data Incubator’s June launch, those troubles have become one of his greatest assets: He knows what knowledge gaps others need to fill. And he knows how to fill them.
In designing The Data Incubator’s curriculum, Li took a cue from art schools.
Throughout the six-week program, each fellow will craft a portfolio project. It should showcase their ability to gather and clean data, apply some meaningful machine learning methods and statistical analyses, and showcase or visualize the results in an engaging, digestible manner. By the end of the program, they’ll have a substantial piece of code for potential employers to review.
“I have some modules written for various things that they should know, but it’s mostly me guiding them through building a portfolio project,” said Li.
Software engineering and numerical computation. Numerical techniques for optimization and vectorized linear algebra. Programming tools including Python, NumPy, SciPy, scikit-learn, matplotlib. Data visualization including d3, ggplot.
Natural language processing. Handling unstructured data, stemming, bag of words, TF/IDF, topic modeling.
Statistics. Hypothesis testing, regression and classification, ensemble methods, cross-validation, variance-bias decomposition, data normalization.
Databases and parallelization. SQL, Hadoop, MapReduce, Hive.
Along the way, fellows will familiarize themselves with key technical skills, spanning software engineering and numerical computation to databases and parallelization. They won’t all come away with the same skill set, but there will be some common ground, like a focus on Python.
“If you want to be a data scientist, these are the things you need to know,” said Li.
We lucky few
Li is currently sorting through more than 1,000 applications.
Together, Data Incubator applicants represent over 80 different universities. They’re mainly PhDs and post-docs, but some junior faculty and assistant professors have also applied to Li’s bootcamp.
Li has yet to decide how many students will make it into the first batch, but he assured us it’ll be a small number.
“We cannot accept 5.8 percent,” he said, referencing Harvard’s acceptance rate. “It’s just not possible.”
That low total will enable Li to pick the folks with the intellectual firepower to excel in the world of data science. They’ll come pre-equipped with 90 percent of the difficult-to-learn skills: the math and stats expertise. Li’s program is all about reinforcing that last 10 percent: the technical training, plus some communication and networking skills.
“It’s not like any of [that material] is conceptually complicated: Bright people can pick it up pretty quickly,” said Facebook’s LeCun. “The stuff you can’t pick up on the fly is the math, and these people [already] have it.”
With some initial financing from a Cornell Tech accelerator program on-hand, Li’s primary focus now is courting the employer partners that will fund his venture in the long-term.
So far, around 20 companies have agreed to participate as employers of the bootcamp’s alumni. Many are in the “mainstream” tech sector, including Foursquare and Etsy. Others represent the health care and financial services industries, like oncology data firm Flatiron Health and algorithmic execution broker Quantitative Brokers. Mashable is holding down the fort as The Data Incubator’s first media employer.
Companies don’t have an obligation to hire Data Incubator alumni, but if they do make a successful hire, they’ll pay the equivalent of a recruiter fee.
Normally, when companies post a data science job listing — say, on a university job board — hundreds of resumes come flooding in. It’s difficult for the firm to separate the qualified applicants from the rest of the crowd.
“In general, when you look at a resume and interview a candidate, this ends up being a relatively shallow assessment of their ability to study and build new things,” said Robyn Peterson, Mashable’s chief technology officer.
“Li’s had to face this problem directly, and any way we can help him and some aspiring data scientist do both, we will.”
Effectively, The Data Incubator will pre-screen candidates. Plus, the alumni will all have portfolio projects, so employers can dig into some tangible, recent material before running their own tests.
With recent university grads often lacking the right skills for the job, companies are showing a preference for more seasoned candidates. But there are a few advantages to hiring straight from a university: Foreign students can get a visa more easily, and folks emerging from degree programs might start with more subject-specific knowledge.
But because there’s no obligation to hire, there’s very little downside to participating in Li’s program.
“We’re open to all kinds of ways to get good people,” said Quantitative Brokers cofounder Robert Almgren. “We only really need one guy who is fired up about financial markets and has the skills to add something.”
Big data, fewer scientists
San Francisco-based Zipfian Academy, another data science bootcamp, opened its doors last September. General Assembly offers an introductory data science course in New York. Insight Data Science runs data science fellowships for post-docs in both Silicon Valley and New York.
So The Data Incubator is not a wholly original idea. But that doesn’t matter.
The data science education market is far from overcrowded: Demand for data scientists continues to outstrip supply. The McKinsey Global Institute estimates that by 2018, the U.S. will face a shortage of 140,000 to 190,000 people equipped with the deep analytical skills necessary to make sense of big data.
That’s why these bootcamps — and other data science education initiatives, from online courses to university programs — are so crucial. If The Data Incubator encourages just one physicist, engineer, or statistician to pursue data science over day trading, that’s a small but not insignificant step toward closing the talent gap in this young, amorphous industry.
Li has big ambitions for his little program. He knows a brick-and-mortar bootcamp can only reach a limited audience. In the long run, he hopes to take The Data Incubator online so it can reach thousands.
For now, though, he’s focused squarely on opening the doors this June.
“I really want us to be the company that sets the standard for what it means to be a data scientist,” said Li.