Some of the world’s biggest tech companies from Google to Facebook are data-driven, but few startup founders have any idea what a data scientist does, never mind whether they should hire one. Here is VentureBeat’s guide to data science for startups.
What does a data scientist do?
DJ Patil led LinkedIn’s data science team and is now the Data Scientist in residence at Greylock Partners. His free ebook “Building Data Science Teams” provides an excellent introduction to the basic areas of data science and how to build a team.
For startups, the most relevant applications of data science are probably decision science and product and marketing analytics. Decision science, as the name implies, allows you to identify and monitor key metrics for your business and answer strategic questions like “Which country should we expand into next?” or “What is the impact on the business if we lose this client?”. Google’s data science team even drives its HR policies.
Product analytics covers anything from how users are reacting to new features to developing standalone data products. LinkedIn’s “People you may know” feature and Amazon’s recommendation system are data-driven features that attempt to keep users on the site longer or drive more sales.
Using data to showcase or market a product is the domain of marketing analytics. One of the best known examples is okCupid’s okTrends blog, which features posts like “The case for an older woman” or “The 4 big myths of profile photos”. The blog drives massive traffic to the site and is regularly covered in the media.
Who are the data scientists?
Since data science is a new area, practitioners often migrate from other fields. You may see maths, statistics, machine learning or computer science on their resumes or a data-intensive field like meteorology. Data scientists want to be of central importance to a business, especially when it’s a startup. The best data scientists are both intensely curious and great communicators. They answer important questions and tell good stories using data.
What is data infrastructure?
Data scientists need specialized tools to manage and process large amounts of data. The minimum you need to get started is simple data access, usually via a database. Larger-scale or less uniform data may require a tool like Hadoop, an open source platform for distributed processing of large data sets across clusters of computers, as well as someone with the technical expertise to use it. Data stores like Cassandra are designed to perform well on very large datasets. These are some of the most commonly used tools, but there are many others for tasks such as streaming data collection, querying non-relational databases and job scheduling.
When do you need to hire a data scientist?
VentureBeat talked to data scientist Cathy O’Neil, who herself works for a startup (Intent Media), about when you need to hire a data scientist. If your data volume is growing, you don’t know if you are seeing noise or information in your data, or in general, if you are not running your business sufficiently quantitatively, then you may need to consider hiring. The following is our brief Q&A with O’Neil:
VentureBeat: How much data is enough for a startup to justify hiring a data scientist?
Cathy O’Neil: Too much to fit on an Excel spreadsheet. And it’s not just how much, it’s really about how high quality the data is; the best is for it to be clean and for it to not be public, or at least not generally used for the purpose that your business uses it for.
VB: Can a data scientist help you when your startup is still trying to find a product-market fit?
CO: Yes, in various ways. If your business model is itself quantitative, say you are trying to make money on some inefficiency of some market, a data scientist can help quantify the potential for the business. This is traditionally the job of a business person but can get pretty quantitative so is best done in cooperation with an analyst.
More critically, a data scientist can estimate how much and what kind of data will be needed for a given business model. Depending on the business, getting access to high quality data can be a big expense so this is important.
VB: When you won’t have “expected numbers” based on past performance and maybe your business model isn’t fixed yet, how can you make forecasts?
CO: This is tricky and a data scientist can do things like using proxy data, combined with domain expertise, to try to estimate things. In other words, if you have information about how other companies in related fields are doing, you can adjust their numbers to fit to yours, with quantitative estimates of how correlated the fields are.
VB: Data scientists sometimes use expensive proprietary software but startups don’t tend to have a lot of cash. Can they also use open source or low-cost tools?
CO: Yes, but you either have to hire a software engineer who has some data science skills, or you may need to have engineers spend quite a bit of time setting things up for the data scientists. In general data scientists don’t have the kind of technical skills that engineers do, and the engineers may need to develop some tools in-house to make things easy for the data scientist to do their thing.
VB: Is it ever useful to hire a data scientist part-time or for a particular project?
CO: Sure, you can advertise on LinkedIn for a data science consultant. This may be a good idea at the beginning of your business to be sure you’re collecting the right kind and enough good data to follow your business plan.
VB: Startup founders are often software developers, but few have expertise in data science. Any advice for hiring?
CO: My advice is, don’t be dismissive when data scientists are bad at programming, because that’s not what they’ve been focusing on. Make sure they know what they need to know about analytics, and figure out how to give them enough support to do those things.
VB: Any other advice on handling data scientists?
CO: Make sure you know what’s sexy about your data.
VB's research team is studying web-personalization... Chime in here, and we’ll share the results.