What Facebook knows about data science may surprise you

At Facebook, you don't have to be a "data scientist" to tackle tough data problems.

That's what we heard from Justin Moore, whose data science career spans a pair of financial firms as well as Foursquare and Facebook. And he should know: As an engineering manager in Facebook's New York office, he hires and manages the folks who end up working with Facebook's massive stores of user data.

I caught up with Moore recently to learn about data science at Facebook. Here's the edited transcript of my conversation with Moore:

Eric Blattberg: What attracted you to Facebook, and how did you end up on the Places team?

Justin Moore: Facebook has an open culture: When you join, you pick the team you want to work on. You choose the thing you feel you can have the most impact on. There’s some guidance as to what the company thinks is important, but it’s a really engineer-driven type of process.

When I joined, I had this vertical-specific knowledge -- more of a math, machine learning, data science type of horizontal knowledge -- so I felt I could be most impactful in helping improve our Places experience.

Eric Blattberg: How big is Facebook’s data science team in New York?

Justin Moore: Well, to me, data science is very amorphous. Everybody you ask is going to define it differently. I think someone who has the ability to tackle and work with large data sets often requires a lot of engineering chops to do it. That’s one component. But another piece is translating what the product needs are for the organization into the types of algorithms that they’re going to be writing. So not just tackling giant data sets but figuring our what you should be doing with that data to make the substantiative impact on our products.

Some people use [data scientist] as a job title, but [in New York] we have people that range from PhDs in machine learning and natural language processing to web product engineers, and all of them are applying the technique of what I call data science to improving our data set. In the end, what it’s doing is utilizing machine learning and crowdsourcing to build a better, data-driven experience for our users.

To learn more about data science, be sure to check out our upcoming DataBeat conference, where the rockstars of the data world will talk about how companies increase profitability through big data and smart analytics tools.

Eric Blattberg: How does Facebook -- and more specifically, Places -- leverage the massive amount of user data on Facebook?

Justin Moore: The first thing you want to do is figure out where the product problems are. With Places, for example, there are a number of issues that could crop up: Maybe we’re not utilizing the GPS on your phone accurately; or maybe our ranking model is not as good as it could be, so we think that more popular things are more important than things that are really close to you. That means looking through anonymized sessions to find out, what are the lists of places that people see when they use this product? And what are the problems associated with those search results?

For each one of those, we think about all the different ways we can fix that problem. Often, it’s crowdsourcing: There are a billion people we can ask questions about Places, and they will give us answers. Is this place a duplicate of that place? What’s the address for this place? Machine learning [also comes into play]. Can we infer that this place is a duplicate of that place based on all of the features associated with it? Sometimes it’s a hybrid of the two. You pick things you’re fairly confident about with machine learning and you ask the crowd to confirm.

Eric Blattberg: I’d love to get a sense of your day-to-day experience as a data science manager at Facebook.

Justin Moore: I can talk a little bit more about the experience of the average engineer on my team, just because as a manager I do more boring stuff.

Eric Blattberg: Well, I am interested in that managerial perspective, too. We had an interesting guest post on VentureBeat a few months back purporting good engineering managers don’t really exist.

Justin Moore: As a manager at Facebook, I wear a lot of different hats. I work very heavily with recruiting, making sure that we’re trying to grow the team and the office the right way, which means growing really fast but also keeping the quality bar really high.

Good engineering managers provide engineers with everything they need to be successful: All the contacts, information, and help they need so they can focus on solving really hard problems on a daily basis. That means coordinating with other teams, making sure that people are talking to each other, letting people know what the high-level goals are for the company, trying to take everyones’ ideas and put them together when they need to be combined, and so on. A lot of what I’m doing is just helping provide that context.

Eric Blattberg: What’s the hardest part of your job?

Justin Moore: I think trying to prioritize, to be as adaptive and flexible as possible, is really tough. There are a thousand things you could potentially work on. What’s the one, as a team, we should focus on -- at a top level and as individuals -- that will make the most impact? Facebook has a lot of engineers, but we also have a lot of problems to solve. If I’m a machine learning engineer, maybe I should be working on some sort of UI change, because that would be more impactful than fixing a broken classifier.

Eric Blattberg: What skills do you need to be a data scientist at Facebook?

Justin Moore: You need to have really strong math skills, the ability to pick up statistics, and whatever else you need to be a strong software engineer. It’s the same interview process: You’re basically a software engineer, which we have a very high bar for here. You also need to have a product sense: You need to be someone who can not only just write algorithms, you need to know why, to figure out when somebody says that something is a problem, to say, ‘This is what I think we should do from an algorithmic perspective to solve that problem.’

Eric Blattberg: Given all of that, how do you learn those skills without drowning in debt?

Justin Moore: I don’t think it’s necessarily a school thing. I was doing a lot of discrete math in undergrad, but I wasn’t a statistician. The product sense part is just an aspect of an individual; that part seems a little bit more innate. But the computer science and the math aspects, you can pick those two up [on your own] -- though it’s better to get at least one in an academic setting, because you want to have some strong base to work from.

I think folks want different types of data scientist: Some want more base strength in statistics, others want more base strength in computer science. We lean toward the latter here [at Facebook], but we also have folks in the business analytics division who lean more towards the former. You don’t have to go get a PhD, necessarily. It’s more about passion. When you talk about deduplicating places in the databases, their eyes light up. Other people think that’s really boring. The first folks are just data people.

Eric Blattberg: So, given that you’ve been working as a data scientist for nearly a decade, how has the field evolved over the years? What’s exciting to you about what’s coming up, both at Facebook and, more generally, in the broader tech world?

Justin Moore: I think some of the tools we’re building here and open sourcing, and some of the tools other companies are building and open sourcing -- Cloudera and a lot of other companies are also doing a really good job with this -- is making it so that so that anyone can be a data scientist or at least tackle these types of problems. I think that's the really exciting thing: I love to see people doing really complicated things in an easier way.

There are lots of components to that. One is being able to process large data sets, so [software like] MapReduce, Hadoop, Hive, and Presto has been game-changing. There are lots of new languages like Julia, MatLab, and R that are allowing you sort of prototype this stuff. You can now run experiments at scale very easily and know how many people to put into control and experimental groups to figure out if there are significant differences. Once all of those tools get in place -- and I think the gaps are slowly being filled -- then anyone can come at a problem and say, ‘Hey, I want to try this, I want to change this, I want to build a classifier,’ and they don’t need to know deep machine learning or deep statistics, or even how to write code. They can just attack the problem. Anyone can do it.

More