Hear from CIOs, CTOs, and other C-level and senior execs on data and AI strategies at the Future of Work Summit this January 12, 2022. Learn more

Ask Google Assistant or Cortana something like “What’s 4 +4?” today and you’re likely to hear “8.” Ask a more difficult question, like “What did the ancient Greeks eat?” and chances are, instead of answering the question directly, you will likely get pointed toward a website for you to sift through to find an answer to your question.

Microsoft Machine Reading Comprehension (MS MARCO), a dataset of 100,000 questions and answers made available to researchers for the first time today, was made to change that.

By open-sourcing a dataset with answers written by humans, Microsoft hopes MS MARCO can make breakthroughs in artificial intelligence research, and begin to help AI read and understand language like humans would.

That way, instead of having to read through a website to find the answer to your question, you can ask a search engine or virtual assistant, and they will skim documents and websites like humans do, then provide a complex or nuanced answer.

The 100,000 questions and answers were made based on questions asked by real people to the Bing search engine or Cortana virtual assistant. Answers provided by MS MARCO were drawn from more than 200,000 documents or websites and summarized by a human.

“The team chose the anonymized questions based on the queries they thought would be more interesting to researchers. In addition, the answers were written by humans, based on real web pages, and verified for accuracy,” said a Microsoft blog post announcing the release of MS MARCO.

Many datasets used to train natural language processing today have notable shortcomings, the eight-person team that compiled MS MARCO argued in a paper published last month on open research publication arxiv.org.

Most datasets used to train natural language processing (NLP) today do not use questions posed by real people, and they tend to draw upon resources like Wikipedia instead of the less polished but more realistic questions from real people.

MS MARCO is available to businesses and researchers, but datasets available to download for free are for non-commercial use.


VentureBeat's mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact. Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:
  • up-to-date information on the subjects of interest to you
  • our newsletters
  • gated thought-leader content and discounted access to our prized events, such as Transform 2021: Learn More
  • networking features, and more
Become a member