Ask Google Assistant or Cortana something like “What’s 4 +4?” today and you’re likely to hear “8.” Ask a more difficult question, like “What did the ancient Greeks eat?” and chances are, instead of answering the question directly, you will likely get pointed toward a website for you to sift through to find an answer to your question.
Microsoft Machine Reading Comprehension (MS MARCO), a dataset of 100,000 questions and answers made available to researchers for the first time today, was made to change that.
By open-sourcing a dataset with answers written by humans, Microsoft hopes MS MARCO can make breakthroughs in artificial intelligence research, and begin to help AI read and understand language like humans would.
That way, instead of having to read through a website to find the answer to your question, you can ask a search engine or virtual assistant, and they will skim documents and websites like humans do, then provide a complex or nuanced answer.
The 100,000 questions and answers were made based on questions asked by real people to the Bing search engine or Cortana virtual assistant. Answers provided by MS MARCO were drawn from more than 200,000 documents or websites and summarized by a human.
“The team chose the anonymized questions based on the queries they thought would be more interesting to researchers. In addition, the answers were written by humans, based on real web pages, and verified for accuracy,” said a Microsoft blog post announcing the release of MS MARCO.
Many datasets used to train natural language processing today have notable shortcomings, the eight-person team that compiled MS MARCO argued in a paper published last month on open research publication arxiv.org.
Most datasets used to train natural language processing (NLP) today do not use questions posed by real people, and they tend to draw upon resources like Wikipedia instead of the less polished but more realistic questions from real people.
MS MARCO is available to businesses and researchers, but datasets available to download for free are for non-commercial use.