Hear from CIOs, CTOs, and other C-level and senior execs on data and AI strategies at the Future of Work Summit this January 12, 2022. Learn more

Facebook today announced that it has released the data it used to train its artificial intelligence software to understand children’s stories and predict the word that was missing from a given sentence in a story.

The data set (.tgz) comes out to more than 1.6GB, and it’s affiliated with a recently published academic paper called “The Goldilocks Principle: Reading Children’s Books with Explicit Memory Representations.” Facebook chief executive Mark Zuckerberg provides a good overview of the research today in a Facebook post:

Language is one of the most complex things for computers to understand. Guessing how to complete a sentence is pretty easy for people but much more difficult for machines. Historically, computers have been able to predict simple words like “on” or “at” and verbs like “run” or “eat”, but they don’t do as well at predicting nouns like “ball”, “table” or people’s names.

For this research, our team taught the computer to look at the context of a sentence and much more accurately predict those more difficult words — nouns and names — which are often the most important parts of sentences. The computer’s predictions were most accurate when it looked at just the right amount of context around relevant words — not too much and not too little. We call this “The Goldilocks Principle”.

Now the data set, which draws from books that are available from the volunteer-led Gutenberg Project, is accessible to academic researchers and even researchers in other companies that are keen to improve language understanding systems for their applications.

Facebook has previously open-sourced some of its artificial intelligence source code — as have other major web companies — and even shared designs for its artificial intelligence servers. Data releases are another way for Facebook to share its tooling to advance research.

Yahoo, another company that engages in artificial intelligence research, recently released a 13TB dataset that can be used for machine learning research, but it’s only available to people affiliated with academic institutions.

More information on Facebook Artificial Intelligence Research’s “Children’s Book Test” is here.


VentureBeat's mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact. Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:
  • up-to-date information on the subjects of interest to you
  • our newsletters
  • gated thought-leader content and discounted access to our prized events, such as Transform 2021: Learn More
  • networking features, and more
Become a member