Natural language processing (NLP) — the AI subfield dealing with machine reading comprehension — isn’t by any stretch solved, and that’s because syntactic nuances can enormously impact the meaning of a sentence or phrase. Consider the paraphrase pairs “Flights from New York to Florida” and “Flights to Florida from New York,” which convey the same thing. Even state-of-the-art algorithms fail to distinguish them from a snippet like “Flights from Florida to New York,” which has a different meaning.

Google thinks a greater diversity of data is one of the keys to solving hard NLP problems, and to this end, it’s today releasing two new corpora: Paraphrase Adversaries from Word Scrambling (PAWS) in English. Alongside PAWS, it’s making available an extension — PAWS-X — to six typologically distinct languages: French, Spanish, German, Chinese, Japanese, and Korean. Both data sets contain “well-formed” pairs of paraphrases and non-paraphrases, which Google says can improve algorithmic accuracy in capturing word order and structure from below 50% to between 85% and 89%.

The PAWS dataset contains 108,463 human-labeled pairs in English sourced from Quora Question Pairs (QQP) and Wikipedia pages. As for PAWS-X, it comprises 23,659 human translated PAWS evaluation pairs and 296,406 machine-translated training pairs.

“[It] is hard for machine learning models to learn [certain patterns] even if they have the capability to understand complex contextual phrasings,” wrote Google research scientist Yuan Zhang and software engineer Yinfei Yang in a blog post. “The new datasets provide an effective instrument for measuring the sensitivity of models to word order and structure.”

PAWS introduces a workflow for producing sentence pairs that share a number of words in common. To create new examples, phrases pass through a model that creates variants that might or might not be paraphrase pairs. Then, they’re judged by individual human raters for grammaticality, after which a team determines whether they’re paraphrases of each other. To avoid producing pairs that aren’t paraphrases, examples are added based on back-translation (a translation of a translated text back into the language of the original text), which helps to preserve meaning while introducing variability.

PAWS-X required hiring human interpreters to translate the development and test data sets. A machine learning model translated the training set, and humans performed tens of thousands of translations on random sample pairs for each of the aforementioned languages. A subset was validated by a second worker, resulting in a final corpus with a word-level error rate less than 5%.

To evaluate the corpora’s impact on NLP accuracy, the researchers trained multiple models on them and measured the classification accuracy. Two models — BERT and DIIN — show “remarkable” improvement compared with a baseline, with BERT’s accuracy improving from 33.5% to 83.1%.

“It is our hope that these datasets will be useful to the research community to drive further progress on multilingual models that better exploit structure, context, and pairwise comparisons,” wrote Zhang and Yang.