Google today open-sourced a machine learning model that can point to answers to natural language questions (for example, “Which wrestler had the most number of reigns?”) in spreadsheets and databases. The model’s creators claim it’s even capable of finding answers spread across cells or that might require aggregating multiple cells.
Much of the world’s information is stored in the form of tables, Google Research’s Thomas Müller points out in a blog post, like global financial statistics and sports results. But these tables often lack an intuitive way to sift through them — a problem Google’s AI model aims to fix.
To answer questions like “Average time as champion for top 2 wrestlers?” the model jointly encodes the question, as well as the table content row by row. It leverages a Transformer-based BERT architecture — one that’s both bidirectional (allowing it to access content from past and future directions) and unsupervised (meaning it can ingest data that’s neither classified nor labeled) — extended along with numerical representations called embeddings to encode the table structure.
A key addition was the embeddings used to encode the structured input, according to Müller. Learned embeddings for the column index, the row index, and one special rank index indicate to the model the order of elements in numerical columns.
For each table cell, the model generates a score indicating the probability that the cell will be part of the answer. In addition, it outputs an operation (e.g., “AVERAGE,” “SUM,” or “COUNT”) indicating which operation (if any) must be applied to produce the final answer.
To pretrain the model, the researchers extracted 6.2 million table-text pairs from English Wikipedia, which served as a training data set. During pretraining, the model learned — with relatively high accuracy — to restore words in both tables and text that had been removed. In fact, 71.4% of items were restored correctly for tables unseen during training.
After pretraining, Müller and his team fine-tuned the model via weak supervision, using limited sources to provide signals for labeling the training data. They report that the best model outperformed the state-of-the-art for the Sequential Answering Dataset, a Microsoft-created benchmark for exploring the task of answering questions on tables, by 12 points. It also bested the previous top model on Stanford’s WikiTableQuestions, which contains questions and tables sourced from Wikipedia.
“The weak supervision scenario is beneficial because it allows for non-experts to provide the data needed to train the model and takes less time than strong supervision,” said Müller.