New research from Salesforce shows that there’s hope on the horizon for people who want to understand information stored in databases without knowing the programming language typically used to query those systems. The company’s research arm laid out a new system called Seq2SQL in a research paper today that is designed to translate natural language questions and turn them into database queries written in the Structured Query Language, better known as SQL.
Using Seq2SQL, it’s possible for someone to ask “how many users do I have in Oregon, and have the system return results from an analytics database with the answer. While that may seem like an easy question, it can actually be deceptively difficult to train a machine learning system to perform such a task accurately.
That’s because there are multiple “correct” answers in response to a query, and machine learning systems can unnecessarily penalize those that don’t conform to the ground truth data it has been fed as part of the training process.
Systems like Seq2SQL could have a major impact on companies like Salesforce and other tech software providers. In the past, nontechnical employees would ask SQL experts to run queries for them, especially if they had complicated questions of company data. At least in theory, we should enter a future where it’s easier for people who aren’t SQL experts to learn what they need to know without asking someone else.
Seq2SQL is special because the Salesforce team found a way to apply reinforcement learning to the problem. That technique, which is used in other machine learning applications, evaluates whether a system’s output is correct, and uses that signal to help it do better in the future.
In addition to the system that Salesforce laid out, the company also released a new WikiSQL data set to the public that provides data scientists with a method of training systems to connect natural language queries with the contents of databases.
As the name implies, the data set was built by extracting tables from Wikipedia and bringing them into a database engine. Salesforce then worked with people through Amazon Mechanical Turk to label the natural language queries for training a machine learning system.
That data set should help other machine learning teams create systems similar to Seq2SQL. Other large public data sets have been associated with major AI advances, like ImageNet, which is used to develop image recognition algorithms.