IBM tests ways to improve natural language processing

It's a miracle most enterprises manage to find anything in their labyrinthine storage setups. Companies increasingly tap a mix of public and private clouds that don't always play nicely together, and the jargony nature of their employees' search queries make them tougher to parse than, say, web searches.

Fortunately, IBM's recent work in natural language processing promises to address those and other search challenges in the corporate domain. In four papers scheduled to be presented at the Annual Meeting of the Association for Computational Linguistics 2019 conference in Florence, teams of researchers propose novel semantic parsing techniques and a method to integrate incomplete knowledge bases with corpora, in addition to a tool that recruits subject matter experts to fine-tune interpretable rules-based systems.

The first study investigated an abstract meaning representation, or AMR, a data structure that's intended to allow similar sentences to share the same representation. Thanks in part to reinforcement learning, an AI training technique that employs rewards to drive software policies toward goals, the paper's coauthors managed to boost the semantic accuracy of a target graph to 75.5% from the earlier state-of-the-art's 74.4%.

Another team posited a querying approach that unified semantic parsing across multiple knowledge bases, and that exploited the structural similarity across querying programs to search various knowledge bases. Their work dovetailed with that of IBM scientists studying incomplete knowledge bases and how they might be fused with a corpus of text, an approach which they asserted could better surface answers to questions not fully addressed in either the knowledge bases or individual documents.

In the last of the papers, researchers describe HEIDL (short for Human-in-the-loop linguistic Expressions wIth Deep Learning), a tool that ranks machine-generated expressions by precision and recall. In one experiment, IBM attorneys annotated phrases related to key clauses like termination, communication, and payment in 20,000 sentences in nearly 150 contracts, which HEIDL analyzed to provide high-level insights. A team of data scientists used it to identify an average of seven rules that automatically labeled the contracts in around half an hour, a process which the coauthors claim would've taken a week or more if conducted manually.

"Enterprise [natural language processing] systems are often challenged by a number of factors, which include making sense of heterogeneous silos of information, dealing with incomplete data, training accurate models from small amounts of data and navigating a changing environment in which new content, products, terms and other information is continuously being added," wrote IBM Research senior manager Salim Roukos in a blog post. "[We're] exploring along ... different themes to tackle these challenges and improve [natural language processing] for enterprise domains."

More