Researchers develop sentence rewriting technique to fool text classifiers

A recent paper coauthored by MIT researchers highlights the problem of sentence-level attacks against text classifiers, in which an attacker alters a sentence to trigger misclassification while keeping the sentence's literal meaning unchanged.

Text classifiers are used in a range of applications, particularly document processing. Such systems allow companies to structure, normalize, and standardize business information like email, legal documents, webpages, and chat conversations. Attacks on these classifiers could be disastrous in industries like home lending, which increasingly relies on AI to process the hundreds of pages associated with mortgages.

Their framework -- conditional BERT sampling (CBS) -- feeds sentences from an AI language model to RewritingSampler, an instance of CBS that rewrites the sentences specifically to attack classifiers. In experiments, the researchers claim CBS and RewritingSampler achieve a better attack success rate than existing word-level methods.

The researchers' CBS framework and RewritingSampler start with a seed sentence and iteratively sample and replace words in the sentence for a given number of times. They use the sum of word embeddings -- a type of word representation that allows words with similar meaning to have a similar representation -- to minimize the semantic differences between the original and rewritten sentences. OpenAI's GPT-2 language model checks the grammatical quality, allowing for control and flexible rewriting of the sentences.

In experiments involving text classification datasets of news, movie reviews, Yelp reviews, and IMDB movie reviews, along with two natural language inference datasets, the researchers found that their approach "significantly" outperformed a baseline. For example, given the sentence "Turkey is put on track for EU membership," which the target classifier would classify "World," the rewritten sentence "EU puts Turkey on track for full membership" yields the classification "Business." Theoretically, if the method were to be used against a real-world classification system, a document labeled "New York loan applications for October" could be mislabeled "Not urgent" as opposed to "Timely," delaying processing.

"Most adversarial attack methods that are designed to deceive a text classifier change the text classifier's prediction by modifying a few words or characters. Few try to attack classifiers by rewriting a whole sentence, due to the difficulties inherent in sentence-level rephrasing as well as the problem of setting the criteria for legitimate rewriting," the researchers wrote. "We solve the problems [with our framework]."

The work builds on TextFooler, a framework for synthesizing adversarial text examples designed by researchers at MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL), the University of Hong Kong, and Singapore's Agency for Science, Technology, and Research. Like the coauthors of this latest work, TextFooler's creators note that while the system could be misused for attacks, it can also be used to test the robustness of models and improve their generalization via adversarial training.

"If [language models] are vulnerable to purposeful adversarial attacking, then the consequences may be disastrous," Di Jin, MIT Ph.D. student and lead author on the TextFooler research paper, said in a previous statement. "These tools need to have effective defense approaches to protect themselves, and in order to make such a safe defense system, we need to first examine the adversarial methods."

More