Data scientists typically rely on metrics to evaluate machine translation, text summarization, and image captioning machine learning algorithms. But problematically, the metrics in question don’t always align with the results of human assessments. That’s why researchers at Stanford and Facebook AI Research propose VizSeq, which they describe as a visual analysis toolkit for instance- and corpus-level testing on a range of text generation tasks.

It is available in open source on GitHub.

“Automatic evaluation [of machine translation is usually] limited in illustrating system error patterns … This suggests the necessity of inspecting detailed evaluation examples to get a full picture of system behaviors, as well as seek improvement directions,” wrote the researchers in a preprint research paper describing VizSeq. “We want to provide a unified and scalable solution that gets rid of all those constraints and is enhanced with a user-friendly interface, as well as the latest [natural language processing] technologies.”

To this end, VizSeq can ingest multiple data sources, including text, images, audios, and videos, while providing visualizations for exploration in Jupyter notebooks and web app interfaces. Where tests are concerned, its suite includes BLEU, NIST, METEOR, TER, RIBES, chrF, and GLEU for evaluating machine translation; ROUGE for summarization and video description; CIDEr for image captioning; and word error rate for speech recognition tasks. Additionally, VizSeq implements embedding-based metrics like BERTScore and LASER using Facebook’s PyTorch, designed to capture semantic similarities among the outputs of text generation models.

VizSeq, which can be deployed locally or onto public servers for centralized data hosting and benchmarking, organizes data by special folder structures. When new samples come in, it precomputes scores and caches them onto storage automatically. Meanwhile, a file monitoring and versioning system detects changes and triggers necessary updates to support evaluation during AI model training.

VizSeq’s web app interface features a data uploading module and a task and data set browsing module, and the Jupyter notebook interface gets data directly from Python variables. As for the analysis module, it supports example grouping with sentence tags (e.g., labels for identified languages and long sentences), which can be either user-defined or machine-generated.

The built-in viewer presents examples with sentence-level scores, which VizSeq is able to sort by metrics, source sentence lengths, and other orders. These and other statistics can be exported with a single click into PNG or SVG images and tables into comma-separated value files.

VizSeq is fairly fully featured for a first release, but the researchers say work is both active and ongoing. They leave to future work enabling image-to-text and video-to-text alignments, adding human assessment modules, and integrating VizSeq with popular text generation frameworks, including fairseq, openmnt, and tensor2tensor.

The release of VizSeq follows Facebook’s open-sourcing of image processing library Spectrum in January, natural language processing modeling framework PyText late last year, and AI reinforcement learning platform Horizon in November. More recently, the company made available Pythia, a modular plug-and-play framework that enables data scientists to quickly build, reproduce, and benchmark AI models, and a machine learning experimentation tool dubbed Ax.