Behavioral Testing of NLP Models
Beyond Accuracy: Behavioral Testing of NLP models with CheckList.
testing unit-tests checklist natural-language-processing library paper research code

Although measuring held-out accuracy has been the primary approach to evaluate generalization, it often overestimates the performance of NLP models, while alternative approaches for evaluating models either focus on individual tasks or on specific behaviors. Inspired by principles of behavioral testing in software engineering, we introduce CheckList, a taskagnostic methodology for testing NLP models. CheckList includes a matrix of general linguistic capabilities and test types that facilitate comprehensive test ideation, as well as a software tool to generate a large and diverse number of test cases quickly. We illustrate the utility of CheckList with tests for three tasks, identifying critical failures in both commercial and state-of-art models. In a user study, a team responsible for a commercial sentiment analysis model found new and actionable bugs in an extensively tested model. In another user study, NLP practitioners with CheckList created twice as many tests, and found almost three times as many bugs as users without it.

Don't forget to tag @marcotcr in your comment, otherwise they may not be notified.

Authors community post
Share this project
Similar projects
Machine Learning: Tests and Production
Best practices for testing ML-based systems.
Structuring Unit Tests in Python
Where to put tests, how to write fixtures and the awesomeness of test parametrization.
How to Test Machine Learning Code and Systems
🚦 Minimal examples of testing machine learning for correct implementation, expected learned behaviour, and model performance.
How to Set Up a Python Project For Automation and Collaboration
How to set up a Python repo with unit tests, code coverage, lint checking, type checking, Makefile wrapper, and automated build with GitHub Actions.