textflint/textflint
Text Robustness Evaluation Platform
repo name | textflint/textflint |
repo link | https://github.com/textflint/textflint |
homepage | |
language | Python |
size (curr.) | 6702 kB |
stars (curr.) | 343 |
created | 2021-03-06 |
license | GNU General Public License v3.0 |
About
TextFlint is a multilingual robustness evaluation platform for natural language processing tasks, which unifies general text transformation, task-specific transformation, adversarial attack, sub-population, and their combinations to provide a comprehensive robustness analysis.
Features:
There are lots of reasons to use TextFlint:
- Full coverage of transformation types, including 20 general transformations, 8 subpopulations and 60 task-specific transformations, as well as thousands of their combinations, which basically covers all aspects of text transformations to comprehensively evaluate the robustness of your model. textflint also supports adversarial attack to generate model specific transformed datas.
- Generate targeted augmented data, and you can use the additional data to train or fine-tune your model to improve your model’s robustness.
- Provide a complete analytical report automatically to accurately explain where your model’s shortcomings are, such as the problems in syntactic rules or syntactic rules.
Setup
Installation
Require python version >= 3.7, recommend install with pip
(recommended)
pip install textflint
Usage
Workflow
The general workflow of TextFlint is displayed above. Evaluation of target models could be devided into three steps:
- For input preparation, the original dataset for testing, which is to be loaded by
Dataset
, should be firstly formatted as a series ofJSON
objects. textflint configuration is specified byConfig
. Target model is also loaded asFlintModel
. - In adversarial sample generation, multi-perspective transformations (i.e.,
Transformation
,Subpopulation
andAttackRecipe
), are performed onDataset
to generate transformed samples. Besides, to ensure semantic and grammatical correctness of transformed samples,Validator
calculates confidence of each sample to filter out unacceptable samples. - Lastly,
Analyzer
collects evaluation results andReportGenerator
automatically generates a comprehensive report of model robustness.
Quick Start
The following code snippet shows how to generate transformed data on the Sentiment Analysis task.
from textflint import Engine
# load the data samples
sample1 = {'x': 'Titanic is my favorite movie.', 'y': 'pos'}
sample2 = {'x': 'I don\'t like the actor Tim Hill', 'y': 'neg'}
data_samples = [sample1, sample2]
# define the output directory
out_dir_path = './test_result/'
# run transformation/subpopulation/attack and save the transformed data to out_dir_path in json format
engine = Engine('SA')
engine.run(data_samples, out_dir_path)
You can also feed data to Engine
in other ways (e.g., json
or csv
) where one line represents for a sample. We have defined some transformations and subpopulations in SA.json
, and you can also pass your own configuration file as you need.
Transformed Datasets
After transformation, here are the contents in ./test_result/
:
ori_Keyboard_2.json
ori_SwapNamedEnt_1.json
trans_Keyboard_2.json
trans_SwapNamedEnt_1.json
...
where the trans_Keyboard_2.json
contains 2
successfully transformed sample by transformation Keyboard
and ori_Keyboard_2.json
contains the corresponding original sample. The content in ori_Keyboard_2.json
:
{"x": "Titanic is my favorite movie.", "y": "pos", "sample_id": 0}
{"x": "I don't like the actor Tim Hill", "y": "neg", "sample_id": 1}
The content in trans_Keyboard_2.json
:
{"x": "Titanic is my favorite m0vie.", "y": "pos", "sample_id": 0}
{"x": "I don't likR the actor Tim Hill", "y": "neg", "sample_id": 1}
Design
Architecture
Input layer: receives textual datasets and models as input, represented as Dataset
and FlintModel
separately.
DataSet
: a container forSample
, provides efficiently and handily operation interfaces forSample
.Dataset
supports loading, verification, and saving data in Json or CSV format for various NLP tasks.FlintModel
: a target model used in an adversarial attack.
Generation layer: there are mainly four parts in generation layer:
Subpopulation
: generates a subset of aDataSet
.Transformation
: transforms each sample ofDataset
if it can be transformed.AttackRecipe
: attacks theFlintModel
and generate aDataSet
of adversarial examples.Validator
: verifies the quality of samples generated byTransformation
andAttackRecipe
.
Report layer: analyzes model testing results and provides robustness report for users.
Transformation
In order to verify the robustness comprehensively, TextFlint offers 20 universal transformations and 60 task-specific transformations, covering 12 NLP tasks. The following table summarizes the Transformation
currently supported and the examples for each transformation can be found in our web site.
Subpopulation
Subpopulation
is to identify the specific part of dataset on which the target model performs poorly. To retrieve a subset that meets the configuration, Subpopulation
divides the dataset through sorting samples by certain attributes. We also support the following Subpopulation
:
AttackRecipe
AttackRecipe
aims to find a perturbation of an input text satisfies the attack’s goal to fool the given FlintModel
. In contrast to Transformation
, AttackRecipe
requires the prediction scores of the target model. textflint provides an interface to integrate the easy-to-use adversarial attack recipes implemented based on textattack
. Users can refer to textattack for more information about the supported AttackRecipe
.
Validator
It is crucial to verify the quality of samples generated by Transformation
and AttackRecipe
. TextFlint provides several metrics to calculate confidence:
Report
In Generation Layer, TextFlint can generate three types of adversarial samples and verify the robustness of the target model. Based on the results from Generation Layer, Report Layer aims to provide users with a standard analysis report from lexics, syntax, and semantic levels. For example, on the Sentiment Analysis (SA) task, this is a statistical chart of the performance ofXLNET
with different types of Transformation
/Subpopulation
/AttackRecipe
on the IMDB
dataset. We can find that the model performance is lower than the original results in all the transformed dataset.
Citation
If you are using TextFlint for your work, please cite:
@article{gui2021textflint,
title={TextFlint: Unified Multilingual Robustness Evaluation Toolkit for Natural Language Processing},
author={Gui, Tao and Wang, Xiao and Zhang, Qi and Liu, Qin and Zou, Yicheng and Zhou, Xin and Zheng, Rui and Zhang, Chong and Wu, Qinzhuo and Ye, Jiacheng and others},
journal={arXiv preprint arXiv:2103.11441},
year={2021}
}