textflint/textflint

Text Robustness Evaluation Platform


repo name	textflint/textflint
repo link	https://github.com/textflint/textflint
homepage
language	Python
size (curr.)	6702 kB
stars (curr.)	343
created	2021-03-06
license	GNU General Public License v3.0

About

TextFlint is a multilingual robustness evaluation platform for natural language processing tasks, which unifies general text transformation, task-specific transformation, adversarial attack, sub-population, and their combinations to provide a comprehensive robustness analysis.

Features:

There are lots of reasons to use TextFlint:

Full coverage of transformation types, including 20 general transformations, 8 subpopulations and 60 task-specific transformations, as well as thousands of their combinations, which basically covers all aspects of text transformations to comprehensively evaluate the robustness of your model. textflint also supports adversarial attack to generate model specific transformed datas.
Generate targeted augmented data, and you can use the additional data to train or fine-tune your model to improve your model’s robustness.
Provide a complete analytical report automatically to accurately explain where your model’s shortcomings are, such as the problems in syntactic rules or syntactic rules.

Setup

Installation

Require python version >= 3.7, recommend install with pip (recommended)

pip install textflint

Usage

Workflow

The general workflow of TextFlint is displayed above. Evaluation of target models could be devided into three steps:

For input preparation, the original dataset for testing, which is to be loaded by Dataset, should be firstly formatted as a series of JSON objects. textflint configuration is specified by Config. Target model is also loaded as FlintModel.
In adversarial sample generation, multi-perspective transformations (i.e., Transformation,Subpopulation and AttackRecipe), are performed on Dataset to generate transformed samples. Besides, to ensure semantic and grammatical correctness of transformed samples, Validator calculates confidence of each sample to filter out unacceptable samples.
Lastly, Analyzer collects evaluation results and ReportGenerator automatically generates a comprehensive report of model robustness.

Quick Start

The following code snippet shows how to generate transformed data on the Sentiment Analysis task.

from textflint import Engine

# load the data samples
sample1 = {'x': 'Titanic is my favorite movie.', 'y': 'pos'}
sample2 = {'x': 'I don\'t like the actor Tim Hill', 'y': 'neg'}
data_samples = [sample1, sample2]

# define the output directory
out_dir_path = './test_result/'

# run transformation/subpopulation/attack and save the transformed data to out_dir_path in json format
engine = Engine('SA')
engine.run(data_samples, out_dir_path)

You can also feed data to Engine in other ways (e.g., json or csv) where one line represents for a sample. We have defined some transformations and subpopulations in SA.json, and you can also pass your own configuration file as you need.

Transformed Datasets

After transformation, here are the contents in ./test_result/:

ori_Keyboard_2.json
ori_SwapNamedEnt_1.json
trans_Keyboard_2.json
trans_SwapNamedEnt_1.json
...

where the trans_Keyboard_2.json contains 2 successfully transformed sample by transformation Keyboard and ori_Keyboard_2.json contains the corresponding original sample. The content in ori_Keyboard_2.json:

{"x": "Titanic is my favorite movie.", "y": "pos", "sample_id": 0}
{"x": "I don't like the actor Tim Hill", "y": "neg", "sample_id": 1}

The content in trans_Keyboard_2.json:

{"x": "Titanic is my favorite m0vie.", "y": "pos", "sample_id": 0}
{"x": "I don't likR the actor Tim Hill", "y": "neg", "sample_id": 1}

Design

Architecture

Input layer: receives textual datasets and models as input, represented as Dataset and FlintModel separately.

DataSet: a container for Sample, provides efficiently and handily operation interfaces for Sample. Dataset supports loading, verification, and saving data in Json or CSV format for various NLP tasks.
FlintModel: a target model used in an adversarial attack.

Generation layer: there are mainly four parts in generation layer:

Subpopulation: generates a subset of a DataSet.
Transformation: transforms each sample of Dataset if it can be transformed.
AttackRecipe: attacks the FlintModel and generate a DataSet of adversarial examples.
Validator: verifies the quality of samples generated by Transformation and AttackRecipe.

Report layer: analyzes model testing results and provides robustness report for users.

Transformation

In order to verify the robustness comprehensively, TextFlint offers 20 universal transformations and 60 task-specific transformations, covering 12 NLP tasks. The following table summarizes the Transformation currently supported and the examples for each transformation can be found in our web site.

Subpopulation

Subpopulation is to identify the specific part of dataset on which the target model performs poorly. To retrieve a subset that meets the configuration, Subpopulation divides the dataset through sorting samples by certain attributes. We also support the following Subpopulation:

AttackRecipe

AttackRecipe aims to find a perturbation of an input text satisfies the attack’s goal to fool the given FlintModel. In contrast to Transformation, AttackRecipe requires the prediction scores of the target model. textflint provides an interface to integrate the easy-to-use adversarial attack recipes implemented based on textattack. Users can refer to textattack for more information about the supported AttackRecipe.

Validator

It is crucial to verify the quality of samples generated by Transformation and AttackRecipe. TextFlint provides several metrics to calculate confidence:

Report

In Generation Layer, TextFlint can generate three types of adversarial samples and verify the robustness of the target model. Based on the results from Generation Layer, Report Layer aims to provide users with a standard analysis report from lexics, syntax, and semantic levels. For example, on the Sentiment Analysis (SA) task, this is a statistical chart of the performance ofXLNET with different types of Transformation/Subpopulation/AttackRecipe on the IMDB dataset. We can find that the model performance is lower than the original results in all the transformed dataset.

Citation

If you are using TextFlint for your work, please cite:

@article{gui2021textflint,
  title={TextFlint: Unified Multilingual Robustness Evaluation Toolkit for Natural Language Processing},
  author={Gui, Tao and Wang, Xiao and Zhang, Qi and Liu, Qin and Zou, Yicheng and Zhou, Xin and Zheng, Rui and Zhang, Chong and Wu, Qinzhuo and Ye, Jiacheng and others},
  journal={arXiv preprint arXiv:2103.11441},
  year={2021}
}