makcedward/nlpaug
Data augmentation for NLP
repo name | makcedward/nlpaug |
repo link | https://github.com/makcedward/nlpaug |
homepage | https://makcedward.github.io/ |
language | Jupyter Notebook |
size (curr.) | 2914 kB |
stars (curr.) | 1320 |
created | 2019-03-21 |
license | MIT License |
nlpaug
This python library helps you with augmenting nlp for your machine learning projects. Visit this introduction to understand about Data Augmentation in NLP. Augmenter
is the basic element of augmentation while Flow
is a pipeline to orchestra multi augmenter together.
Features
- Generate synthetic data for improving model performance without manual effort
- Simple, easy-to-use and lightweight library. Augment data in 3 lines of code
- Plug and play to any machine leanring/ neural network frameworks (e.g. scikit-learn, PyTorch, TensorFlow)
- Support textual and audio input
Section | Description |
---|---|
Quick Demo | How to use this library |
Augmenter | Introduce all available augmentation methods |
Installation | How to install this library |
Recent Changes | Latest enhancement |
Extension Reading | More real life examples or researchs |
Reference | Refernce of external resources such as data or model |
Quick Demo
- Quick Example
- Example of Augmentation for Textual Inputs
- Example of Augmentation for Multilingual Textual Inputs
- Example of Augmentation for Spectrogram Inputs
- Example of Augmentation for Audio Inputs
- Example of Orchestra Multiple Augmenters
- Example of Showing Augmentation History
- How to train TF-IDF model
- How to create custom augmentation
- API Documentation
Augmenter
Augmenter | Target | Augmenter | Action | Description |
---|---|---|---|---|
Textual | Character | KeyboardAug | substitute | Simulate keyboard distance error |
Textual | OcrAug | substitute | Simulate OCR engine error | |
Textual | RandomAug | insert, substitute, swap, delete | Apply augmentation randomly | |
Textual | Word | AntonymAug | substitute | Substitute opposite meaning word according to WordNet antonym |
Textual | ContextualWordEmbsAug | insert, substitute | Feeding surroundings word to BERT, DistilBERT, RoBERTa or XLNet language model to find out the most suitlabe word for augmentation | |
Textual | RandomWordAug | swap, crop, delete | Apply augmentation randomly | |
Textual | SpellingAug | substitute | Substitute word according to spelling mistake dictionary | |
Textual | SplitAug | split | Split one word to two words randomly | |
Textual | SynonymAug | substitute | Substitute similar word according to WordNet/ PPDB synonym | |
Textual | TfIdfAug | insert, substitute | Use TF-IDF to find out how word should be augmented | |
Textual | WordEmbsAug | insert, substitute | Leverage word2vec, GloVe or fasttext embeddings to apply augmentation | |
Textual | BackTranslationAug | substitute | Leverage two translation models for augmentation | |
Textual | ReservedAug | substitute | Replace reserved words | |
Textual | Sentence | ContextualWordEmbsForSentenceAug | insert | Insert sentence according to XLNet, GPT2 or DistilGPT2 prediction |
Textual | AbstSummAug | substitute | Summarize article by abstractive summarization method | |
Signal | Audio | CropAug | delete | Delete audio’s segment |
Signal | LoudnessAug | substitute | Adjust audio’s volume | |
Signal | MaskAug | substitute | Mask audio’s segment | |
Signal | NoiseAug | substitute | Inject noise | |
Signal | PitchAug | substitute | Adjust audio’s pitch | |
Signal | ShiftAug | substitute | Shift time dimension forward/ backward | |
Signal | SpeedAug | substitute | Adjust audio’s speed | |
Signal | VtlpAug | substitute | Change vocal tract | |
Signal | Spectrogram | FrequencyMaskingAug | substitute | Set block of values to zero according to frequency dimension |
Signal | TimeMaskingAug | substitute | Set block of values to zero according to time dimension | |
Signal | LoudnessAug | substitute | Adjust volume |
Flow
Augmenter | Augmenter | Description |
---|---|---|
Pipeline | Sequential | Apply list of augmentation functions sequentially |
Pipeline | Sometimes | Apply some augmentation functions randomly |
Installation
The library supports python 3.5+ in linux and window platform.
To install the library:
pip install numpy requests nlpaug
or install the latest version (include BETA features) from github directly
pip install numpy git+https://github.com/makcedward/nlpaug.git
or install over conda
conda install -c makcedward nlpaug
If you use ContextualWordEmbsAug, ContextualWordEmbsForSentenceAug and AbstSummAug, installing the following dependencies as well
pip install torch>=1.6.0 transformers>=3.0.2
If you use BackTranslationAug, have to use python either 3.7 or 3.8. Also, installing the following dependencies as well
pip install torch>=1.6.0 fairseq>=0.9.0 sacremoses>=0.0.43 fastBPE>=0.1.0
If you use AntonymAug, SynonymAug, installing the following dependencies as well
pip install nltk>=3.4.5
If you use WordEmbsAug (word2vec, glove or fasttext), downloading pre-trained model first
from nlpaug.util.file.download import DownloadUtil
DownloadUtil.download_word2vec(dest_dir='.') # Download word2vec model
DownloadUtil.download_glove(model_name='glove.6B', dest_dir='.') # Download GloVe model
DownloadUtil.download_fasttext(model_name='wiki-news-300d-1M', dest_dir='.') # Download fasttext model
If you use SynonymAug (PPDB), downloading file from the following URI. You may not able to run the augmenter if you get PPDB file from other website
http://paraphrase.org/#/download
If you use PitchAug, SpeedAug and VtlpAug, installing the following dependencies as well
pip install librosa>=0.7.1 matplotlib
Recent Changes
1.1.0dev
See changelog for more details.
Extension Reading
- Data Augmentation library for Text
- Does your NLP model able to prevent adversarial attack?
- How does Data Noising Help to Improve your NLP Model?
- Data Augmentation library for Speech Recognition
- Data Augmentation library for Audio
- Unsupervied Data Augmentation
- A Visual Survey of Data Augmentation in NLP
Reference
This library uses data (e.g. capturing from internet), research (e.g. following augmenter idea), model (e.g. using pre-trained model) See data source for more details.
Citing
@misc{ma2019nlpaug,
title={NLP Augmentation},
author={Edward Ma},
howpublished={https://github.com/makcedward/nlpaug},
year={2019}
}
Book cited nlpaug
- S. Vajjala, B. Majumder, A. Gupta and H. Surana. Practical Natural Language Processing: A Comprehensive Guide to Building Real-World NLP Systems. 2020
Research paper cited nlpaug
- M. Raghu and E. Schmidt. A Survey of Deep Learning for Scientific Discovery. 2020
- H. Guan, J. Li, H. Xu and M. Devarakonda. Robustly Pre-trained Neural Model for Direct Temporal Relation Extraction. 2020
- X. He, K. Zhao and X. Chu. AutoML: A Survey of the State-of-the-Art. 2020
- S. Illium, R. Muller, A. Sedlmeier and C. Linnhoff-Popien. Surgical Mask Detection with Convolutional Neural Networks and Data Augmentations on Spectrograms. 2020
- D. Niederhut. A Python package for text data enrichment. 2020
- P. Ryan, S. Takafuji, C. Yang, N. Wilson and C. McBride. Using Self-Supervised Learning of Birdsong for Downstream Industrial Audio Classification. 2020
- Z. Shao, J. Yang and S. Ren. Calibrating Deep Neural Network Classifiers on Out-of-Distribution Datasets. 2020
- S. Qiu, B. Xu, J. Zhang, Y. Wang, X. Shen, G. D. Melo, C. Long and X. Li EasyAug: An Automatic Textual Data Augmentation Platform for Classification Tasks. 2020
- D. Nguyen, Q. H. Nguyen, M. Dao, D. Dang-Nguyen, C. Gurrin and B. T. Nguyen. Duplicate Identification Algorithms in SaaS Platforms. 2020
- A. Ollagnier and H. Williams. Text Augmentation Techniques for Clinical Case Classification. 2020
- V. Atliha and D. Šešok. Text Augmentation Using BERT for Image Captioning. 2020
- Y. Ma, X. Xu, and Y. Li. LungRN+NL: An Improved Adventitious Lung Sound Classification Using non-local block ResNet Neural Network with Mixup Data Augmentation. 2020
- S. N. Zisad, M. Shahadat and K. Andersson. Speech emotion recognition in neurological disorders using Convolutional Neural Network. 2020
- M. Bhange and N. Kasliwal. HinglishNLP: Fine-tuned Language Models for Hinglish Sentiment Detection. 2020
- T. Deruyttere, S. Vandenhende, D. Grujicic, Y. Liu, L. V. Gool, M. Blaschko, T. v and M. Moens. Commands 4 Autonomous Vehicles (C4AV) Workshop Summary. 2020
- A. Tamkin, M. Wu and N. Goodman. Viewmaker Networks: Learning Views for Unsupervised Representation Learning. 2020
Project cited nlpaug
- D. Garcia-Olano and A. Jain. Generating Counterfactual Explanations using Reinforcement Learning Methods for Tabular and Text data. 2019
- L. Yi. Avengers: Achieving Superhuman Performance for Question Answering on SQuAD 2.0 Using Multiple Data Augmentations, Randomized Mini-Batch Training and Architecture Ensembling. 2020
Contributions (Supporting Other Languages)
- sakares: Add Thai support to KeyboardAug