makcedward/nlp
:memo: This repository recorded my NLP journey.
repo name | makcedward/nlp |
repo link | https://github.com/makcedward/nlp |
homepage | https://makcedward.github.io/ |
language | Python |
size (curr.) | 2320 kB |
stars (curr.) | 620 |
created | 2018-05-18 |
license | |
NLP - Tutorial
Repository to show how NLP can tacke real problem. Including the source code, dataset, state-of-the art in NLP
Data Augmentation
- Data Augmentation in NLP
- Data Augmentation library for Text
- Does your NLP model able to prevent adversarial attack?
- How does Data Noising Help to Improve your NLP Model?
- Data Augmentation library for Speech Recognition
- Data Augmentation library for Audio
- Unsupervied Data Augmentation
- Adversarial Attacks in Textual Deep Neural Networks
Text Preprocessing
Section | Sub-Section | Description | Story |
---|---|---|---|
Tokenization | Subword Tokenization | Medium | |
Tokenization | Word Tokenization | Medium Github | |
Tokenization | Sentence Tokenization | Medium Github | |
Part of Speech | Medium Github | ||
Lemmatization | Medium Github | ||
Stemming | Medium Github | ||
Stop Words | Medium Github | ||
Phrase Word Recognition | |||
Spell Checking | Lexicon-based | Peter Norvig algorithm | Medium Github |
Lexicon-based | Symspell | Medium Github | |
Machine Translation | Statistical Machine Translation | Medium | |
Machine Translation | Attention | Medium | |
String Matching | Fuzzywuzzy | Medium Github |
Text Representation
Section | Sub-Section | Research Lab | Story | Source |
---|---|---|---|---|
Traditional Method | Bag-of-words (BoW) | Medium Github | ||
Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA) | Medium Github | |||
Character Level | Character Embedding | NYU | Medium Github | Paper |
Word Level | Negative Sampling and Hierarchical Softmax | Medium | ||
Word2Vec, GloVe, fastText | Medium Github | |||
Contextualized Word Vectors (CoVe) | Salesforce | Medium Github | Paper Code | |
Misspelling Oblivious (word) Embeddings | Medium | Paper | ||
Embeddings from Language Models (ELMo) | AI2 | Medium Github | Paper Code | |
Contextual String Embeddings | Zalando Research | Medium | Paper Code | |
Sentence Level | Skip-thoughts | Medium Github | Paper Code | |
InferSent | Medium Github | Paper Code | ||
Quick-Thoughts | Medium | Paper Code | ||
General Purpose Sentence (GenSen) | Medium | Paper Code | ||
Bidirectional Encoder Representations from Transformers (BERT) | Medium | Paper(2019) Code | ||
Generative Pre-Training (GPT) | OpenAI | Medium | Paper(2019) Code | |
Self-Governing Neural Networks (SGNN) | Medium | Paper | ||
Multi-Task Deep Neural Networks (MT-DNN) | Microsoft | Medium | Paper(2019) | |
Generative Pre-Training-2 (GPT-2) | OpenAI | Medium | Paper(2019) Code | |
Universal Language Model Fine-tuning (ULMFiT) | OpenAI | Medium | Paper Code | |
BERT in Science Domain | Medium | Paper(2019) Paper(2019) | ||
BERT in Clinical Domain | NYU/PU | Medium | Paper(2019) Paper(2019) | |
RoBERTa | UW/Facebook | Medium | Paper(2019) Paper | |
Unified Language Model for NLP and NLU (UNILM) | Microsoft | Medium | Paper(2019) | |
Cross-lingual Language Model (XLMs) | Medium | Paper(2019) | ||
Transformer-XL | CMU/Google | Medium | Paper(2019) | |
XLNet | CMU/Google | Medium | Paper(2019) | |
CTRL | Salesforce | Medium | Paper(2019) | |
ALBERT | Google/Toyota | Medium | Paper(2019) | |
T5 | Googles | Medium | Paper(2019) | |
Document Level | lda2vec | Medium | Paper | |
doc2vec | Medium Github | Paper |
NLP Problem
Section | Sub-Section | Description | Research Lab | Story | Paper & Code |
---|---|---|---|---|---|
Named Entity Recognition (NER) | Pattern-based Recognition | Medium | |||
Lexicon-based Recognition | Medium | ||||
spaCy Pre-trained NER | Medium Github | ||||
Optical Character Recognition (OCR) | Printed Text | Google Cloud Vision API | Medium | Paper | |
Handwriting | LSTM | Medium | Paper | ||
Text Summarization | Extractive Approach | Medium Github | |||
Abstractive Approach | Medium | ||||
Emotion Recognition | Audio, Text, Visual | 3 Multimodals for Emotion Recognition | Medium |
Acoustic Problem
Section | Sub-Section | Description | Research Lab | Story | Paper & Code |
---|---|---|---|---|---|
Feature Representation | Unsupervised Learning | Introduction to Audio Feature Learning | Medium | Paper 1 Paper 2 Paper 3 | |
Feature Representation | Unsupervised Learning | Speech2Vec and Sentence Level Embeddings | Medium | Paper 1 Paper 2 | |
Feature Representation | Unsupervised Learning | Wav2vec | Medium | Paper | |
Speech-to-text | Introduction to Speeh-to-text | Medium |
Text Distance Measurement
Section | Sub-Section | Description | Research Lab | Story | Paper & Code |
---|---|---|---|---|---|
Euclidean Distance, Cosine Similarity and Jaccard Similarity | Medium Github | ||||
Edit Distance | Levenshtein Distance | Medium Github | |||
Word Moving Distance (WMD) | Medium Github | ||||
Supervised Word Moving Distance (S-WMD) | Medium | ||||
Manhattan LSTM | Medium | Paper |
Model Interpretation
Section | Sub-Section | Description | Research Lab | Story | Paper & Code |
---|---|---|---|---|---|
ELI5, LIME and Skater | Medium Github | ||||
SHapley Additive exPlanations (SHAP) | Medium Github | ||||
Anchors | Medium Github |
Graph
Section | Sub-Section | Description | Research Lab | Story | Paper & Code |
---|---|---|---|---|---|
Embeddings | TransE, RESCAL, DistMult, ComplEx, PyTorch BigGraph | Medium | RESCAL(2011) TransE(2013) DistMult(2015) ComplEx(2016) PyTorch BigGraph(2019) | ||
Embeddings | DeepWalk, node2vec, LINE, GraphSAGE | Medium | DeepWalk(2014) node2vec(2015) LINE(2015) GraphSAGE(2018) | ||
Embeddings | WLG, GCN, GAT, GIN | Medium | WLG(2011) GCN2017) GAT(2017) GraphSAGE(2018) |
Image
Section | Sub-Section | Description | Research Lab | Story | Paper & Code |
---|---|---|---|---|---|
Object Detection | R-CNN | Medium | Paper(2013) | ||
Object Detection | Fast R-CNN | Medium | Paper(2015) | ||
Object Detection | Faster R-CNN | Medium | Paper(2015) | ||
Object Detection | ResNet | Microsoft | Medium | Paper(2015) | |
Object Detection | VGGNet | Medium | Paper(2014) |
Source Code
Section | Sub-Section | Description | Link |
---|---|---|---|
Spellcheck | Github | ||
InferSent | Github |