zhedongzheng/tensorflow-nlp
Code, for Natural Language Processing, and Text Generation, in TensorFlow 2.x / 1.x
repo name | zhedongzheng/tensorflow-nlp |
repo link | https://github.com/zhedongzheng/tensorflow-nlp |
homepage | |
language | Jupyter Notebook |
size (curr.) | 3656 kB |
stars (curr.) | 1404 |
created | 2017-03-12 |
license | MIT License |
- These scripts have been run on Google Colab which provides free GPU memory
Contents
-
Natural Language Processing(自然语言处理)
-
-
IMDB
-
TF-IDF + Logistic Regression
-
FastText
-
Attention
-
Sliced LSTM
-
-
-
-
SNLI
-
DAM
-
MatchPyramid
-
ESIM
-
RE2
-
-
-
Chatbot(对话机器人)
-
Single-turn (单轮对话)
-
Spoken Language Understanding(对话理解)
- ATIS
-
-
RNN Seq2Seq + Attention
-
Transformer
-
-
-
Multi-turn (多轮对话)
-
Multi-turn Dialogue Rewriting(多轮对话改写)
- RNN Seq2Seq + Attention + Dynamic Memory
-
-
-
-
Semantic Parsing for Task Oriented Dialog
-
RNN Seq2Seq + Attention
-
Transformer
-
-
-
-
bAbI
- Dynamic Memory Network
-
-
-
Word Extraction
-
Text Vectorization
-
Word Segmentation
-
-
-
Knowledge Graph(知识图谱)
-
Knowledge Graph Inference(知识图谱推理)
-
WN18
-
DistMult
-
TuckER
-
ComplEx
-
-
-
-
Movielens 1M
-
Fusion
-
Classification
-
Regression
-
-
-
Text Classification
└── finch/tensorflow2/text_classification/imdb
│
├── data
│ └── glove.840B.300d.txt # pretrained embedding, download and put here
│ └── make_data.ipynb # step 1. make data and vocab: train.txt, test.txt, word.txt
│ └── train.txt # incomplete sample, format <label, text> separated by \t
│ └── test.txt # incomplete sample, format <label, text> separated by \t
│ └── train_bt_part1.txt # (back-translated) incomplete sample, format <label, text> separated by \t
│
├── vocab
│ └── word.txt # incomplete sample, list of words in vocabulary
│
└── main
└── attention_linear.ipynb # step 2: train and evaluate model
└── attention_conv.ipynb # step 2: train and evaluate model
└── fasttext_unigram.ipynb # step 2: train and evaluate model
└── fasttext_bigram.ipynb # step 2: train and evaluate model
└── sliced_rnn.ipynb # step 2: train and evaluate model
└── sliced_rnn_bt.ipynb # step 2: train and evaluate model
-
Task: IMDB
-
Model: TF-IDF + Logistic Regression
-
Model: FastText
-
Model: Feedforward Attention
-
Model: Sliced RNN
-
TensorFlow 2
-
<Notebook> Sliced LSTM + Back-Translation -> 91.7 % Testing Accuracy
-
<Notebook> Sliced LSTM + Back-Translation + Char Embedding -> 92.3 % Testing Accuracy
-
<Notebook> Sliced LSTM + Back-Translation + Char Embedding + Label Smoothing
-> 92.5 % Testing Accuracy
This result (without transfer learning) is higher than CoVe (with transfer learning)
-
Text Matching
└── finch/tensorflow2/text_matching/snli
│
├── data
│ └── glove.840B.300d.txt # pretrained embedding, download and put here
│ └── download_data.ipynb # step 1. run this to download snli dataset
│ └── make_data.ipynb # step 2. run this to generate train.txt, test.txt, word.txt
│ └── train.txt # incomplete sample, format <label, text1, text2> separated by \t
│ └── test.txt # incomplete sample, format <label, text1, text2> separated by \t
│
├── vocab
│ └── word.txt # incomplete sample, list of words in vocabulary
│
└── main
└── dam.ipynb # step 3. train and evaluate model
└── esim.ipynb # step 3. train and evaluate model
-
Task: SNLI
-
Model: DAM
-
TensorFlow 2
-
<Notebook> DAM -> 85.3% Testing Accuracy
The accuracy of this implementation is higher than UCL MR Group (84.6%)
-
-
-
Model: Match Pyramid
-
TensorFlow 2
-
<Notebook> Match Pyramid + Multiway Attention -> 87.1% Testing Accuracy
The accuracy of this model is 0.3% below ESIM, however the speed is 1x faster than ESIM
-
-
Model: ESIM
-
TensorFlow 2
-
<Notebook> ESIM -> 87.4% Testing Accuracy
The accuracy of this implementation is sligntly higher than UCL MR Group (87.2%)
-
-
-
Model: RE2
Topic Modelling
- Data: Some Book Titles
-
Model: TF-IDF + LDA
-
PySpark
-
Sklearn + pyLDAvis
-
-
Spoken Language Understanding
└── finch/tensorflow2/spoken_language_understanding/atis
│
├── data
│ └── glove.840B.300d.txt # pretrained embedding, download and put here
│ └── make_data.ipynb # step 1. run this to generate vocab: word.txt, intent.txt, slot.txt
│ └── atis.train.w-intent.iob # incomplete sample, format <text, slot, intent>
│ └── atis.test.w-intent.iob # incomplete sample, format <text, slot, intent>
│
├── vocab
│ └── word.txt # list of words in vocabulary
│ └── intent.txt # list of intents in vocabulary
│ └── slot.txt # list of slots in vocabulary
│
└── main
└── bigru.ipynb # step 2. train and evaluate model
└── bigru_self_attn.ipynb # step 2. train and evaluate model
└── transformer.ipynb # step 2. train and evaluate model
└── transformer_elu.ipynb # step 2. train and evaluate model
-
Task: ATIS
-
Model: Bi-directional RNN
-
TensorFlow 2
-
97.8% Intent Micro-F1, 95.5% Slot Micro-F1 on Testing Data
-
-
TensorFlow 1
-
97.2% Intent Micro-F1, 95.7% Slot Micro-F1 on Testing Data
-
-
-
Model: Transformer
-
TensorFlow 2
-
97.5% Intent Micro-F1, 94.9% Slot Micro-F1 on Testing Data
-
<Notebook> Transformer + ELU activation
97.2% Intent Micro-F1, 95.5% Slot Micro-F1 on Testing Data
-
<Notebook> Bi-GRU + Transformer
97.7% Intent Micro-F1, 95.8% Slot Micro-F1 on Testing Data
-
-
-
Model: ELMO Embedding
-
TensorFlow 1
-
<Notebook> ELMO (the first LSTM hidden state) + Bi-GRU
97.6% Intent Micro-F1, 96.2% Slot Micro-F1 on Testing Data
-
<Notebook> ELMO (weighted sum of 3 layers) + Bi-GRU
97.6% Intent Micro-F1, 96.1% Slot Micro-F1 on Testing Data
-
-
Generative Dialog
└── finch/tensorflow1/free_chat/chinese_gaoq1
│
├── data
│ └── make_data.ipynb # step 1. run this to generate vocab {char.txt} and data {reduce.txt & core.txt}
│
├── vocab
│ └── char.txt # list of chars in vocabulary for chinese
│ └── cc.zh.300.vec # fastText pretrained embedding downloaded from external
│ └── char.npy # chinese characters and their embedding values (300 dim)
│
└── main
└── lstm_seq2seq_train.ipynb # step 2. train and evaluate model
└── lstm_seq2seq_export.ipynb # step 3. export trained tf model
└── lstm_seq2seq_predict.ipynb # step 4. end-to-end inference
- Task: Chinese Free Chat
-
Data
-
Model: RNN Seq2Seq + Attention
-
TensorFlow 1
-
LSTM + Attention + Beam Search -> 28.6 Perplexity & 10.5 BLEU-2
-
-
-
Model: Transformer
-
TensorFlow 1
-
Transformer (6 Layers, 8 Heads) -> 29.4 Perplexity & 12.1 BLEU-2
-
-
-
Semantic Parsing
└── finch/tensorflow1/semantic_parsing/tree_slu
│
├── data
│ └── glove.840B.300d.txt # pretrained embedding, download and put here
│ └── make_data.ipynb # step 1. run this to generate vocab: word.txt, intent.txt, slot.txt
│ └── train.tsv # incomplete sample, format <text, tokenized_text, tree>
│ └── test.tsv # incomplete sample, format <text, tokenized_text, tree>
│
├── vocab
│ └── source.txt # list of words in vocabulary for source (of seq2seq)
│ └── target.txt # list of words in vocabulary for target (of seq2seq)
│
└── main
└── lstm_transformer.ipynb # step 2. train and evaluate model
└── lstm_seq2seq_multi_attn.ipynb # step 2. train and evaluate model
-
Task: Semantic Parsing for Task Oriented Dialog
-
Model: RNN Seq2Seq + Attention
-
TensorFlow 2
-
<Notebook> LSTM + Attention + Beam Search ->
72.4% Exact Match Accuracy on Testing Data
-
-
TensorFlow 1
-
<Notebook> ELMO + LSTM + Attention + Beam Search + Label Smoothing ->
74.8% Exact Match Accuracy on Testing Data
-
-
-
Model: Transformer
-
TensorFlow 1 + Texar
-
<Notebook> ELMO + Transformer + Beam Search + Label Smoothing ->
73.3% Exact Match Accuracy on Testing Data
-
-
Knowledge Graph Inference
└── finch/tensorflow2/knowledge_graph_completion/wn18
│
├── data
│ └── download_data.ipynb # step 1. run this to download wn18 dataset
│ └── make_data.ipynb # step 2. run this to generate vocabulary: entity.txt, relation.txt
│ └── wn18 # wn18 folder (will be auto created by download_data.ipynb)
│ └── train.txt # incomplete sample, format <entity1, relation, entity2> separated by \t
│ └── valid.txt # incomplete sample, format <entity1, relation, entity2> separated by \t
│ └── test.txt # incomplete sample, format <entity1, relation, entity2> separated by \t
│
├── vocab
│ └── entity.txt # incomplete sample, list of entities in vocabulary
│ └── relation.txt # incomplete sample, list of relations in vocabulary
│
└── main
└── distmult_1-N.ipynb # step 3. train and evaluate model
-
Task: WN18
-
We use 1-N Fast Evaluation to largely accelerate evaluation process
MRR: Mean Reciprocal Rank
-
Model: DistMult
-
TensorFlow 2
-
TensorFlow 1
-
-
Model: TuckER
-
TensorFlow 2
-
-
Model: ComplEx
-
TensorFlow 2
-
Knowledge Graph Construction
-
Data Scraping
-
SPARQL
-
Neo4j + Cypher
Question Answering
└── finch/tensorflow1/question_answering/babi
│
├── data
│ └── make_data.ipynb # step 1. run this to generate vocabulary: word.txt
│ └── qa5_three-arg-relations_train.txt # one complete example of babi dataset
│ └── qa5_three-arg-relations_test.txt # one complete example of babi dataset
│
├── vocab
│ └── word.txt # complete list of words in vocabulary
│
└── main
└── dmn_train.ipynb
└── dmn_serve.ipynb
└── attn_gru_cell.py
-
Task: bAbI
Text Transformation
-
Word Extraction
-
Chinese
-
-
Text Vectorization
-
Chinese
-
-
Word Segmentation
-
Chinese
-
Custom TensorFlow Op added by applenob
-
-
Recommender System
└── finch/tensorflow1/recommender/movielens
│
├── data
│ └── make_data.ipynb # run this to generate vocabulary
│
├── vocab
│ └── user_job.txt
│ └── user_id.txt
│ └── user_gender.txt
│ └── user_age.txt
│ └── movie_types.txt
│ └── movie_title.txt
│ └── movie_id.txt
│
└── main
└── dnn_softmax.ipynb
└── dnn_mse.ipynb
-
Task: Movielens 1M
-
Model: Fusion
-
TensorFlow 1
MAE: Mean Absolute Error
-
Multi-turn Dialogue Rewriting
└── finch/tensorflow1/multi_turn_rewrite/chinese/
│
├── data
│ └── make_data.ipynb # run this to generate vocab, split train & test data, make pretrained embedding
│
├── vocab
│ └── cc.zh.300.vec # fastText pretrained embedding downloaded from external
│ └── char.npy # chinese characters and their embedding values (300 dim)
│ └── char.txt # list of chinese characters used in this project
│
└── main
└── baseline_lstm_train.ipynb
└── baseline_lstm_export.ipynb
└── baseline_lstm_predict.ipynb
-
Task: Chinese Multi-turn Dialogue Rewriting
-
Model: RNN Seq2Seq + Attention + Dynamic Memory
-
TensorFlow 1
-
<Notebook> LSTM + Attention + Memory + Beam Search
-> BLEU-1: 95.0, BLEU-2: 89.4, BELU-4: 79.0, EM: 56.7%
-
-