louisowen6/NLP_bahasa_resources
A Curated List of Dataset and Usable Library Resources for NLP in Bahasa Indonesia
repo name | louisowen6/NLP_bahasa_resources |
repo link | https://github.com/louisowen6/NLP_bahasa_resources |
homepage | |
language | |
size (curr.) | 157 kB |
stars (curr.) | 4 |
created | 2020-03-31 |
license | |
NLP Bahasa Indonesia Resources
This repository provides link to useful dataset and another resources for NLP in Bahasa Indonesia.
Dictionary
Sentiment Words
- (Negative) https://github.com/ramaprakoso/analisis-sentimen/blob/master/kamus/negatif_ta2.txt
- (Negative) https://github.com/ramaprakoso/analisis-sentimen/blob/master/kamus/negative_add.txt
- (Negative) https://github.com/ramaprakoso/analisis-sentimen/blob/master/kamus/negative_keyword.txt
- (Negative) https://github.com/masdevid/ID-OpinionWords/blob/master/negative.txt
- (Positive) https://github.com/ramaprakoso/analisis-sentimen/blob/master/kamus/positif_ta2.txt
- (Positive) https://github.com/ramaprakoso/analisis-sentimen/blob/master/kamus/positive_add.txt
- (Positive) https://github.com/ramaprakoso/analisis-sentimen/blob/master/kamus/positive_keyword.txt
- (Positive) https://github.com/masdevid/ID-OpinionWords/blob/master/positive.txt
- (Score) https://github.com/agusmakmun/SentiStrengthID/blob/master/id_dict/sentimentword.txt
- (InSet Lexicon) https://github.com/fajri91/InSet [Paper]
Position / Degree Words
- https://github.com/panggi/pujangga/blob/master/resource/netagger/contextualfeature/psuf.txt
- https://github.com/panggi/pujangga/blob/master/resource/netagger/contextualfeature/lldr.txt
- https://github.com/panggi/pujangga/blob/master/resource/netagger/contextualfeature/opos.txt
- https://github.com/panggi/pujangga/blob/master/resource/netagger/contextualfeature/ptit.txt
Root Words
- https://github.com/agusmakmun/SentiStrengthID/blob/master/id_dict/rootword.txt
- https://github.com/sastrawi/sastrawi/blob/master/data/kata-dasar.original.txt
- https://github.com/sastrawi/sastrawi/blob/master/data/kata-dasar.txt
- https://github.com/prasastoadi/serangkai/blob/master/serangkai/kamus/data/kamus-kata-dasar.csv
I have made the combined root words list from all of the above repositories.
Slang Words
- https://github.com/ramaprakoso/analisis-sentimen/blob/master/kamus/kbba.txt
- https://github.com/agusmakmun/SentiStrengthID/blob/master/id_dict/slangword.txt
- https://github.com/panggi/pujangga/blob/master/resource/formalization/formalizationDict.txt
I have made the combined slang words dictionary from all of the above repositories.
Stop Words
- https://github.com/yasirutomo/python-sentianalysis-id/blob/master/data/feature_list/stopwordsID.txt
- https://github.com/ramaprakoso/analisis-sentimen/blob/master/kamus/stopword.txt
- https://github.com/abhimantramb/elang/tree/master/word2vec/utils/stopwords-list
I have made the combined stop words list from all of the above repositories.
Emoticon
- https://github.com/ramaprakoso/analisis-sentimen/blob/master/kamus/emoticon.txt
- https://github.com/jolicode/emoji-search/blob/master/synonyms/cldr-emoji-annotation-synonyms-id.txt
- https://github.com/agusmakmun/SentiStrengthID/blob/master/id_dict/emoticon.txt
Acronym
- https://github.com/ramaprakoso/analisis-sentimen/blob/master/kamus/acronym.txt
- https://github.com/panggi/pujangga/blob/master/resource/sentencedetector/acronym.txt
- https://id.wiktionary.org/wiki/Lampiran:Daftar_singkatan_dan_akronim_dalam_bahasa_Indonesia#A
Indonesia Region
- https://github.com/abhimantramb/elang/blob/master/word2vec/utils/indonesian-region.txt
- https://github.com/edwardsamuel/Wilayah-Administratif-Indonesia/tree/master/csv
- https://github.com/pentagonal/Indonesia-Postal-Code/tree/master/Csv
Swear Words
Composite Words
Country
Region Words
Title of Name Words
Gender by Name
Organization Words
Number Words
Pre-trained word embedding
- https://github.com/meisaputri21/Indonesian-Twitter-Emotion-Dataset. [Paper]
- https://github.com/Kyubyong/wordvectors
- https://drive.google.com/uc?id=0B5YTktu2dOKKNUY1OWJORlZTcUU&export=download
- https://github.com/deryrahman/word2vec-bahasa-indonesia
- https://sites.google.com/site/rmyeid/projects/polyglot
Train Word Embedding by Your Self
- (FastText). https://structilmy.com/2019/08/membuat-model-word-embedding-fasttext-bahasa-indonesia/
- (Word2Vec). https://yudiwbs.wordpress.com/2018/03/31/word2vec-wikipedia-bahasa-indonesia-dengan-python-gensim/
Usable Library
- Pujangga: Indonesian Natural Language Processing REST API. https://github.com/panggi/pujangga
- Sastrawi Stemmer Bahasa Indonesia. https://github.com/har07/PySastrawi
- MorphInd: Indonesian Morphological Analyzer. http://septinalarasati.com/morphind/
- INDRA: Indonesian Resource Grammar. https://github.com/davidmoeljadi/INDRA
- Typo Checker. https://github.com/mamat-rahmat/checker_id
- https://bagas.me/spacy-bahasa-indonesia.html
- https://github.com/yohanesgultom/nlp-experiments
- https://github.com/yasirutomo/python-sentianalysis-id
- https://github.com/riochr17/Analisis-Sentimen-ID
- https://github.com/yusufsyaifudin/indonesia-ner
Topic Analysis
- (Introduction to LSA & LDA). https://monkeylearn.com/blog/introduction-to-topic-modeling/
- (Introduction to LDA w/ Code & Tips). https://www.analyticsvidhya.com/blog/2016/08/beginners-guide-to-topic-modeling-in-python/
- (Topic Modeling Methods Comparison Paper). https://thesai.org/Downloads/Volume6No1/Paper_21-A_Survey_of_Topic_Modeling_in_Text_Mining.pdf
- (Original LDA Paper). http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf
- (LDA Python Library). https://pypi.org/project/lda/; https://radimrehurek.com/gensim/models/ldamodel.html; https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html
- (Original CTM Paper). http://people.ee.duke.edu/~lcarin/Blei2005CTM.pdf
- (CTM Python Library). https://pypi.org/project/tomotopy/; https://github.com/kzhai/PyCTM
- (Gaussian LDA Paper). https://www.aclweb.org/anthology/P15-1077.pdf
- (Gaussian LDA Library). https://github.com/rajarshd/Gaussian_LDA
- (Temporal Topic Modeling Comparison Paper). https://thesai.org/Downloads/Volume6No1/Paper_21-A_Survey_of_Topic_Modeling_in_Text_Mining.pdf
- (TOT: A Non-Markov Continuous-Time Model of Topical Trends Paper). https://people.cs.umass.edu/~mccallum/papers/tot-kdd06s.pdf
- (TOT Library). https://github.com/ahmaurya/topics_over_time
- (Example of LDA in Bahasa Project Code). https://github.com/kirralabs/text-clustering
Translation
Sometimes there is an english word within our text and we have to translate it. We can exploit the english word dictionary provided here and we can use the Google Translate API for Python