Supervised indonesian lexical database development and spelling correction for social media data mining
Lexical Databases are commonly used in Natural Language Processing (NLP), more specifically in similarity measuring algorithms. There are two approach of similarity measuring algorithm, Corpus-based which use the Term Frequency (TF) algorithm and knowledge-based measures that use lexical databases. The most complete and used Lexical Database is WordNet, which is language dependent and currently not available for Indonesian. This research attempts to create Indonesian Lexical Database by developing an automatic Lexical Database generator framework. The corpus will have Indonesian lexicons whereby each lexicon has the following attributes: part of speech tag, synsets, word derivatives sets and antonym to synsets. This framework will combine web crawler technology, spelling correction algorithm and word clustering algorithm in creating the Lexical Database. The finalize Corpus will also contain slangs that need the human supervision for revision and defining the attributes. With the slangs included and faster time needed than to develop Indonesian WordNet, this Lexical Database can be an alternative for Social Media Data Mining.
B01185 | (wh) | Available |
No other version available