Published 2003-12-31
Keywords
- lexical acquisition,
- corpora,
- machine-learning
How to Cite
Abstract
Acquisition of semantic knowledge to support natural language processing tasks is a non-trivial task, and more so if manually undertaken. This paper presents an automatic lexical acquisition method that learns semantic properties of Kiswahili words directly from data. The method exploits Kiswahili's system of nominal and concordial agreement that is inherently rich with semantic information, to capture the morphological and syntactic contexts of words. Classification of nouns and verbs into clusters of semantically-similar words is done based on this contextual encoding. The method uses training data from the Helsinki corpus of Kiswahili while the machine-learning component is implemented using the Self-organizing Map algorithm. The proposed method offers an efficient and consistent way of augmenting lexicons with semantic information, where electronic corpora of the language in question are available. It also provides researchers with an investigative tool that can be used to identify dependencies within linguistic data and represent them in an understandable form, for further analysis.