SPELL CHECKER

BACKGROUND

Chichewa is an important African language. According to Wikipedia, there are over 12 million native speakers of Chichewa spread over several countries in Eastern and Southern Africa. Despite being an important language, Chichewa is considered an under resourced due to low availability of literary texts in Chichewa, datasets and computational tools. Recent work has advanced the availability of datasets for text classification, machine translation, named entity recognition and automatic parts-of-speech (1,2). There are several dialects of Chichewa and the language is undergoing continuous change. There are dialects of Chichewa spoken in various parts of Malawi and in Zambia the language is called Nyanja. Research indicates that the ability of school pupils to read and write in Chichewa is linked to the ability of reading and writing in English calling for more efforts in supporting students’ abilities to use Chichewa (3).

Spellcheckers are important tools for writers helping them to produce texts that are free from misspellings. Spelling errors in text have wider implications for producing text datasets that are of high quality. To date there are no state of the art spellcheckers, or grammar checkers to support writers of Chichewa.

AIMS

  1. Develop a dictionary-based spellchecker for Chichewa using the Damerau-Levenshtein algorithm method.
  2. Test the performance of the spellchecker to detect spelling mistakes.

METHODOLOGY

An already existing corpus of spelling mistakes will be used (4). We will develop a spellchecker using the Damerau-Levenshtein algorithm method and test it using a subset of the newspaper dataset. This algorithm measures the distance between strings hat’s it, a transposition of two adjacent characters. This distance was shown to explain over 80% of all spelling errors in English (5). Its performance for Chichewa text will be investigated as part of this research.

IMPACT

This project will contribute to computational tools for Chichewa, an under resourced African language.

REFERENCES

  1. Adelani DI, Neubig G, Ruder S, Rijhwani S, Beukman M, Palen-Michel C, et al. MasakhaNER 2.0: Africa-centric Transfer Learning for Named Entity Recognition. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022. 2022.
  2. Bamba Dione CM, Adelani DI, Nabende P, Alabi JO, Sindane T, Buzaaba H, et al. MasakhaPOS: Part-of-Speech Tagging for Typologically Diverse African Languages. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics. 2023.
  3. Shin J, Sailors M, McClung N, Pearson PD, Hoffman J V., Chilimanjira M. The Case of Chichewa and English in Malawi: The Impact of First Language Reading and Writing on Learning English as a Second Language. Biling Res J. 2015;38(3).
  4. Taylor A. SpokenChichewaCorpus [Internet]. Zenodo; 2020. Available from: https://doi.org/10.5281/zenodo.3731994
  5. Setiadi I. Damerau-Levenshtein Algorithm and Bayes Theorem for Spell Checker Optimization. 2013 Feb;