Authors: Mohammed M. Ali; Mohammed M. Abu Shquier; Afag Slah Eldeen; Mohamed E. Zidan; Ra'ed M. Al-Khatib
Addresses: Department of Information Technology, Faculty of Computers and Information Technology, University of Tabuk, Tabuk, Kingdom of Saudi Arabia ' Faculty of Computer Science and Information Technology, Jerash University, Jerash, Jordan ' Sudan University of Science and Technology, Khartoum, Sudan ' Faculty of Science, University of Tabuk, Tabuk, Kingdom of Saudi Arabia ' Department of Computer Sciences, Faculty of Information Technology and Computer Sciences, Yarmouk University, Irbid, Jordan
Abstract: Mixing languages together in text and in talking is a major feature in non-English languages in developing countries. This mixed grammar is also emerging in SMS, Facebook communication, searching the web and any future attempts also may increase the footprint of such a mixed language knowledge base. Traditional information retrieval (IR) and cross-language information retrieval (CLIR) systems do not exploit this natural human tendency as the underlying assumption is that user query is always monolingual. Accordingly, the majority of the text collections are either monolingual or multilingual. This paper explores the trends of mixed-language querying and writing. It also shows how the corpus is validated statistically and how an Arabic lexicon can be extracted using co-occurrence statistics. Results showed that the distribution of frequencies of words in the corpus is very skewed the vocabulary growth is a good fit. The results of how to handle mixed queries are also summarised.
Keywords: multilingual; monolingual; multilingualism characteristic; retrieval of documents.
International Journal of Computing Science and Mathematics, 2020 Vol.11 No.3, pp.291 - 304
Accepted: 23 Oct 2017
Published online: 06 Apr 2020 *