Frequency and text coverage in SA based on Arabic Internet Corpus


  • Ahmed Ech-Charfi Mohamed V University in Rabat



This study reports the results of the revision of Sawalha‟s list of word lemmas extracted from the Arabic Internet Corpus compiled and made available at the website of the University of Leeds. The list was revised to conform to an explicit and consistent definition of what a word lemma in Arabic is. Colloquial word types also were dropped to limit the study to Standard Arabic, in accordance with native speakers‟ expectations. The revised version of the list consist of around 22,000 lemmas the frequency of which exhibit the usual characteristics of a Zipfian distribution. Tow estimations of lexical coverage are presented: one based the whole corpus and the other on the running words represented by the revised list. It is noted that the two estimations diverge significantly, with the first having lower figures than the second. A study of three texts representing different genres indicate that the exact coverage may be somewhere between the two estimations.