A New Practical Approach to Automatically Generate the Trending Topics in Morroccan Society using the Social Network Twitter

Abdeljalil EL ABDOULI, Larbi Hassouni, Houda Anoun

Résumé


Social networks are taking an increasingly important place in the field of communication within our society. The most used are Twitter, Facebook, Instagram, Tumblr, Dribble, LinkedIn, and Google+. Twitter is a popular social network where connected users can publish short messages limited to 140 characters called “tweets” in which users can share thoughts, post links or images. Twitter has gained wide popularity in Arab world and especially Morocco due to its simplicity of use and services offered by its platform, this information revolution in our society leads to an accumulation of a vast quantity of data that may contain a lot of valuable information. Analyzing these tweets of Moroccan users come with challenges because Moroccan users use a variety of languages and dialects, such as Standard Arabic, Moroccan Arabic called “Darija”, Moroccan Amazigh dialect called “Tamazight”, French, English and more. In addition, the tweets of Moroccan users contain a lot of abbreviations, #hashtags, URLs, spelling mistakes, reduced syntactic structures, and many abbreviations. In this paper, we propose a new approach to determine, from the data sent on Twitter, the subjects that interest Moroccan society and then locate on the Moroccan map the areas from where come the tweets related to these topics. Our proposed approach is based on a distributed system, which contains four main components: the Hadoop framework, the natural language processing, the clustering algorithm k-means, and a tool for plotting tweets graphically on Moroccan map. The first task of this system is to automatically extract the tweets. Next, it stores them in a distributed file system using HDFS (Hadoop Distributed File System) of Apache Hadoop framework. Then we process this raw data and analyze it by using a distributed program using MapReduce of Hadoop framework, Python language, and Natural Language Processing (NLP) techniques. Afterward, we use a text mining technique, called TF-IDF (Term Frequency-Inverse Document Frequency), to convert the corpus generated by the previous step into a vector representation, where each dimension of the vector corresponds to a word, and then we implement the kmeans algorithm to cluster all words into topics. Finally, we graphically plot the topics on the Moroccan map by using the coordinates extracted from tweets, in order to discover the relation between the discovered topics and located Moroccan areas

Mots-clés


Hadoop framework; HDFS; Distributed program; MapReduce; Python Language; Natural Language Processing; TFIDF; K-means

Texte intégral :

PDF