Classify Webpages using Arab Chat Alphabet Lamiaa Mostafa Business Information System Department Arab Academy for Science and Technology and Maritime Transport Alexandria, Egypt [email protected] Abstract Arabic language is a very complex language. There is still not enough computation research on the Arabic language processing. However there is an increase in the information on the internet. Many automated classification methods were executed on English text. This calls for researches in the filed of Arabic text classification. It is important to develop models for Arabic text semantic processing even if such models seem to be incomplete. This paper proposes a model for Arabic text processing using specific English language. This language is called the Arabic Chat Alphabet (ACA), Arabizi, Arabish or Araby or even Franco-Arab. The proposed model classifies Arabic Webpages using one of the data mining techniques which is Classification. The model has two main phases. In the first phase the system is trained using a set of pre-labeled web pages, where the system builds a Semantic Expansion Network (SEN). In the second phase the system classifies new web page by checking the similarity between a representation of the web page and the Semantic Activation Network constructed in the training phase. Using the F1-measure for the method validation, the proposed method had shown promising results. Categories and Subject Descriptors I.2.7 [Natural Language Processing] General Terms Language parsing, Text Analysis On the internet world, Arabic language was not used as a regular base compared to English language because of the difficulties of using the Arabic script and also Arabic makes monophonical analysis a very complex task [7]. Arabic users write their emails or chatt with friends in Roman alphabet also known as Arabic chat alphabet (ACA). ACA uses English letters to write Arabic words. An example of ACA is “modares, ( ”)مدرسwhich means a teacher. English language usage in the internet is so huge, this is based on the huge number of sites that is written in English format. English is also the language used for Shopping and Games, News, and Educational issues; this is due to the tendency of all sites to be in English, rather than any linguistic preference. For the processing of any text there are some specific steps: parsing, stopword removal and stemming [2].Arabic language has many stopwords which are divided into either words appear in the sentences and don’t have any meaning or indications about the content such as (so لذلك, with ( باإلشارةor a sequence of the sentences like (firstly اوال, secondly ) ثانيا, or pronouns such as (he [ )هو3]. The rest of this paper will be organized as follows. In Section 2, the paper reviews related work on web classification. In Section 3, the approach is detailed, and section 4 discusses the experiment results. The paper concludes with discussion and summary of the work. Keywords Arabic; Semantic Analysis; Arabic chat alphabet; FrancoArab. 1. INTRODUCTION There is an increase in the amount of pages available on the web. However, these pages vary to a great extent in their contents and purpose which calls for directing efforts to web page classification. Classification” is the process of assigning a web page to one or more predefined category labels”, also known as web page categorization [3]. However, many efforts for classifying web contents into categories rely on human labor. This is not practical in many cases due to the size and growth of web contents. Research done in Webpages classification has focused mainly on the English language; however the expanding to other languages has to involve new issues for research. 2.RELATED WORK The related work is divided into 3 sections. The first section is webpage classification; this section will discuss the concept of classification. The second section is the semantic expansion network and spreading algorithm in which different researcher’s findings are listed, the third section is Arab chat alphabet, its definition and applications. 2.1. Webpage Classification Text Classification is the process of finding the sameness groups. Classification can be done either on statistical based, or knowledge-based, or hybrid [5].. The knowledge-based approach focuses on semantic parser such as machine readable dictionaries. Knowledge based approach is used in this paper proposed model. Most researchers nowadays are focusing on Arabic language morphological analysis and its uses in different Artificial Intelligence (AI) applications as information retrieval and data mining. So far no formal theory of semantics used to provide a complete or consistent account of Arabic language [7]. A model was designed in [7] that represent the Arabic Language using derivational Arabic Ontology. Ontology is used for Knowledge representation techniques. Ontology is defined as “a set of knowledge terms, including the vocabulary, the semantic interconnections and some simple rules of inference and logic for some particular topic” [7]. The designed ontology can benefit researchers in many fields as Automatic Ontology Construction, Arabic Morphology, Arabic Language understanding, and Arabic Language Development. Researchers in [4] build a prototype for English to Arabic Machine Translation (EAMT) using knowledge representation. The prototype is divided into 5 modules: semantic classification of Arabic words, Knowledge base of Arabic concepts, language independent semantic representation: using object oriented representation (OOR), interpreter: (defined as carrying English morphological to get the corresponding language), and finally generator which generates the Arabic text from the semantic representation. Three techniques for Arabic text classification were tested in [3]. The techniques are Support Vector Machine (SVM) used with Sequential Minimal Optimization (SMO), Naïve Bayesian (NB), and J48. The accuracy of the classifier is measured by the Percentage split method (holdout) and the results had shown that SMO classifier had the highest accuracy and the smallest time. An algorithm that classifies Arabic documents is created in [5].Its idea focuses in features selection; those selected features must represent the main ideas in a document while ignoring the stopwords. The algorithms are tested on 242 Arabic abstracts from the Saudi Arabian National Computer Conference. Stemming is the process of reducing a word to its stem or root form. Thus, stems represent the key terms of a query or document by the original words, e.g. "computation" might be stemmed to "compute. In [10] a stemming algorithm is designed to pre-process the Arabic word and keywords extraction. Spreading activation model (SA) has been used in classification [6, 8, 14, 15, and 16]. SA can be used in categorizing user profiles and product suggestion in ecommerce. Semantic expansion is used to customize contents to be suggested based on the user's profile [13]. Constrained Spreading Activation model (CSA) was proposed by [15] as an extension for the traditional Spreading Activation (SA) Model. [14] Have proposed a system called WebSCSA (Web Search by Constrained Spreading Activation) .This system applies CSA technique to retrieve information from the web using an ostensive approach querying similar to query by example. In [8] Spreading Activation model has been used to create personalized suggestion based on the user's current personal preferences. SA model includes two major components: a spreading activation network and an activation spreading mechanism. Spreading mechanism includes several major steps: adjusting inputs, concept spreading, calculating outputs and spreading termination. Two actions control the spreading: pulse spreading and termination check. A pulse continuously spreads data to surrounding nodes until a termination check is met [13]. Pure spreading activation model has many drawbacks that are discussed in [14] .One of these drawbacks is that activation should be carefully controlled, any fault in the spreading process will spread all over the network which will be hard to be recovered. To overcome their drawbacks a new model called Constrained Spreading Activation (CSA) is introduced [14] .CSA provides four types of constraints to limits spreading. These constraints are: Fan out constraint: spreading diminishes at high degree of connectivity with other nodes, distance constraint: activation effect based on a specific distance, activation constraint: activation of a node depends on a threshold value. Finally, path constraint: spreading ca follows certain paths. 2.3. 2.2. Semantic Expansion Network and Spreading Algorithm A Semantic Network (SN) is a directed graph. The graph consists of nodes and edges. Nodes represent physical or conceptual objects, and edges represent relations between these objects. SN is usually used as a graphical notation for representing knowledge in a problem domain. [12]. Semantic Expansion Network (SEN) is an advanced form of SN. SEN differ from SN in which weights are added on edges. The weights on the edges define the association strength between the 2 concepts. SEN is also known as Spreading Activation Network (SAN) [6]. Information priming is the general motivation to introduce Spreading Activation (SA) model. Information priming refers to the possibility of retrieving relevant information by retrieving information that is “associated” with this information [15]. In Spreading Activation model, concepts are expanded based on the semantic relationships between each other [6]. Arab Chat Alphabet “Arabic Chat Alphabet (ACA) (also known as: Arabizi, Arabish, Franco-Arab, or Franco) is a writing language for Arabic in which English letters are written instead of Arabic ones. Basically,” it is an encoding system that represents every Arabic phoneme with the English letter that matches the same pronunciation”[1]. “ACA is a natural language that includes short vowels that are missing in traditional Arabic orthography” .There was a comparison between ACA-based approach and Modern Standard Arabic; however the result has shown that 86% of Arabic computer users confirmed that they type faster using ACA and that ACA is more accurate than the arabic baseline [1]. The formal Arabic language is the Modern Standard Arabic (MSA).MSA is used in a formal base like news broadcast, formal speeches, and books. However, MSA is not the Arabic spoken language for speaking in the everyday life. Dialectal (or colloquial) Arabic is the natural spoken form for Arabic speakers. There are significant phonological, morphological, syntactic, and lexical differences exist between the dialects and MSA [1].The following figure describes the differences between the Arabic letter and the ACA letters. 3.1. Proposed approach phases The steps of the proposed method pass through three main modules which are webpage preprocessing, constructing tool from Arabic to ACA and construct category SEN. The webpage preprocessing consists of parser module. The parser is a program that breaks large units of data into smaller pieces which are called (tokenzs). After the webpage is fed to the system, the parser divides it into tokenzs. The output of the parser phase is the list of Arabic words in the webpage. For the complexity of the Arabic language, there is no available tool for removing the stop words, so the proposed model include a manual step for removing the stop words. After the first module, the second module is the conversion tool from Arabic to ACA. There are many tools that can be used to convert from Arabic to ACA and vice versa. Recently there are some tools that can help Arab users in typing Arabic or ACA language [11],[9]. For assumption proving, a tool is created to help the Arabic computer users for typing directly in Arabic orthography as in [9] in which automatic ACA to Arabic letters conversion was proposed [1]. Figure 1. Sample of Arabic letters and the corresponding ACA transcriptions [1]. 3. PROPOSED APPROACH The proposed approach is divided into 2 phases, the first phase is the training phase and the second phase is the classification phase. In the training phase there are three main modules which are webpage preprocessing, constructing tool from Arabic to ACA and construct category SEN as shown in “Figure 2”. The input of the proposed approach is the webpage and the output is Category SEN. The following figure shows the proposed approach model in the training phase. The terms and their frequencies are passed to the construct category SEN module. This module inputs are lexical database and domain specific knowledgebase. Lexical database is used to enrich the terms with their synonyms (the same meaning with different stem) relationships. The domain specific knowledgebase focuses on adding the relationship between the terms which are divided into association, composition and inheritance. The output of the model is the SEN of this topic or domain. After the training phase and the construction of the category SEN the second phase which is the classification phase is going to start. The classification phase has the 4 modules: preprocessing webpage module, conversion tool from Arabic to ACA , construction of webpage SEN and the last module is the comparison between the webpage SEN and the category SEN that is already created in the training phase. The following figure shows the classification phase of the proposed model. Figure 3. Proposed Approach’s classification phase. Figure 2. Proposed Approach’s training phase. 3.2. Approach Tools ACA is used instead of Arabic orthography since it is easier in processing and it can provide information about missing vowels compared to normal Arabic orthography [1]. After defining the lexical relationships, domain knowledge which contains the relationships between the differtnt terms in a specific domain is used.The knowledge in the proposed approach is based on the shopping domain. The knowledgebase is build using a domain expert who is able to define the different relationships between the terms in a specific domain.”Figure 6” shows the terms and their frequencies after addding the domain relationships. The proposed model uses [9] which is a tool to convert from Arabic to ACA , this tool is a web-based tool, the user can enter the Arabic text required to be converted and the webpage proposes the ACA conversion to the entered Arabic text. The output of the conversion tool module is the set of ACA terms with their frequencies. 3.3. Construction of Lexical and Semantic Relationship For those terms extracted and preprocessed, the system examines lexical relationships among those terms through consulting a lexical database (e.g. WordNet1). The lexical relationship used is the Synonyms (the terms has the same meaning but different stem).Terms with similar meaning represents a concept as shown in “Figure 4”. The system extracts the lexical relationships from the lexical database and for each synonyms word; this word’s frequency is increased by one. The following diagram shows an example of words extracted from a webpage with their frequencies. Figure 6. Semantic Expansion Network including lexical and semantic relationships 4. EVALUATION AND TEST RESULTS 4.1.Experiment Design The goal of the experiment is to validate the result of the proposed approach by using the F-measure. The experiment design was mainly divided into two parts. The first part examines precision; the second define the recall of the proposed approach. Figure 4. Terms extracted from Webpages and their frequencies After consulting the lexical database, the synonyms term are collected and represented as a one term, this is called a Synset. Synset is a set of terms. Those terms are similar in the meaning but with different stem. The following figure represents the idea of a Synset. F-measure is the combination of Precision (the percentage of positive predictions that are correct) and the recall (the percentage of positive labeled instances that were predicted as positive). To calculate the F-measure it is required to compute the Precision and Recall values. The following equations 2 are used for this purpose. (1) Figure 5. Representing a lexical concept, where teacher ()مدرس is chosen to be the representative of the tutor () علمSynset as it has the highest frequency. 1 (2) 2 http://wordnet.princeton.edu http://wiki.uni.lu/secan-lab/docs/EugenStaabE_and_F_measure.pdf [Retrieved Mar. 8, 2011] proposed model experiment should be compared with another Arabic classification models. (3) 6. References To accomplish the experiment design, a data set for webpages should be used. DMOZ3 data set (Open Directory) is used for this purpose. DMOZ or The Open Directory Project is the largest, most comprehensive human-edited directory of the Web. It is constructed and maintained by a global community of volunteer editors. The Open Directory was founded in the spirit of the Open Source movement, and is the only major directory that is 100% free. The proposed model’s data set used consists of 150 Webpages of shopping domain extracted from DMOZ dataset. [1] 4.2.Experiment Results [4] For the purpose of calculating the proposed approach validation Fmeasure is used. Based on the 150 Webpages in the training dataset, and the 75 shopping Webpages in the classification phase, the value of F-measure is 0.8278 as shown in “Table 1”. The model is new and it can not be compared with the Arabic translation tools. Table 1. F-Measure values [2] [3] [5] [6] [7] Proposed Solution Precision 0.7533 Recall 0.9187 F-Measure [8] 0.8278 [9] [10] 5. Conclusion and future work The paper provides a model for Arabic web pages classification using the concept of keywords and the relationships between them. Also the model focuses on the conversion from Arabic language to ACA language and vice versa. This research is one of the researches that is done based on Arabic language, since that Arabic language already received a little computational analysis. There is a need for researches to focus on this complex language. This proposed model may encourage the research in this area. So far, there is no formal theory of semantics analysis for Arabic knowledge, also the model combines the Arabic conversion to ACA and adding the terms not only with their frequencies but also after consulting lexical and domain know ledge. Until now, there is no research provides a complete and consistent dictionary to the Arabic language phenomena; however it is important to develop models for semantic processing of Arabic. Future work could be started in creating knowledgebase in different domain. The model should be tested in more than one domain, trying to creating a knowledgebase that contains all available ACA terms for the Arabic Webpages, finally the 3 http: //www.dmoz.org/ . [Retrieved Mar. 8, 2011] [11] [12] [13] [14] [15] [16] Elmahdy, M., Gruhn, R., Abdennadher, S.,and Minker, W. 2011.Rapid Phonetic Transcription using everyday life natual chat alphabet orthography for dialectal Arabic speech recognition. IEEE 978-1-4577-0539-7/11.ICA SSP 2011. Mostafa,L. 2011. Webpage Keyword Extraction using Term Frequency. 3rd IEEE International Conference on Information Management and Engineering (IEEE ICIME 2011). Bassam, Al-Shargabi B., AL-Romimah,W.,and Olayah F.2011. A Comparative Study for Arabic Text Classification Algorithms Based on Stop Words Elimination. ISWSA’11, April 18–20, 2011, Amman, Jordan, ACM. Aref,M., Al-Mulhem, M. and Al-Muhtaseb, H.1992. English to Arabic machine translation: a critical review and suggestions for development. King Fahd University of Petroleum and Minerals Dhahran, Saudi Arabia. Ghwanmeh, S., Kanaan, G. Al-Shalabi,R. Ababneh, A.2009.An Enhanced Text-Classification-Based Arabic Information Retrieval System. IGI Global. Sharifian, F., and Samani, R.,1997 , “Hierarchical spreading of activation”, in Farzad Sharifian,ed.,Proc. of the Conference on Language, Cognition, and Interpretation, IAU Press, pp. 110,1997. Hoseini, M.2011.Modeling the Arabic language through verb based ontology. International Journal of Academic Research Vol. 3. No. 3. May, 2011, II Part. Nilas, N., Nilas, P., and Masakul, K.,2007, “A Spreading Activation Approach for e-Commerce Site Selection System”,Proceedings of the International Conference on eBusiness ,2007. 13 Cairo Microsoft Innovation Lab, “Microsoft Maren,”2009, http://www.microsoft.com/middleeast/egypt/cmic/maren Omer, M. Long, M.2009.Stemming algorithm to classify Arabic documents. Symposium on Progress in Information & Communication Technology 2009. http://www.yamli.com/ar/ [Last viewed: 12-2-2012]. Shetty, R., Riccio, P., Quinqueton, J.,2009,” Extended Semantic Netowrk for Knowledge Sharing”, Emerging Trends in Computer Science and Engineering, ETICSE06 , Version 1 , 24-Mar. 2009. Liang, T., Yang, Y., Chen, D., and Ku, Y.2009 , " A semanticexpansion approach to personalized knowledge recommendation". Decision Support Systems 45 (3) 401–412, 2009. Crestani, F. and Lee, P.,2000, "Searching the web by constrained spreading activation". Information Processing and Management, 36(4):585{605, 2000. Crestino, F.,1997 ,”Application of spreading activation techniques in information retrieval”,Artificial Intelligence Review 11,(6)(December 1997) 453-482,1997. Suchal, J.,2007 ,” Caching Spreading Activation Search”. In Bielikov´a, M., ed.: IIT.SRC2007: Student Research Conference, Faculty of Informatics and Information Technologies,Slovak University of Technology in Bratislava (April 2007) 151– 155.2007.
© Copyright 2025 Paperzz