ACA-lamiaa

Classify Webpages using Arab Chat Alphabet
Lamiaa Mostafa
Business Information System Department
Arab Academy for Science and Technology and Maritime Transport
Alexandria, Egypt
[email protected]
Abstract
Arabic language is a very complex language. There is still not
enough computation research on the Arabic language processing.
However there is an increase in the information on the internet.
Many automated classification methods were executed on English
text. This calls for researches in the filed of Arabic text
classification.
It is important to develop models for Arabic text semantic
processing even if such models seem to be incomplete. This paper
proposes a model for Arabic text processing using specific English
language. This language is called the Arabic Chat Alphabet
(ACA), Arabizi, Arabish or Araby or even Franco-Arab. The
proposed model classifies Arabic Webpages using one of the data
mining techniques which is Classification. The model has two
main phases. In the first phase the system is trained using a set of
pre-labeled web pages, where the system builds a Semantic
Expansion Network (SEN). In the second phase the system
classifies new web page by checking the similarity between a
representation of the web page and the Semantic Activation
Network constructed in the training phase. Using the F1-measure
for the method validation, the proposed method had shown
promising results.
Categories and Subject Descriptors
I.2.7 [Natural Language Processing]
General Terms
Language parsing, Text Analysis
On the internet world, Arabic language was not used as a regular
base compared to English language because of the difficulties of
using the Arabic script and also Arabic makes monophonical
analysis a very complex task [7].
Arabic users write their emails or chatt with friends in Roman
alphabet also known as Arabic chat alphabet (ACA). ACA uses
English letters to write Arabic words. An example of ACA is
“modares, (‫ ”)مدرس‬which means a teacher.
English language usage in the internet is so huge, this is based on
the huge number of sites that is written in English format. English
is also the language used for Shopping and Games, News, and
Educational issues; this is due to the tendency of all sites to be in
English, rather than any linguistic preference.
For the processing of any text there are some specific steps:
parsing, stopword removal and stemming [2].Arabic language has
many stopwords which are divided into either words appear in the
sentences and don’t have any meaning or indications about the
content such as (so ‫لذلك‬, with ‫( باإلشارة‬or a sequence of the
sentences like (firstly ‫اوال‬, secondly ‫) ثانيا‬, or pronouns such as (he
‫[ )هو‬3].
The rest of this paper will be organized as follows. In Section 2,
the paper reviews related work on web classification. In Section 3,
the approach is detailed, and section 4 discusses the experiment
results. The paper concludes with discussion and summary of the
work.
Keywords
Arabic; Semantic Analysis; Arabic chat alphabet; FrancoArab.
1. INTRODUCTION
There is an increase in the amount of pages available on the web.
However, these pages vary to a great extent in their contents and
purpose which calls for directing efforts to web page classification.
Classification” is the process of assigning a web page to one or
more predefined category labels”, also known as web page
categorization [3]. However, many efforts for classifying web
contents into categories rely on human labor. This is not practical
in many cases due to the size and growth of web contents.
Research done in Webpages classification has focused mainly on
the English language; however the expanding to other languages
has to involve new issues for research.
2.RELATED WORK
The related work is divided into 3 sections. The first section is
webpage classification; this section will discuss the concept of
classification. The second section is the semantic expansion
network and spreading algorithm in which different researcher’s
findings are listed, the third section is Arab chat alphabet, its
definition and applications.
2.1.
Webpage Classification
Text Classification is the process of finding the sameness
groups. Classification can be done either on statistical based, or
knowledge-based, or hybrid [5].. The knowledge-based approach
focuses on semantic parser such as machine readable dictionaries.
Knowledge based approach is used in this paper proposed model.
Most researchers nowadays are focusing on Arabic language
morphological analysis and its uses in different Artificial
Intelligence (AI) applications as information retrieval and data
mining. So far no formal theory of semantics used to provide a
complete or consistent account of Arabic language [7].
A model was designed in [7] that represent the Arabic
Language using derivational Arabic Ontology. Ontology is used
for Knowledge representation techniques. Ontology is defined as
“a set of knowledge terms, including the vocabulary, the semantic
interconnections and some simple rules of inference and logic for
some particular topic” [7]. The designed ontology can benefit
researchers in many fields as Automatic Ontology Construction,
Arabic Morphology, Arabic Language understanding, and Arabic
Language Development.
Researchers in [4] build a prototype for English to Arabic
Machine Translation (EAMT) using knowledge representation.
The prototype is divided into 5 modules: semantic classification of
Arabic words, Knowledge base of Arabic concepts, language
independent semantic representation: using object oriented
representation (OOR), interpreter: (defined as carrying English
morphological to get the corresponding language), and finally
generator which generates the Arabic text from the semantic
representation.
Three techniques for Arabic text classification were tested in
[3]. The techniques are Support Vector Machine (SVM) used with
Sequential Minimal Optimization (SMO), Naïve Bayesian (NB),
and J48. The accuracy of the classifier is measured by the
Percentage split method (holdout) and the results had shown that
SMO classifier had the highest accuracy and the smallest time.
An algorithm that classifies Arabic documents is created in
[5].Its idea focuses in features selection; those selected features
must represent the main ideas in a document while ignoring the
stopwords. The algorithms are tested on 242 Arabic abstracts from
the Saudi Arabian National Computer Conference.
Stemming is the process of reducing a word to its stem or root
form. Thus, stems represent the key terms of a query or document
by the original words, e.g. "computation" might be stemmed to
"compute. In [10] a stemming algorithm is designed to pre-process
the Arabic word and keywords extraction.
Spreading activation model (SA) has been used in
classification [6, 8, 14, 15, and 16]. SA can be used in categorizing
user profiles and product suggestion in ecommerce. Semantic
expansion is used to customize contents to be suggested based on
the user's profile [13].
Constrained Spreading Activation model (CSA) was proposed
by [15] as an extension for the traditional Spreading Activation
(SA) Model. [14] Have proposed a system called WebSCSA (Web
Search by Constrained Spreading Activation) .This system applies
CSA technique to retrieve information from the web using an
ostensive approach querying similar to query by example. In [8]
Spreading Activation model has been used to create personalized
suggestion based on the user's current personal preferences.
SA model includes two major components: a spreading activation
network and an activation spreading mechanism. Spreading
mechanism includes several major steps: adjusting inputs, concept
spreading, calculating outputs and spreading termination. Two
actions control the spreading: pulse spreading and termination
check. A pulse continuously spreads data to surrounding nodes
until a termination check is met [13].
Pure spreading activation model has many drawbacks that are
discussed in [14] .One of these drawbacks is that activation should
be carefully controlled, any fault in the spreading process will
spread all over the network which will be hard to be recovered. To
overcome their drawbacks a new model called Constrained
Spreading Activation (CSA) is introduced [14] .CSA provides four
types of constraints to limits spreading. These constraints are: Fan
out constraint: spreading diminishes at high degree of connectivity
with other nodes, distance constraint: activation effect based on a
specific distance, activation constraint: activation of a node
depends on a threshold value. Finally, path constraint: spreading ca
follows certain paths.
2.3.
2.2.
Semantic Expansion Network and
Spreading Algorithm
A Semantic Network (SN) is a directed graph. The graph consists
of nodes and edges. Nodes represent physical or conceptual
objects, and edges represent relations between these objects. SN is
usually used as a graphical notation for representing knowledge in
a problem domain. [12].
Semantic Expansion Network (SEN) is an advanced form of SN.
SEN differ from SN in which weights are added on edges. The
weights on the edges define the association strength between the 2
concepts. SEN is also known as Spreading Activation Network
(SAN) [6]. Information priming is the general motivation to
introduce Spreading Activation (SA) model. Information priming
refers to the possibility of retrieving relevant information by
retrieving information that is “associated” with this information
[15]. In Spreading Activation model, concepts are expanded based
on the semantic relationships between each other [6].
Arab Chat Alphabet
“Arabic Chat Alphabet (ACA) (also known as: Arabizi, Arabish,
Franco-Arab, or Franco) is a writing language for Arabic in which
English letters are written instead of Arabic ones. Basically,” it is
an encoding system that represents every Arabic phoneme with the
English letter that matches the same pronunciation”[1].
“ACA is a natural language that includes short vowels that are
missing in traditional Arabic orthography” .There was a
comparison between ACA-based approach and Modern Standard
Arabic; however the result has shown that 86% of Arabic computer
users confirmed that they type faster using ACA and that ACA is
more accurate than the arabic baseline [1].
The formal Arabic language is the Modern Standard Arabic
(MSA).MSA is used in a formal base like news broadcast, formal
speeches, and books. However, MSA is not the Arabic spoken
language for speaking in the everyday life. Dialectal (or colloquial)
Arabic is the natural spoken form for Arabic speakers. There are
significant phonological, morphological, syntactic, and lexical
differences exist between the dialects and MSA [1].The following
figure describes the differences between the Arabic letter and the
ACA letters.
3.1. Proposed approach phases
The steps of the proposed method pass through three main modules
which are webpage preprocessing, constructing tool from Arabic to
ACA and construct category SEN. The webpage preprocessing
consists of parser module. The parser is a program that breaks
large units of data into smaller pieces which are called (tokenzs).
After the webpage is fed to the system, the parser divides it into
tokenzs. The output of the parser phase is the list of Arabic words
in the webpage.
For the complexity of the Arabic language, there is no available
tool for removing the stop words, so the proposed model include a
manual step for removing the stop words.
After the first module, the second module is the conversion tool
from Arabic to ACA. There are many tools that can be used to
convert from Arabic to ACA and vice versa. Recently there are
some tools that can help Arab users in typing Arabic or ACA
language [11],[9].
For assumption proving, a tool is created to help the Arabic
computer users for typing directly in Arabic orthography as in [9]
in which automatic ACA to Arabic letters conversion was
proposed [1].
Figure 1. Sample of Arabic letters and the corresponding
ACA transcriptions [1].
3.
PROPOSED APPROACH
The proposed approach is divided into 2 phases, the first phase is
the training phase and the second phase is the classification phase.
In the training phase there are three main modules which are
webpage preprocessing, constructing tool from Arabic to ACA and
construct category SEN as shown in “Figure 2”. The input of the
proposed approach is the webpage and the output is Category SEN.
The following figure shows the proposed approach model in the
training phase.
The terms and their frequencies are passed to the construct
category SEN module. This module inputs are lexical database and
domain specific knowledgebase. Lexical database is used to enrich
the terms with their synonyms (the same meaning with different
stem) relationships. The domain specific knowledgebase focuses
on adding the relationship between the terms which are divided
into association, composition and inheritance. The output of the
model is the SEN of this topic or domain.
After the training phase and the construction of the category SEN
the second phase which is the classification phase is going to start.
The classification phase has the 4 modules: preprocessing webpage
module, conversion tool from Arabic to ACA , construction of
webpage SEN and the last module is the comparison between the
webpage SEN and the category SEN that is already created in the
training phase. The following figure shows the classification phase
of the proposed model.
Figure 3. Proposed Approach’s classification phase.
Figure 2. Proposed Approach’s training phase.
3.2.
Approach Tools
ACA is used instead of Arabic orthography since it is easier in
processing and it can provide information about missing vowels
compared to normal Arabic orthography [1].
After defining the lexical relationships, domain knowledge which
contains the relationships between the differtnt terms in a specific
domain is used.The knowledge in the proposed approach is based
on the shopping domain. The knowledgebase is build using a
domain expert who is able to define the different relationships
between the terms in a specific domain.”Figure 6” shows the terms
and their frequencies after addding the domain relationships.
The proposed model uses [9] which is a tool to convert from
Arabic to ACA , this tool is a web-based tool, the user can enter
the Arabic text required to be converted and the webpage proposes
the ACA conversion to the entered Arabic text. The output of the
conversion tool module is the set of ACA terms with their
frequencies.
3.3. Construction of Lexical and Semantic
Relationship
For those terms extracted and preprocessed, the system examines
lexical relationships among those terms through consulting a
lexical database (e.g. WordNet1). The lexical relationship used is
the Synonyms (the terms has the same meaning but different
stem).Terms with similar meaning represents a concept as shown
in “Figure 4”. The system extracts the lexical relationships from
the lexical database and for each synonyms word; this word’s
frequency is increased by one. The following diagram shows an
example of words extracted from a webpage with their frequencies.
Figure 6. Semantic Expansion Network including lexical and
semantic relationships
4.
EVALUATION AND TEST RESULTS
4.1.Experiment Design
The goal of the experiment is to validate the result of the proposed
approach by using the F-measure.
The experiment design was mainly divided into two parts. The first
part examines precision; the second define the recall of the
proposed approach.
Figure 4. Terms extracted from Webpages and their
frequencies
After consulting the lexical database, the synonyms term are
collected and represented as a one term, this is called a Synset.
Synset is a set of terms. Those terms are similar in the meaning but
with different stem. The following figure represents the idea of a
Synset.
F-measure is the combination of Precision (the percentage of
positive predictions that are correct) and the recall (the percentage
of positive labeled instances that were predicted as positive).
To calculate the F-measure it is required to compute the Precision
and Recall values. The following equations 2 are used for this
purpose.
(1)
Figure 5. Representing a lexical concept, where teacher (‫)مدرس‬
is chosen to be the representative of the tutor (‫) علم‬Synset as it
has the highest frequency.
1
(2)
2
http://wordnet.princeton.edu
http://wiki.uni.lu/secan-lab/docs/EugenStaabE_and_F_measure.pdf
[Retrieved Mar. 8, 2011]
proposed model experiment should be compared with another
Arabic classification models.
(3)
6. References
To accomplish the experiment design, a data set for webpages
should be used. DMOZ3 data set (Open Directory) is used for this
purpose. DMOZ or The Open Directory Project is the largest, most
comprehensive human-edited directory of the Web. It is
constructed and maintained by a global community of volunteer
editors. The Open Directory was founded in the spirit of the Open
Source movement, and is the only major directory that is 100%
free. The proposed model’s data set used consists of 150
Webpages of shopping domain extracted from DMOZ dataset.
[1]
4.2.Experiment Results
[4]
For the purpose of calculating the proposed approach validation Fmeasure is used. Based on the 150 Webpages in the training
dataset, and the 75 shopping Webpages in the classification phase,
the value of F-measure is 0.8278 as shown in “Table 1”. The
model is new and it can not be compared with the Arabic
translation tools.
Table 1. F-Measure values
[2]
[3]
[5]
[6]
[7]
Proposed Solution
Precision
0.7533
Recall
0.9187
F-Measure
[8]
0.8278
[9]
[10]
5. Conclusion and future work
The paper provides a model for Arabic web pages classification
using the concept of keywords and the relationships between them.
Also the model focuses on the conversion from Arabic language to
ACA language and vice versa. This research is one of the
researches that is done based on Arabic language, since that Arabic
language already received a little computational analysis. There is
a need for researches to focus on this complex language. This
proposed model may encourage the research in this area.
So far, there is no formal theory of semantics analysis for Arabic
knowledge, also the model combines the Arabic conversion to
ACA and adding the terms not only with their frequencies but also
after consulting lexical and domain know ledge.
Until now, there is no research provides a complete and consistent
dictionary to the Arabic language phenomena; however it is
important to develop models for semantic processing of Arabic.
Future work could be started in creating knowledgebase in
different domain. The model should be tested in more than one
domain, trying to creating a knowledgebase that contains all
available ACA terms for the Arabic Webpages, finally the
3
http: //www.dmoz.org/ . [Retrieved Mar. 8, 2011]
[11]
[12]
[13]
[14]
[15]
[16]
Elmahdy, M., Gruhn, R., Abdennadher, S.,and Minker, W.
2011.Rapid Phonetic Transcription using everyday life natual chat
alphabet orthography for dialectal Arabic speech recognition. IEEE
978-1-4577-0539-7/11.ICA SSP 2011.
Mostafa,L. 2011. Webpage Keyword Extraction using Term
Frequency. 3rd IEEE International Conference on Information
Management and Engineering (IEEE ICIME 2011).
Bassam, Al-Shargabi B., AL-Romimah,W.,and Olayah F.2011. A
Comparative Study for Arabic Text Classification Algorithms
Based on Stop Words Elimination. ISWSA’11, April 18–20, 2011,
Amman, Jordan, ACM.
Aref,M., Al-Mulhem, M. and Al-Muhtaseb, H.1992. English to
Arabic machine translation: a critical review and suggestions for
development. King Fahd University of Petroleum and Minerals
Dhahran, Saudi Arabia.
Ghwanmeh, S., Kanaan, G. Al-Shalabi,R. Ababneh, A.2009.An
Enhanced Text-Classification-Based Arabic Information Retrieval
System. IGI Global.
Sharifian, F., and Samani, R.,1997 , “Hierarchical spreading of
activation”, in Farzad Sharifian,ed.,Proc. of the Conference on
Language, Cognition, and Interpretation, IAU Press, pp. 110,1997.
Hoseini, M.2011.Modeling the Arabic language through verb
based ontology. International Journal of Academic Research Vol.
3. No. 3. May, 2011, II Part.
Nilas, N., Nilas, P., and Masakul, K.,2007, “A Spreading
Activation Approach for e-Commerce Site Selection
System”,Proceedings of the International Conference on eBusiness ,2007.
13 Cairo Microsoft Innovation Lab, “Microsoft Maren,”2009,
http://www.microsoft.com/middleeast/egypt/cmic/maren
Omer, M. Long, M.2009.Stemming algorithm to classify Arabic
documents. Symposium on Progress in Information &
Communication Technology 2009.
http://www.yamli.com/ar/ [Last viewed: 12-2-2012].
Shetty, R., Riccio, P., Quinqueton, J.,2009,” Extended Semantic
Netowrk for Knowledge Sharing”, Emerging Trends in Computer
Science and Engineering, ETICSE06 , Version 1 , 24-Mar. 2009.
Liang, T., Yang, Y., Chen, D., and Ku, Y.2009 , " A semanticexpansion approach to personalized knowledge recommendation".
Decision Support Systems 45 (3) 401–412, 2009.
Crestani, F. and Lee, P.,2000, "Searching the web by constrained
spreading activation". Information Processing and Management,
36(4):585{605, 2000.
Crestino, F.,1997 ,”Application of spreading activation techniques
in
information
retrieval”,Artificial
Intelligence
Review
11,(6)(December 1997) 453-482,1997.
Suchal, J.,2007 ,” Caching Spreading Activation Search”. In
Bielikov´a, M., ed.: IIT.SRC2007: Student Research Conference,
Faculty of Informatics and Information Technologies,Slovak
University of Technology in Bratislava (April 2007) 151–
155.2007.