Multilingual Information Retrieval in World Wide Web

Multilingual Information Retrieval in World Wide Web
Xiaoda Zhang and James N. K. Liu
Department of Computing, Hong Kong Polytechnic University, Hong Kong
[email protected]
Eric Atwell
Department of Computer Studies, University of Leeds, UK
Abstract
1. Introduction
The article addresses: (1). The design of an
In a large distributed hypertext system like the World
information retrieval (IR), as the Multilingual
Wide Web, users find by following hypertext links.
Information Retrieval Tool Hierarchy (MIRTH),
As the size of the system increases, must traverse
which with virtual corpora on the World Wide Web,
increasingly more links to find what they are looking
also known as Web or WWW. It is motivated by the
for. This task is very demanding. One comprehensive
desire to create a search engine to retrieve
way to cope with this to develop a computer program
information by accessing a virtual. (2). The
to help people explore the Web. This is a search
implementation of a general model of multilingual
engine. All search engines provide users with a query
retrieval for the Web searching. It copes with both
frame in which the user can key in search
and Chinese information retrieval techniques. This
requirement in form of keywords, or phrases, or a
paper starts to address some problems of the World
regular expression. Then, they can go through the
Wide Web relating to information retrieval. Then it
Web pages, locate documents within the entire Web
introduces some existing information retrieval tools
and return selected hits in format of WWW
on the Web. The need to create a multilingual search
documents. Examples of this kind of search engines
engine is discussed. Next, a general hierarchy of
are Infoseek, Yahoo, WebCrawler, Excite, ALIWEB,
MIRTH search engine is illustrated. Furthermore,
CUSI, and W3 Catalogue [Graham,1995].
techniques to set up a MIRTH search engine are
1.1 Advantages of Search Engines
explored. These include build up data files, a
structure of the search engine [Gilster, 1996], and
constraints on query syntax. In addition, the means to
create MIRTH multilingual search engine for
Chinese (English) information retrieval is dealt and
some examples of using MIRTH search engine are
given.
To help users to retrieve information from WWW is
the basic function of search engines. Two most
impressive features of search engines are timesaving
way of searching over the Web and their simplicity
of use. For example, via Netscape Navigator, once a
user keys in a keyword, phrase, or a regular
expression in the frame of a search engine, he/she
linguistics and literature. As a search engine, it
just needs to type 'Enter' from his/her keyboard, then
retrieves information in both English Chinese. This
the desired will be located in a very short time on the
research comprises two major tasks:
same Netscape browser.
• To organize a "virtual corpus" of computer based
text-training
materials
for
linguistics
and
1.2 Disadvantages of Existing Search Engines
literature available on the World Wide Web. The
Each search engine has its weaknesses. The common
corpus contains multilingual information about
shortcomings for all of can be summarized below:
human language learning and training, grammar
Most of them can only search in English. This
and language modeling research. Users could find
would prevent, for instance, linguist from retrieving
extensive materials within the corpus such as
materials about the Chinese language or theory in
novels,
poetry,
prose
and
various
on-line
Chinese. Hence the need exists to establish a search
electronic magazines.
engine.
• To set up a multilingual search tool which can
Most of them are general search tools focusing on
handle queries in both English and Chinese. This
general information retrieval, which might not be
tool can be used to scan the contents of the
efficient for the purpose of searching in a research
mentioned corpus on line.
area. Although some can search in a very wide
database, they can not give the user satisfaction when
2. Methodologies
of
Creating
an
they need very detailed information on professional
Information Retrieval Tool
topics, for example, Chinese grammar.
As a system of information retrieval tool, some
1.3 Objectives of Research
creation methodologies, for example, building up
As such, the presented paper was motivated by the
special corpora, creating unique data file, defining
desire to create a search engine, which retrieves
special query syntax rules. In the following section
information from the World Wide Web (WWW)
these will be stated briefly.
using a linguistics virtual corpus. Specific attention
2.1 Build up Special Corpora
has been paid to multilingual search facilities. The
WWW is a huge storeroom where a variety of
study will include the discussion on use of English
information materials have been placed. As most
and Chinese language tools. It is intended to create
WWW documents have mainly been written using
multilingual information retrieval tool to support
HTML in plain text, it can be seen as a corpus
searching of in specified areas, such as in the field of
[Atwell, 1993], [Liu and Lee 1997]. There are
already billions of documents on WWW, and the
computer and, furthermore, is potentially subject to
number grows rapidly. As all of these hypertext
unforeseen changes as remote sources are modified
documents are dispersed over the Web, it is a time
central control. Hypertext resources on the Web are
consuming task to find in a particular research area.
seen as collections of virtual corpora. A Web data
So it must be wise to set up an access point for
file is also called a data set, that is a collection of
relevant materials such as a Computing corpus (Most
Web pages' Uniform Resource Locators (URL). It
documents on the Web are written in The Hypertext
support the whole process of information retrieval
Markup Language, HTML in short).
from the Web. To set up Web data files, three issues
are discussed:
2.2 Set up Web Data File
Setting up a data file to save space
It is noted that for Web searching, the common
Specifying a structure of a data file
search object is a database [Ford, 1995]. However,
Defining applications for a data file
this requires a huge storage space for information,
A Principle of Setting up the Data File
and it also takes a long time for the search process to
The purpose of creating a data file instead of using
complete. Some special approach based on MIRTH
the contents of hypertext page is to save storage
is devised for organizing the database. Within this
space. It is clear that the wide availability of resource
MIRTH, there are two kinds of search objects:
on the Web might be too large to be saved in one
virtual corpora [Butler, 1992] and data files. The
machine. One way serve this purpose might be to
virtual corpus is dynamic. It is different from the
store only keywords in a data file and a give a few
traditional corpora the contents of a virtual corpus
lines of explanations of the content of the documents
are changed from time to time [Butler, 1992]. A
in the file. The example of the data file is given in
virtual corpus might not be stored on a user's
Fig. 1 as follows:
<LI><a href= "http://www.scsn.net/~ics/"> Intelligent Computer program
Solutions WWW Site </a> <h4> Introduction of Computer programs</h4>
<LI><a href= "http://www.education.siggraph.org/theses/theindex.htm">
M.S. and Ph.D Computer Graphics Theses</a> <h4><I>This directory contains
the ASCII text files for all of the Computer Graphics Thesis and
Dissertation Abstracts Compendiums published in Computer</I></h4>
<LI><a href="http://www.copfer.com/search.htm"> Computer & Associates </a>
<h4><I>Computer Based Training Internet Services Web Design Touch Screen
Kiosks Electronic Catalogs Java Shock wave Multimedia </I></h4>
Figure 1: A data file example with key words and explanations
When the data file is searched by MIRTH for a key word 'computer', the underline part of text (keywords and
main explanation) should be shown on the screen as the results of a search, this can be seen in Fig. 2.
Figure 2: Search Results on the Data File
Data files are application-oriented, or its contents
Application of a Data File
Search results are controlled by data files. About the
depend on users' requirement of a search engine in
search procedures, the first thing to be done is to key
terms of information retrieval, so they are not always
in a search item in a query box. Then the engine runs.
the same. The resulting data files might inevitably be
In fact, the search query as a string has been passed
unique as it casts their creators' personal ideas
to the search program by the external program. When
considering those of its users. Generally speaking,
the program is run, it starts pattern matching in the
however, there are some common features for
data file. If the results found in the data file between
designing a data file. For example, the design is
one and ten or more "hits", then the matched items
influenced by its objectives. The purpose of the
(hits) will be ranked by the program automatically
MIRTH
before being sent back to the user. If the search item
specialists, linguists and people studying literature.
is not found in the data file, it means there is nothing
So, the author approaches the problem of design data
matched, and the user will get no results from the
files by collecting all resources related to those aims
search. Now we could see the importance of a data
of above group of special users. It has built up a large
file. The data file restricts the application of a search.
computing, linguistics, and literature file possible in
A comprehensive, high quality data file is essential to
the time available. In the other words, the contents
efficient searching. Its structure will be influenced by
the MIRTH data file introduce both general
the kind of searches users wish to make. This is a
computing knowledge and linguistics theory and
major
literature materials such as novels, poetry, literature
issue
that
was considered
development of MIRTH.
during the
search
engine
is
to
help
computer
journals (both Chinese and English). It can also be
extended other topics, for example civil engineering
system to help users searching information more
and chemistry, if necessary, but this will involve
accurately and efficiently [Ford, 1995]. It is
further manual creation of entries into a data file, and
understood that all developed search engines in
will dilute the subject specificity thus risking more
literature have their own syntax rules for making
erroneous 'hits'.
query.
Two problems exist with them: (1) in
practice, just a few search tools will provide
Construct a Data file
explanations for their rules explicitly; (2) most
The rule of thumb to construct a data file is to save
common query syntax rules depend on concepts of
computer memory. The best way to do this is choose
natural language, such as words, expressions (several
keywords of a HTML document and convert them
words), etc. While MIRTH differs from them as it
into a data file, see example above. Many automatic
focuses on linguistics and literature research, some
search engines rely on data file to deal with queries.
syntax rules for the purpose of supporting specific
It is important to choose comprehensive keywords to
searching functions are needed. MIRTH syntax rule
improve the chance of retrieving relevant documents.
definition focus on some special usage, such as how
For the data files, keywords can be seen as the words
does affix search, Root search, etc. Details will be
or phrase, which reflect subject of the corresponding
addressed below.
home page. The theory behind a data file is as simple
as a rule of thumb: users are interested if a particular
Motivation of Affix and Root Search
document contains some keywords in relation to their
As a general definition, syntax addresses the
interests. Keywords to be placed in a data file were
structure of sentences, but technically, it has more
selected manually from introductory textbooks in
meanings. In terms of computing, the term of syntax
linguistics, computing and literature, guided by
has been used widely. Any computer language
authors' experience in Chinese linguistics and
requires certain syntax rules for its commands and
literature.
codes, it ranges from simple structure of the words to
entire program. Moreover, in different situations,
2.3 Define Query Syntax Rules
syntax has different definition and different content.
Although MIRTH searches pre-computed data files
As far as computer program writing, there are lots of
instead of the Web pages, any relevant home pages
syntax rules that programmers have to follow. For
eventually will be downloaded by users if it is
example, in the Hypertext writing, the HTML
necessary. Our interest will be on those home pages
(HyperText
Mark-up
Language)
requests
all
as well. Intelligent query support is included in the
commands should be enclosed by: <>. The <>
symbols usually appear in pairs: the <> is used at the
Example of Affix and Root Search Pre(*)
beginning and </> should be used at the end of the
If a user wants to search out some words, which start
same sentence.
with same prefix, then he/she should follow this
It has been mentioned that the main function of
MIRTH search is to do linguistic search. And to
analyzed special structure of words (phrases) is a
syntax rule to enter their item in the search box:
prefix* (without a space before '*').
See example in the following section.
very important issue for linguistics and language
If you want to do a search for prefix matching, the
learners [Graham,1995]. For example: English, most
search engine will all words in its database, and
words have a root, but the root can form lots of
provide the information which you are looking for.
derivations. Such as adding a suffix or a prefix to a
Supposing you start your search the words begin
root, then a new word would be created. Let us
with 'dis', you can start the search as that: to add
consider: 'think' is a verb, we can regard it as a root,
asterisk (*) after the search item without space, then
when we add the prefix and suffix with it. It might
search the words prefix "dis", then you would get the
become new word, such as 'unthinkable'. For
search results: dislike, display, discrete, dismember,
language searching, these special functions of prefix,
discomfort, discredit, discover, discolor, disclose,
suffix and root search have been considered for
and disloyal. Fig. 3 illustrates this idea.
query syntax.
Figure 3: Search Prefix "Com"
Fig. 3 searched for prefix of "Com". Totally, over
It just needs users to input the search query in the
thirty hits been matched. The details also can be seen
item box, in front of the suffix, you must add
with the Fig. 4 for Chinese search.
asterisk(*), it looks like that: "*ing", then the search
engine will seek out which word that includes the
suffix you are looking for, and automatically pick
them up for you (see Fig. 5 displayed by Web
browser Netscape).
Figure 4: Chinese Engine Search
In this process of Chinese prefix search, the search
item is "ÖÐ " (central, or middle) + " * ". While
running the search engine, it picked out over fifteen
hits, which contained the prefix "ÖÐ ". As: ÖÐ ÎÄ
Figure 5: MIRTH Search for "*ing"
(Chinese), ÖÐ ¹ú, ÖÐ »ª(China), ÖÐ¹úÎÄÑ§(Chinese
(*) + Root + (*)
literature), ÖÐ ¶«(Central Eastern), ÖÐ Î÷ (Central
This query syntax means that there is a root part of a
Western) and so on.
between two stars without space among them. Then
(*) + Suffix
the search engine will match all words that have an
This syntax rule defines a search to match all words
identical root installed in the index file.
having an identical suffix in the data file. The request
of the input is *suffix, and there is no space between
the asterisk and suffix.
3. Multilingual Issues in MIRTH
displayed with Chinese environment (supported by
Unionway). Fig. 7 is displayed without Chinese
3.1 Chinese Computing Environment
environment.
For multilingual information retrieval, most systems
including UNIX and PC systems are in the standard
English environment, and the WWW documents
work in this format [Christian, 1988]. For example,
Netscape has a Web browser (named Navigator),
which standard HTML in plain English. When a
document in Chinese is on the Web, Netscape
Navigator will show the Chinese characters in
strange symbols, without a certain software to
support displaying and Chinese, the strange symbols
get displayed will not be understood by any user
(including Chinese people).
Figure 6: Display with Chinese environment
Reading Chinese with Web browser, and setting
up a search system for Chinese information retrieval,
are still topics of debate on the Web [Zhou and Liu,
1997]. But in this paper, we have presented one
solution. We now understand to set up a Chinese
environment, how to deal with Chinese characters,
and how to retrieve information from a Chinese
virtual corpus.
Chinese GB & BIG5 Codes are displayed in
default Netscape Font. Some Chinese software can
cope with this problem by converting English
computing environment into an environment which
supports both English and Chinese. Once this
software is installed, the Chinese code will be
converted to readable Chinese characters as given in
Figs 6 and 7. They are same document files. Fig. 6 is
Figure 7: Display without Chinese environment
3.2 Dealing
with
Symbolic
Chinese
Characters
The MIRTH provides access to Chinese virtual
corpora as well. To understand this procedure, we
need to know how the computer deals with symbolic
adding 32 to both of the line and column numbers of
characters. Now we discuss the issues such as how to
Line-Column Code. Taking the word 'big' for
convert a symbolic character into digital information,
example, adding 32 to 20 (line number) and 83
how to store a set of digital characters (as bitmaps)
(column number) gives 52f and 115, and if we check
and how to represent particular Chinese in the GB
the ASCII code table, 52 represents '4' and 115 for 's',
protocol.
so GB Code for 'big' is '4s'. As the minimum line and
column is 1, so the minimum number of GB Code is
Chinese Code Protocols
32+1=33, and the maximum Code is 32+94=126. We
We discuss terms like GB, HZ, and BIG5 Chinese
could see that GB Code is within the range ASCII of
code protocols used by MIRTH. When we search a
codes, which represent 94 symbols. This means we
Chinese data file in MIRTH, this file normally is
could use GB Code as standard information
written in a kind of code, in other words, the Chinese
interchange code set like ASCII.
characters are written in specific codes instead of the
graphic characters.
3.3 Chinese Information Retrieval System
Line-Column Code
The structure of the Chinese character set is different
One simple protocol is called Line-Column code,
from English. It has its special characteristics, and it
which uses a character's line number and column
can be displayed by MIRTH on the Web. The whole
number as its code [Huabei, 1981]. For example,
procedure can be seen below.
when we search for the word 'big', we key in 2083,
where 20 is the line number 'big' in the Chinese
character library, and 83 is the column number. This
method is not widely used, as when we key in a
single Chinese character, we need to type 4 numbers
instead of 2 letters used by GB code.
GB Code(Guo Biao)
Chinese WWW Servers
How to get access to Chinese documents on the
WWW? The first step is to approach the Chinese
Web server. Recently, dozens of Chinese servers
have appeared on different platforms on the Web.
These include:
Chinese
WEB
server
(URL:
http://darwin.technet.sg/cweb/cstart.html),
GB means national standard, which stands for
Wen Zhai (it also known as Chinese News
Chinese Standard for Information Interchange (read
Digest. The URL: http://www.cnd.org).
Guo Biao in Chinese), which is defined by the
The
People's Republic of China and is widely used in
(http://www.ncb.gov.sg/chinese-web/)
Chinese societies around the World. It is defined by
Xian
Chinese
World
Wide
Web
Chinese
Gopher Menu(gopher://sunrise.cc.mcgill.ca/).
The
Chinese
Web
page
(http://agora.leeds.ac.uk/xiaoda/Dcorpus.htm),
they want, and they can click on it to do more search
or to capture the information that they are looking for
and so on.
at once. To achieve this aim, we have built up our
Retrieving Information from Chinese Data Files
Chinese data files in GB and HZ Chinese codes.
To allow users of MIRTH to have wider choices of
People who have the Chinese environment can get
virtual Chinese Corpora, we need to create Chinese
access to them easily.
data files for Linguistics research. The Chinese
corpus is collected via a Chinese data file. Chinese
Special Syntax Definition for Chinese Search
search is similar to the English one in MIRTH. The
Chinese has its individual property of phrasal
difference is that Chinese codes use two bytes, so
structure, and it is much complicated than English.
when we set up the Chinese data file, we have to
Primarily, the structure of a Chinese word is very
consider this special property in choosing the
different from English. We can subdivide English
appropriate way retrieve information.
words into parts: suffix, prefix and root in one unit.
MIRTH allows the user to input the query in
For example, the word display is comprised of two
Chinese in terms of keywords, or subject in the
parts one word, but it comprises two parts: prefix
linguistics area, then the search engine will return
(dis), and Root(play), the structure is: prefix + root.
ranked list of documents in order of relevance. Users
can read the documents first, then find their interests
and refine the search by marking the documents that
When translating this word into Chinese, it can be
shown by two independent units (they are two), and
the structure of it is shown in Figure 8 below:
have been highlighted. When users find out what
Figure 8: English "Display" is shown in Chinese
There is another way which tell you how
The syntax rules
differentiate between English and. We know, to add
The first syntax rule is if the query item is a keyword
"ing" behind a Verb that can change the nature of a
in Chinese, a general structure should be: Root +
word and transform a verb to a noun in English. For
Root or Root + suffix, because that is a major
example, 'take' is a verb, "taking" is a verb-noun,
structure of Chinese words, or phrases, the search
and can be an object in a sentence. But Chinese
item should a phrase or few words within this
doesn't have this kind of rule. There is no way for
structure:
changing a verb to noun in Chinese. According to the
analysis above, it is very easy to see that: Chinese
grammar is different from English.
So, Chinese
query syntax is considered in MIRTH. Here are a
few examples of the query syntax for a Chinese
search.
Root + Root, or Root + suffix.
This rule is quite useful for linguistics, particularly
for Chinese grammar learning, such as structure of
Chinese words and phrase. For instance, a user
search item is ÖÐ(in English is middle or central), it
as a root can be used in the derivation of new
phrases. It can be seen in the Fig. 9.
Figure 9: Chinese Phrasal Words
The second syntax rule is to use "the keywords must
The Chinese data file looks for a query by using
include an object which be a noun (nouns) in the
the special rule of pattern search. After you define
search item". For instance: search 'Chinese' is a
your query as a set of keywords and the other
standard search pattern, when the search engine gets
qualifiers, it likes 'Human Language' or 'Chinese
the query, it will do search around the object
Grammar'. A Boolean-type search would match the
'Chinese', and match some results around 'Chinese'.
subject 'Language' and the object 'Grammar', and
find a number of items you are interested in from the
be generated to manipulate data files more
linguistics data file [Atwell, 1993].
efficiently and accurately. This management
system will perform the task such as add insert,
delete, update, replace and sort links with their
4. Conclusion
key words.
In this paper, we have discussed the main hierarchy
of MIRTH, a multilingual information retrieval tool
References
(also called search engine). For the purpose, three
Atwell.
Eric
1993.
Knowledge
at
Work
in
Universities, Leeds University press.
works were done. Firstly, particular sources were
Butler, Christopher S. 1992. Computers and Written
"linked" together as its "Virtual Corpora" containing
separated topics such as computing, Linguistics and
Language, and Chinese literature. Secondly, an
example of a multilingual environment was created
with the help of UnionWay. Some of these corpora
are multilingual sources, which mainly demonstrate
materials in English and Chinese. Thirdly, an
automatic tool to retrieve information from the
Texts, Biddles Ltd, Guildford, Surrey.
Christian, Kaare 1988. The Unix Operating System,
Jone Wiley & Sons, Inc.
Ford, Andrew 1995. Spinning the Web, International
Thomson Publishing.
Graham, Lan 1995. The HTML Sourcebook, John
Wiley & Sons.
Gilster, Paol 1996. Finding it on the Internet, John
Wiley & Sons, Inc.
multilingual corpora was set up. Some improvement
Huabei 1981. Huabei Computing Institute, Chinese
for MIRTH further work shall be:
National Standard: A Collection of Chinese
(1) As already stated, this research has set up its own
Character Codes for Information Exchanging,
data files. But they are not big enough to hold
complete information resources in a particular
China Standard Press House.
Liu, N.K. 1996. Formal verification of some
potential contradictions in knowledge base using
research area. So to install more data materials is
a High Level Net approach, Applied Intelligence,
an essential tack before MIRTH can be improved
6(4):325-344.
in a real situation.
(2) As a complete linguistics tool, more techniques
Liu, J. and Lee, YK. 1997. Development of a
Chinese Extraction System. In Proceedings Of
International
Conference
on
Computer
should be added in, such as tagging, parsing, and
Processing of Original Languages, April 2-4,
analyzing structure of a sentence [Zhou et al.,
1997, Hong Kong.
1998], [Zhou and Liu, 1997].
(3) Maintenance is an issue for the system [Liu,
Zhou, L. and Liu, J. 1997. An efficient algorithm for
bilingual word translation acquisition“, in the 2nd
Workshop
1996]. A data file management system needs to
on
Multilinguality
in
Software
Industry: The AI Contribution (MULSAIC’97) of
the International Joint Conference on Artificial
Intelligence (IJCAI-97), August 23-29, 1997,
Nagoya, Japan.
Zhou, L., Liu, J. and Yu, S.E. 1998. Study and
implementation of combined techniques for
automatic extraction of word translation pairs: An
analysis of the contributions of word heuristics to
a statistical method”, to appear in International
Journal on Computer Processing of Oriental
Languages.

Download Report

Multilingual Information Retrieval in World Wide Web

Paperzz.com

Your Paperzz