Multilingual Information Retrieval in World Wide Web Xiaoda Zhang and James N. K. Liu Department of Computing, Hong Kong Polytechnic University, Hong Kong [email protected] Eric Atwell Department of Computer Studies, University of Leeds, UK Abstract 1. Introduction The article addresses: (1). The design of an In a large distributed hypertext system like the World information retrieval (IR), as the Multilingual Wide Web, users find by following hypertext links. Information Retrieval Tool Hierarchy (MIRTH), As the size of the system increases, must traverse which with virtual corpora on the World Wide Web, increasingly more links to find what they are looking also known as Web or WWW. It is motivated by the for. This task is very demanding. One comprehensive desire to create a search engine to retrieve way to cope with this to develop a computer program information by accessing a virtual. (2). The to help people explore the Web. This is a search implementation of a general model of multilingual engine. All search engines provide users with a query retrieval for the Web searching. It copes with both frame in which the user can key in search and Chinese information retrieval techniques. This requirement in form of keywords, or phrases, or a paper starts to address some problems of the World regular expression. Then, they can go through the Wide Web relating to information retrieval. Then it Web pages, locate documents within the entire Web introduces some existing information retrieval tools and return selected hits in format of WWW on the Web. The need to create a multilingual search documents. Examples of this kind of search engines engine is discussed. Next, a general hierarchy of are Infoseek, Yahoo, WebCrawler, Excite, ALIWEB, MIRTH search engine is illustrated. Furthermore, CUSI, and W3 Catalogue [Graham,1995]. techniques to set up a MIRTH search engine are 1.1 Advantages of Search Engines explored. These include build up data files, a structure of the search engine [Gilster, 1996], and constraints on query syntax. In addition, the means to create MIRTH multilingual search engine for Chinese (English) information retrieval is dealt and some examples of using MIRTH search engine are given. To help users to retrieve information from WWW is the basic function of search engines. Two most impressive features of search engines are timesaving way of searching over the Web and their simplicity of use. For example, via Netscape Navigator, once a user keys in a keyword, phrase, or a regular expression in the frame of a search engine, he/she linguistics and literature. As a search engine, it just needs to type 'Enter' from his/her keyboard, then retrieves information in both English Chinese. This the desired will be located in a very short time on the research comprises two major tasks: same Netscape browser. • To organize a "virtual corpus" of computer based text-training materials for linguistics and 1.2 Disadvantages of Existing Search Engines literature available on the World Wide Web. The Each search engine has its weaknesses. The common corpus contains multilingual information about shortcomings for all of can be summarized below: human language learning and training, grammar Most of them can only search in English. This and language modeling research. Users could find would prevent, for instance, linguist from retrieving extensive materials within the corpus such as materials about the Chinese language or theory in novels, poetry, prose and various on-line Chinese. Hence the need exists to establish a search electronic magazines. engine. • To set up a multilingual search tool which can Most of them are general search tools focusing on handle queries in both English and Chinese. This general information retrieval, which might not be tool can be used to scan the contents of the efficient for the purpose of searching in a research mentioned corpus on line. area. Although some can search in a very wide database, they can not give the user satisfaction when 2. Methodologies of Creating an they need very detailed information on professional Information Retrieval Tool topics, for example, Chinese grammar. As a system of information retrieval tool, some 1.3 Objectives of Research creation methodologies, for example, building up As such, the presented paper was motivated by the special corpora, creating unique data file, defining desire to create a search engine, which retrieves special query syntax rules. In the following section information from the World Wide Web (WWW) these will be stated briefly. using a linguistics virtual corpus. Specific attention 2.1 Build up Special Corpora has been paid to multilingual search facilities. The WWW is a huge storeroom where a variety of study will include the discussion on use of English information materials have been placed. As most and Chinese language tools. It is intended to create WWW documents have mainly been written using multilingual information retrieval tool to support HTML in plain text, it can be seen as a corpus searching of in specified areas, such as in the field of [Atwell, 1993], [Liu and Lee 1997]. There are already billions of documents on WWW, and the computer and, furthermore, is potentially subject to number grows rapidly. As all of these hypertext unforeseen changes as remote sources are modified documents are dispersed over the Web, it is a time central control. Hypertext resources on the Web are consuming task to find in a particular research area. seen as collections of virtual corpora. A Web data So it must be wise to set up an access point for file is also called a data set, that is a collection of relevant materials such as a Computing corpus (Most Web pages' Uniform Resource Locators (URL). It documents on the Web are written in The Hypertext support the whole process of information retrieval Markup Language, HTML in short). from the Web. To set up Web data files, three issues are discussed: 2.2 Set up Web Data File Setting up a data file to save space It is noted that for Web searching, the common Specifying a structure of a data file search object is a database [Ford, 1995]. However, Defining applications for a data file this requires a huge storage space for information, A Principle of Setting up the Data File and it also takes a long time for the search process to The purpose of creating a data file instead of using complete. Some special approach based on MIRTH the contents of hypertext page is to save storage is devised for organizing the database. Within this space. It is clear that the wide availability of resource MIRTH, there are two kinds of search objects: on the Web might be too large to be saved in one virtual corpora [Butler, 1992] and data files. The machine. One way serve this purpose might be to virtual corpus is dynamic. It is different from the store only keywords in a data file and a give a few traditional corpora the contents of a virtual corpus lines of explanations of the content of the documents are changed from time to time [Butler, 1992]. A in the file. The example of the data file is given in virtual corpus might not be stored on a user's Fig. 1 as follows: <LI><a href= "http://www.scsn.net/~ics/"> Intelligent Computer program Solutions WWW Site </a> <h4> Introduction of Computer programs</h4> <LI><a href= "http://www.education.siggraph.org/theses/theindex.htm"> M.S. and Ph.D Computer Graphics Theses</a> <h4><I>This directory contains the ASCII text files for all of the Computer Graphics Thesis and Dissertation Abstracts Compendiums published in Computer</I></h4> <LI><a href="http://www.copfer.com/search.htm"> Computer & Associates </a> <h4><I>Computer Based Training Internet Services Web Design Touch Screen Kiosks Electronic Catalogs Java Shock wave Multimedia </I></h4> Figure 1: A data file example with key words and explanations When the data file is searched by MIRTH for a key word 'computer', the underline part of text (keywords and main explanation) should be shown on the screen as the results of a search, this can be seen in Fig. 2. Figure 2: Search Results on the Data File Data files are application-oriented, or its contents Application of a Data File Search results are controlled by data files. About the depend on users' requirement of a search engine in search procedures, the first thing to be done is to key terms of information retrieval, so they are not always in a search item in a query box. Then the engine runs. the same. The resulting data files might inevitably be In fact, the search query as a string has been passed unique as it casts their creators' personal ideas to the search program by the external program. When considering those of its users. Generally speaking, the program is run, it starts pattern matching in the however, there are some common features for data file. If the results found in the data file between designing a data file. For example, the design is one and ten or more "hits", then the matched items influenced by its objectives. The purpose of the (hits) will be ranked by the program automatically MIRTH before being sent back to the user. If the search item specialists, linguists and people studying literature. is not found in the data file, it means there is nothing So, the author approaches the problem of design data matched, and the user will get no results from the files by collecting all resources related to those aims search. Now we could see the importance of a data of above group of special users. It has built up a large file. The data file restricts the application of a search. computing, linguistics, and literature file possible in A comprehensive, high quality data file is essential to the time available. In the other words, the contents efficient searching. Its structure will be influenced by the MIRTH data file introduce both general the kind of searches users wish to make. This is a computing knowledge and linguistics theory and major literature materials such as novels, poetry, literature issue that was considered development of MIRTH. during the search engine is to help computer journals (both Chinese and English). It can also be extended other topics, for example civil engineering system to help users searching information more and chemistry, if necessary, but this will involve accurately and efficiently [Ford, 1995]. It is further manual creation of entries into a data file, and understood that all developed search engines in will dilute the subject specificity thus risking more literature have their own syntax rules for making erroneous 'hits'. query. Two problems exist with them: (1) in practice, just a few search tools will provide Construct a Data file explanations for their rules explicitly; (2) most The rule of thumb to construct a data file is to save common query syntax rules depend on concepts of computer memory. The best way to do this is choose natural language, such as words, expressions (several keywords of a HTML document and convert them words), etc. While MIRTH differs from them as it into a data file, see example above. Many automatic focuses on linguistics and literature research, some search engines rely on data file to deal with queries. syntax rules for the purpose of supporting specific It is important to choose comprehensive keywords to searching functions are needed. MIRTH syntax rule improve the chance of retrieving relevant documents. definition focus on some special usage, such as how For the data files, keywords can be seen as the words does affix search, Root search, etc. Details will be or phrase, which reflect subject of the corresponding addressed below. home page. The theory behind a data file is as simple as a rule of thumb: users are interested if a particular Motivation of Affix and Root Search document contains some keywords in relation to their As a general definition, syntax addresses the interests. Keywords to be placed in a data file were structure of sentences, but technically, it has more selected manually from introductory textbooks in meanings. In terms of computing, the term of syntax linguistics, computing and literature, guided by has been used widely. Any computer language authors' experience in Chinese linguistics and requires certain syntax rules for its commands and literature. codes, it ranges from simple structure of the words to entire program. Moreover, in different situations, 2.3 Define Query Syntax Rules syntax has different definition and different content. Although MIRTH searches pre-computed data files As far as computer program writing, there are lots of instead of the Web pages, any relevant home pages syntax rules that programmers have to follow. For eventually will be downloaded by users if it is example, in the Hypertext writing, the HTML necessary. Our interest will be on those home pages (HyperText Mark-up Language) requests all as well. Intelligent query support is included in the commands should be enclosed by: <>. The <> symbols usually appear in pairs: the <> is used at the Example of Affix and Root Search Pre(*) beginning and </> should be used at the end of the If a user wants to search out some words, which start same sentence. with same prefix, then he/she should follow this It has been mentioned that the main function of MIRTH search is to do linguistic search. And to analyzed special structure of words (phrases) is a syntax rule to enter their item in the search box: prefix* (without a space before '*'). See example in the following section. very important issue for linguistics and language If you want to do a search for prefix matching, the learners [Graham,1995]. For example: English, most search engine will all words in its database, and words have a root, but the root can form lots of provide the information which you are looking for. derivations. Such as adding a suffix or a prefix to a Supposing you start your search the words begin root, then a new word would be created. Let us with 'dis', you can start the search as that: to add consider: 'think' is a verb, we can regard it as a root, asterisk (*) after the search item without space, then when we add the prefix and suffix with it. It might search the words prefix "dis", then you would get the become new word, such as 'unthinkable'. For search results: dislike, display, discrete, dismember, language searching, these special functions of prefix, discomfort, discredit, discover, discolor, disclose, suffix and root search have been considered for and disloyal. Fig. 3 illustrates this idea. query syntax. Figure 3: Search Prefix "Com" Fig. 3 searched for prefix of "Com". Totally, over It just needs users to input the search query in the thirty hits been matched. The details also can be seen item box, in front of the suffix, you must add with the Fig. 4 for Chinese search. asterisk(*), it looks like that: "*ing", then the search engine will seek out which word that includes the suffix you are looking for, and automatically pick them up for you (see Fig. 5 displayed by Web browser Netscape). Figure 4: Chinese Engine Search In this process of Chinese prefix search, the search item is "ÖÐ " (central, or middle) + " * ". While running the search engine, it picked out over fifteen hits, which contained the prefix "ÖÐ ". As: ÖÐ ÎÄ Figure 5: MIRTH Search for "*ing" (Chinese), ÖÐ ¹ú, ÖÐ »ª(China), ÖйúÎÄѧ(Chinese (*) + Root + (*) literature), ÖÐ ¶«(Central Eastern), ÖÐ Î÷ (Central This query syntax means that there is a root part of a Western) and so on. between two stars without space among them. Then (*) + Suffix the search engine will match all words that have an This syntax rule defines a search to match all words identical root installed in the index file. having an identical suffix in the data file. The request of the input is *suffix, and there is no space between the asterisk and suffix. 3. Multilingual Issues in MIRTH displayed with Chinese environment (supported by Unionway). Fig. 7 is displayed without Chinese 3.1 Chinese Computing Environment environment. For multilingual information retrieval, most systems including UNIX and PC systems are in the standard English environment, and the WWW documents work in this format [Christian, 1988]. For example, Netscape has a Web browser (named Navigator), which standard HTML in plain English. When a document in Chinese is on the Web, Netscape Navigator will show the Chinese characters in strange symbols, without a certain software to support displaying and Chinese, the strange symbols get displayed will not be understood by any user (including Chinese people). Figure 6: Display with Chinese environment Reading Chinese with Web browser, and setting up a search system for Chinese information retrieval, are still topics of debate on the Web [Zhou and Liu, 1997]. But in this paper, we have presented one solution. We now understand to set up a Chinese environment, how to deal with Chinese characters, and how to retrieve information from a Chinese virtual corpus. Chinese GB & BIG5 Codes are displayed in default Netscape Font. Some Chinese software can cope with this problem by converting English computing environment into an environment which supports both English and Chinese. Once this software is installed, the Chinese code will be converted to readable Chinese characters as given in Figs 6 and 7. They are same document files. Fig. 6 is Figure 7: Display without Chinese environment 3.2 Dealing with Symbolic Chinese Characters The MIRTH provides access to Chinese virtual corpora as well. To understand this procedure, we need to know how the computer deals with symbolic adding 32 to both of the line and column numbers of characters. Now we discuss the issues such as how to Line-Column Code. Taking the word 'big' for convert a symbolic character into digital information, example, adding 32 to 20 (line number) and 83 how to store a set of digital characters (as bitmaps) (column number) gives 52f and 115, and if we check and how to represent particular Chinese in the GB the ASCII code table, 52 represents '4' and 115 for 's', protocol. so GB Code for 'big' is '4s'. As the minimum line and column is 1, so the minimum number of GB Code is Chinese Code Protocols 32+1=33, and the maximum Code is 32+94=126. We We discuss terms like GB, HZ, and BIG5 Chinese could see that GB Code is within the range ASCII of code protocols used by MIRTH. When we search a codes, which represent 94 symbols. This means we Chinese data file in MIRTH, this file normally is could use GB Code as standard information written in a kind of code, in other words, the Chinese interchange code set like ASCII. characters are written in specific codes instead of the graphic characters. 3.3 Chinese Information Retrieval System Line-Column Code The structure of the Chinese character set is different One simple protocol is called Line-Column code, from English. It has its special characteristics, and it which uses a character's line number and column can be displayed by MIRTH on the Web. The whole number as its code [Huabei, 1981]. For example, procedure can be seen below. when we search for the word 'big', we key in 2083, where 20 is the line number 'big' in the Chinese character library, and 83 is the column number. This method is not widely used, as when we key in a single Chinese character, we need to type 4 numbers instead of 2 letters used by GB code. GB Code(Guo Biao) Chinese WWW Servers How to get access to Chinese documents on the WWW? The first step is to approach the Chinese Web server. Recently, dozens of Chinese servers have appeared on different platforms on the Web. These include: Chinese WEB server (URL: http://darwin.technet.sg/cweb/cstart.html), GB means national standard, which stands for Wen Zhai (it also known as Chinese News Chinese Standard for Information Interchange (read Digest. The URL: http://www.cnd.org). Guo Biao in Chinese), which is defined by the The People's Republic of China and is widely used in (http://www.ncb.gov.sg/chinese-web/) Chinese societies around the World. It is defined by Xian Chinese World Wide Web Chinese Gopher Menu(gopher://sunrise.cc.mcgill.ca/). The Chinese Web page (http://agora.leeds.ac.uk/xiaoda/Dcorpus.htm), they want, and they can click on it to do more search or to capture the information that they are looking for and so on. at once. To achieve this aim, we have built up our Retrieving Information from Chinese Data Files Chinese data files in GB and HZ Chinese codes. To allow users of MIRTH to have wider choices of People who have the Chinese environment can get virtual Chinese Corpora, we need to create Chinese access to them easily. data files for Linguistics research. The Chinese corpus is collected via a Chinese data file. Chinese Special Syntax Definition for Chinese Search search is similar to the English one in MIRTH. The Chinese has its individual property of phrasal difference is that Chinese codes use two bytes, so structure, and it is much complicated than English. when we set up the Chinese data file, we have to Primarily, the structure of a Chinese word is very consider this special property in choosing the different from English. We can subdivide English appropriate way retrieve information. words into parts: suffix, prefix and root in one unit. MIRTH allows the user to input the query in For example, the word display is comprised of two Chinese in terms of keywords, or subject in the parts one word, but it comprises two parts: prefix linguistics area, then the search engine will return (dis), and Root(play), the structure is: prefix + root. ranked list of documents in order of relevance. Users can read the documents first, then find their interests and refine the search by marking the documents that When translating this word into Chinese, it can be shown by two independent units (they are two), and the structure of it is shown in Figure 8 below: have been highlighted. When users find out what Figure 8: English "Display" is shown in Chinese There is another way which tell you how The syntax rules differentiate between English and. We know, to add The first syntax rule is if the query item is a keyword "ing" behind a Verb that can change the nature of a in Chinese, a general structure should be: Root + word and transform a verb to a noun in English. For Root or Root + suffix, because that is a major example, 'take' is a verb, "taking" is a verb-noun, structure of Chinese words, or phrases, the search and can be an object in a sentence. But Chinese item should a phrase or few words within this doesn't have this kind of rule. There is no way for structure: changing a verb to noun in Chinese. According to the analysis above, it is very easy to see that: Chinese grammar is different from English. So, Chinese query syntax is considered in MIRTH. Here are a few examples of the query syntax for a Chinese search. Root + Root, or Root + suffix. This rule is quite useful for linguistics, particularly for Chinese grammar learning, such as structure of Chinese words and phrase. For instance, a user search item is ÖÐ(in English is middle or central), it as a root can be used in the derivation of new phrases. It can be seen in the Fig. 9. Figure 9: Chinese Phrasal Words The second syntax rule is to use "the keywords must The Chinese data file looks for a query by using include an object which be a noun (nouns) in the the special rule of pattern search. After you define search item". For instance: search 'Chinese' is a your query as a set of keywords and the other standard search pattern, when the search engine gets qualifiers, it likes 'Human Language' or 'Chinese the query, it will do search around the object Grammar'. A Boolean-type search would match the 'Chinese', and match some results around 'Chinese'. subject 'Language' and the object 'Grammar', and find a number of items you are interested in from the be generated to manipulate data files more linguistics data file [Atwell, 1993]. efficiently and accurately. This management system will perform the task such as add insert, delete, update, replace and sort links with their 4. Conclusion key words. In this paper, we have discussed the main hierarchy of MIRTH, a multilingual information retrieval tool References (also called search engine). For the purpose, three Atwell. Eric 1993. Knowledge at Work in Universities, Leeds University press. works were done. Firstly, particular sources were Butler, Christopher S. 1992. Computers and Written "linked" together as its "Virtual Corpora" containing separated topics such as computing, Linguistics and Language, and Chinese literature. Secondly, an example of a multilingual environment was created with the help of UnionWay. Some of these corpora are multilingual sources, which mainly demonstrate materials in English and Chinese. Thirdly, an automatic tool to retrieve information from the Texts, Biddles Ltd, Guildford, Surrey. Christian, Kaare 1988. The Unix Operating System, Jone Wiley & Sons, Inc. Ford, Andrew 1995. Spinning the Web, International Thomson Publishing. Graham, Lan 1995. The HTML Sourcebook, John Wiley & Sons. Gilster, Paol 1996. Finding it on the Internet, John Wiley & Sons, Inc. multilingual corpora was set up. Some improvement Huabei 1981. Huabei Computing Institute, Chinese for MIRTH further work shall be: National Standard: A Collection of Chinese (1) As already stated, this research has set up its own Character Codes for Information Exchanging, data files. But they are not big enough to hold complete information resources in a particular China Standard Press House. Liu, N.K. 1996. Formal verification of some potential contradictions in knowledge base using research area. So to install more data materials is a High Level Net approach, Applied Intelligence, an essential tack before MIRTH can be improved 6(4):325-344. in a real situation. (2) As a complete linguistics tool, more techniques Liu, J. and Lee, YK. 1997. Development of a Chinese Extraction System. In Proceedings Of International Conference on Computer should be added in, such as tagging, parsing, and Processing of Original Languages, April 2-4, analyzing structure of a sentence [Zhou et al., 1997, Hong Kong. 1998], [Zhou and Liu, 1997]. (3) Maintenance is an issue for the system [Liu, Zhou, L. and Liu, J. 1997. An efficient algorithm for bilingual word translation acquisition“, in the 2nd Workshop 1996]. A data file management system needs to on Multilinguality in Software Industry: The AI Contribution (MULSAIC’97) of the International Joint Conference on Artificial Intelligence (IJCAI-97), August 23-29, 1997, Nagoya, Japan. Zhou, L., Liu, J. and Yu, S.E. 1998. Study and implementation of combined techniques for automatic extraction of word translation pairs: An analysis of the contributions of word heuristics to a statistical method”, to appear in International Journal on Computer Processing of Oriental Languages.
© Copyright 2025 Paperzz