Building a multilingual company thesaurus. Distributed development based on IC INDEX. Project reported: Deutsche Post World Net. Dipl.-Inf.wiss. Manfred Hauer M.A. AGI – Information Management Consultants Mandelring 238 b 67433 Neustadt / Weinstrasse Germany Tel: 06321 / 96 35 - 10 Fax: 06321 / 96 35 – 29 Manfred. [email protected] http://www.agi-imc.de Summary: The conversion of Deutsche Post World Net from a public supplier to one of the largest logistics enterprises forces the employees to speak and to think in new terms. The need for learning new terms is typical for deregulated and merged enterprises. Using the same terminology, taxonomy or ontology is vital for exchanging knowledge between individuals. The information department was asked to build an enterprise thesaurus covering all the important subjects of all DPWN companies, branches and departments. AGI – Information Management Consultants were selected as an external consultant and software supplier. The thesaurus team is working at 4 locations in 3 countries. The thesaurus development software IC INDEX is based on Lotus Notes & Domino. Thus the content can be synchronized easily several times a day. Bringing together an external banking thesaurus, internal lists, glossaries, printed dictionaries and the knowledge of individuals is a huge task. In total 17.000 German terms had to be analysed and linked together within a semantic net. Then all terms have to be translated and linked into English and French. A built in automatic translation engine which translates the terms from German to English and French speeds up the translation substantially. In order to improve the retrieval function the thesaurus is copied to Excalibur search engine. Excalibur is installed in the corporate Intranet and the Internet. In order to improve the intellectual indexing, the thesaurus will be integrated in several content and document management systems. An automatic indexing engine is scheduled for 2002. This engine is to find and suggest the adequate classes and descriptors to the human indexers. Table of content 1. 2. 3. 4. 5. 6. 6.1 6.2 6.3 6.4 7. 8. Deutsche Post World Net AGI – Information Management Consultants The need for a global thesaurus at DPWN The team: 14 members The workflow IC INDEX: The software behind Why IC INDEX – the point of view of DPNW Lotus Notes & Domino as platform Terminology concepts in IC INDEX The way of working Interface to Excalibur and Content-Management-Tools The vision of future use at DPWN 1. Deutsche Post World Net Deregulation is a business strategy for public administrations in many countries since more than 15 years. In Germany telecommunication and postal services have been integrated under the head of one ministry for more than 40 years. This conglomerate has been split into three companies in the Nineties: 1. Deutsche Telekom AG - focused on all kind of telecommunications 2. Deutsche Post AG - focused on all kind of transportation from letter to parcel 3. Deutsche Postbank AG - focused on banking and finance A major change in this concept was the re-merge of Deutsche Post and Deutsche Postbank reflecting the new requirement of Supply Chain Management, where all services related to the transportation of goods are integrated. To all kind of physical transaction of goods the transaction of money is related. Currently Deutsche Post AG is still the monopolist in transportation of letters. This generates the major part of the revenue. But in a few years, this monopolistic advantage valid for Germany only will be lost and Deutsche Post AG wants to become a major player in the international logistics arena – in every kind of non-electronic transportation. So the merge with DANZAS – a leading logistic company based in Switzerland – is part of the preparation for the future. Anyway the new multinational organisation is already now the number one in logistics worldwide employing about 350.000 people – much larger than Siemens, Daimler Chrysler or Volkswagen. 2. AGI – Information Management Consultants Very much in contrast to DPWN the project partner AGI – Information Management Consultants is a small company. It is a German based headquarter doing consulting, project management and terminology development and an India based software development team. AGI is focused on information management since 1983 and delivering dedicated thesaurus development software and all related services since 1987. Retrieval technologies, databases and high integrated workflows for information professionals and their clients is the major focus. Lotus Notes & Domino is the favourite software platform. “Information Center”, is the brand name of the software architecture. Information Center has become a well established solution for some important international companies, all of them global players with headquarters in Switzerland and Germany. 3. The need for a global thesaurus at DPWN The move from a regulated public organisation to a modern enterprise is also a move of mind and language spoken inside and published for the world outside. The officials are on the way to loose their state oriented official language and they must learn to think, write and speak in a more modern international business slang. So a new unified language is vital for all kind of face-to-face communication or the communication supported by technical means. It is difficult to take decisions, to bring issues to a result and difficult to produce and to find documents if there is no shared terminology. The new multinational approach requires English as well as other languages to be spoken and written. The increasing number of documents available on computer disks and the normal daily chaos in the file directories has initiated new technologies like Document Management (RDBMS based systems) or Knowledge Management (Information Retrieval based systems) pushed by a large number of IT companies. Combined with the opportunity of establishing an Intranet, the need for indexing the content of documents has been recognized and a small task force has been set up for building a multilingual terminology solution comprising all subjects inside this very large organisation. 4. The team: 14 members DPWN asked its information center in Bonn to install and manage this task force. An external partner, AGI, was selected for providing a software able to support this kind of project and to customize according to future needs arising from the Intranet, the search engine and the Content Management System. Due to the large scope of the project additional staff for terminology development was needed. AGI provided 4 people – one external consultant was hired, he is specialized in logistics (based in Vienna) and 3 internal employees focused on non dedicated “Post” subjects like banking, accounting, economics, marketing, IT, security or geography. DPWN concentrated upon the large subjects law, human resource and all kinds of postal and transportation services. Company DPWN DPWN DPWN DPWN DPWN AGI AGI AGI Roles 1 2 4 3 2 1 4 2 Task Project management Pre-Scanning sources for terms Developing classification and thesaurus, translation IT tasks and IT project management Translation department Project management Classification and thesaurus development, translation Software development In total 14 people have been working or will work for the thesaurus project. The team has been working simultaneously in Southern India (software development) and three European locations (Bonn, Vienna, Neustadt/Weinstrasse). The geographical distance has never caused any kind of problems. One of the difficulties that occur in terminology teams are differences in the understanding of terms. Each team member therefore was responsible for one or several subjects. All work could be done at the desks and only few meetings were necessary. The number of meetings decreased in the course of the project because the team members got used to the amount of terms, to the software tool and they learned to recognize the importance of a term from the point of view of DPWN. With a specific field terms can be set to “Should be deleted”, “Should be discussed” and “Term is necessary – after a discussion”. The built in discussion technique – a typical advantage of Lotus Notes & Domino based solutions – was not used by the core team, but it was used by other interested people via the web interface. So the rather typical conflicts of thesaurus projects did not occur. Even if everybody of us was surprised many times at his colleagues and their view of terms, it was fun to work together. 5. The workflow A period of three years is needed from the start until the completion of the multilingual version. Half of this period of time was needed for internal term extractions, internal politics and the real start of the project with a software and an external partner. The time table is displays the steps: Year Kind of work 1999 Extraction of terms: Intellectual scanning of 200 printed and electronic sources – About 20.000 terms retrieved and collected in MS Excel files 2000 Installing software IC INDEX 5.0 at 3 locations (at end of 2000) Developing and testing export format for Excalibur retrieval system. Building Classification: 18 main topics, 150 subclasses at 3 levels (10 days, 4 people) Import of 5000 terms of a banking thesaurus Import of 12.000 terms in MS Excel format 2001 Building German Thesaurus: 17.000 terms have to be checked and linked to the classes and to each other (several 10.000 links, currently 5000 non-descriptors, 3000 abbreviations) Check of critical terms by other internal experts Import in Excalibur and start of use Integration in Content Management Systems 2002 Translation from German to English and French about 13.000 terms – terms, definitions, scope notes and all links have to be translated, built and checked. Support via translation engines and dedicated workflows. Reload in Excalibur and in CMS Integration of computer aided indexing software, automatic indexing for supporting human indexing 2003 Maintenance & ??? 7. IC INDEX 5.0: The software behind IC INDEX 5.0 – based on 16 years of experience with the successful INDEX, the leading product in the German speaking market of thesaurus development software – was rebuilt in 1999 under Lotus Notes & Domino 5.0. The prior version was already used by large publishers, broadcasting companies, banks, chemical and pharmaceutical industries, consultancies and education. Even if this software platform Lotus Notes & Domino is not the fastest database system for very complex data structures with large quantities of entries, it is one of the best platforms for collaborative work. Whereas in the past most thesaurus have been monolingual and were built by teams within one location, now most thesaurus are multilingual and have to be set up by distributed teams. The new need for multilingual thesaurus is a result of the global economy. English has become a business language even if it is not the mother tongue of a company, technologies are merging, companies are merging, functions are merging. All business workflows are more and more IT driven or supported. A Lotus Notes & Domino based terminology software fits as well to the information architecture of “Information Center” (IC), the product family of AGI – Information Management Consultants. It was one of the last missing modules in this platform. 6.1 Why IC INDEX – the point of view of DPNW Major reasons for selecting IC INDEX from the point of DPWN have been: 1. Approved tool for thesaurus development 2. Based on an advanced software platform running on all important operating systems 3. Integration of classification and thesaurus concept for having top-down and bottom-up access to the data 4. More than 25 relationships available for linking terms 5. Support of distributed, collaborative work within a multilingual team 6. Open for new import and export formats and new workflow of the project team 7. External provider delivers a full service: consultancy, software, customizing, import, export, new development & support, terminology building and project management 8. Web-interface for publishing and discussion 9. Powerful support of translation work Before the partnership with AGI – Information Management Consultants was confirmed a long and very detailed guideline of skills and options had been specified at DPWN. It concerned the software product and the software partner. It was used for checking a small number of relevant players in the market and after that for decision-making. 6.2 Lotus Notes & Domino as platform A number of about 90 million people have a Lotus Notes client at their desktop, in addition to that a large number of users are accessing the Lotus Notes & Domino Server via Web browsers in Intranets, Extranets or via the Internet. It is one of the best tools for Knowledge Management solutions in small teams operating within large global enterprises, if it is used as software development platform and not just as an E-mail system. Four programming languages including Java are built in in a powerful programming environment, called Domino Designer. Behind this programming environment stands a strong team of 3000 IBM software developers managing the data structures of a multimedia database engine, the retrieval system GTR, connectors to several other platforms like SAP, very powerful mail communication, the best replication of databases available and the basic user front end. Main competitors are Microsoft Exchange – as a mail system - and Oracle and DB2 as relational database management systems. The major focus of Lotus Notes & Domino are documents and all workflows and sharing related. IC INDEX requires several forms and 70 indexed views for managing the requirements of a terminology development platform. Databases are larger than in the old RDBMS based version but the number of information displayed is much higher and the usability and ease of use has improved very much. More comfort costs more disk space. Currently the DPWN database requires 300 MB including the full-text retrieval index (16.000 terms plus all links and some definitions or scope notes). The network which was set up has one central Lotus Domino server. Clients in Bonn and Vienna are linked via fast ISDN connection for doing replication and mail transfer. This is a very secure environment – there is no risk of Internet attacks. Each Lotus Notes client workstation has a replica of the thesaurus database. A replica is a copy which is synchronized several times a day. So every user is working offline – having the same database – and updating his and the other replicas via exchange with the central server. The replication requires a few minutes only. Additional to the central server later a publishing database – just a replica – was installed at AGI´s Domino server in the web for giving restricted access to interested people within DPWN. The software development was done on design templates. So the designer did never access the productive database. The productive environment was updated after approval. 6.3 Terminology concepts in IC INDEX IC INDEX can be used for building terminology, taxonomy, ontology, topic map, encyclopaedia or translation dictionary. We call it a thesaurus development tool, supporting thesaurus (net topology), classification (tree topology) and keyword chains (lists). Basically any kind of semantic relationships between terms can be outlined, the number of relationships is not restricted – 26 is the default. Each term has several attributes specifying the language, the type of term, the subject, definitions, scope notes. URL and all kind of multimedia content as well as administrative fields, which can be defined by the project administrator. All lists and rules behind can be edited by administrators. Left frame: Is an outline, which can be collapsed and expanded opening different kind of views. Some views are personal or manually selected in shared views (Gemeinsame Listen). Administration is visible for Admin role only. Right frame: All views can be sorted with the small twisties at the top of column. Most action buttons are working on groups of selected terms. In the search line above the views any left, middle and right hand truncation is allowed for finding phonemes, terms or phrases. . Left frame: A part of Administration is expanded and the form for setting up user defined keyword lists including variable labels is opened (window in the foreground). 1. Window in the background: here the terms are typed in. 2. Window in the middle is used for linking term 1 with term 2. All terms already linked are displayed below the term. Several action buttons are supporting the related workflows. 3. Window at the right bottom: used for selecting the kind of relation 6.4 The way of working With IC INDEX decentralized teamwork was easy. Everybody had his subjects, everybody was working decentralized – at any time (we can see the working time in the document history). Three notebook users have been working at different locations (while travelling, at airports, in trains, during holidays at different places from the Alps to the Indian ocean) The only requirement was to replicate several times a day with the server for avoiding replication conflicts. But integrated data checking utilities inside the software made the management of these conflicts easy and fast. Never data got lost. Each expert was using printed or electronic sources like dictionaries, vocabularies, glossaries, periodicals, brochures, books, internal papers, manuals, Intranet content – in total about 300 sources. Each term has to be linked one by one to relevant other terms. At first all terms got a subject entry – like IT, human resource, building, transportation techniques -. Based on these large groups terms got links to the classes. Some classes have got too many related keywords, some got very little – the built in word counting in the IC INDEX views displays unbalanced classes. The classes that have to be treated and improved are shown. After receiving their link to classes the typical thesaurus relationships between terms are established. The proximity or the distance of one term to another is defined in a quite sophisticated manner: Broader/Narrower terms, Related terms, Synonyms and many others: 25 different relationships have been used. Each kind of relationship is replaced by a statistical weight while importing the thesaurus into Excalibur. This kind of semantic linking requires indeed the major part of the time. If available or easy to write scope notes have been added to the terms. About 2500 abbreviations – often with 3 and more meanings – had to be reduced to one remaining meaning – otherwise the use in Excalibur would not be possible in an easy way. All other terms have got a message in the Scope Note field about this abbreviations. The large number of terms in IT, marketing and other modern areas which are English terms like “Computer, Marketing, Supply Chain Management …” have not been translated into German, because often there is no proper or common term. These terms got links as well to the related German terms and classes as well as to the English terms and classes. With a simple field called “Has to be treated later” and a small keyword list, terms have been marked for team discussion, checking at DPWN or forwarding to internal experts. Inverted terms and abbreviations got special marks required for Excalibur export. Translation With IC INDEX it was always possible to build a translation link. But for translating the large number of terms an additional workflow was designed. Behind this workflow the translation engine eTranslation Server of Linguatec (Germany) was installed and gave access to the dictionary and the automatic translation. It was not necessary to search in the dictionary, to type in all these words. The quality of the automatic translation was often good enough but checking was a major task. Here the translation department of DPWN had to be integrated – basically for control and not for editing. Translation window: Each field of the term document is prompted, back and forward browsing is support and with “Open Source” only dictionaries will be listed and can be opened with one click in the web browser. When starting a term, if available the translation is already displayed and has to be accepted by pressing the OK button. 7. Interface to Excalibur and Content-Management-Tools The interfaces of Excalibur and the CMS is under construction while writing this conference paper. IC INDEX has got a new output format in Excalibur format. 8. The vision of future use at DPWN Even for trained information specialists the use of a large and complex thesaurus like DPWN thesaurus is not easy. The thesaurus development was always done under the assumption, that an automatic indexing system is analysing the content of the documents and suggesting proper classes as well as descriptors to end users not familiar with language engineering and documentation. Based on this kind of automatic indexing technology retrieval should be supported. The thesaurus support by Excalibur is not bad at all but the power of this system cannot be used or displayed by 95 % of the users and was not built in the DPWN user interface we have checked until now. Only a small link is missing – we and other companies are providing these kind of technique – maybe 2002 it will be adopted.
© Copyright 2025 Paperzz