Building a multilingual company thesaurus. Distributed development

Building a multilingual company thesaurus. Distributed development
based on IC INDEX. Project reported: Deutsche Post World Net.
Dipl.-Inf.wiss. Manfred Hauer M.A.
AGI – Information Management Consultants
Mandelring 238 b
67433 Neustadt / Weinstrasse
Germany
Tel: 06321 / 96 35 - 10
Fax: 06321 / 96 35 – 29
Manfred. [email protected]
http://www.agi-imc.de
Summary:
The conversion of Deutsche Post World Net from a public supplier to one of the largest logistics
enterprises forces the employees to speak and to think in new terms. The need for learning new terms is
typical for deregulated and merged enterprises. Using the same terminology, taxonomy or ontology is
vital for exchanging knowledge between individuals. The information department was asked to build an
enterprise thesaurus covering all the important subjects of all DPWN companies, branches and
departments.
AGI – Information Management Consultants were selected as an external consultant and software
supplier.
The thesaurus team is working at 4 locations in 3 countries. The thesaurus development software IC
INDEX is based on Lotus Notes & Domino. Thus the content can be synchronized easily several times a
day. Bringing together an external banking thesaurus, internal lists, glossaries, printed dictionaries and
the knowledge of individuals is a huge task. In total 17.000 German terms had to be analysed and linked
together within a semantic net. Then all terms have to be translated and linked into English and French. A
built in automatic translation engine which translates the terms from German to English and French
speeds up the translation substantially.
In order to improve the retrieval function the thesaurus is copied to Excalibur search engine. Excalibur is
installed in the corporate Intranet and the Internet. In order to improve the intellectual indexing, the
thesaurus will be integrated in several content and document management systems. An automatic
indexing engine is scheduled for 2002. This engine is to find and suggest the adequate classes and
descriptors to the human indexers.
Table of content
1.
2.
3.
4.
5.
6.
6.1
6.2
6.3
6.4
7.
8.
Deutsche Post World Net
AGI – Information Management Consultants
The need for a global thesaurus at DPWN
The team: 14 members
The workflow
IC INDEX: The software behind
Why IC INDEX – the point of view of DPNW
Lotus Notes & Domino as platform
Terminology concepts in IC INDEX
The way of working
Interface to Excalibur and Content-Management-Tools
The vision of future use at DPWN
1. Deutsche Post World Net
Deregulation is a business strategy for public administrations in many countries since more than 15
years. In Germany telecommunication and postal services have been integrated under the head of one
ministry for more than 40 years. This conglomerate has been split into three companies in the Nineties:
1. Deutsche Telekom AG - focused on all kind of telecommunications
2. Deutsche Post AG - focused on all kind of transportation from letter to parcel
3. Deutsche Postbank AG - focused on banking and finance
A major change in this concept was the re-merge of Deutsche Post and Deutsche Postbank reflecting
the new requirement of Supply Chain Management, where all services related to the transportation of
goods are integrated. To all kind of physical transaction of goods the transaction of money is related.
Currently Deutsche Post AG is still the monopolist in transportation of letters. This generates the major
part of the revenue. But in a few years, this monopolistic advantage valid for Germany only will be lost
and Deutsche Post AG wants to become a major player in the international logistics arena – in every kind
of non-electronic transportation. So the merge with DANZAS – a leading logistic company based in
Switzerland – is part of the preparation for the future. Anyway the new multinational organisation is
already now the number one in logistics worldwide employing about 350.000 people – much larger than
Siemens, Daimler Chrysler or Volkswagen.
2. AGI – Information Management Consultants
Very much in contrast to DPWN the project partner AGI – Information Management Consultants is a small
company. It is a German based headquarter doing consulting, project management and terminology
development and an India based software development team. AGI is focused on information
management since 1983 and delivering dedicated thesaurus development software and all related
services since 1987. Retrieval technologies, databases and high integrated workflows for information
professionals and their clients is the major focus. Lotus Notes & Domino is the favourite software
platform. “Information Center”, is the brand name of the software architecture. Information Center has
become a well established solution for some important international companies, all of them global players
with headquarters in Switzerland and Germany.
3. The need for a global thesaurus at DPWN
The move from a regulated public organisation to a modern enterprise is also a move of mind and
language spoken inside and published for the world outside. The officials are on the way to loose their
state oriented official language and they must learn to think, write and speak in a more modern
international business slang.
So a new unified language is vital for all kind of face-to-face communication or the communication
supported by technical means. It is difficult to take decisions, to bring issues to a result and difficult to
produce and to find documents if there is no shared terminology. The new multinational approach
requires English as well as other languages to be spoken and written.
The increasing number of documents available on computer disks and the normal daily chaos in the file
directories has initiated new technologies like Document Management (RDBMS based systems) or
Knowledge Management (Information Retrieval based systems) pushed by a large number of IT
companies. Combined with the opportunity of establishing an Intranet, the need for indexing the content
of documents has been recognized and a small task force has been set up for building a multilingual
terminology solution comprising all subjects inside this very large organisation.
4. The team: 14 members
DPWN asked its information center in Bonn to install and manage this task force. An external partner,
AGI, was selected for providing a software able to support this kind of project and to customize according
to future needs arising from the Intranet, the search engine and the Content Management System. Due to
the large scope of the project additional staff for terminology development was needed. AGI provided 4
people – one external consultant was hired, he is specialized in logistics (based in Vienna) and 3 internal
employees focused on non dedicated “Post” subjects like banking, accounting, economics, marketing, IT,
security or geography. DPWN concentrated upon the large subjects law, human resource and all kinds of
postal and transportation services.
Company
DPWN
DPWN
DPWN
DPWN
DPWN
AGI
AGI
AGI
Roles
1
2
4
3
2
1
4
2
Task
Project management
Pre-Scanning sources for terms
Developing classification and thesaurus, translation
IT tasks and IT project management
Translation department
Project management
Classification and thesaurus development, translation
Software development
In total 14 people have been working or will work for the thesaurus project. The team has been working
simultaneously in Southern India (software development) and three European locations (Bonn, Vienna,
Neustadt/Weinstrasse). The geographical distance has never caused any kind of problems.
One of the difficulties that occur in terminology teams are differences in the understanding of terms. Each
team member therefore was responsible for one or several subjects.
All work could be done at the desks and only few meetings were necessary. The number of meetings
decreased in the course of the project because the team members got used to the amount of terms, to
the software tool and they learned to recognize the importance of a term from the point of view of DPWN.
With a specific field terms can be set to “Should be deleted”, “Should be discussed” and “Term is
necessary – after a discussion”. The built in discussion technique – a typical advantage of Lotus Notes &
Domino based solutions – was not used by the core team, but it was used by other interested people via
the web interface.
So the rather typical conflicts of thesaurus projects did not occur. Even if everybody of us was surprised
many times at his colleagues and their view of terms, it was fun to work together.
5. The workflow
A period of three years is needed from the start until the completion of the multilingual version. Half of this
period of time was needed for internal term extractions, internal politics and the real start of the project
with a software and an external partner. The time table is displays the steps:
Year
Kind of work
1999
Extraction of terms:
Intellectual scanning of 200 printed and electronic sources –
About 20.000 terms retrieved and collected in MS Excel files
2000
Installing software IC INDEX 5.0 at 3 locations (at end of 2000)
Developing and testing export format for Excalibur retrieval system.
Building Classification:
18 main topics, 150 subclasses at 3 levels
(10 days, 4 people)
Import of 5000 terms of a banking thesaurus
Import of 12.000 terms in MS Excel format
2001
Building German Thesaurus:
17.000 terms have to be checked and linked to the classes and to each other
(several 10.000 links, currently 5000 non-descriptors, 3000 abbreviations)
Check of critical terms by other internal experts
Import in Excalibur and start of use
Integration in Content Management Systems
2002
Translation from German to English and French
about 13.000 terms – terms, definitions, scope notes and all links have to be
translated, built and checked. Support via translation engines and dedicated
workflows.
Reload in Excalibur and in CMS
Integration of computer aided indexing software, automatic indexing for supporting
human indexing
2003
Maintenance & ???
7. IC INDEX 5.0: The software behind
IC INDEX 5.0 – based on 16 years of experience with the successful INDEX, the leading product in
the German speaking market of thesaurus development software – was rebuilt in 1999 under Lotus
Notes & Domino 5.0. The prior version was already used by large publishers, broadcasting
companies, banks, chemical and pharmaceutical industries, consultancies and education.
Even if this software platform Lotus Notes & Domino is not the fastest database system for very
complex data structures with large quantities of entries, it is one of the best platforms for collaborative
work. Whereas in the past most thesaurus have been monolingual and were built by teams within one
location, now most thesaurus are multilingual and have to be set up by distributed teams. The new
need for multilingual thesaurus is a result of the global economy. English has become a business
language even if it is not the mother tongue of a company, technologies are merging, companies are
merging, functions are merging. All business workflows are more and more IT driven or supported.
A Lotus Notes & Domino based terminology software fits as well to the information architecture of
“Information Center” (IC), the product family of AGI – Information Management Consultants. It was
one of the last missing modules in this platform.
6.1 Why IC INDEX – the point of view of DPNW
Major reasons for selecting IC INDEX from the point of DPWN have been:
1. Approved tool for thesaurus development
2. Based on an advanced software platform running on all important operating systems
3. Integration of classification and thesaurus concept for having top-down and bottom-up access
to the data
4. More than 25 relationships available for linking terms
5. Support of distributed, collaborative work within a multilingual team
6. Open for new import and export formats and new workflow of the project team
7. External provider delivers a full service: consultancy, software, customizing, import, export,
new development & support, terminology building and project management
8. Web-interface for publishing and discussion
9. Powerful support of translation work
Before the partnership with AGI – Information Management Consultants was confirmed a long and
very detailed guideline of skills and options had been specified at DPWN. It concerned the software
product and the software partner. It was used for checking a small number of relevant players in the
market and after that for decision-making.
6.2 Lotus Notes & Domino as platform
A number of about 90 million people have a Lotus Notes client at their desktop, in addition to that a
large number of users are accessing the Lotus Notes & Domino Server via Web browsers in
Intranets, Extranets or via the Internet. It is one of the best tools for Knowledge Management
solutions in small teams operating within large global enterprises, if it is used as software
development platform and not just as an E-mail system. Four programming languages including Java
are built in in a powerful programming environment, called Domino Designer. Behind this
programming environment stands a strong team of 3000 IBM software developers managing the data
structures of a multimedia database engine, the retrieval system GTR, connectors to several other
platforms like SAP, very powerful mail communication, the best replication of databases available and
the basic user front end. Main competitors are Microsoft Exchange – as a mail system - and Oracle
and DB2 as relational database management systems. The major focus of Lotus Notes & Domino are
documents and all workflows and sharing related.
IC INDEX requires several forms and 70 indexed views for managing the requirements of a
terminology development platform. Databases are larger than in the old RDBMS based version but
the number of information displayed is much higher and the usability and ease of use has improved
very much. More comfort costs more disk space. Currently the DPWN database requires 300 MB
including the full-text retrieval index (16.000 terms plus all links and some definitions or scope notes).
The network which was set up has one central Lotus Domino server. Clients in Bonn and Vienna are
linked via fast ISDN connection for doing replication and mail transfer. This is a very secure
environment – there is no risk of Internet attacks. Each Lotus Notes client workstation has a replica of
the thesaurus database. A replica is a copy which is synchronized several times a day. So every user
is working offline – having the same database – and updating his and the other replicas via exchange
with the central server. The replication requires a few minutes only. Additional to the central server
later a publishing database – just a replica – was installed at AGI´s Domino server in the web for
giving restricted access to interested people within DPWN.
The software development was done on design templates. So the designer did never access the
productive database. The productive environment was updated after approval.
6.3 Terminology concepts in IC INDEX
IC INDEX can be used for building terminology, taxonomy, ontology, topic map, encyclopaedia or
translation dictionary. We call it a thesaurus development tool, supporting thesaurus (net topology),
classification (tree topology) and keyword chains (lists). Basically any kind of semantic relationships
between terms can be outlined, the number of relationships is not restricted – 26 is the default. Each
term has several attributes specifying the language, the type of term, the subject, definitions, scope
notes. URL and all kind of multimedia content as well as administrative fields, which can be defined
by the project administrator. All lists and rules behind can be edited by administrators.
Left frame: Is an outline, which can be collapsed and expanded opening different kind of views. Some
views are personal or manually selected in shared views (Gemeinsame Listen). Administration is
visible for Admin role only. Right frame: All views can be sorted with the small twisties at the top of
column. Most action buttons are working on groups of selected terms. In the search line above the
views any left, middle and right hand truncation is allowed for finding phonemes, terms or phrases.
.
Left frame: A part of Administration is expanded and the form for setting up user defined keyword lists
including variable labels is opened (window in the foreground).
1. Window in the background: here the terms are typed in.
2. Window in the middle is used for linking term 1 with term 2. All terms already linked are displayed
below the term. Several action buttons are supporting the related workflows.
3. Window at the right bottom: used for selecting the kind of relation
6.4 The way of working
With IC INDEX decentralized teamwork was easy. Everybody had his subjects, everybody was
working decentralized – at any time (we can see the working time in the document history). Three
notebook users have been working at different locations (while travelling, at airports, in trains, during
holidays at different places from the Alps to the Indian ocean)
The only requirement was to replicate several times a day with the server for avoiding replication
conflicts. But integrated data checking utilities inside the software made the management of these
conflicts easy and fast. Never data got lost.
Each expert was using printed or electronic sources like dictionaries, vocabularies, glossaries,
periodicals, brochures, books, internal papers, manuals, Intranet content – in total about 300 sources.
Each term has to be linked one by one to relevant other terms. At first all terms got a subject entry –
like IT, human resource, building, transportation techniques -. Based on these large groups terms got
links to the classes. Some classes have got too many related keywords, some got very little – the
built in word counting in the IC INDEX views displays unbalanced classes. The classes that have to
be treated and improved are shown.
After receiving their link to classes the typical thesaurus relationships between terms are established.
The proximity or the distance of one term to another is defined in a quite sophisticated manner:
Broader/Narrower terms, Related terms, Synonyms and many others: 25 different relationships have
been used. Each kind of relationship is replaced by a statistical weight while importing the thesaurus
into Excalibur. This kind of semantic linking requires indeed the major part of the time.
If available or easy to write scope notes have been added to the terms.
About 2500 abbreviations – often with 3 and more meanings – had to be reduced to one remaining
meaning – otherwise the use in Excalibur would not be possible in an easy way. All other terms have
got a message in the Scope Note field about this abbreviations.
The large number of terms in IT, marketing and other modern areas which are English terms like
“Computer, Marketing, Supply Chain Management …” have not been translated into German,
because often there is no proper or common term. These terms got links as well to the related
German terms and classes as well as to the English terms and classes.
With a simple field called “Has to be treated later” and a small keyword list, terms have been marked
for team discussion, checking at DPWN or forwarding to internal experts. Inverted terms and
abbreviations got special marks required for Excalibur export.
Translation
With IC INDEX it was always possible to build a translation link. But for translating the large number
of terms an additional workflow was designed. Behind this workflow the translation engine eTranslation Server of Linguatec (Germany) was installed and gave access to the dictionary and the
automatic translation. It was not necessary to search in the dictionary, to type in all these words. The
quality of the automatic translation was often good enough but checking was a major task.
Here the translation department of DPWN had to be integrated – basically for control and not for
editing.
Translation window: Each field of the term document is prompted, back and forward browsing is
support and with “Open Source” only dictionaries will be listed and can be opened with one click in
the web browser. When starting a term, if available the translation is already displayed and has to be
accepted by pressing the OK button.
7. Interface to Excalibur and Content-Management-Tools
The interfaces of Excalibur and the CMS is under construction while writing this conference paper. IC
INDEX has got a new output format in Excalibur format.
8. The vision of future use at DPWN
Even for trained information specialists the use of a large and complex thesaurus like DPWN
thesaurus is not easy. The thesaurus development was always done under the assumption, that an
automatic indexing system is analysing the content of the documents and suggesting proper classes
as well as descriptors to end users not familiar with language engineering and documentation.
Based on this kind of automatic indexing technology retrieval should be supported. The thesaurus
support by Excalibur is not bad at all but the power of this system cannot be used or displayed by 95
% of the users and was not built in the DPWN user interface we have checked until now.
Only a small link is missing – we and other companies are providing these kind of technique – maybe
2002 it will be adopted.