1: blark

A Field Survey for Establishing Priorities in the
Development of HLT Resources for Dutch
D. Binnenpoorte, F. de Vriend, J. Sturm, W. Daelemans, H. Strik, C. Cucchiarini
Introduction
A field survey of Dutch
language resources has
been carried out within the
framework of a project
launched by the Dutch
Language Union
(Nederlandse Taalunie) with
the aim of strengthening the
position of Dutch in Human
Language Technologies
(HLT).
Feedback
1: BLARK
2: Inventory of available resources
In defining the BLARK a distinction was made between:
• Applications
• Modules
• Data
A matrix (fragment in Figure 1) was drawn up describing
• which modules are required for which applications;
• which data are required for which modules;
• what the relative importance is of the modules and data.
An inventory was made to establish which of
the components - modules and data - that
make up the BLARK are:
A second matrix (fragment in figure 2)
describes the availability of the components
in the BLARK.
• available; i.e. can be bought or are freely
obtainable e.g. by open source;
Figure 1 (“+” = important, “++” = very important)
Based on the full matrix a BLARK was defined.
• (re-)usable.
Inventory based on:
• expert knowledge;
This field survey was done in
three stages.
1: BLARK
The Basic Language
Resource Kit (BLARK) is a
wish list for an ideal HLT
field.
2: Inventory of available
HLT resources
BLARK for language technology:
BLARK for speech technology:
• information found on the internet and in the
literature;
Modules:
• Robust modular text pre-processing
• Morphological analysis and morpho-syntactic
disambiguation
• Syntactic analysis
• Semantic analysis
Modules:
• Automatic speech recognition
• Speech synthesis
• Tools for calculating confidence measures
• Tools for identification
• Tools for (semi-) automatic annotation of speech corpora
• personal communication with actors in the
field.
Data:
• Monolingual lexicon
• Annotated corpus of text (a treebank
• Benchmarks for evaluation
Limited to a descriptive level: modules and
data were checked against a list of evaluation
criteria.
Data:
• Speech corpora for specific applications
• Multi-modal speech corpora
• Multi-media speech corpora
• Multi-lingual speech corpora
• Benchmarks for evaluation
3: Priority list
The priority list indicates
which materials need to be
developed to complete the
BLARK. It was drawn up by
comparing the inventory with
the definition of the BLARK.
Components can only be considered usable if Figure 2 (1 = ‘module or data set is
they are of sufficient quality  quality
unavailable’ to 10 = ‘module or data set is
evaluation.
easily obtainable’).
Comparison
3: Priority list
The method described can
be adopted for languages
other than Dutch.
Requirements for prioritization:
• the components should be relevant for a large number of
applications;
Language technology:
1. Annotated corpus written Dutch
2. Syntactic analysis: robust recognition of sentence structure
3. Robust text pre-processing: tokenization and named entity
• the components should currently be either unavailable,
recognition
inaccessible, or have insufficient quality;
4. Semantic annotations for the treebank mentioned above
5. Translation equivalents
• developing the components should be feasible in the short term. 6. Benchmarks for evaluation
Speech technology:
1. Automatic speech recognition (including non-native ASR, robust
ASR, adaptation, and prosody recognition)
2. Speech corpora for specific applications (e.g. directory
assistance, CALL)
3. Multi-media speech corpora (speech corpora that also contain
information from other media such as newspapers, WWW, etc.).
4. Tools for (semi-) automatic transcription of speech data
5. Speech synthesis (including tools for unit selection)
6. Benchmarks for evaluation
Feedback of the HLT field
(academia and industry) was
collected at a workshop with
about 100 participants.
Conclusions
• The current HLT infrastructure is
scattered, incomplete, and not
sufficiently accessible.
• The available modules and
applications are often poorly
documented.
• There is a great need for
objective and methodologically
sound comparisons and
benchmarking of the materials.
• The components that constitute
the BLARK should be available
at low cost or free of charge.
Recommendations
• Establish an HLT agency to
collect, document and maintain
existing parts of the BLARK.
• Complete the BLARK by
encouraging funding bodies to
finance the development of the
prioritized resources.
• Make the BLARK available to
academia and HLT industry
under the conditions of open
source development.
• Develop benchmarks, test
corpora, and a methodology for
objective comparison, evaluation
and validation of parts of the
BLARK.
• Promote HLT education.
• Ensure that enough funding is
assigned to fundamental
research.