the implementation of cuvoicebrowser, a voice web navigation tool

THE IMPLEMENTATION OF CUVOICEBROWSER, A VOICE WEB
NAVIGATION TOOL FOR THE DISABLED THAIS
Proadpran Punyabukkana, Jirasak Chirathivat, Chanin Chanma, Juthasit Maekwongtrakarn, Atiwong Suchato
Spoken Language Systems Group
Department of Computer Engineering, Faculty of Engineering
Chulalongkorn University
Thailand
[email protected], [email protected], [email protected], [email protected], [email protected]
ABSTRACT
Surfing the web is, today, a daily activity for most. While
we utilize the web as a means of publishing information
to the world and accessing information from wherever,
less fortunate people do not have that privilege and this
includes the blind and the motor-handicapped. This paper
explains how we have designed and built the
CUVoiceBrowser, a web browser that can be controlled
by voice in the Thai language to serve the Thai disabled
group. While many have worked on text-to-speech
capabilities in reading web pages to the blind, we focus
on taking commands from the blinds and the motorhandicapped, in Thai, to navigate the web and to search
for information on the web. Our prototype shows high
accuracy rate of over 80% when users speak their web
navigation commands, and over 70% when Thai
characters are inputted by voice.
KEY WORDS
Web browser, speech recognition, text-to-speech, and
assistive technology
1. Introduction
Web accessibility centered around voice input will one
day allow anyone to use the web without training [1].
More importantly, this technology will allow the blinds to
participate in the growing web community. A summary
of the progress of web accessibility is given by Asakawa
[2]. Many researchers have built prototypes to
demonstrate the workability of Voice-web concepts.
Hemphill, et.al., have developed voice controlled
navigation using speaker independent speech recognition
that supports a speakable hotlist, speakable links and
smart pages in the speech user agents [3]. Parente [4] has
developed a prototype of audio enriched link that will
provide a summary of web pages for those web pages
accessed by the blind. Most of the work done in this area
is based on the English language. However, there have
been experimental projects attempted by other nonEnglish systems.
Brondsted, et.al., have built the
prototype of a Danish voice-controlled utility for internet
browsing targeting motor-handicapped users who have
difficulty in using the standard keyboard and/or standard
mouse [5]. And Lopez, et.al., [6] have offered a tool to
help the visually-impaired to surf web in Spanish but does
not have a voice-recognition capability in place. To date,
IBM ViaVoice and Homepage Reader are the only
commercial tools that support the Thai language. While
ViaVoice does recognize Thai language, it is not a tool to
navigate the web. Although IBM Homepage Reader does
read homepages or the web aloud in Thai, it does not have
a feature to recognize commands from users.
The blind and the motor-handicapped share similar
difficulties when it comes to surfing the web. They
cannot use a keyboard or a mouse to input commands.
This fact instantly limits their use of the web. In fact, the
blind might be in a better position if they could use a
Braille keyboard together with text-to-speech software.
Unfortunately, those whose hands are not capable of
typing will find the keyboard to be of no use. In addition,
the Thai disabled community is not necessarily proficient
in English. This further limits them when seeking
knowledge through the web. Hence, this project was
initiated to build a web browser, the “CUVoiceBrowser”,
that understands simple web navigation commands in
Thai. Although both the blind and the motor-handicapped
are considered as the users of this tool, the design of this
tool is more focused on the blind who have more
limitations in terms of viewing the web.
2. Design and Implementation Criteria
The design of CUVoiceBrowser web navigation tool has
taken into consideration the criteria that both general and
particular Thai handicapped people are the users. We
allow simple voice commands that are appropriate in the
Thai language which may not be exact translations from
English. Particularly the blind, as they are not able to see
the screen, are unable to understand even fundamental
concepts like normal people. These include frame, color,
blinking text, pictures, etc.
This simple framework does not reduce the capabilities
nor make it harder for the motor-handicapped. Our
design criteria mainly focus on ease of use and ease of
understanding. Taking these criteria into account, the
resulting design is as follows:
2.1 Minimal Training
To minimize the training needed, the tool only requires
the users to listen to the guidelines at the beginning once
the user opens the program. It tells the user to locate ‘Alt’
and the ‘Space bar’ which they will use to activate each
command. It will also list available commands that users
may use. These guidelines are kept as a help menu that
the user can review at anytime. As a result, users will
need no formal training in the use of this tool. The
structure of Help and guideline is shown in Figure 1.
Whenever the users want to activate the help function,
they can say “Help” in Thai at anytime.
Help.html
Introduction (Welcome)
Control:
How to record (Ctrl + Spacebar)
How to pause
Essential commands (Read, Read Next, Read
previous)
Link command (Goto link)
Guideline (one.html) (How to use the tool step by step)
How to record sound
How to input addresses and goto addresses
What happens when finishing loading
Other commands
Techniques (two.html)
All commands (three.html)
Program limitation (four.html)
Figure 1 Structure of Help and Guideline for the users
ii.
iii.
iv.
v.
“Next ” - Forward to next page
“Reload”- Refresh the current page
“Stop loading” – stop loading the new page
“Input address” – Entering input mode, voice
input of characters and special characters
forming URL
vi. “Go” – Go to the specified URL
vii. “Search ” – Entering search mode , voice input
of search item then Web Search Engine will
be activated
c) Web reading functions
i. “Read” – Read the current paragraph
ii. “Read next” – Read the next paragraph
iii. “Read previous” – Read the previous paragraph
iv. “Read again” – Read from the beginning of the
current page
v. “Stop reading” – Stop reading the text
vi. “Next link” – Read and select next link
vii. “Previous link” – Read and select the previous
link
viii. “Open link” – Go to the specified link
d) Other commands
i. “Help” – Go to help menu
ii. “Bookmark” –Go to bookmark page to store
the desired page
There are altogether 21 commands to navigate the web.
The users do not need to memorize all the commands as
the tool will ask the users what they would like to do at
each stage. The options will be given at each stage or the
expected response will be intuitive enough without
instruction. Please note again that all the commands are
in Thai.
Among the 21 commands, the most used are “Read”,
“Read next”, and “Search”.
2.2 Web Navigation Commands
2.3 Language Input
In developing the Thai language commands, we have
attempted to use the fewest set of commands as possible.
Also, we have tried to select commands that do not sound
similar to each other so that accuracy will be higher. One
more consideration we would like to note here is that the
commands in Thai are not direct translations from
English. This is to reduce the frustration that might be
experienced by users should they have to learn new
technical jargon. As a result, the following groups of
functions have been defined and implemented. The
commands enclosed in “ ” are in Thai. Here, it is
presented in English for the purpose of publishing.
The highlight and the challenge of this tool is the ability
for users to input the desired information in Thai using
their voice, in addition to English. This is particularly
useful when users want to search for something. They
may use Thai, English, number, or a special character as
the input and they can say one character at a time to the
tool.
a) Control Functions:
i. “Open a new page” - Open new tab
ii. “Close page” - Close the current tab
iii. “Next page” - Change to next tab
iv. “Exit”- Exit browser
b) Page access functions:
i. “Previous ” - Back to previous page
The tool will recognize it and echo the result back to the
user. If it is incorrect, the user can repeat that letter again.
This paper will not go into details on the English
character set. However, it is worth explaining how the
Thai language is characterized and how we have designed
our system to manage this task.
Figure 2 Architecture of the Tool
2.3.1
Thai language
2.3.3
Numbers and Special Characters
In the Thai language, there are 46 consonants, 21 vowels,
and 4 tone indicators to indicate 5 tones. Out of 46
consonants, there are 44 in use. Our tool allows the use of
all 44 consonants, 21 vowels, and 5 tone indicators. The
complication arises because some of the characters sound
exactly the same although the autography of the letters is
different. Therefore, to capture the right one, the user is
asked to say that letter along with the main word sample
attached to that letter. Though it may sound complicated,
it is rather trivial for Thais since it is how the language is
taught in school when Thais start to learn. The format of
this input is similar to “A-Apple, B-Boy” in English.
However, when the user wants to input a vowel, they say
the word “vowel” in Thai, followed by the desired vowel.
Numbers 0 to 9 can be inputted by saying them in Thai.
Again, since there is no duplicate sound, the tool will
understand that the user wants a number once it is said.
For special characters, most are allowed and they are
listed in the guidelines for users to see what can be used.
Examples are -, _, @, #, ?, ~, , (comma), :, +, -, “ ”. Users
will say these in Thai as well.
2.3.2
To save users time, users can also bookmark their favorite
pages. Favorites are kept for them provided that user sets
a keyword. Keyword may be anything such as the name
of the newspaper, name of the bank, etc.
English language
When users want to input an English character, they only
have to speak that letter without saying the sample word
as they have to do when inputting Thai. There are two
reasons for this approach. The first is that there is no
duplicate sound from those 26 letters. The second is that
it makes it automatic for the tool to understand whether
the user is inputting English or Thai character.
2.4 Utilities
We have designed our speaking Help function not only to
minimize training process, but also to serve as a utility for
users. In addition to the Help menu, it is important to
select utility that is useful to the users.
Once a bookmark is set, user is able to go directly to their
bookmark in one step by saying the keyword of the
bookmarked page. Another option to reach the kept
favorite pages if user cannot remember the keyword is to
say “bookmark”. The tool will read aloud the list in the
bookmark one by one. Once the desired page is read, user
can say “open link”, and the tool will automatically
navigate to that page.
3. Architecture of the Tool
There are three main components when building this
CUVoiceBrowser; the browser, the recognizer and the
text-to-speech. We have used Microsoft Visual Studio, to
build the browser and its functions. In the case of the
recognizer, we built it using HMM technique and use
HTK as a tool. For text-to-speech, we simply called a
library from IBM ViaVoice to do the work. In this paper,
we will only give details on the browser and the
recognizer that we built. The overall structure of the
architecture is shown in Figure 2.
Figure 3 Sample of screenshot of CUVoiceBrowser
Recognizer
Browser
Microsoft Foundation Class (MFC) has been used to build
our browser. As shown in Figure 2, WebBrowserView
handles the displaying of webpage. The navigation tasks
are done by calling the functions in WebBrowserView and
the function is sent to WebBrowserCore which executes
and controls each navigation task.
When
the
user
presses
Ctrl+Spacebar,
the
WebBrowserView will accept the event from the keyboard
and understand that the user now wants to input a voice
command. Then, the MainFrame function will call
Recorder which will record the voice command from the
user before sending it to the Recognizer. The Recognizer
in turn calls HTK to execute the built-in module for Thai
and English commands which will perform the
recognition task. The recognizer here does the entire job
of recognizing what the users say, including web
navigation commands and the character input; Thai,
English, and special characters, and numbers. The output
from the Recognizer is a string of text that is one of the
commands that will be sent to MainFrame to accomplish
what the user asks the tool to do.
At this stage, it will go back to the web browser. The
WebBrowserCore function will send HTML string to
WebBrowserView in order for it to display the screen.
After that, the reading function is performed when
HTMLTranslator translates the source code that it has
obtained from WebBrowserView. It understands which
part of the text is to be read and which part is the link
determined from the associated tags.
Within HTMLTranslater, there is a Function List. It keeps
the text to be read in the List. When the user asks the tool
to read, the Function List will send a text string to
ViaVoice to read it to the user. Links are also read and
the tool will ask the user if they want to go to the next
link. A sample of the CUVoiceBrowser screen is shown
in Figure 3.
As stated briefly earlier, we have used the Hidden Markov
Model (HMM) technique and the Hidden Markov Toolkit
(HTK) to build the recognizer. Seventy five Thai
phonemes, including syllable initial phonemes, vowel
phonemes, and coda phonemes, were used as the basic
sound units for the vocabulary in the recognizer’s
dictionary. Each acoustic model specifically corresponded
to a context-dependent triphone, which corresponded to
one of the seventy five Thai phonemes with specific
preceding and following phonemes.
We collected the needed utterances from a total of 30
native Thai speakers, equally distributed between male
and female, to train our acoustic models corresponding to
the 75 phonemes. Each speaker was asked to say all of the
commands used in the system as well as every Thai and
English letter and number.
Their voices were recorded using a computer headset and
digitized at 16kHz with 16 bit resolution. The topology of
each HMM is a five-state left-to-right model with three
emitting states. Thirty nine-dimensional feature vectors,
consisting of Mel-frequency cepstral coefficients
(MFCCs) together with their deltas and accelerations,
were used to represent observations from speech frames.
Gaussian mixtures with two components were used to
govern the emitting probabilities of the emitting states. A
Baum-Welch re-estimation algorithm with flat-start
strategy was used to estimate the required parameters for
every HMM and Gaussian mixture. The Token Passing
algorithm was used in the decoding phase to find the most
likely hypothesis from the HMM-based triphone network
generated by associated task grammar, which could be
either navigation command or character filling.
4. Discussion and Conclusion
In this paper, we have presented the design and
implementation of a prototype for a voice-controlled tool
using the Thai language that targets Thai who are motorhandicapped and/or blind novice Thai web browsers. The
tool is called the CUVoiceBrowser. This prototype is the
first example web accessibility technology being
implemented in Thailand. In testing the effectiveness of
the prototype, we found the accuracy to be satisfactory.
The greatest accuracy is found when users use web
navigation commands where the accuracy is in the high
80%.
With Thai character input, accuracy is
approximately 70%. The weakness is when users input
English characters where the accuracy is lower than 60%.
This might be attributed to the deficiency in English
speaking of Thai native speakers who are in our test
group. Our future research will concentrate on the
systematic processing of non structured commands and
combinations of commands. We envision implementing a
system with feedback from the web interface so as to
guide the users more effectively and also help the user to
resolve conflicting accessibility paths. The feedback will
also increase the speed with which the handicapped user
will use the web pages. Furthermore, we have worked to
improve on the portability of the tool and the cost of using
it so that all visually-impaired and motor-handicapped
Thais will have access to the tool. We also wish to
employ information retrieval techniques to summarize the
Thai webpage for the users.
Acknowledgements
The authors thank IBM (Thailand) who supplied the
ViaVoice library module to help this project accomplish
text-to-speech capability.
References
[1] Frost, R., “Call for a Public Domain SpeechWeb”,
Communications of the ACM, Nov 05, Vol.48, No.11, pp.
45-49.
[2] Asakawa, C., “What’s the Web Like If You can’t See It?
W4A at WWW2005, 10th May 2005, Chiba, Japan.
[3] Hemphill, C.T. and Thrift, P.R., “Surfing the Web by
Voice.” ACM Multimedia 95-Electronic Proceedings,
November 5-9, 1995, San Francisco, CA, USA.
[4] Parente, P., “Audio Enrished Links: Web Page Previews
for Blind Users.” ASSET’04, Oct 18-20, 2004, Atlanta,
Georgia, 2004.
[5] Brondsted, T. and Aaskoven, E., “Voice-Controlled
Internet Browsing for Motor-handicapped Users, Design
and Implementation Issues, Interspeech 2005.
[6] Lopez R. A., Krischning A. I. , “Finder and Reader of Web
Pages in Spanish for People with Visual Disadvantages,”
Proceedings of the 16th IEEE Conference on Electronics,
Communications and Computers, 2006.