Abstract (Summary)

Learning Through Multimedia: Speech Recognition Enhancing Accessibility
and Interaction
Mike Wald. Journal of Educational Multimedia and Hypermedia. Norfolk:
2008. Vol. 17, Iss. 2; pg. 215, 19 pgs
Abstract (Summary)
Lectures can present barriers to learning for many students and although
online multimedia materials have become technically easier to create and
offer many benefits for learning and teaching, they can be difficult to
access, manage, and exploit.
»
Jump to indexing (document details)
Full Text
(6750 words)
Copyright Association for the Advancement of Computing in Education
2008
[Headnote]
Lectures can present barriers to learning for many students and although
online multimedia materials have become technically easier to create and
offer many benefits for learning and teaching, they can be difficult to
access, manage, and exploit. This article considers how research on
interacting with multimedia can inform developments in using
automatically synchronised speech recognition (SR) captioning to:
1. facilitate the manipulation of digital multimedia resources;
2. support preferred learning and teaching styles and assist those who, for
cognitive physical or sensory reasons, find note-taking difficult; and
3. caption speech for deaf learners or for any learner when speech is not
available or suitable.
Speech recognition (SR) has the potential to benefit all learners through
automatically and cost-effectively providing synchronised captions and
transcripts of live or recorded speech (Bain, Basson, Faisman, & Konevsky,
2005). SR verbatim transcriptions have been shown to particularly benefit
students who find it difficult or impossible to take notes at the same time
as listening, watching, and thinking as well as those who are unable to
attend the lecture (e.g., for mental or physical health reasons): "It's like
going back in time to the class and doing it all over again...and really
listening and understanding the notes and every thing... and learning all
over again for the second time" (Leitch & MacMillan, 2003, p. 16). This
article summarises current developments and reviews relevant research in
detail in order to identify how SR may best facilitate learning from and
interacting with multimedia.
Many systems have been developed to digitally record and replay
multimedia face to face lecture content to provide revision materials for
students who attended the class or to provide a substitute learning
experience for students unable to attend. A growing number of universities
are also supporting the downloading of recorded lectures onto students'
iPods or MP3 players (Tyre 2005). Effective interaction with recorded
multimedia materials requires efficient methods to search for specific
parts of the recordings and early research in using SR transcriptions for
this purpose proved less successful than the use of student or teacher
annotations.
Speech, text, and images have communication qualities and strengths that
may be suitable for different content, tasks, learning styles, and
preferences and this article reviews the extensive research literature on
the components of multimedia and the value of learning style instruments.
UK Disability Discrimination Legislation requires reasonable adjustments
to be made to ensure that disabled students are not disadvantaged
(Special Educational Needs and Disability Act [SENDA], 2001) and SR offers
the potential to make multimedia materials accessible. This article also
discusses the related concepts of Universal access, Universal Design, and
Universal Usability. Synchronising speech with text can assist blind,
visually impaired, or dyslexic learners to read and search text-based
learning material more readily by augmenting unnatural synthetic speech
with natural recorded real speech. Although speech synthesis can provide
access to text-based materials for blind, visually impaired, or dyslexic
people, it can be difficult and unpleasant to listen to for long periods and
cannot match synchronised real recorded speech in conveying
"pedagogical presence," attitudes, interest, emotion, and tone and
communicating words in a foreign language and descriptions of pictures,
mathematical equations, tables, diagrams, and so forth.
Real time captioning (i.e., creating a live verbatim transcript of what is
being spoken) using phonetic keyboards can cope with people talking at up
to 240 words per minute but has not been available in UK universities
because of the cost and unavailability of stenographers. As video and
speech become more common components of online learning materials,
the need for captioned multimedia with synchronised speech and text, as
recommended by the Web Accessibility Guidelines (Web Accessibility
Initiative [WAI], 1999), can be expected to increase and so finding an
affordable method of captioning will become even more important. Current
non-SR captioning approaches are therefore reviewed and the feasibility of
using SR for captioning discussed.
SR captioning involves people reading text from a screen and so research
into displays, readability, accuracy, punctuation, and formatting is also
discussed. Since for the foreseeable future SR will produce some
recognition errors, preliminary findings are presented on methods to make
corrections in real-time.
UNIVERSAL ACCESS, DESIGN, AND USABILITY
The Code of Ethics of the Association of Computing Machinery (ACM, 1992,
1.4) states that: "In a fair society, all individuals would have equal
opportunity to participate in, or benefit from, the use of computer
resources regardless of race, sex, religion, age, disability, national origin or
other such similar factors." The concepts of Universal Access, Design, and
Usability (Shneiderman, 2000) mean that technology should benefit the
greatest number of people and situations. For example, the "curb-cut"
pavement ramps created to allow wheelchairs to cross the street also
benefited anyone pushing or riding anything with wheels. It is cheaper to
design and build-in than modify later and since there is no "average" user it
is important for technology to suit abilities, preferences, situations, and
environments. While it is clearly not ethical to exclude people because of a
disability, legislation also makes it illegal to discriminate against a
disabled person. There are also strong self-interest reasons as to why
accessibility issues should be addressed. Even if not born with a disability,
an individual or someone they care about will have a high probability of
becoming disabled in some way at some point in their life (e.g., through
accident, disease, age, loud music/ noise, incorrect repetitive use of
keyboards, etc.). It also does not make good business sense to exclude
disabled people from being customers and standard search engines can be
thought to act like a very rich blind customer unable to find information
presented in nontext formats. The Liberated Learning (LL) Consortium
(Liberated Learning, 2006) working with IBM and international partners
aims to turn the vision of universal access to learning into a reality through
the use of SR.
CAPTIONING METHODS
Non-SR Captioning Tools
Tools are available to synchronise preprepared text and corresponding
audio files, either for the production of electronic books (Dolphin, 2005)
based on the Digital Accessible Information SYstem (DAISY) specifications
(DAISY, 2005) or for the captioning of multimedia (Media Accessible
Generator [MAGpie], 2005) using for example the Synchronized Multimedia
Integration Language (SMIL, 2005). These tools are not however normally
suitable or cost-effective for use by teachers for the "everyday" production
of learning materials. This is because they depend on either a teacher
reading a prepared script aloud, which can make a presentation less
natural sounding and therefore less effective, or on obtaining a written
transcript of the lecture, which is expensive and time consuming to
produce. Carrol and McLaughlin (2005) describe how Hicaption was used
for captioning after having problems using MAGpie, and finding the
University of Wisconsin eTeach (eTeach, 2005) manual creation of
transcripts, captioning tags, and timestamps too labour intensive.
Captioning Using SR
Informal feasibility trials were carried out in 1998 by WaId in the UK and by
St Mary's University in Canada using existing commercially available SR
software to provide a real time verbatim displayed transcript in lectures for
deaf students. These trials identified that this software for example,
Dragon and Via Voice (Nuance, 2005) was unsuitable as it required the
dictation of punctuation, which does not occur naturally in spontaneous
speech in lectures. The software also stored the speech synchronised with
text in proprietary nonstandard formats that lost speech and
synchronisation after editing. Without the dictation of punctuation the SR
software produced a continuous unbroken stream of text that was very
difficult to read and comprehend. The trials however showed that
reasonable accuracy could be achieved by interested and committed
lecturers who spoke very clearly and carefully after extensively training
the system to their voice by reading the training scripts and teaching the
system any new vocabulary that was not already in the dictionary. Based
on these feasibility trials the international LL Collaboration was
established by Saint Mary's University, Nova Scotia, Canada in 1999 and
since then the author has continued to work with LL and IBM to investigate
how SR can make speech more accessible.
READABILITY OFTEXT CAPTIONS
The readability of text captions will be affected by transcription errors and
how the text is displayed and formatted.
Readability Measures
Readability formulas provide a means for predicting the difficulty a reader
may have reading and understanding, usually based on the number of
syllables (or letters) in a word, and the number of words in a sentence
(Bailey, 2002). Since most readability formulas consider only these two
factors, these formulas do not however explain why written material is
difficult to read and comprehend. Jones (Jones et al., 2003) found no
previous work that investigated the readability of SR generated speech
transcripts and their experiments found a subjective preference for texts
with punctuation and capitals over texts automatically segmented by the
system. No objective differences were found although they were concerned
there might have been a ceiling effect and stated that future work would
include investigating whether including periods between sentences
improved readability.
Reading From a Screen
Mills and Weldon (1987) found it was best to present linguistically
appropriate segments by idea or phrase with longer words needing smaller
segments compared to shorter words. Smaller characters were better for
reading continuous text, larger characters better for search tasks. Muter
and Maurutto (1991) reported that reading well designed text presented
from a high quality computer screen can be as efficient as reading from a
book. Holleran and Bauersfeld (1993) showed that people preferred more
space between lines than the normal default while Muter (1996) reported
that increasing line spacing and decreasing letter spacing improved
readability although increasing text density reduced reading speed.
Highlighting only target information was found to be helpful and paging was
superior to scrolling. Piolat, Roussey, and Thunin (1997) also found that
information was better located and ideas remembered when presented in a
page by page format compared to scrolling. Boyarski, Neuwirth, Forlizzi,
and Regli (1998) reported that font had little effect on reading speeds
whether bit mapped or anti-aliased, sans serif or serif, designed for the
computer screen or not. O'Hara and Sellen (1997) found that paper was
better than a computer screen for annotating and note-taking, as it enabled
the user to find text again quickly by being able to see all or the parts of
text through laying the pages out on a desk in different ways. Hornbaek
and Frøkjaer (2003) learned that interfaces that provide an overview for
easy navigation as well as detail for reading were better than the common
linear interface for writing essays and answering questions about scientific
documents. Bailey (2000) has reported that proofreading from paper occurs
at about 200 words per minute and is 10% slower on a monitor. The
average adult reading speed for English is 250 to 300 words per minute,
which can be increased to 600 to 800 words per minute or faster with
training and practice. Rahman and Muter (1999) reported that a sentenceby-sentence format for presenting text in a small display window is as
efficient and as preferred as the normal page format. Laarni (2002) found
that dynamic presentation methods were faster to read in small-screen
interfaces than a static page display. Comprehension became poorer as the
reading rate increased but wasn't affected by screen size. On the 3 ? 11 cm
wide 60 character display users preferred the one character added at a
time presentation. Vertically scrolling text was generally the fastest
method apart from for the mobile phone size screen where presenting one
word at time in the middle of the screen was faster. ICUE (Keegan, 2005)
uses this method on a mobile phone to enable users to read books twice as
fast as normal.
While projecting the SR text onto a large screen in the classroom has been
used successfully in LL classrooms it is clear that in many situations an
individual personalised and customisable display would be preferable or
essential. A client server personal display system has been developed
(WaId, 2005a) to provide users with their own personal display on their own
wireless handheld computer systems customised to their preferences.
Punctuation and Formatting
Spontaneous speech verbatim transcriptions do not have the same
sentence structure as written text and so the insertion points for commas
and periods/full-stops is not obvious. Automatically punctuating transcribed
spontaneous speech is also very difficult as SR systems can only recognise
words not the concepts being conveyed. LL informal investigations and
trials demonstrated it was possible to develop SR applications that
provided a readable real time display by automatically breaking up the
continuous stream of text based on the length of pauses/silences. An
effective and readable approach was achieved by providing a visual
indication of pauses to indicate how the speaker grouped words together,
for example, one new line for a short pause and two for a long pause. SR
can therefore provide automatically segmented verbatim captioning for
both live and recorded speech (Wald, 2005b) and ViaScribe (IBM, 2005)
would appear to be the only SR tool that can currently create automatically
formatted synchronised captioned multimedia viewable on internet
browsers or media players.
SR Accuracy
Detailed feedback from students with a wide range of physical, sensory
and cognitive disabilities and from interviews with lecturers (Leitch &
MacMillan, 2003) showed that both students and teachers generally liked
the Liberated Learning concept of providing real time SR lecture
transcription and archived notes and felt it improved teaching and learning
as long as the text was reasonably accurate (e.g., >85%). While it has
proved difficult to obtain this level of accuracy in all higher education
classroom environments directly from the speech of all teachers, many
students developed strategies to cope with errors in the text and the
majority of students used the text as an additional resource to verify and
clarify what they heard. "You also don't notice the mistakes as much
anymore either. I mean you sort of get used to mistakes being there like it's
just part and parcel" (Leitch and MacMillan, p. 14). Although it can be
expected that developments in SR will continue to improve accuracy rates
(Olavsrud, 2002; IBM, 2003; Howard-Spink, 2005) the use of a human
intermediary to correct mistakes in real-time as they are made by the SR
software could, where necessary, help compensate for some of SR's
current limitations. Since not all errors are equally important the editor can
use their knowledge and experience to prioritise those that most affect
readability and understanding. To investigate this hypothesis a prototype
system was developed to investigate the most efficient approach to
realtime editing (Wald, 2006). Five test subjects edited SR captions for
speaking rates varying from 105 words per minute to 176 words per minute
and error rates varying from 13% to 29%. In addition to quantitative data
recorded by data logging, subjects were interviewed. All subjects believed
the task to be feasible with navigation using the mouse being preferred and
producing the highest correction rates of 11 errors per minute. Further
work is required to investigate whether correction rates can be enhanced
by interface improvements and training. The use of automatic error
correction through phonetic searching (Clements, Robertson, & Miller,
2002) and confidence scores (which reflect the probability that the
recognised word is correct) also offer some promise.
ACCESSING AND INTERACTING WITH MULTIMEDIA
An extensive body of research has shown the value of synchronised
annotations or transcripts in assisting students to access and interact with
multimedia recordings.
Searching Synchronised Multimedia Using Student Notes
Although the Guinness Book Of World Records (McWhirter, 1985) recorded
the world's fastest typing top speed at 212 words per minute, most people
handwrite or type on computers at between 20 and 40 words per minute
(Bailey, 2000) while lecturers typically speak at 150 words per minute or
above. Piolat, Olive, and Kellogg (2004) demonstrated note-taking is not
only a transcription of information that is heard or read but involves
concurrent management, comprehension, selection, and production
processes. Since speaking is much faster than writing note takers must
summarise and abbreviate words or concepts and this requires mental
effort that is affected by both attention and prior-knowledge. Taking notes
from a lecture places more demands on working memory resources than
note-taking from a web site which is more demanding than note-taking
from a book. Barbier and Piolat (2005) found that French university
students who could write as well in English as in French could not take
notes as well in English as in French. They suggested that this
demonstrated the high cognitive demands of comprehension, selection and
reformulation of information when note-taking.
Whittaker, Hyland, and Wiley (1994) described the development and testing
of Filochat, to allow users to take notes in their normal manner and be able
to find the appropriate section of audio by synchronizing their pen strokes
with the audio recording. To enable access to speech when few notes had
been taken they envisaged that segment cues and some keywords might
be derived automatically from acoustic analysis of the speech as this
appeared to be more achievable than full SR. Wilcox, Schilit, and Sawhney
(1997) developed and tested Dynomite, a portable electronic notebook for
the capture and retrieval of handwritten notes and audio that, to cope with
the limited memory, only permanently stored the sections of audio
highlighted by the user through digital pen strokes, and displayed a
horizontal timeline to access the audio when no notes had been taken.
Their experiments corroborated the Filochat findings that people appeared
to be learning to take fewer handwritten notes and relying more on the
audio and that people wanted to improve their handwritten notes
afterwards by playing back portions of the audio. Chiu and Wilcox (1998)
described a technique for dynamically grouping digital ink pen down and
pen up events and phrase segments of audio delineated by silences to
support user interaction in freeform note-taking systems by reducing the
need to use the tape-deck controls. Chiu, Kapuskar, Reitmeief, and Wilcox
(1999) described NoteLook, a client-server integrated conference room
system running on wireless penbased notebook computers designed and
built to support multimedia notetaking in meetings with digital video and
ink. Images could be incorporated into the note pages and users could
select images and write freeform ink notes. NoteLook generated web
pages with links from the images and ink strokes correlated to the video.
Slides and thumbnail images from the room cameras could be
automatically incorporated into notes. Two distinct notetaking styles were
observed, producing a set of note pages that resembled a full set of the
speaker's slides with annotations or more handwritten ink supplemented by
some images, and fewer pages of notes. Stifelman, Arons, and Schmandt
(2001) described how the Audio Notebook enabled navigation through the
user's hand written/drawn notes synchronised with the digital audio
recording assisted by means of a phrase detection algorithm to enable the
start of an utterance to be detected accurately. Tegrity (2005) sold the
Tegrity Pen to write and/or draw on paper while also automatically storing
the writing and drawings digitally. Software can also store handwritten
notes on a Tablet PC or typed notes on a notebook computer. Clicking on
any note replays the automatically captured and time stamped projected
slides, computer screen and audio and video of the instructor.
Searching Synchronised Multimedia Through Speaker Transcripts and
Annotations
Hindus and Schmandt (1992) described applications for retrieval, capturing,
and structuring spoken content from office discussions and telephone calls
without human transcription by obtaining high-quality recordings and
segmenting each participant's utterances, as SR of fluent, unconstrained
natural language was not achievable at that time. Playback up to three
times normal speed allowed faster scanning through familiar material,
although intelligibility was reduced beyond twice normal speed.
Abowd (1999) described a system where time-stamp information was
recorded for every pen stroke and pixel drawn on an electronic whiteboard
by the teacher. Students could click on the lecturer's handwriting and get
to the lecture at the time it was written, or use a time line with indications
when a new slide was visited or a new URL was browsed to load the slide
or page. Commercial SR software was not yet able to produce readable
timestamped transcripts of lectures and so additional manual effort was
used to produce more accurate time-stamped transcripts for keyword
searches to deliver pointers to portions of lectures in which the keywords
were spoken. Initial trials of students using cumbersome, slow, small tablet
computers as a personal notetaker to take their own time stamped notes
were suspended as the notes were often very similar to the lecturer's
notes available through the Web. Students who used it for every lecture
gave the least positive reaction to the overall system, whereas the most
positive reactions came from those students who never used the personal
note-taker. Both speech transcription and student note-taking with newer
networked tablet computers were incorporated into a live environment in
1999 and Abowd expected to be able to report on their effectiveness within
the year. However no reference to this was made in the 2004 evaluation by
Brotherton and Abowd (2004) that presented the findings of a longitudinal
study over a three-year period of use of Classroom 2000 (renamed eClass).
This automatically creates a series of web pages at the end of a class,
integrating the audio, video, visited web pages, and the annotated slides.
The student can search for keywords or click on any of the teacher's
captured handwritten annotations on the whiteboard to launch the video of
the lecture at the time that the annotations were written. Brotherton and
Abowd concluded that eClass did not have a negative impact on
attendance or a measurable impact on performance grades but seemed to
encourage the helpful activities of reviewing lectures shortly after they
occurred and also later for exam cramming. Suggestions for future
improvements included: (a) easy to use high fidelity capture and dynamic
replay of any lecture presentation materials at user adjustable rates; (b)
supporting collaborative student note-taking and enabling students and
instructors to edit the materials; (c) automated summaries of a lecture; (d)
linking the notes to the start of a topic; (e) making the capture system
more visible to the users so that they know exactly what is being recorded
and what is not.
Klemmer, Graham, Wolff, and Landay et al. 2003) found that the concept of
using bar-codes on paper transcripts as an interface to enabling fast,
random access of original digital video recordings on a PDA seemed
perfectly "natural" and provided emotion and nonverbal cues in the video
not available in the printed transcript. Younger users appeared more
comfortable with the multisensory approach of simultaneously listening to
one section while reading another, even though people could read three
times faster than they listened. There was no facility to go from video to
text and occasionally participants got lost and it was thought that if there
were subtitles on the video indicating page and paragraph number it might
remedy this problem.
Bailey (2000) has reported people can comfortably listen to speech at 150
to 160 words per minute, the recommended rate for audio narration,
however there is no loss in comprehension when speech is replayed at 210
words per minute. Arons (1991) investigated a hyperspeech application
using SR to explore a database of recorded speech through synthetic
speech feedback and showed that listening to speech was "easier" than
reading text as it was not necessary to look at a screen during an
interaction. Arons (1997) showed how user controlled time-compressed
speech, pause shortening within clauses, automatic emphasis detection,
and nonspeech audio feedback made it easier to browse recorded speech.
Baecker, Wolf, and Rankin (2004) mentioned research being undertaken on
further automating the production of structured searchable webcast
archives by automatically recognizing key words in the audio track. Rankin,
Baecker, and Wolff (2004) indicated planned research included the
automatic recognition of speech, especially keywords on the audio track to
assist users searching archived ePresence webcasts in addition to the
existing methods. These included chapter titles created during the talk or
afterwards and slide titles or keywords generated from PowerPoint slides.
Dufour, Toms, Bartlett, Ferencok, and Baecker (2004) noted that
participants performing tasks using ePresence suggested that adding a
searchable textual transcript of the lectures would be helpful. Phonetic
searching (Clements et al., 2002) is faster than searching the original
speech and can also help overcome SR "out of vocabulary" errors that
occur when words spoken are not known to the SR system, as it searches
for words based on their phonetic sounds not their spelling.
COMPONENTS OF MULTIMEDIA
Speech can express feelings that are difficult to convey through text (e.g.,
presence, attitudes, interest, emotion, and tone) and that cannot be
reproduced through speech synthesis. Images can communicate
information permanently and holistically and simplify complex information
and portray moods and relationships. When a student becomes distracted
or loses focus it is easy to miss or forget what has been said whereas text
reduces the memory demands of spoken language by providing a lasting
written record that can be reread. Synchronising multimedia involves using
timing information to link together text, speech, and images so all
communication qualities and strengths can be available as appropriate for
different content, tasks, learning styles, and preferences.
Faraday and Sutcliffe (1997) conducted a series of studies that tracked
eye-movement patterns during multimedia presentations and suggested
that learning could be improved by using speech to reinforce an image,
using concurrent captions or labels to reinforce speech and cueing
animations with speech. Lee and Bowers (1997) investigated how the text,
graphic/animation, and audio components of multimedia affect learning.
They found that audio and animation played simultaneously was better
than sequentially, supporting previous research findings and that
performance improved with the number of multimedia components. Mayer
and Moreno (2002) and Moreno and Mayer (2002) developed a cognitive
theory of multimedia learning drawing on dual coding theory to present
principles of how to use multimedia to help students learn. Students
learned better in multimedia environments (i.e., two modes of
representation rather than one) and also when verbal input was presented
as speech rather than as text. This supported the theory that using
auditory working memory to hold representations of words prevents visual
working memory from being overloaded through having to split attention
between multiple visual sources of information. They speculated that more
information is likely to be held in both auditory and visual working memory
rather than in just either alone and that the combination of auditory verbal
and visual non-verbal materials may create deeper understanding than the
combination of visual verbal and visual non-verbal materials. Studies also
provided evidence that students better understood an explanation when
short captions or text summaries were presented with illustrations and
when corresponding words and pictures were presented at the same time
rather than when they were separated in time. Narayanan and Hegarty
(2002) agreed with previous research findings that multimedia design
principles that helped learning included: (a) synchronizing commentaries
with animations; (b) highlighting visuals synchronised with commentaries;
(c) using words and pictures rather than just words; (d) presenting words
and pictures together rather than separately; (e) presenting words through
the auditory channel when pictures were engaging the visual channel.
Najjar (1998) discussed principles to maximise the learning effectiveness
of multimedia: (a) text was better than speech for information
communicated only in words but speech was better than text if pictorial
information was presented as well; (b) materials that forced learners to
actively process the meaning (e.g., figure out confusing information and/or
integrate the information) could improve learning; (c) an interactive user
interface had a significant positive effect on learning from multimedia
compared to one that didn't allow interaction; (d) multimedia that
encouraged information to be processed through both verbal and pictorial
channels appeared to help people learn better that through either channel
alone and synchronised multimedia was better than sequential. Najjar
concluded that interactive multimedia can improve a person's ability to
learn and remember if it is used to focus motivated learners' attention,
uses the media that best communicates the information and encourages
the user to actively process the information.
Carver, Howard, and Lane (1999) attempted to enhance student learning by
identifying the type of hypermedia appropriate for different learning styles.
Although a formal assessment of the results of the work was not
conducted, informal assessments showed students appeared to be learning
more material at a deeper level and although there was no significant
change in the cumulative GPA, the performance of the best students
substantially increased while the performance of the weakest students
decreased. Hede and Hede (2002) noted that research had produced
inconsistent results regarding the effects of multimedia on learning, most
likely due to multiple factors operating, and reviewed the major design
implications of a proposed integrated model of multimedia effects. This
involved learner control in multimedia being designed to accommodate the
different abilities and styles of learners, allowing learners to focus
attention on one single media resource at a time if preferred and providing
tools for annotation and collation of notes to stimulate learner
engagement. Coffield, Moseley and Hall (2004) critically reviewed the
literature on learning styles and recommended that most instruments had
such low reliability and poor validity their use should be discontinued.
IMPLICATIONS FOR USING SR IN MULTIMEDIA E-LEARNING
The implications for using SR in Multimedia E-Learning can be summarised
as follows.
1. The low reliability and poor validity of learning style instruments suggest
that students should be given the choice of media rather than attempting
to predict their preferred media and so synchronised text captions should
always be available with audio and video.
2. Captioning by hand is very time consuming and expensive and SR offers
the opportunity for cost effective captioning if accuracy and error
correction can be improved.
3. Although reading from a screen can be as efficient as reading from
paper, especially if the user can manipulate display parameters, a linear
screen display is not as easy to interact with as paper. SR can be used to
enhance learning from onscreen information by providing searchable
synchronised text captions to assist navigation.
4. The facility to adjust speech replay speed while maintaining
synchronisation with captions can reduce the time taken to listen to
recordings and so help make learning more efficient.
5. The optimum display methods for hand held personal display systems
may depend on the size of the screen and so user options should be
available to cope with a variety of devices displaying captions.
6. Note-taking in lectures is a very difficult skill, particularly for normative
speakers, and so using SR to assist students to take notes could be very
helpful, especially if students are also able to highlight and annotate the
transcribed text.
7. Improving the accuracy of the SR captions and developing faster editing
methods is important because editing SR captions is currently difficult and
slow.
8. Since existing readability measures involve measuring the length of
error free sentences with punctuation, they cannot easily be used to
measure the readability of unpunctuated SR captions and transcripts
containing errors.
9. Further research is required to investigate in detail the importance of
punctuation, segmentation, and errors on readability.
CONCLUSION
Research suggests that an ideal system to digitally record and replay
multimedia content should automatically create an error free transcript of
spoken language synchronised with audio, video, and any onscreen
graphics and enable this to be displayed in the most appropriate way on
different devices and with adjustable replay speed. Annotation would be
available through pen or keyboard and mouse and be synchronised with the
multimedia content. Continued research is needed to improve the accuracy
and readability of SR captions before this vision of universal access to
learning can become an everyday reality.
[Reference] » View reference page with links
References
Abowd, G. (1999). Classroom 2000: An experiment with the instrumentation
of a living educational environment. IBM Systems Journal, 58(4), 508-530.
Arons, B. (1991, December). Hyperspeech: Navigating in speech-only
hypermedia. Proceedings of the 3rd Annual ACM Conference on Hypertext
(pp. 133146), San Antonio, TX. New York: Author.
Arons, B. (1997). SpeechSkimmer: A system for interactively skimming
recorded speech. 7997 ACM Transactions on Computer-Human Interaction
(TOCHI), 4(1), 3-38.
Association of Computing Machinery ([ACM], 1992). Retrieved March 10,
2006, from http://www.acm.org/constitution/code.html
Baecker, R. M., Wolf, P., & Rankin, K. (2004, October). The ePresence
interactive webcasting system: Technology overview and current research
issues. Proceedings of the World Conference on E-Leaming in Government,
Healthcare, & Higher Education 2004 (pp. 2396-3069), Washington, DC.
Chesapeake, VA: Association for the Advancement of Computing in
Education.
Bailey, B. (2000). Human interaction speeds. Retrieved December 8, 2005,
from http://webusability.com/article_human_interaction_speeds_9_2000.htm
Bailey, B. (2002). Readability formulas and writing for the web. Retrieved
December 8, 2005, from http://webusability.com/article_readabili
ty_formula_7_2002.htm
Bain, K., Basson, S. A., Faisman, A., & Kanevsky, D. (2005). Accessibility,
transcription, and access everywhere, IBM Systems Journal, 44(3), 589603. Retrieved December 12, 2005, from
http://www.research.ibm.com/joumaysj/443/bain.pdf
Barbier, M. L., & Piolat, A. (2005, September). Ll and 12 cognitive effort of
notetaking and writing. Paper presented at the SIG Writing Conference
2004, Geneva, Switzerland.
Boyarski, D., Neuwirth, C., Forlizzi, J., & Regli, S. H. (1998, April). A study of
fonts designed for screen display. Proceedings of CHI'98 (pp. 87-94), Los
Angeles, CA.
Brotherton, J. A., & Abowd. G. D. (2004). Lessons learned from eClass:
Assessing automated capture and access in the classroom. ACAf
Transactions on Computer-Human Interaction, 11(2), 121-155.
Carrol, J., & McLaughlin, K. (2005). Closed captioning in distance
education. Journal of Computing Sciences in Colleges, 20(4), 183-189.
Carver, C. A., Howard, R. A., & Lane, W. D. (1999). Enhancing student
learning through hypermedia courseware and incorporation of student
learning styles. IEEE Transactions on Education, 42(1), 33-38.
Chiu, P., Kapuskar, A., Reitmeief, S., & Wilcox, L. (1999, October/November).
NoteLook: Taking notes in meetings with digital video and ink. Proceedings
of the Seventh ACM International Conference on Multimedia (Part 1, pp. 149
- 158), Orlando, FL.
Chiu, P., & Wilcox, L. (1998, November). A dynamic grouping technique for
ink and audio notes. Proceedings of the llth Annual ACM Symposium on
User Interface Software and Technology (pp. 195-202), San Francisco, CA.
Clements, M., Robertson, S., & Miller, M. S. (2002). Phonetic searching
applied to on-line distance learning modules. Retrieved December 8, 2005,
from http://www.imtc.gatech.edu/news/multimedi a/spe2002_paper.pdf
Coffield, F, Moseley, D., Hall, E., & Ecclestone, K. (2004). Learning styles
and pedagogy in post-16 learning: A systematic and critical review. London:
Learning and Skills Research Centre.
Digital Accessible Information SYstem ([DAISY]; 2005). Retrieved December
8, 2005, from http://www.daisy.org
Dolphin (2005). Dolphin Publisher. [Computer software]. Retrieved
December 8, 2005, from http://www.dolphinaudiopublishing.com/
Dufour, C., Toms, E. G., Bartlett. J., Ferenbok, J., & Baecker, R. M. (2004,
May). Exploring user interaction with digital videos. Paper presented at the
Graphics Interface Conference, London, ON, Canada. Retrieved January 29,
2008, from http://kmdi.utoronto.ca/rmb/papers/E13.pdf
eTeach (2005). Retrieved December 8, 2005, from
http://eteach.engr.wisc.edu/newEteach/home.html
Faraday, P., & Sutcliffe, A. (1997, March). Designing effective multimedia
presentations. Proceedings of CHI '97 (pp. 272-278), Atlanta, GA.
Hede, T., & Hede, A. (2002, July). Multimedia effects on learning: Design
implications of an integrated model. In S. McNamara & E. Stacey (Eds),
Untangling the web: Establishing learning links: Proceedings of ASET
Conference 2002, Melbourne, VIC, Australia. Retrieved December 8, 2005,
from http://www.ascilite.org.au/aset-archives/confs/2002/hede-t.html
Hindus, D., & Schmandt, C. (1992, November). Ubiquitous audio: Capturing
spontaneous collaboration. Proceedings of the ACM Conference on
Computer Supported Co-Operative Work (pp. 210-217), Toronto, ON,
Canada.
Holleran, P., & Bauersfeld, K. (1993, April). Vertical spacing of computerpresented text. CHI '93 Conference on Human Factors in Computing
Systems (pp. 179-180), Amsterdam, The Netherlands.
Hornbaek, K., & Frøkjaer, E. (2003). Reading patterns and usability in
visualizations of electronic documents. ACM Transactions on ComputerHuman Interaction (TOCHI), 10(2), 119-149.
Howard-Spink, S. (2005). You just don't understand! Retrieved January 30,
2008, from http://domino.research.ibm.com/comm/wwwr_thinkresearch.
nsf/pages/20020918_speech.html
IBM (2003). The superhuman speech recognition project. Retrieved
December 12, 2005, from
http://www.research.ibm.com/superhuman/superhuman.htm
IBM (2005). ViaScribe. Retrieved December 8, 2005, from http://www306.ibm.com/able/solution_offerings/ViaScribe.html
Jones, D., Wolf, R, Gibson, E., Williams, E., Fedorenko, F., Reynolds, D. A., et
al. (2003, September). Measuring the readability of automatic speechtotext transcripts. Proceedings ofEurospeech (pp. 1585-1588), Geneva,
Switzerland.
Keegan, V. (2005). Read your mobile like an open book. The Guardian.
Retrieved December 8, 2005, from
http://technology.guardian.co.uk/weekly/story/0,16376,1660765,00.html
Klemmer, S. R., Graham, J., Wolff, G. J., & Landay, J. A. (2003, April). Books
with voices: Paper transcripts as a physical interface to oral histories.
Proceedings of the SIGCHI Conference on Human Factors in Computing
Systems (pp. 89-96), Fort Lauderdale, FL.
Laarni, J. (2002, October) Searching for optimal methods of presenting
dynamic text on different types of screens. Proceedings of the second
Nordic Conference on Human-Computer Interaction NordiCHI '02 (pp. 219222), Aarhus, Denmark.
Lee, A. Y, & Bowers, A. N. (1997, September). The effect of multimedia
components on learning, Proceedings of the Human Factors and
Ergonomics Society (pp. 340-344), Albuquerque, NM.
Leiten, D., & MacMillan, T. (2003). Liberated learning initiative innovative
technology and inclusion: Current issues and future directions for liberated
learning research. Halifax, Nova Scotia, Canada: Saint Mary's University.
Retrieved March 10, 2006, from http://www.liberatedlearning.com/
Liberated Learning Consortium (2006). Retrieved March 10, 2006, from
www.liberatedlearning.com
Mayer, R. E. & Moreno, R. A. (2002). Cognitive theory of multimedia
learning: Implications for design principles. Retrieved December 8, 2005,
from http:// www.unm.edu/~moreno/PDFS/chi.pdf
McWhirter, N. (Ed). (1985). The guiness book of world records (23rd US
ed.). New York: Sterling Publishing.
Media Accessible Generator ([MAGpie], 2005). Retrieved December 8, 2005,
from http://ncam.wgbh.org/webaccess/magpie/
Mills, C., & Weldon, L. (1987). Reading text from computer screens. ACM
Computing Surveys, 19(4), 329-357.
Moreno, R., & Mayer, R. E. (2002).Visual presentations in multimedia
Learning: Conditions that overload visual working memory. Retrieved
December 8, 2005, from http://www.unm.edu/~moreno/PDFS/Visual.pdf
Muter, P. (1996). Interface design and optimization of reading of continuous
text. In H. van Oostendorp & S. de MuI (Eds.), Cognitive aspects of
electronic text processing, (pp. 161-180). Norwood, NJ: Ablex.
Muter, P., & Maurutto, P. (1991). Reading and skimming from computer
screens and books: The paperless office revisited? Behaviour &
Information Technology, 10(4), 257-266.
Najjar, L. J. (1998). Principles of educational multimedia user interface
design. Human Factors, 40(2), 311-323.
Narayanan, N. H., & Hegarty, M. (2002). Multimedia design for
communication of dynamic information. International Journal of HumanComputer Studies, 57(4), 279-315.
Nuance (2005). Dragon audiomining. Retrieved December 8, 2005, from
http:// www.nuance.com/audiomining
O'Hara, K., & Seilen, A. (1997, March). A comparison of reading paper and
online documents. Proceedings of the SIGCHI Conference on Human
Factors in Computing Systems, (pp. 335-342), Atlanta, GA.
Olavsrud, T. (2002). IBM wants you to talk to your devices. Retrieved
December 12, 2005, from http://www.internetnews.com/entnews/article.php/1004901
Piolat, A., Olive, T., & Kellogg, R.T. (2004). Cognitive effort during note
taking. Applied Cognitive Psychology, 79(3), 291-312.
Piolat, A., Roussey, J.Y., & Thunin, O. (1997). Effects of screen presentation
on text reading and revising. International Journal of Human-Computer
Studies, 47(4), 565-589.
Rahman, T., & Muter, P. (1999). Designing an interface to optimize reading
with small display windows. Human Factors, 41(1), 106-117.
Rankin, K., Baecker, R. M., & Wolf. P. (2004, November). ePresence: An open
source interactive webcasting and archiving system for eleaming.
Proceedings of E-Learn 2004 (pp. 2872-3069), Washington, DC. Chesapeake,
VA: Association for the Advancement of Computing in Education.
Shneiderman, B. (2000). Universal usability. Communications of the ACM,
43(5), 85-91. Retrieved March 10, 2006, from
http://www.cs.umd.edu/~ben/p84-shneiderman-May2000CACMf.pdf
Special Educational Needs and Disability Act ([SENDA], 2001). Retrieved
December 12, 2005, from
http://www.opsi.gov.uk/acts/acts2001/20010010.htm
Stifelman, L., Arons, B., & Schmandt, C. (2001, March/April). The audio
notebook: Paper and pen interaction with structured speech. Proceedings
of the SIGCHI Conference on Human Factors in Computing Systems (pp.
182189), Seattle, WA.
Synchronized Multimedia Integration Language ([SMIL] ,2005). Retrieved
December 8, 2005, from http://www.w3.org/AudioVideo/
Tegrity (2005). Tegrity pen. Retrieved December 8, 2005, from
http://www.tegrity.com/
Tyre, P. (2005, November 28). Professor in your pocket. Newsweek.
Wald, M. (2005a, June). Personalised displays. Proceedings of Speech
Technologies: Captioning, Transcription and Beyond Conference, New York
, IBM T.J. Watson Research Center. Retrieved December 27, 2005, from
http:// www.liberatedleaming.com/resources/pdf/2005_avios.pdf
Wald, M. (2005b, June). SpeechText: Enhancing learning and teaching by
using automatic speech recognition to create accessible synchronised
multimedia. Proceedings of ED-MEDIA 2005 World Conference on
Educational Multimedia, Hypermedia, & Telecommunications (pp. 47654769), Montreal, QC, Canada. Chesapeake, VA: Association for the
Advancement of Computing in Education.
Wald, M. (2006). Creating accessible educational multimedia through
editing automatic speech recognition captioning in real time. International
Journal of Interactive Technology and Smart Education: Smarter Use of
Technology in Education, 3(2), 131-142.
Web Accessibility Initiative ([WAI], 1999). Retrieved December 12, 2005,
from http://www.w3.org/WAI
Whittaker, S., Hyland, P., & Wiley, M. (1994, April). Filochat handwritten
notes provide access to recorded conversations. Proceedings of CHI '94
(pp. 271-277), Boston, MA.
Wilcox, L., Schilit, B., & Sawhney, N. (1997, March). Dynomite: A
dynamically organized ink and audio notebook. Proceedings of CHI '97 (pp.
186-193), Atlanta, GA.
[Author Affiliation]
MIKEWALD
University of Southampton
UK
[email protected]
References
 References (63)