Learning Through Multimedia: Speech Recognition Enhancing Accessibility and Interaction Mike Wald. Journal of Educational Multimedia and Hypermedia. Norfolk: 2008. Vol. 17, Iss. 2; pg. 215, 19 pgs Abstract (Summary) Lectures can present barriers to learning for many students and although online multimedia materials have become technically easier to create and offer many benefits for learning and teaching, they can be difficult to access, manage, and exploit. » Jump to indexing (document details) Full Text (6750 words) Copyright Association for the Advancement of Computing in Education 2008 [Headnote] Lectures can present barriers to learning for many students and although online multimedia materials have become technically easier to create and offer many benefits for learning and teaching, they can be difficult to access, manage, and exploit. This article considers how research on interacting with multimedia can inform developments in using automatically synchronised speech recognition (SR) captioning to: 1. facilitate the manipulation of digital multimedia resources; 2. support preferred learning and teaching styles and assist those who, for cognitive physical or sensory reasons, find note-taking difficult; and 3. caption speech for deaf learners or for any learner when speech is not available or suitable. Speech recognition (SR) has the potential to benefit all learners through automatically and cost-effectively providing synchronised captions and transcripts of live or recorded speech (Bain, Basson, Faisman, & Konevsky, 2005). SR verbatim transcriptions have been shown to particularly benefit students who find it difficult or impossible to take notes at the same time as listening, watching, and thinking as well as those who are unable to attend the lecture (e.g., for mental or physical health reasons): "It's like going back in time to the class and doing it all over again...and really listening and understanding the notes and every thing... and learning all over again for the second time" (Leitch & MacMillan, 2003, p. 16). This article summarises current developments and reviews relevant research in detail in order to identify how SR may best facilitate learning from and interacting with multimedia. Many systems have been developed to digitally record and replay multimedia face to face lecture content to provide revision materials for students who attended the class or to provide a substitute learning experience for students unable to attend. A growing number of universities are also supporting the downloading of recorded lectures onto students' iPods or MP3 players (Tyre 2005). Effective interaction with recorded multimedia materials requires efficient methods to search for specific parts of the recordings and early research in using SR transcriptions for this purpose proved less successful than the use of student or teacher annotations. Speech, text, and images have communication qualities and strengths that may be suitable for different content, tasks, learning styles, and preferences and this article reviews the extensive research literature on the components of multimedia and the value of learning style instruments. UK Disability Discrimination Legislation requires reasonable adjustments to be made to ensure that disabled students are not disadvantaged (Special Educational Needs and Disability Act [SENDA], 2001) and SR offers the potential to make multimedia materials accessible. This article also discusses the related concepts of Universal access, Universal Design, and Universal Usability. Synchronising speech with text can assist blind, visually impaired, or dyslexic learners to read and search text-based learning material more readily by augmenting unnatural synthetic speech with natural recorded real speech. Although speech synthesis can provide access to text-based materials for blind, visually impaired, or dyslexic people, it can be difficult and unpleasant to listen to for long periods and cannot match synchronised real recorded speech in conveying "pedagogical presence," attitudes, interest, emotion, and tone and communicating words in a foreign language and descriptions of pictures, mathematical equations, tables, diagrams, and so forth. Real time captioning (i.e., creating a live verbatim transcript of what is being spoken) using phonetic keyboards can cope with people talking at up to 240 words per minute but has not been available in UK universities because of the cost and unavailability of stenographers. As video and speech become more common components of online learning materials, the need for captioned multimedia with synchronised speech and text, as recommended by the Web Accessibility Guidelines (Web Accessibility Initiative [WAI], 1999), can be expected to increase and so finding an affordable method of captioning will become even more important. Current non-SR captioning approaches are therefore reviewed and the feasibility of using SR for captioning discussed. SR captioning involves people reading text from a screen and so research into displays, readability, accuracy, punctuation, and formatting is also discussed. Since for the foreseeable future SR will produce some recognition errors, preliminary findings are presented on methods to make corrections in real-time. UNIVERSAL ACCESS, DESIGN, AND USABILITY The Code of Ethics of the Association of Computing Machinery (ACM, 1992, 1.4) states that: "In a fair society, all individuals would have equal opportunity to participate in, or benefit from, the use of computer resources regardless of race, sex, religion, age, disability, national origin or other such similar factors." The concepts of Universal Access, Design, and Usability (Shneiderman, 2000) mean that technology should benefit the greatest number of people and situations. For example, the "curb-cut" pavement ramps created to allow wheelchairs to cross the street also benefited anyone pushing or riding anything with wheels. It is cheaper to design and build-in than modify later and since there is no "average" user it is important for technology to suit abilities, preferences, situations, and environments. While it is clearly not ethical to exclude people because of a disability, legislation also makes it illegal to discriminate against a disabled person. There are also strong self-interest reasons as to why accessibility issues should be addressed. Even if not born with a disability, an individual or someone they care about will have a high probability of becoming disabled in some way at some point in their life (e.g., through accident, disease, age, loud music/ noise, incorrect repetitive use of keyboards, etc.). It also does not make good business sense to exclude disabled people from being customers and standard search engines can be thought to act like a very rich blind customer unable to find information presented in nontext formats. The Liberated Learning (LL) Consortium (Liberated Learning, 2006) working with IBM and international partners aims to turn the vision of universal access to learning into a reality through the use of SR. CAPTIONING METHODS Non-SR Captioning Tools Tools are available to synchronise preprepared text and corresponding audio files, either for the production of electronic books (Dolphin, 2005) based on the Digital Accessible Information SYstem (DAISY) specifications (DAISY, 2005) or for the captioning of multimedia (Media Accessible Generator [MAGpie], 2005) using for example the Synchronized Multimedia Integration Language (SMIL, 2005). These tools are not however normally suitable or cost-effective for use by teachers for the "everyday" production of learning materials. This is because they depend on either a teacher reading a prepared script aloud, which can make a presentation less natural sounding and therefore less effective, or on obtaining a written transcript of the lecture, which is expensive and time consuming to produce. Carrol and McLaughlin (2005) describe how Hicaption was used for captioning after having problems using MAGpie, and finding the University of Wisconsin eTeach (eTeach, 2005) manual creation of transcripts, captioning tags, and timestamps too labour intensive. Captioning Using SR Informal feasibility trials were carried out in 1998 by WaId in the UK and by St Mary's University in Canada using existing commercially available SR software to provide a real time verbatim displayed transcript in lectures for deaf students. These trials identified that this software for example, Dragon and Via Voice (Nuance, 2005) was unsuitable as it required the dictation of punctuation, which does not occur naturally in spontaneous speech in lectures. The software also stored the speech synchronised with text in proprietary nonstandard formats that lost speech and synchronisation after editing. Without the dictation of punctuation the SR software produced a continuous unbroken stream of text that was very difficult to read and comprehend. The trials however showed that reasonable accuracy could be achieved by interested and committed lecturers who spoke very clearly and carefully after extensively training the system to their voice by reading the training scripts and teaching the system any new vocabulary that was not already in the dictionary. Based on these feasibility trials the international LL Collaboration was established by Saint Mary's University, Nova Scotia, Canada in 1999 and since then the author has continued to work with LL and IBM to investigate how SR can make speech more accessible. READABILITY OFTEXT CAPTIONS The readability of text captions will be affected by transcription errors and how the text is displayed and formatted. Readability Measures Readability formulas provide a means for predicting the difficulty a reader may have reading and understanding, usually based on the number of syllables (or letters) in a word, and the number of words in a sentence (Bailey, 2002). Since most readability formulas consider only these two factors, these formulas do not however explain why written material is difficult to read and comprehend. Jones (Jones et al., 2003) found no previous work that investigated the readability of SR generated speech transcripts and their experiments found a subjective preference for texts with punctuation and capitals over texts automatically segmented by the system. No objective differences were found although they were concerned there might have been a ceiling effect and stated that future work would include investigating whether including periods between sentences improved readability. Reading From a Screen Mills and Weldon (1987) found it was best to present linguistically appropriate segments by idea or phrase with longer words needing smaller segments compared to shorter words. Smaller characters were better for reading continuous text, larger characters better for search tasks. Muter and Maurutto (1991) reported that reading well designed text presented from a high quality computer screen can be as efficient as reading from a book. Holleran and Bauersfeld (1993) showed that people preferred more space between lines than the normal default while Muter (1996) reported that increasing line spacing and decreasing letter spacing improved readability although increasing text density reduced reading speed. Highlighting only target information was found to be helpful and paging was superior to scrolling. Piolat, Roussey, and Thunin (1997) also found that information was better located and ideas remembered when presented in a page by page format compared to scrolling. Boyarski, Neuwirth, Forlizzi, and Regli (1998) reported that font had little effect on reading speeds whether bit mapped or anti-aliased, sans serif or serif, designed for the computer screen or not. O'Hara and Sellen (1997) found that paper was better than a computer screen for annotating and note-taking, as it enabled the user to find text again quickly by being able to see all or the parts of text through laying the pages out on a desk in different ways. Hornbaek and Frøkjaer (2003) learned that interfaces that provide an overview for easy navigation as well as detail for reading were better than the common linear interface for writing essays and answering questions about scientific documents. Bailey (2000) has reported that proofreading from paper occurs at about 200 words per minute and is 10% slower on a monitor. The average adult reading speed for English is 250 to 300 words per minute, which can be increased to 600 to 800 words per minute or faster with training and practice. Rahman and Muter (1999) reported that a sentenceby-sentence format for presenting text in a small display window is as efficient and as preferred as the normal page format. Laarni (2002) found that dynamic presentation methods were faster to read in small-screen interfaces than a static page display. Comprehension became poorer as the reading rate increased but wasn't affected by screen size. On the 3 ? 11 cm wide 60 character display users preferred the one character added at a time presentation. Vertically scrolling text was generally the fastest method apart from for the mobile phone size screen where presenting one word at time in the middle of the screen was faster. ICUE (Keegan, 2005) uses this method on a mobile phone to enable users to read books twice as fast as normal. While projecting the SR text onto a large screen in the classroom has been used successfully in LL classrooms it is clear that in many situations an individual personalised and customisable display would be preferable or essential. A client server personal display system has been developed (WaId, 2005a) to provide users with their own personal display on their own wireless handheld computer systems customised to their preferences. Punctuation and Formatting Spontaneous speech verbatim transcriptions do not have the same sentence structure as written text and so the insertion points for commas and periods/full-stops is not obvious. Automatically punctuating transcribed spontaneous speech is also very difficult as SR systems can only recognise words not the concepts being conveyed. LL informal investigations and trials demonstrated it was possible to develop SR applications that provided a readable real time display by automatically breaking up the continuous stream of text based on the length of pauses/silences. An effective and readable approach was achieved by providing a visual indication of pauses to indicate how the speaker grouped words together, for example, one new line for a short pause and two for a long pause. SR can therefore provide automatically segmented verbatim captioning for both live and recorded speech (Wald, 2005b) and ViaScribe (IBM, 2005) would appear to be the only SR tool that can currently create automatically formatted synchronised captioned multimedia viewable on internet browsers or media players. SR Accuracy Detailed feedback from students with a wide range of physical, sensory and cognitive disabilities and from interviews with lecturers (Leitch & MacMillan, 2003) showed that both students and teachers generally liked the Liberated Learning concept of providing real time SR lecture transcription and archived notes and felt it improved teaching and learning as long as the text was reasonably accurate (e.g., >85%). While it has proved difficult to obtain this level of accuracy in all higher education classroom environments directly from the speech of all teachers, many students developed strategies to cope with errors in the text and the majority of students used the text as an additional resource to verify and clarify what they heard. "You also don't notice the mistakes as much anymore either. I mean you sort of get used to mistakes being there like it's just part and parcel" (Leitch and MacMillan, p. 14). Although it can be expected that developments in SR will continue to improve accuracy rates (Olavsrud, 2002; IBM, 2003; Howard-Spink, 2005) the use of a human intermediary to correct mistakes in real-time as they are made by the SR software could, where necessary, help compensate for some of SR's current limitations. Since not all errors are equally important the editor can use their knowledge and experience to prioritise those that most affect readability and understanding. To investigate this hypothesis a prototype system was developed to investigate the most efficient approach to realtime editing (Wald, 2006). Five test subjects edited SR captions for speaking rates varying from 105 words per minute to 176 words per minute and error rates varying from 13% to 29%. In addition to quantitative data recorded by data logging, subjects were interviewed. All subjects believed the task to be feasible with navigation using the mouse being preferred and producing the highest correction rates of 11 errors per minute. Further work is required to investigate whether correction rates can be enhanced by interface improvements and training. The use of automatic error correction through phonetic searching (Clements, Robertson, & Miller, 2002) and confidence scores (which reflect the probability that the recognised word is correct) also offer some promise. ACCESSING AND INTERACTING WITH MULTIMEDIA An extensive body of research has shown the value of synchronised annotations or transcripts in assisting students to access and interact with multimedia recordings. Searching Synchronised Multimedia Using Student Notes Although the Guinness Book Of World Records (McWhirter, 1985) recorded the world's fastest typing top speed at 212 words per minute, most people handwrite or type on computers at between 20 and 40 words per minute (Bailey, 2000) while lecturers typically speak at 150 words per minute or above. Piolat, Olive, and Kellogg (2004) demonstrated note-taking is not only a transcription of information that is heard or read but involves concurrent management, comprehension, selection, and production processes. Since speaking is much faster than writing note takers must summarise and abbreviate words or concepts and this requires mental effort that is affected by both attention and prior-knowledge. Taking notes from a lecture places more demands on working memory resources than note-taking from a web site which is more demanding than note-taking from a book. Barbier and Piolat (2005) found that French university students who could write as well in English as in French could not take notes as well in English as in French. They suggested that this demonstrated the high cognitive demands of comprehension, selection and reformulation of information when note-taking. Whittaker, Hyland, and Wiley (1994) described the development and testing of Filochat, to allow users to take notes in their normal manner and be able to find the appropriate section of audio by synchronizing their pen strokes with the audio recording. To enable access to speech when few notes had been taken they envisaged that segment cues and some keywords might be derived automatically from acoustic analysis of the speech as this appeared to be more achievable than full SR. Wilcox, Schilit, and Sawhney (1997) developed and tested Dynomite, a portable electronic notebook for the capture and retrieval of handwritten notes and audio that, to cope with the limited memory, only permanently stored the sections of audio highlighted by the user through digital pen strokes, and displayed a horizontal timeline to access the audio when no notes had been taken. Their experiments corroborated the Filochat findings that people appeared to be learning to take fewer handwritten notes and relying more on the audio and that people wanted to improve their handwritten notes afterwards by playing back portions of the audio. Chiu and Wilcox (1998) described a technique for dynamically grouping digital ink pen down and pen up events and phrase segments of audio delineated by silences to support user interaction in freeform note-taking systems by reducing the need to use the tape-deck controls. Chiu, Kapuskar, Reitmeief, and Wilcox (1999) described NoteLook, a client-server integrated conference room system running on wireless penbased notebook computers designed and built to support multimedia notetaking in meetings with digital video and ink. Images could be incorporated into the note pages and users could select images and write freeform ink notes. NoteLook generated web pages with links from the images and ink strokes correlated to the video. Slides and thumbnail images from the room cameras could be automatically incorporated into notes. Two distinct notetaking styles were observed, producing a set of note pages that resembled a full set of the speaker's slides with annotations or more handwritten ink supplemented by some images, and fewer pages of notes. Stifelman, Arons, and Schmandt (2001) described how the Audio Notebook enabled navigation through the user's hand written/drawn notes synchronised with the digital audio recording assisted by means of a phrase detection algorithm to enable the start of an utterance to be detected accurately. Tegrity (2005) sold the Tegrity Pen to write and/or draw on paper while also automatically storing the writing and drawings digitally. Software can also store handwritten notes on a Tablet PC or typed notes on a notebook computer. Clicking on any note replays the automatically captured and time stamped projected slides, computer screen and audio and video of the instructor. Searching Synchronised Multimedia Through Speaker Transcripts and Annotations Hindus and Schmandt (1992) described applications for retrieval, capturing, and structuring spoken content from office discussions and telephone calls without human transcription by obtaining high-quality recordings and segmenting each participant's utterances, as SR of fluent, unconstrained natural language was not achievable at that time. Playback up to three times normal speed allowed faster scanning through familiar material, although intelligibility was reduced beyond twice normal speed. Abowd (1999) described a system where time-stamp information was recorded for every pen stroke and pixel drawn on an electronic whiteboard by the teacher. Students could click on the lecturer's handwriting and get to the lecture at the time it was written, or use a time line with indications when a new slide was visited or a new URL was browsed to load the slide or page. Commercial SR software was not yet able to produce readable timestamped transcripts of lectures and so additional manual effort was used to produce more accurate time-stamped transcripts for keyword searches to deliver pointers to portions of lectures in which the keywords were spoken. Initial trials of students using cumbersome, slow, small tablet computers as a personal notetaker to take their own time stamped notes were suspended as the notes were often very similar to the lecturer's notes available through the Web. Students who used it for every lecture gave the least positive reaction to the overall system, whereas the most positive reactions came from those students who never used the personal note-taker. Both speech transcription and student note-taking with newer networked tablet computers were incorporated into a live environment in 1999 and Abowd expected to be able to report on their effectiveness within the year. However no reference to this was made in the 2004 evaluation by Brotherton and Abowd (2004) that presented the findings of a longitudinal study over a three-year period of use of Classroom 2000 (renamed eClass). This automatically creates a series of web pages at the end of a class, integrating the audio, video, visited web pages, and the annotated slides. The student can search for keywords or click on any of the teacher's captured handwritten annotations on the whiteboard to launch the video of the lecture at the time that the annotations were written. Brotherton and Abowd concluded that eClass did not have a negative impact on attendance or a measurable impact on performance grades but seemed to encourage the helpful activities of reviewing lectures shortly after they occurred and also later for exam cramming. Suggestions for future improvements included: (a) easy to use high fidelity capture and dynamic replay of any lecture presentation materials at user adjustable rates; (b) supporting collaborative student note-taking and enabling students and instructors to edit the materials; (c) automated summaries of a lecture; (d) linking the notes to the start of a topic; (e) making the capture system more visible to the users so that they know exactly what is being recorded and what is not. Klemmer, Graham, Wolff, and Landay et al. 2003) found that the concept of using bar-codes on paper transcripts as an interface to enabling fast, random access of original digital video recordings on a PDA seemed perfectly "natural" and provided emotion and nonverbal cues in the video not available in the printed transcript. Younger users appeared more comfortable with the multisensory approach of simultaneously listening to one section while reading another, even though people could read three times faster than they listened. There was no facility to go from video to text and occasionally participants got lost and it was thought that if there were subtitles on the video indicating page and paragraph number it might remedy this problem. Bailey (2000) has reported people can comfortably listen to speech at 150 to 160 words per minute, the recommended rate for audio narration, however there is no loss in comprehension when speech is replayed at 210 words per minute. Arons (1991) investigated a hyperspeech application using SR to explore a database of recorded speech through synthetic speech feedback and showed that listening to speech was "easier" than reading text as it was not necessary to look at a screen during an interaction. Arons (1997) showed how user controlled time-compressed speech, pause shortening within clauses, automatic emphasis detection, and nonspeech audio feedback made it easier to browse recorded speech. Baecker, Wolf, and Rankin (2004) mentioned research being undertaken on further automating the production of structured searchable webcast archives by automatically recognizing key words in the audio track. Rankin, Baecker, and Wolff (2004) indicated planned research included the automatic recognition of speech, especially keywords on the audio track to assist users searching archived ePresence webcasts in addition to the existing methods. These included chapter titles created during the talk or afterwards and slide titles or keywords generated from PowerPoint slides. Dufour, Toms, Bartlett, Ferencok, and Baecker (2004) noted that participants performing tasks using ePresence suggested that adding a searchable textual transcript of the lectures would be helpful. Phonetic searching (Clements et al., 2002) is faster than searching the original speech and can also help overcome SR "out of vocabulary" errors that occur when words spoken are not known to the SR system, as it searches for words based on their phonetic sounds not their spelling. COMPONENTS OF MULTIMEDIA Speech can express feelings that are difficult to convey through text (e.g., presence, attitudes, interest, emotion, and tone) and that cannot be reproduced through speech synthesis. Images can communicate information permanently and holistically and simplify complex information and portray moods and relationships. When a student becomes distracted or loses focus it is easy to miss or forget what has been said whereas text reduces the memory demands of spoken language by providing a lasting written record that can be reread. Synchronising multimedia involves using timing information to link together text, speech, and images so all communication qualities and strengths can be available as appropriate for different content, tasks, learning styles, and preferences. Faraday and Sutcliffe (1997) conducted a series of studies that tracked eye-movement patterns during multimedia presentations and suggested that learning could be improved by using speech to reinforce an image, using concurrent captions or labels to reinforce speech and cueing animations with speech. Lee and Bowers (1997) investigated how the text, graphic/animation, and audio components of multimedia affect learning. They found that audio and animation played simultaneously was better than sequentially, supporting previous research findings and that performance improved with the number of multimedia components. Mayer and Moreno (2002) and Moreno and Mayer (2002) developed a cognitive theory of multimedia learning drawing on dual coding theory to present principles of how to use multimedia to help students learn. Students learned better in multimedia environments (i.e., two modes of representation rather than one) and also when verbal input was presented as speech rather than as text. This supported the theory that using auditory working memory to hold representations of words prevents visual working memory from being overloaded through having to split attention between multiple visual sources of information. They speculated that more information is likely to be held in both auditory and visual working memory rather than in just either alone and that the combination of auditory verbal and visual non-verbal materials may create deeper understanding than the combination of visual verbal and visual non-verbal materials. Studies also provided evidence that students better understood an explanation when short captions or text summaries were presented with illustrations and when corresponding words and pictures were presented at the same time rather than when they were separated in time. Narayanan and Hegarty (2002) agreed with previous research findings that multimedia design principles that helped learning included: (a) synchronizing commentaries with animations; (b) highlighting visuals synchronised with commentaries; (c) using words and pictures rather than just words; (d) presenting words and pictures together rather than separately; (e) presenting words through the auditory channel when pictures were engaging the visual channel. Najjar (1998) discussed principles to maximise the learning effectiveness of multimedia: (a) text was better than speech for information communicated only in words but speech was better than text if pictorial information was presented as well; (b) materials that forced learners to actively process the meaning (e.g., figure out confusing information and/or integrate the information) could improve learning; (c) an interactive user interface had a significant positive effect on learning from multimedia compared to one that didn't allow interaction; (d) multimedia that encouraged information to be processed through both verbal and pictorial channels appeared to help people learn better that through either channel alone and synchronised multimedia was better than sequential. Najjar concluded that interactive multimedia can improve a person's ability to learn and remember if it is used to focus motivated learners' attention, uses the media that best communicates the information and encourages the user to actively process the information. Carver, Howard, and Lane (1999) attempted to enhance student learning by identifying the type of hypermedia appropriate for different learning styles. Although a formal assessment of the results of the work was not conducted, informal assessments showed students appeared to be learning more material at a deeper level and although there was no significant change in the cumulative GPA, the performance of the best students substantially increased while the performance of the weakest students decreased. Hede and Hede (2002) noted that research had produced inconsistent results regarding the effects of multimedia on learning, most likely due to multiple factors operating, and reviewed the major design implications of a proposed integrated model of multimedia effects. This involved learner control in multimedia being designed to accommodate the different abilities and styles of learners, allowing learners to focus attention on one single media resource at a time if preferred and providing tools for annotation and collation of notes to stimulate learner engagement. Coffield, Moseley and Hall (2004) critically reviewed the literature on learning styles and recommended that most instruments had such low reliability and poor validity their use should be discontinued. IMPLICATIONS FOR USING SR IN MULTIMEDIA E-LEARNING The implications for using SR in Multimedia E-Learning can be summarised as follows. 1. The low reliability and poor validity of learning style instruments suggest that students should be given the choice of media rather than attempting to predict their preferred media and so synchronised text captions should always be available with audio and video. 2. Captioning by hand is very time consuming and expensive and SR offers the opportunity for cost effective captioning if accuracy and error correction can be improved. 3. Although reading from a screen can be as efficient as reading from paper, especially if the user can manipulate display parameters, a linear screen display is not as easy to interact with as paper. SR can be used to enhance learning from onscreen information by providing searchable synchronised text captions to assist navigation. 4. The facility to adjust speech replay speed while maintaining synchronisation with captions can reduce the time taken to listen to recordings and so help make learning more efficient. 5. The optimum display methods for hand held personal display systems may depend on the size of the screen and so user options should be available to cope with a variety of devices displaying captions. 6. Note-taking in lectures is a very difficult skill, particularly for normative speakers, and so using SR to assist students to take notes could be very helpful, especially if students are also able to highlight and annotate the transcribed text. 7. Improving the accuracy of the SR captions and developing faster editing methods is important because editing SR captions is currently difficult and slow. 8. Since existing readability measures involve measuring the length of error free sentences with punctuation, they cannot easily be used to measure the readability of unpunctuated SR captions and transcripts containing errors. 9. Further research is required to investigate in detail the importance of punctuation, segmentation, and errors on readability. CONCLUSION Research suggests that an ideal system to digitally record and replay multimedia content should automatically create an error free transcript of spoken language synchronised with audio, video, and any onscreen graphics and enable this to be displayed in the most appropriate way on different devices and with adjustable replay speed. Annotation would be available through pen or keyboard and mouse and be synchronised with the multimedia content. Continued research is needed to improve the accuracy and readability of SR captions before this vision of universal access to learning can become an everyday reality. [Reference] » View reference page with links References Abowd, G. (1999). Classroom 2000: An experiment with the instrumentation of a living educational environment. IBM Systems Journal, 58(4), 508-530. Arons, B. (1991, December). Hyperspeech: Navigating in speech-only hypermedia. Proceedings of the 3rd Annual ACM Conference on Hypertext (pp. 133146), San Antonio, TX. New York: Author. Arons, B. (1997). SpeechSkimmer: A system for interactively skimming recorded speech. 7997 ACM Transactions on Computer-Human Interaction (TOCHI), 4(1), 3-38. Association of Computing Machinery ([ACM], 1992). Retrieved March 10, 2006, from http://www.acm.org/constitution/code.html Baecker, R. M., Wolf, P., & Rankin, K. (2004, October). The ePresence interactive webcasting system: Technology overview and current research issues. Proceedings of the World Conference on E-Leaming in Government, Healthcare, & Higher Education 2004 (pp. 2396-3069), Washington, DC. Chesapeake, VA: Association for the Advancement of Computing in Education. Bailey, B. (2000). Human interaction speeds. Retrieved December 8, 2005, from http://webusability.com/article_human_interaction_speeds_9_2000.htm Bailey, B. (2002). Readability formulas and writing for the web. Retrieved December 8, 2005, from http://webusability.com/article_readabili ty_formula_7_2002.htm Bain, K., Basson, S. A., Faisman, A., & Kanevsky, D. (2005). Accessibility, transcription, and access everywhere, IBM Systems Journal, 44(3), 589603. Retrieved December 12, 2005, from http://www.research.ibm.com/joumaysj/443/bain.pdf Barbier, M. L., & Piolat, A. (2005, September). Ll and 12 cognitive effort of notetaking and writing. Paper presented at the SIG Writing Conference 2004, Geneva, Switzerland. Boyarski, D., Neuwirth, C., Forlizzi, J., & Regli, S. H. (1998, April). A study of fonts designed for screen display. Proceedings of CHI'98 (pp. 87-94), Los Angeles, CA. Brotherton, J. A., & Abowd. G. D. (2004). Lessons learned from eClass: Assessing automated capture and access in the classroom. ACAf Transactions on Computer-Human Interaction, 11(2), 121-155. Carrol, J., & McLaughlin, K. (2005). Closed captioning in distance education. Journal of Computing Sciences in Colleges, 20(4), 183-189. Carver, C. A., Howard, R. A., & Lane, W. D. (1999). Enhancing student learning through hypermedia courseware and incorporation of student learning styles. IEEE Transactions on Education, 42(1), 33-38. Chiu, P., Kapuskar, A., Reitmeief, S., & Wilcox, L. (1999, October/November). NoteLook: Taking notes in meetings with digital video and ink. Proceedings of the Seventh ACM International Conference on Multimedia (Part 1, pp. 149 - 158), Orlando, FL. Chiu, P., & Wilcox, L. (1998, November). A dynamic grouping technique for ink and audio notes. Proceedings of the llth Annual ACM Symposium on User Interface Software and Technology (pp. 195-202), San Francisco, CA. Clements, M., Robertson, S., & Miller, M. S. (2002). Phonetic searching applied to on-line distance learning modules. Retrieved December 8, 2005, from http://www.imtc.gatech.edu/news/multimedi a/spe2002_paper.pdf Coffield, F, Moseley, D., Hall, E., & Ecclestone, K. (2004). Learning styles and pedagogy in post-16 learning: A systematic and critical review. London: Learning and Skills Research Centre. Digital Accessible Information SYstem ([DAISY]; 2005). Retrieved December 8, 2005, from http://www.daisy.org Dolphin (2005). Dolphin Publisher. [Computer software]. Retrieved December 8, 2005, from http://www.dolphinaudiopublishing.com/ Dufour, C., Toms, E. G., Bartlett. J., Ferenbok, J., & Baecker, R. M. (2004, May). Exploring user interaction with digital videos. Paper presented at the Graphics Interface Conference, London, ON, Canada. Retrieved January 29, 2008, from http://kmdi.utoronto.ca/rmb/papers/E13.pdf eTeach (2005). Retrieved December 8, 2005, from http://eteach.engr.wisc.edu/newEteach/home.html Faraday, P., & Sutcliffe, A. (1997, March). Designing effective multimedia presentations. Proceedings of CHI '97 (pp. 272-278), Atlanta, GA. Hede, T., & Hede, A. (2002, July). Multimedia effects on learning: Design implications of an integrated model. In S. McNamara & E. Stacey (Eds), Untangling the web: Establishing learning links: Proceedings of ASET Conference 2002, Melbourne, VIC, Australia. Retrieved December 8, 2005, from http://www.ascilite.org.au/aset-archives/confs/2002/hede-t.html Hindus, D., & Schmandt, C. (1992, November). Ubiquitous audio: Capturing spontaneous collaboration. Proceedings of the ACM Conference on Computer Supported Co-Operative Work (pp. 210-217), Toronto, ON, Canada. Holleran, P., & Bauersfeld, K. (1993, April). Vertical spacing of computerpresented text. CHI '93 Conference on Human Factors in Computing Systems (pp. 179-180), Amsterdam, The Netherlands. Hornbaek, K., & Frøkjaer, E. (2003). Reading patterns and usability in visualizations of electronic documents. ACM Transactions on ComputerHuman Interaction (TOCHI), 10(2), 119-149. Howard-Spink, S. (2005). You just don't understand! Retrieved January 30, 2008, from http://domino.research.ibm.com/comm/wwwr_thinkresearch. nsf/pages/20020918_speech.html IBM (2003). The superhuman speech recognition project. Retrieved December 12, 2005, from http://www.research.ibm.com/superhuman/superhuman.htm IBM (2005). ViaScribe. Retrieved December 8, 2005, from http://www306.ibm.com/able/solution_offerings/ViaScribe.html Jones, D., Wolf, R, Gibson, E., Williams, E., Fedorenko, F., Reynolds, D. A., et al. (2003, September). Measuring the readability of automatic speechtotext transcripts. Proceedings ofEurospeech (pp. 1585-1588), Geneva, Switzerland. Keegan, V. (2005). Read your mobile like an open book. The Guardian. Retrieved December 8, 2005, from http://technology.guardian.co.uk/weekly/story/0,16376,1660765,00.html Klemmer, S. R., Graham, J., Wolff, G. J., & Landay, J. A. (2003, April). Books with voices: Paper transcripts as a physical interface to oral histories. Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (pp. 89-96), Fort Lauderdale, FL. Laarni, J. (2002, October) Searching for optimal methods of presenting dynamic text on different types of screens. Proceedings of the second Nordic Conference on Human-Computer Interaction NordiCHI '02 (pp. 219222), Aarhus, Denmark. Lee, A. Y, & Bowers, A. N. (1997, September). The effect of multimedia components on learning, Proceedings of the Human Factors and Ergonomics Society (pp. 340-344), Albuquerque, NM. Leiten, D., & MacMillan, T. (2003). Liberated learning initiative innovative technology and inclusion: Current issues and future directions for liberated learning research. Halifax, Nova Scotia, Canada: Saint Mary's University. Retrieved March 10, 2006, from http://www.liberatedlearning.com/ Liberated Learning Consortium (2006). Retrieved March 10, 2006, from www.liberatedlearning.com Mayer, R. E. & Moreno, R. A. (2002). Cognitive theory of multimedia learning: Implications for design principles. Retrieved December 8, 2005, from http:// www.unm.edu/~moreno/PDFS/chi.pdf McWhirter, N. (Ed). (1985). The guiness book of world records (23rd US ed.). New York: Sterling Publishing. Media Accessible Generator ([MAGpie], 2005). Retrieved December 8, 2005, from http://ncam.wgbh.org/webaccess/magpie/ Mills, C., & Weldon, L. (1987). Reading text from computer screens. ACM Computing Surveys, 19(4), 329-357. Moreno, R., & Mayer, R. E. (2002).Visual presentations in multimedia Learning: Conditions that overload visual working memory. Retrieved December 8, 2005, from http://www.unm.edu/~moreno/PDFS/Visual.pdf Muter, P. (1996). Interface design and optimization of reading of continuous text. In H. van Oostendorp & S. de MuI (Eds.), Cognitive aspects of electronic text processing, (pp. 161-180). Norwood, NJ: Ablex. Muter, P., & Maurutto, P. (1991). Reading and skimming from computer screens and books: The paperless office revisited? Behaviour & Information Technology, 10(4), 257-266. Najjar, L. J. (1998). Principles of educational multimedia user interface design. Human Factors, 40(2), 311-323. Narayanan, N. H., & Hegarty, M. (2002). Multimedia design for communication of dynamic information. International Journal of HumanComputer Studies, 57(4), 279-315. Nuance (2005). Dragon audiomining. Retrieved December 8, 2005, from http:// www.nuance.com/audiomining O'Hara, K., & Seilen, A. (1997, March). A comparison of reading paper and online documents. Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, (pp. 335-342), Atlanta, GA. Olavsrud, T. (2002). IBM wants you to talk to your devices. Retrieved December 12, 2005, from http://www.internetnews.com/entnews/article.php/1004901 Piolat, A., Olive, T., & Kellogg, R.T. (2004). Cognitive effort during note taking. Applied Cognitive Psychology, 79(3), 291-312. Piolat, A., Roussey, J.Y., & Thunin, O. (1997). Effects of screen presentation on text reading and revising. International Journal of Human-Computer Studies, 47(4), 565-589. Rahman, T., & Muter, P. (1999). Designing an interface to optimize reading with small display windows. Human Factors, 41(1), 106-117. Rankin, K., Baecker, R. M., & Wolf. P. (2004, November). ePresence: An open source interactive webcasting and archiving system for eleaming. Proceedings of E-Learn 2004 (pp. 2872-3069), Washington, DC. Chesapeake, VA: Association for the Advancement of Computing in Education. Shneiderman, B. (2000). Universal usability. Communications of the ACM, 43(5), 85-91. Retrieved March 10, 2006, from http://www.cs.umd.edu/~ben/p84-shneiderman-May2000CACMf.pdf Special Educational Needs and Disability Act ([SENDA], 2001). Retrieved December 12, 2005, from http://www.opsi.gov.uk/acts/acts2001/20010010.htm Stifelman, L., Arons, B., & Schmandt, C. (2001, March/April). The audio notebook: Paper and pen interaction with structured speech. Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (pp. 182189), Seattle, WA. Synchronized Multimedia Integration Language ([SMIL] ,2005). Retrieved December 8, 2005, from http://www.w3.org/AudioVideo/ Tegrity (2005). Tegrity pen. Retrieved December 8, 2005, from http://www.tegrity.com/ Tyre, P. (2005, November 28). Professor in your pocket. Newsweek. Wald, M. (2005a, June). Personalised displays. Proceedings of Speech Technologies: Captioning, Transcription and Beyond Conference, New York , IBM T.J. Watson Research Center. Retrieved December 27, 2005, from http:// www.liberatedleaming.com/resources/pdf/2005_avios.pdf Wald, M. (2005b, June). SpeechText: Enhancing learning and teaching by using automatic speech recognition to create accessible synchronised multimedia. Proceedings of ED-MEDIA 2005 World Conference on Educational Multimedia, Hypermedia, & Telecommunications (pp. 47654769), Montreal, QC, Canada. Chesapeake, VA: Association for the Advancement of Computing in Education. Wald, M. (2006). Creating accessible educational multimedia through editing automatic speech recognition captioning in real time. International Journal of Interactive Technology and Smart Education: Smarter Use of Technology in Education, 3(2), 131-142. Web Accessibility Initiative ([WAI], 1999). Retrieved December 12, 2005, from http://www.w3.org/WAI Whittaker, S., Hyland, P., & Wiley, M. (1994, April). Filochat handwritten notes provide access to recorded conversations. Proceedings of CHI '94 (pp. 271-277), Boston, MA. Wilcox, L., Schilit, B., & Sawhney, N. (1997, March). Dynomite: A dynamically organized ink and audio notebook. Proceedings of CHI '97 (pp. 186-193), Atlanta, GA. [Author Affiliation] MIKEWALD University of Southampton UK [email protected] References References (63)
© Copyright 2024 Paperzz