Interactive Visualisation Techniques for Dynamic Speech Transcription, Correction and Training Saturnino Luz Masood Masoodian Bill Rogers Dept. of Computer Science Trinity College Dublin Ireland Dept. of Computer Science The University of Waikato New Zealand Dept. of Computer Science The University of Waikato New Zealand [email protected] [email protected] [email protected] ABSTRACT 1. INTRODUCTION As performance gains in automatic speech recognition systems plateau, improvements to existing applications of speech recognition technology seem more likely to come from better user interface design than from further progress in core recognition components. Among all applications of speech recognition, the usability of systems for transcription of spontaneous speech is particularly sensitive to high word error rates. This paper presents a series of approaches to improving the usability of such applications. We propose new mechanisms for error correction, use of contextual information, and use of 3D visualisation techniques to improve user interaction with a recogniser and maximise the impact of user feedback. These proposals are illustrated through several prototypes which target tasks such as: off-line transcript editing, dynamic transcript editing, and real-time visualisation of recognition paths. An evaluation of our dynamic transcript editing system demonstrates the gains that can be made by adding the corrected words to the recogniser’s dictionary and then propagating the user’s corrections. Transcription is one of the major applications of automatic speech recognition, as evidenced by the existence of workshops and evaluation tasks specialised on this topic [21, 16]. Transcription systems are wide and varied in their uses, which include dictation [23] as well as audio [8, 7], multimedia [6, 25, 18] and meeting indexing applications [24, 22]. However, since even the most sophisticated automatic speech recognition systems are far from perfect, transcriptions produced by their application often contain errors. The accuracy of automatically generated transcripts depend on a number of factors such as the quality of the speech audio (recordings or live speech), the level of prior training of the recogniser for a particular speaker, etc. Although under ideal circumstances the accuracy can be in the 85% to 95% range (or even further improved to 99% for a single speaker who speaks clearly, only uttering words that are in the system’s vocabulary, and on whose voice the system has been individually trained), in other cases error rates can be as high as 60%. Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval; H.5.1 [Information Interfaces and Presentation]: Multimedia Information Systems; H.5.2 [User Interfaces]: Natural Language General Terms Human Factors Keywords Automatic Speech Transcription, Error-correction, Speech Recogniser Training, Semi-automatic Speech Transcription Regardless of the initial accuracy levels achieved through automatic speech recognition, when 100% accuracy is required, manual correction of transcripts by human operators becomes necessary [10, 2]. Depending on the quality of the automatically generated transcripts the task of making corrections by a human transcriber can be rather time consuming [19, 11]. For instance when the original transcripts contain something like 50% errors, the manual correction task is almost as time consuming as transcribing the entire speech from scratch; requiring several iterations of listening to the speech audio and making corrections. The task of generating transcripts from live speech in realtime is even more complicated. In such cases automatic speech recognisers are even less accurate than when working on recorded speech where transcription speed is not an issue. Real-time speech transcription is therefore generally carried out by highly trained human transcribers, which is costly and therefore of limited in application. These demonstrative examples of problems associated with automatic as well as manual transcription of speech identify a clear need for systems that combine the best aspects of automatic speech transcription (i.e. speed) with that of manual transcription (i.e. accuracy) [15]. Such semi-automatic systems, however, need to minimise the human intervention Luz, S., Masoodian, M., and Rogers, B. Interactive visualisation techniques for dynamic speech transcription, correction and training. In Proceedings of CHINZ 2008, The 9th ACM SIGCHI-NZ Annual Conference on Computer-Human Interaction (Wellington, New Zealand, 2008), S. Marshall, Ed., ACM Press, pp. 9–16. Figure 1: TRAED speech transcript editor. while maximising the speed and accuracy of the generated transcripts. In this paper we review a range of solutions to the problem of balancing the need for speed and accuracy in semiautomatic speech transcription through the development of interactive visual techniques which aim to assist human transcribers intervening in the process of speech recognition and transcription. Several interactive prototype systems are described which demonstrate the use of visual techniques in a range of tasks including correcting transcripts of recorded speech, dynamic correction of speech transcripts, and dynamic training of speech recognition systems. 2. HUMAN INTERVENTION IN SPEECH RECOGNITION There are many cases, such as in dictation, where it is crucial to generate completely accurate transcripts of speech. In such cases transcription of the recordings can be generated manually by human operators. However, as mentioned above, this process can be slow and expensive. Similarly, automatic speech transcription systems, with their inherent inaccuracies, are less than ideal for this type of task. Clearly a better solution is to use a semi-automatic approach where an automatic speech recognition system can be used to generate transcripts (with possible errors), which can then be manually corrected by a human operator. The success of this mode of operation depends largely on providing the hu- man operator with visual tools that would allow easy identification of the errors made by the automatic speech recogniser, and then allow error correction with minimal effort on the part of the operator. Error correction mechanisms have been studied in the human-factors literature mainly in connection with dictation and dialogue systems [20, 11, 1, 9]. In general, these error correction facilities can be divided into two groups: 1. interactive visual facilities for locating and correcting errors, without involving the speech recogniser, 2. facilities for automatic propagation of manual corrections made by the operator, through iterative involvement of the speech recogniser. The type of facilities provided by semi-automatic systems are also dependant on the granularity or the level at which human interventions (i.e. corrections) are made possible. This in turn is dependant on the application area of transcription, which can either be real-time transcription (e.g. word by word or sentence by sentence) or time-delayed transcription (e.g. paragraph by paragraph, or an entire recorded session). Interactive visual facilities of type 1, which only allow locating and correcting errors on automatically generated transcripts, without further involvement of the speech recogniser, only need to work at the entire recorded session level. Luz, S., Masoodian, M., and Rogers, B. Interactive visualisation techniques for dynamic speech transcription, correction and training. In Proceedings of CHINZ 2008, The 9th ACM SIGCHI-NZ Annual Conference on Computer-Human Interaction (Wellington, New Zealand, 2008), S. Marshall, Ed., ACM Press, pp. 9–16. The scenario of work in this case is to use the recogniser to transcribe the entire session and then allow the user to make corrections in any order; although sequential mode of making correction from the beginning to end is more likely to be adopted. Error correction facilities of type 2, on the other hand, which allow propagation of manual corrections through further iterative involvement of the speech recogniser, need to work at the word, sentence, or paragraph level. The scenario of work in these situations is to use the recogniser to start transcribing speech input (either from live or recorded speech input), and then depending on the level of granularity provide means of intervention (e.g. making manual corrections or selecting from a list of suggestions) appropriately. Then, once some user interventions have been made the system can use those single or multiple interventions to guide the recogniser through the remainder of the transcription process. Our research has focused on designing interactive visualisation techniques which provide these types of facilities to the users. We have developed a range of prototype systems incorporating these techniques, with the aim of making the task of semi-automatic transcription as efficient as possible in different modes of operation. 3. Although TRAED allows easy correction of transcript errors, it does not provide any clues as to why the speech recogniser used to generate the transcripts has made those errors. Modern automatic speech recognition systems generally use Hidden Markov Models [17] to compare statistical probabilities associated with different words in their generated “word lattice” to select a word from a set of likely possibilities. However, after a word has been selected (which may actually be incorrect) the other possibilities (which may include the correct word) are simply ignored. LOCATING AND CORRECTING ERRORS The first prototype system we developed was specifically designed to allow users to make corrections to transcripts generated using MicrosoftTM ’s Speech Application Programming Interface (SAPI)1 . The user interface of this system, called TRAED (TRAnscript EDitor), is shown in Figure 1. This system, which is more fully described elsewhere [14], displays the audio speech signal in waveform along with its transcription line by line. Boxes around each recognised word serve to delimit the associated speech signal. The waveform signal for each word is displayed at a fixed timescale, and their associated words are centred under the waveforms. The chosen scale is such that the waveform is usually wider than the word, and in cases where the text is wider, the left and right borders of the waveform display area are widened, leaving the wave itself correctly scaled. This simple technique for associating the transcribed text with their time-aligned speech signals allows the users to easily identify mis-recognised words, and if in doubt play back the associated speech audio, without having to search through the audio recording using the more common, and time consuming, fast-forward and rewind method. Another obvious advantage of the time-alignment of transcribed text and audio signal is that the areas of silence, between word gaps, etc, which have no transcription can easily be seen; as unrecognised audio segments are represented by single space characters (see the spaces between some of the recognised words in Figure 2). TRAED also provides a simple mechanism for dealing with one of the most common type of errors made by speech 1 recognisers; which is not being able to identify word boundaries correctly. This problem results in a long word being broken down into several smaller words, or alternatively (though less commonly) in smaller words being joined up into a long mis-recognised word. Sometimes speech recognisers also mis-align boundaries of correctly recognised words, in which case being able to align the boundaries manually becomes necessary. This is specially true of tasks such as creation of speech corpora [2], in which getting the alignment between transcript and audio waveform (or spectrogram) right is essential. In TRAED adjusting word boundaries can be carried out using simple mouse operations (e.g. clicking and dragging). http://research.microsoft.com/research/srg/sapi.aspx Figure 2: Audio segments without transcription. TRAED also only displays the final transcript (correct or incorrect) returned by its recogniser. The human transcriber using TRAED does not get to see the set of possible words from the word lattice, and can only manually correct the mistakes made by the recogniser by typing in the corrections, and when necessary change the word boundaries. Since MicrosoftTM ’s SAPI does not provide access to its generated word lattice, we developed another prototype system, called TRAENX, to provide the words selection mechanism to the users. TRAENX uses the open-source Sphinx-42 speech recognition engine, which allows access to intermediate word lattices as well as greater control over the recognition process. Figure 3 shows the user interface of TRAENX. Although the visual aspect of TRAENX are similar to that of TRAED, and despite the fact that it provides similar functionality to TRAED, there is one crucial difference between the two prototypes. Unlike TRAED, TRAENX clearly shows which words in the transcription were selected from a set of possible words by the speech recogniser. These words are shown as buttons in the transcript, and the user can click on them to see what the other possibilities were. If a different option is the correct word, the user can select it instead of the incorrect (automatically chosen) alternative. This is illustrated in Figure 4. 2 http://cmusphinx.sf.net/ Luz, S., Masoodian, M., and Rogers, B. Interactive visualisation techniques for dynamic speech transcription, correction and training. In Proceedings of CHINZ 2008, The 9th ACM SIGCHI-NZ Annual Conference on Computer-Human Interaction (Wellington, New Zealand, 2008), S. Marshall, Ed., ACM Press, pp. 9–16. Figure 3: TRAENX speech transcript editor. edly throughout the transcription. In such cases, TRAED or TRAENX will assist the user with the task of finding and correcting those transcription errors, but neither of these systems has any mechanism for learning from the initial corrections that the user makes so that other repeated cases can be fixed automatically. Figure 4: Word selections in TRAENX. 4. PROPAGATING USER CORRECTIONS Despite the fact that both TRAED and TRAENX provide useful transcript error correction facilities to human operators, their functionality is limited to the initial transcription results returned by their backend speech recogniser, without providing any functionality for dynamic interaction between the operator and the speech recogniser. Although the form of “post-transcription correction” mode of operation implemented by TRAED and TRAENX might be sufficient in some use cases, further assistance can be provided to the users through dynamic interaction with the speech recogniser. In many speech transcription cases (e.g. transcription of broadcast news) there are often phrases or nouns that are repeated throughout a particular segment of speech being transcribed. If, for example, these repeated nouns are missing from the recogniser’s vocabulary (as often is the case with proper nouns), or if the system has not been trained on that particular speaker (also very common in cases such as broadcast news), then it is likely that the speech recogniser will transcribe those nouns or phrases incorrectly repeat- Furthermore, since it is often likely that the speech recogniser would have mis-recognised the same noun differently (though be it incorrectly) at different points with-in the segment, an automatic find and replace functionality will have very little value in most cases. Therefore, more advanced techniques are needed to allow the system to learn from the corrections that the user makes, and propagate the results of the user’s interventions as much as possible by guiding the backend speech recogniser dynamically throughout the entire transcription process in an iterative manner. These advanced dynamic correction techniques can be divided into three groups: 1. making word additions to the speech recogniser’s vocabulary, 2. training the speech recogniser, 3. guiding the selection of alternatives from the word lattice. We have developed two prototype systems to demonstrate how each of these techniques could be used to provide for dynamic user intervention in semi-automatic transcription tasks. The user interface of the first of these two prototypes is shown in Figure 5. This system which is called TRAEDUP Luz, S., Masoodian, M., and Rogers, B. Interactive visualisation techniques for dynamic speech transcription, correction and training. In Proceedings of CHINZ 2008, The 9th ACM SIGCHI-NZ Annual Conference on Computer-Human Interaction (Wellington, New Zealand, 2008), S. Marshall, Ed., ACM Press, pp. 9–16. Figure 5: TRAEDUP speech transcript editor. is based on TRAED and has been more fully described elsewhere [13]. Although TRAEDUP is visually similar to TRAED, their modes of operation are rather different. TRA-ED utilises its backend speech recogniser to produce the a complete transcription of its input speech data, which is then shown to the user so that manual corrections can be made. TRAEDUP also utilises MicrosoftTM SAPI as its backend speech recogniser. However, unlike TRAED, its mode of operation is iterative and can be summarised as follows: The advanced transcription correction facilities of TRAEDUP come under the first two categories mentioned above (i.e. additions to the recogniser’s vocabulary and training). However, once again, since TRAEDUP is based on TRAED and utilises a proprietary API, we were unable to access the recogniser’s internal working components during transcription, including the word lattice it generates. Therefore, to demonstrate the third category of more advanced dynamic correction techniques we developed another prototype using the previously mentioned Sphinx-4 open-source speech recognition engine. 1. the recogniser is provided with the speech input to be transcribed, Figure 6 shows the interface of this prototype called DYTRAED (DYnamic TRAnscript EDitor). DYTRAED is radically different from all the other systems described here, and is the first system to actually use the word lattice produced by its backend speech recogniser in real-time in order to propagate the results of user interventions (e.g. selections and corrections) through the transcript. 2. the entire input is transcribed (probably with errors) and presented to the user, 3. the user starts making manual correction from the beginning, producing transcripts that will be (substantially) correct, 4. corrections to words that are missing from the system’s vocabulary are dynamically added to the vocabulary, 5. corrections of mistakes other than out-of-vocabulary words are used for training the speech recogniser’s spea-ker model, 6. the system re-transcribes the speech related to the parts that are not yet corrected by the user, Also, unlike the previous systems, which have a linearly textual representation of transcripts along with their associated speech signal in waveform, DYTRAED employs a directed graph-based 3D visualisation technique to depict the recognition alternatives as generated by the speech recogniser in real-time. As the technical details of DYTRAED and its implementation are described elsewhere[12], it is sufficient here to mention how it is operated by a human transcriber. 7. steps 3-6 are repeated until no more corrections are needed. Luz, S., Masoodian, M., and Rogers, B. Interactive visualisation techniques for dynamic speech transcription, correction and training. In Proceedings of CHINZ 2008, The 9th ACM SIGCHI-NZ Annual Conference on Computer-Human Interaction (Wellington, New Zealand, 2008), S. Marshall, Ed., ACM Press, pp. 9–16. Figure 6: DYTRAED speech transcript editor. When DYTRAED receives its speech audio signal, either from a live speaker or a recorded file, it starts transcribing its input using the speech recogniser engine, which in turn generates the word lattice as part of its recognition process. DYTRAED receives these recognition alternatives from its recogniser and displays them in the form of a directed graph to the user. The partial recognition alternatives with the greatest probability scores are highlighted and connected with red lines, the lower-score alternatives are dimmed, and the alternatives undergoing active search are highlighted and connected with yellow lines. The user is able to: pause, speed up, or slow down the animation; click on a word to select and accept the entire sentence being transcribed; or simply ignore and continue to visualise the recogniser’s preferred transcription being generated. When the transcription of a given sentence is completed (either by the operator or the system) it moves to the background, and then gradually is pushed towards the horizon by sentences completed afterwards, until it disappears completely from the screen. 5. EVALUATION Although the tools and techniques described in this paper seem to be valuable in providing mechanisms for assisting users with the tasks of semi-automatic speech transcription, it is clearly important to evaluate them to make sure that they are indeed as efficient as possible in their usability. We have started with the process of conducting formal evaluations of these tools, both in terms of their usability, as well as their efficiency, accuracy, etc. A brief testing of TRAEDUP system [13] showed promising results, and allowed us to design specific tests to evaluate the effects of dynamic word addition and recogniser training. So far we have carried out a further test to see if dynamic word additions make a sufficient difference to improving the accuracy of TRAEDUP’s back-end speech recogniser. For this test we focused on the fact that one of the main difficulties speech recognisers often face is recognising proper nouns, which tend to occur regularly in cases such as broadcast news. The problem with proper nouns is that they are either missing from the recognisers’ vocabulary, or are pronounced differently by people. Six audio files were used in this experiment. They were from the CMU ARCTIC databases3 and downloaded from the Voxforge corpus collection4 , which includes the speech audio recordings and their correct transcriptions. The speech files we chose were by two US female speakers, two US male speakers, a Canadian male speaker, and a Scottish male speaker. Each speech file contained 1132 sentences, consisting of around 10000 words. We chose five proper nouns (Philip, Saxon, MacDougall, Gregson, Jeanne) which together occurred 76 times in each of the recordings (0.8% of the content). We used one instance of each of these nouns from one of the US female speaker’s recording to train the system. In the case of Gregson and Jeanne this also added them to the recogniser’s vocabulary. Table 1 shows the results of the word additions and training test. The results clearly indicate the benefits of adding words and training on even a single case of each word. In 3 4 http://festvox.org/cmu arctic/ http://www.voxforge.org/ Luz, S., Masoodian, M., and Rogers, B. Interactive visualisation techniques for dynamic speech transcription, correction and training. In Proceedings of CHINZ 2008, The 9th ACM SIGCHI-NZ Annual Conference on Computer-Human Interaction (Wellington, New Zealand, 2008), S. Marshall, Ed., ACM Press, pp. 9–16. Table 1: The results of the word addition and training test Philip Saxon MacDougall Gregson Jeanne Occurrences 30 5 5 18 18 Speakers Recognition US Female 1 before 30 2 3 0 0 after 30 4 3 15 14 US Female 2 before 28 1 2 0 0 after 30 5 5 18 14 US Male 1 before 22 1 4 0 0 after 27 4 5 15 17 US Male 2 before 19 1 2 0 0 after 23 5 3 18 16 Canadian Male before 19 1 3 0 0 after 22 5 5 18 16 Scottish Male before 28 1 3 0 0 after 30 5 4 18 14 Average Recognition: before 28 out of 76 Average Recognition: after 68 out of 76 Recognition Rate: before 37% Recognition Rate: after 89% fact the results are quite impressive considering the fact that addition and training of the recogniser on 5 words improves their recognition across different speakers on average from 37% to 89%. In fact if we exclude the word “Philip” in this sample case the average accuracy of recognition for the other 4 words increases from 9% to nearly 89%. Further investigation of the errors after word additions and training in our test shows that we could achieve 100% accuracy by adding another instance of each noun which includes the possessive “’s”, for example “Gregson’s”, “Jeanne’s”, etc. 6. OTHER TECHNIQUES The techniques described in this paper have focused on error correction through manual intervention. However, in certain scenarios it is possible to provide more assistance to operators of semi-automatic speech transcription tasks through techniques that automatically intervene in the speech recognition process by analysing the context of speech through other modalities. Such techniques can sometimes be used to improve the accuracy of speech transcription to a level that is suitable for applications where 100% accuracy is not critically required, and therefore remove the need for human intervention altogether. That is typically the case of applications that use transcripts to aid user navigation and search of time-based data such as multimedia recordings of meetings. “Meeting browsers” [4], for instance, can benefit from transcribed speech even if the transcription is not perfect. On the other hand, records of textual data produced at the meetings can be used to guide speech recognition results. The MeetingMiner [5, 3], for instance, is a system that uses time-stamped textual records to improve speech recognition results. It does so by constraining recognition alternatives, re-scoring word lattices by boosting the scores of words that occur on text segments in the “temporal neighbourhood” of the sentence being transcribed and adding out-of-vocabulary items to the recogniser’s vocabulary through text analysis. A detailed evaluation of an information retrieval task supported by speech recognition alone versus the same task Total 76 35 66 31 72 27 68 22 65 23 66 32 71 supported by combined speech recognition, temporal information and text analysis has shown that increases of up to 43% in precision can be obtained even with baseline (speechrecognition only) transcripts that have a mean error rate as high as 60.2% [3]. 7. CONCLUSIONS In this paper we have described a range of interactive visualisation techniques developed to assist human operators with the process of correcting transcription errors in semiautomatic transcription applications. These techniques have been implemented in several prototype systems which demonstrate their effectiveness in assisting the users. Due to the limitations of the proprietary automatic speech recogniser we chose initially we were unable to implement all the techniques we have described in this paper in a single prototype system. We have now moved to using the opensource Sphinx speech recognition engine which has given us more capabilities. Our aim is to port our initial prototypes and combine them all into a single system to give the users the option of using the techniques that best suit their needs. It is possible that indeed in some cases users may utilise a combination of facilities sequentially to minimise their efforts in creating correct transcriptions. For instance a user may wish to rely on the techniques provided by DYTRAED to transcribe live speech, which may result in missing some transcription errors due to the real-time limitations of the task. Once the live speech session is over, the user may then utilise facilities provided by TRAEDUP and TRAENX to correct any errors missed during the live transcription session. Of course while the user is carrying out these transcription correction activities the system would be utilising the user’s input to train its speaker models as well as updating its vocabulary collection. 8. ACKNOWLEDGEMENTS The work of Saturnino Luz is partially supported by Science Foundation Ireland through a CSET (CNGL) grant. Luz, S., Masoodian, M., and Rogers, B. Interactive visualisation techniques for dynamic speech transcription, correction and training. In Proceedings of CHINZ 2008, The 9th ACM SIGCHI-NZ Annual Conference on Computer-Human Interaction (Wellington, New Zealand, 2008), S. Marshall, Ed., ACM Press, pp. 9–16. 9. REFERENCES [1] W. A. Ainsworth and S. R. Pratt. Feedback strategies for error correction in speech recognition systems. International Journal of Man-Machine Studies, 36(6):833–842, June 1992. [2] C. Barras, E. Geoffrois, Z. Wu, and M. Liberman. Transcriber: development and use of a tool for assisting speech corpora production. Speech Communication, 33(1-2):5–22, 2001. [3] M.-M. Bouamrane. Interaction-based Information Retrieval in Multimodal, Online, Artefact-Focused Meeting Recordings. PhD thesis, Trinity College, Dept of Computer Science, 2007. [4] M.-M. Bouamrane and S. Luz. Meeting browsing. Multimedia Systems, 12(4–5):439–457, 2007. [5] M.-M. Bouamrane, S. Luz, and M. Masoodian. History based visual mining of semi-structured audio and text. In Proceedings of the 12th International Multi-media Modelling Conference, MMM06, pages 360–363, Beijing, China, Jan. 2006. IEEE Press. [6] L. Deng, Y. Wang, K. Wang, A. Acero, H. Hon, J. Droppo, C. Boulis, M. Mahajan, and X. D. Huang. Speech and language processing for multimodal human-computer interaction. Journal VLSI Signal Processing Systems, 36(2/3):161–187, 2004. [7] J. Foote. An overview of audio information retrieval. Multimedia Systems, 7(1):2–10, 1999. [8] J. Goldman, S. Renals, S. Bird, F. de Jong, M. Federico, C. Fleischhauer, M. Kornbluh, L. Lamel, D. Oard, C. Stewart, and R. Wright. Accessing the spoken word. International Journal of Digital Libraries, 5(4):287–298, 2005. [9] Halverson, C. A., Horn, D. B., Karat, C.-M., and J. Karat. The beauty of errors: Patterns of error correction in desktop speech systems. In Proceedings of INTERACT’99: Human-Computer Interaction, pages 133–140, 1999. [10] T. Hazen. Automatic alignment and error correction of human generated transcripts for long speech recordings. In Procedings of Interspeech’06, pages 1606–1609, Pittsburgh, Pennsylvania, 2006. [11] C.-M. Karat, C. Halverson, D. Horn, and J. Karat. Patterns of entry and correction in large vocabulary continuous speech recognition systems. In CHI ’99: Proceedings of the SIGCHI conference on Human factors in computing systems, pages 568–575. ACM Press, 1999. [12] S. Luz, M. Masoodian, B. Rogers, and B. Zhang. A system for dynamic 3d visualisation of speech recognition paths. In Proceedings of Advanced Visual Interfaces AVI’08, pages 482–483. ACM Press, 2008. [13] M. Masoodian, B. Rogers, and S. Luz. Improving automatic speech transcription for multimedia content. In P. Isaı́as and M. B. Nunes, editors, Proceedings of WWW/Internet ’07, pages 145–152, Vila Real, 2007. [14] M. Masoodian, B. Rogers, D. Ware, and S. McKoy. TRAED: Speech audio editing using imperfect transcripts. In 12th International Conference on Multi-Media Modeling (MMM 2006), pages 454–259, Beijing, China, 2006. IEEE Computer Society. [15] H. Nanjo and T. Kawahara. Towards an efficient archive of spontaneous speech: Design of computer-assisted speech transcription system. The Journal of the Acoustical Society of America, 120:3042, 2006. [16] NIST Automatic Meeting Transcription, Data Collection and Annotation Workshop, 2001. [17] L. Rabiner and B.-H. Juang. Fundamentals of speech recognition. Prentice Hall, 1993. [18] J. Robertson, W. Y. Wong, C. Chung, and D. K. Kim. Automatic speech recognition for generalised time based media retrieval and indexing. In Proceedings of the sixth ACM international conference on Multimedia, MULTIMEDIA ’98, pages 241–246, New York, NY, US, 1998. ACM Press. [19] A. Sears, C. Karat, K. Oseitutu, A. Karimullah, and J. Feng. Productivity, satisfaction, and interaction strategies of individuals with spinal cord injuries and traditional users interacting with speech recognition software. Universal Access in the Information Society, 1(1):4–15, 2001. [20] B. Suhm, B. Myers, and A. Waibel. Multimodal error correction for speech user interfaces. ACM Trans. Comput.-Hum. Interact., 8(1):60–98, 2001. [21] University of Maryland. Proceedings of the 2000 Speech Transcription Workshop. NIST, 2000. [22] A. Waibel, M. Brett, F. Metze, K. Ries, T. Schaaf, T. Schultz, H. Soltau, H. Yu, and K. Zechner. Advances in automatic meeting record creation and access. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, volume 1, pages 597–600. IEEE Press, 2001. [23] A. Waibel and K. Lee. Readings in Speech Recognition. Morgan Kaufmann, 1990. [24] P. Wellner, M. Flynn, and M. Guillemot. Browsing recorded meetings with Ferret. In S. Bengio and H. Bourlard, editors, Proceedings of Machine Learning for Multimodal Interaction: First International Workshop, MLMI 2004, volume 3361, pages 12–21, Martigny, Switzerland, June 2004. Springer-Verlag GmbH. [25] S. Whittaker, J. Hirschberg, B. Amento, L. Stark, M. Bacchiani, P. Isenhour, L. Stead, G. Zamchick, and A. Rosenberg. Scanmail: a voicemail interface that makes speech browsable, readable and searchable. In Proceedings of the SIGCHI conference on Human factors in computing systems,CHI ’02, pages 275–282, New York, NY, US, 2002. ACM Press. Luz, S., Masoodian, M., and Rogers, B. Interactive visualisation techniques for dynamic speech transcription, correction and training. In Proceedings of CHINZ 2008, The 9th ACM SIGCHI-NZ Annual Conference on Computer-Human Interaction (Wellington, New Zealand, 2008), S. Marshall, Ed., ACM Press, pp. 9–16.
© Copyright 2026 Paperzz