phonetic-Based Dialogue Search

white paper
Phonetic-Based Dialogue Search:
The Key to Unlocking an Archive’s Potential
A Whitepaper by Jacob Garland, Colin Blake, Mark Finlay and Drew Lanham
Nexidia, Inc., Atlanta, GA
whitepaper
Phonetic Based Dialogue Search
People who create, own, distribute, or consume media need simple, accurate, cost-effective
tools for discovering it, especially if they must sift through years’ worth of media files or need
immediate access to their media for production. After all, if they can’t find it, they can’t
manage or monetize it.
Production and media asset management systems, generically referred as “MAMs” in this paper, hold file-based metadata such
as the date the footage was shot, but often there is not much descriptive metadata about the content, making it difficult to find
an asset quickly. Manual logging and transcription are not only time-consuming and prohibitively expensive, but they also yield
limited detail and or introduce potential costly delays when the media is not available to be searched. Image recognition reveals
who’s pictured, but not what they’re saying. Speech-to-text has had insufficient performance and accuracy to be useful even on
the clearest speech.
Dialogue, on the other hand, is present in almost all program types and often provides more detailed, precise content description
than any other metadata. The automated, phonetic-based search of dialogue is accurate, extremely fast, affordable, and can
integrate with existing MAM systems and editing applications. It also applies broadly to any industry that creates, owns, or
distributes content.
This paper will discuss the technology behind the phonetic-based search of dialogue used in products like the patented Nexidia
Dialogue Search, Avid PhraseFind, and Boris Soundbite, and how it can change the way content owners, creators, and aggregators
discover, use, and monetize their media assets.
Introduction
For media creators and owners, the amount of digital media in their libraries never stops growing as they continue to produce
new digital content every day. Those that have also progressed to the point of digitizing their legacy audio and video assets —
sometimes decades’ worth — can easily be dealing with hundreds of thousands of hours of content. Content owners who don’t
organize and manage their data are not getting the maximum value out of their investment in their archived media, despite
spending significant sums of money to manage and store it.
Whether the aim is to reach new audiences, serve second (or third) screens, create completely new programming, prove compliance,
license their content, or some combination of those, the ultimate goal for many content owners is to monetize the assets that
would otherwise languish in their media libraries. To do so, many content owners have chosen to use a media asset management
(MAM) system. There are many MAM systems to choose from at a range of price points and features, and the process of choosing
the right one can require a lot of research and consideration. Once the MAM application is in place, it’s tempting to think that
the process is over and that discovery problems are solved, but for most content owners, the MAM is just the beginning.
A good MAM system is a critical part of any file-based media operation, to be sure, but its search capability is only as good as
the metadata that goes into it. Without rich, descriptive metadata, a MAM system may not meet the expectations of the content
owner and or justify the expense of the MAM, storage, and other systems required to utilize the MAM.
Different Approaches to Search
Assets usually get ingested into a MAM system along with basic information such as filename, file type, date, timecode, and
duration information. Metadata might also include show/project name, summary, and or relevant keywords. Unfortunately for
many content owners, that’s where the metadata stops and the search problems begin.
2
whitepaper
Phonetic Based Dialogue Search
When it comes to metadata, the more you have, usually the better your chances are of finding exactly what you’re looking for —
but it’s hard to find what you’re looking for based on simple file attributes alone. It takes additional metadata that describes the
content within a given media file, which usually must be entered manually using some type of logging application. It’s a very
laborious process that requires watching the video and making notes about it in the logging application. The logging process is
often two to four times slower than real time, so most media operations simply don’t have the resources required to do it regularly
and thoroughly. Simply logging the content (live or during ingest) might cost upwards of $80 per hour of video, while more
involved transcription costs can be as much as $160 per hour of material if timing information is included. Further, it can take
days before that material is then available for searching. Also, when no timing information is available, there is no synchronized
link between the search results in the text document and the media. The result: the files in the MAM system often don’t contain
enough descriptive metadata for the MAM to be useful. And so valuable media assets sit unused and, potentially, forgotten.
Another search method is caption-based search, but this is challenging for a number of reasons. First, very few MAMs are able
to use captions to inform search. Also, typically only content that has already been aired actually contains captions and, for
those, the captions are rarely verbatim and frequently have misspellings that limit effective search. In addition, many broadcast
processes such as encoding and decoding can “break” the captions, so that the caption data is actually lost when new versions
of the media are created. Finally, if the captions are embedded, the time required to extract the captions can be prohibitive.
An additional potential category is speech-to-text based applications. In order to address these properly, it is necessary to look
a little deeper into how they work.
Speech-to-Text
Retrieval of information from audio and speech has been a goal of many researchers over the past 20 years. The simplest approach
is to apply Large Vocabulary Continuous Speech Recognition (LVCSR), perform time alignment, and produce an index of text
content along with time stamps. Much of the improved performance demonstrated in current LVCSR systems comes from better
linguistic modeling1 to eliminate sequences of words that are not allowed within the language.
In the LVCSR approach, the recognizer tries to transcribe all input speech as a chain of words in its vocabulary. Keyword spotting
is another technique for searching audio for specific words and phrases where the recognizer is only concerned with occurrences
of one keyword or phrase. Since the score of the single word must be computed (instead of the entire vocabulary), much less
computation is required.2, 3
Another advantage of keyword spotting is the potential for an open, constantly changing vocabulary at search time, making
this technique useful in archive retrieval but not so ideal for real-time execution. When searching through tens or hundreds of
thousands of hours of archived audio data, scanning must be executed many thousands of times faster than real time in order
to be practical.
A new class of keyword spotters has been developed that performs separate indexing and searching stages. In doing so, search
speeds that are several thousand times faster than real time have been successfully achieved. However, many of the same
limitations regarding vocabulary still apply for this approach.
Introducing Phonetic Search
Another approach is phonetic searching, illustrated in Figure 1. This high-speed algorithm comprises two phases—indexing and
searching. The first phase indexes the input speech to produce a phonetic search track and is performed only once. The second
phase, performed whenever a search for a word or phrase is initiated, is comprised of searching the phonetic search track. Once
3
whitepaper
Phonetic Based Dialogue Search
the indexing is completed, this search stage can be repeated for any number of queries. Since the search is phonetic, search
queries do not need to be in any pre-defined dictionary — thus allowing searches for proper names, new words, misspelled
words, and jargon. Note that once indexing has been completed, the original media files are not involved at all during searching.
This means the search track can be generated on the highest-quality media available for highest accuracy, but then the audio
can be replaced by a compressed representation for storage and subsequent playback afterwards.
Figure 1
Index and Search Architecture
Indexing
The indexing phase begins with decoding of the input media, into a standard audio representation for subsequent handling
(PCM). Then, using an acoustic model, the indexing engine scans the input speech and produces the corresponding phonetic
search track. An acoustic model jointly represents characteristics of both an acoustic channel (an environment in which the
speech was uttered and a transducer through which it was recorded) and a natural language (in which human beings express
the input speech). Audio channel characteristics include frequency response, background noise, and reverberation. Characteristics of a natural language include gender, dialect, and the speaker’s accent.
The end result of phonetic indexing of an audio file is the creation of a Phonetic Audio Track (PAT) file — a highly compressed
representation of the phonetic content of the input speech. Unlike LVCSR, whose essential purpose is to make irreversible and
possibly incorrect bindings between speech sounds and specific words, phonetic indexing merely infers the likelihood of potential phonetic content as a reduced lattice, deferring decisions about word bindings to the subsequent searching phase.
In order to support search in any given language, a database of that language must be built. This requires roughly 100 hours of
diverse content containing dialogue from a wide variety of speakers (gender, age, accent/inflection) and genres along with the
complete transcripts for that content, which are compiled and processed to create a language “pack.” The ability to have broad
language support is a key advantage over speech-to-text applications.
4
whitepaper
Phonetic Based Dialogue Search
Searching
The system begins the searching phase by parsing the query string, which is specified as text containing one or more:
•
Words or phrases (e.g., “President” or “Supreme Court Justice”)
•
Phonetic strings (e.g., “_B _IY _T _UW _B _IY,” six phonemes representing the acronym “B2B”)
•
Temporal operators (e.g., Obama &15 bailout,” representing two words or phrases spoken within 15 seconds of each other)
After the system parses words, phrases, phonetic strings, and temporal operators within the query term, actual searching begins.
Multiple PAT files can be scanned at high speed during a single search for likely phonetic sequences (possibly separated by
offsets specified by temporal operators) that closely match corresponding strings of phonemes in the query term. Since PAT
files encode potential sets of phonemes, not irreversible bindings to sounds, the matching algorithm is probabilistic and returns
multiple results, each as a 4-tuple:
•
PAT File (to identify the media segment associated with the hit)
•
Start Time Offset (beginning of the query term within the media segment, accurate to one hundredth of a second)
•
End Time Offset (approximate time offset for the end of the query term)
•
Confidence Level (that the query term occurs as indicated, between 0.0 and 1.0)
Key Benefits
•
Speed, accuracy, scalability. The indexing phase devotes its limited time allotment only to categorizing input speech sounds
into potential sets of phonemes—rather than making irreversible decisions about words. This approach preserves the possibility
for high accuracy so that the searching phase can make better decisions when presented with specific query terms. Also, the
architecture separates indexing and searching so that the indexing needs to be performed only once, typically during media
ingest, and the relatively fast operation (searching) can be performed as often as needed.
•
Open vocabulary. LVCSR systems can only recognize words found in their lexicons. Many common query terms, such as
specialized terminology and names of people, places, and organizations (collectedly referred as “entities”) are typically
omitted from LVCSR lexicons, partly to keep them small enough that LVCSRs can be executed cost-effectively in real-time,
and also because these kinds of query terms are notably unstable as new terminology and names are constantly evolving.
By enabling the search of entities, the search can be more specific and allow better discrimination of search results.
Phonetic indexing is unconcerned about such linguistic issues, maintaining completely open vocabulary (or, perhaps
more accurately, no vocabulary at all).
•
Low penalty for new words. LVCSR lexicons can be updated with new terminology, names, and other words. However, this
exacts a serious penalty in the cost of ownership because the entire media archive must then be reprocessed through LVCSR
to recognize the new words (an operation that typically executes only slightly faster than real time at best). Also, probabilities
need to be assigned to the new words, either by guessing their frequency or context or by retraining a language model that
includes the new words. The dictionary within the phonetic searching architecture, on the other hand, is consulted only during
the searching phase, which is relatively fast compared to indexing. Adding new words incurs only another search, and it is
often unnecessary to add words, since the spelling-to-sound engine can handle most cases automatically, or users can simply
enter sound-it-out versions of words.
5
whitepaper
Phonetic Based Dialogue Search
•
Phonetic, inexact spelling, and multiple pronunciations. Proper names are particularly useful query terms but they are also
particularly difficult for LVCSR, not only because they may not occur in the lexicon as described above, but also because they
often have multiple spellings (and any variant may be specified at search time). With phonetic searching, exact spelling is not
required. This advantage becomes clear with a name that can be spelled “Qaddafi,” “Khaddafi,” “Quadafy,” “Kaddafi,” or
“Kadoffee”, any of which could be located by phonetic searching.
•
User-determined depth of search. If a particular word or phrase is not spoken clearly, or if background noise interferes at that
moment, then LVCSR will likely not recognize the sounds correctly. Once that decision is made, the correct interpretation is
hopelessly lost to subsequent searches. Phonetic searching, however, returns multiple results, sorted by confidence level. The
sound at issue may not be the first, or even in the top 10 or 100, but it is very likely in the results list somewhere, particularly
if some portion of the word or phrase is relatively unimpeded by channel artifacts. If enough time is available, and if the
retrieval is sufficiently important, then a motivated user aided by an efficient human interface can drill as deeply as necessary.
•
Amenable to parallel execution. The phonetic searching architecture can take full advantage of any parallel processing
accommodations. For example, a server with 32 cores can index nearly 32 times as fast as a single core. Additionally, PAT
files can be processed in parallel by banks of computers to search more media per unit time (or search tracks can be replicated
in the same implementation to handle more queries over the same media). This is a significant improvement over LVCSR that
doesn’t scale in a linear way since it has to devote more processing cores to ensure higher accuracy rates.
•
Broad language support. The ability to recognize different languages is unique and unmatched. New languages are being
added on a regular basis.
Accuracy
Phonetic-based search results are returned as a list of potential hit locations, in descending likelihood order. See Figure 2. As
a user progresses further down this list, they will find more and more instances of their query occurring. However, they will also
eventually encounter an increasing amount of false alarms (results that do not correspond to the desired search term).
Figure 2
Search Results
6
whitepaper
Phonetic Based Dialogue Search
When using a typical North American English broadcast language pack and a query length of 12-15 phonemes, you can expect,
on average, to find 85% of the true occurrences, with less than one false hit per two hours of media searched. The user has
the flexibility to choose to ensure a high probability of detection by accepting results with a moderate confidence score, or to
reduce false alarms by raising the score threshold and only accepting those with high confidence scores. These settings can also
be automatically calculated to balance accuracy and recall to conform to the given phonetic characteristics of a body of media
about to be searched.
In a word-spotting system, more phonemes in the query means more discriminative information is available at search time.
Fortunately, rather than short, single word queries such as “no” or “the”, most real-world searches are for proper names, phrases,
or other interesting speech that represents longer phoneme sequences. Even when the desired phrase is short, it can almost
always be interpreted as an OR of several common carrier phrases. For example “tennis court” OR “tennis elbow” OR “tennis
match” OR “tennis shoes” would be four carrier phrases that would capture most instances of the word “tennis.”
Speed
Indexing speed is another important metric and is defined as the speed in which media can become searchable. Indexing
requires a relatively constant amount of computation per media hour. Thus, on a single server the indexing time for 3,200
hours of media is less than an hour of real time for PCM content. Put another way, a single server at full capacity can index
over 76,800 hours’ worth of media per day. When more cores are added, speed increases.
A final performance measure is the speed at which media can be searched once it has been indexed. Two main factors influence
the speed of searching. The most important factor is whether the PAT files are currently in memory, and the read access time
of the storage. Once an application requests a PAT file to be loaded (if it expects it to be needed soon), or else upon the first
search of a track, the search engine will load it into memory. Recent advancements have also enabled the creation of a higher
level index which can achieve search performance that is millions of times faster than real time.
Conclusions
This paper has given an overview of the issues surrounding the need to find and access clips in any body of media including
those from news, sports, education, government, and production, so that they can be used to create new content and be
monetized quickly and easily. A discussion of different types of metadata creation and the various shortcomings brought us
to the relatively new method of using phonetic search. The method breaks searching into two stages: indexing and searching.
Search queries can be words, phrases, or even structured queries that allow operators such as AND, OR, and time constraints
on groups of words. Search results are returned as lists, file names, or unique identifiers for the media, the name of the search
query or queries, and their corresponding time code along with an accompanying score giving the likelihood that a match to
the query happened at this time.
Phonetic searching has several advantages over previous methods of searching media. By not constraining the pronunciation of
searches, the method can find any proper name, slang, or even words that have been incorrectly spelled — completely avoiding
the out-of-vocabulary problems of speech recognition systems. Phonetic search is also fast. For deployments such as major
broadcasting networks with tens of thousands of hours of media, users don’t have to choose a subset to analyze since all media
can be indexed for search, even with modest resources. Once media is ingested, it can be immediately indexed as part of the
workflow and become immediately searchable. Unlike other approaches, phonetic search technologies are very scalable, allowing
for fast and efficient searching and analysis of extremely large archives.
7
whitepaper
Phonetic Based Dialogue Search
References
1
D. Jurafsky and J. Martin, Speech and Language Processing, Prentice-Hall, 2000.
2
J. Wilpon, L. Rabiner, L. Lee, and E. Goldman, “Automatic Recognition of Keywords in
Unconstrained Speech Using Hidden Markov Models,” IEEE Transactions on Acoustics, Speech,
and Signal Processing, Vol. 38, no. 11, pp 1870-1878, November, 1990.
3
R. Wohlford, A. Smith, and M. Sambur, “The Enhancement of Wordspotting Techniques,”
in Proceedings of IEEE International Conference
on Acoustics, Speech and Signal Processing, Denver, CO, Vol. 1, pp 209-212, 1980.
4
R. R. Sarukkai and D. H. Ballard, “Phonetic Set Indexing for Fast Lexical Access,” IEEE Transactions
on Pattern Analysis and Machine Intelligence, Vol. 20, no. 1, pp 78-82, January, 1998.
5
D. A. James and S. J. Young, “A Fast Lattice-Based Approach to Vocabulary Independent Wordspotting,”
in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Adelais,
SA, Australia, Vol. 1, pp 377-380, 1994.
6
P Yu, K. Chen, C. Ma, and F. Seide, “Vocabulary-Independent Indexing of Spontaneous Speech,”
IEEE Transactions on Speech and Audio Processing, volume 13, no. 5, September 2005.
Applicable Patents
Phonetic Searching
Patent 7,263,484; issued August 28, 2007
Creation and search of phonetic index for audio/video files
Phonetic Searching
Patent 7,313,521; issued December 25, 2007
Assessment of search term quality
Phonetic Searching
Patent 7,324,939; issued January 29, 2008
Indexing and search covering both forward and backward directions in time
Phonetic Searching
Patent 7,406,415; issued July 29, 2008
Structured queries: combination of search terms via Boolean and time-based operators
Phonetic Searching
Patent 7,475,065; issued January 6, 2009
Search via linguistic search term plus phonetic search term or voice command
Wordspotting System Normalization
Patent 7,650,282; issued January 19, 2010
Structured query normalization: statistical modeling of score distributions of potential
hits to characterize and reduce false alarms to improve accuracy; auto thresholding
Phonetic Searching
Patent 7,769,587; issued August 3, 2010
Phonetic indexing and search of text
8
whitepaper
Phonetic Based Dialogue Search
Comparing events in word spotting
Patent 8,170,873; Issued May 1st, 2012
Application of subword unit models to classify audio
Spoken Word Spotting Queries
Patent 7,904,296; Issued March 8, 2011
Searching audio by selecting audio clips as the search query
Multiresolution Searching
Patent 7,949,527; Issued May 24, 2011
Faster search of spoken content via multiresolution phonetic indexing and
novel compression techniques
Keyword Spotting Using a Phoneme-Sequence Index
Patent 8,311,828; Issued November 13, 2012
Application of phonetic search to very large sets of data
Copyright Notice
Copyright © 2004-2014, Nexidia Inc. All rights reserved.
This document and any software described herein, in whole or in part may not be reproduced, translated or modified in any manner,
without the prior written approval of Nexidia Inc. This document is the copyrighted work of Nexidia Inc. or its licensors and is owned
by Nexidia Inc. or its licensors. This document contains information that may be protected by one or more U.S. patents, foreign
patents or pending applications.
TRADEMARKS
Nexidia, Nexidia Dialogue Search, the Nexidia logo, and combinations thereof are trademarks of Nexidia Inc. in the United States
and other countries. Other product names and brands mentioned in this manual may be the trademarks or registered trademarks
of their respective companies and are hereby acknowledged.
disclaimer
This paper was first presented at the 2014 NAB Broadcast Engineering Conference on Wednesday, April 9, 2014 in Las Vegas, Nevada.
You can find additional papers from the 2014 NAB Broadcast Engineering Conference by purchasing a copy of the 2014 BEC
Proceedings at www.nabshow.com.
Nexidia Inc. – Headquarters
3565 Piedmont Road NE Building Two, Suite 400
Atlanta, GA 30305
USA
[email protected]
© 2014 Nexidia Inc. All rights reserved. All trademarks are the property of their respective owners. Nexidia products are protected by copyrights
and one or more of the following United States patents: 7,231,351; 7,263,484; 7,313,521; 7,324,939; 7,406,415, 7,475,065, 7,769,586,
7,231,351, 7,487,086, 7,640,161, 7,650,282, 7,904,296, 7,949,527, 8,051,086, 8,170,873, 8,311,828 and other patents pending.
nexidia.tv