Gist: Achieving Human Parity in Conversational Speech Recognition

Gist of Achieving Human
Parity in Conversational
Speech Recognition
Suggested Major Project Topic
Areas
•
Image recognition
•
•
•
•
Category of image
Handwriting recognition
Object identification
Image to text
•
Reinforcement learning
•
Unsupervised learning
•
•
•
•
•
•
Generative adversarial networks
Autoencoders
Semi-supervised learning
Transfer learning
Other
Speech
•
•
Speech recognition
Speech synthesis
•
Other synthesis (music, painting, text)
•
Natural language
•
•
•
•
•
Machine translation
Word embeddings
Summarization
Text understanding
Information retrieval
•
Other LSTM and gated network applications
•
Health and Medicine
•
Preference prediction
•
Sequence prediction
•
•
•
Time series
Stocks
Very deep networks
•
•
•
•
Each team should post
their tentative topic choice
and begin posting gists. If
you are undecided, post
gists on multiple topics.
Highway networks
Residual learning
Other
Interpretability and Human-supplied knowledge and
control
Each person should post
at least 4 gists. You can
post as many as you
want. There will be a
running competition for
the best gists. You may
rate and comment on any
posted gist.
Gist Format
• Quick gist of the paper:
• What is the significant result?
• How major?
•
•
•
•
What is the premise?
What are the main prior work?
What are the new methodologies?
What techniques are assumed known?
They are
proud of this
work. Most of
the abstract is
spent bragging
about the
result.
Repeats the
claim in the
title.
Announces
that what
follows is
important.
Existing
techniques
Things to look for in the paper:
Convolutional and LSTM networks
What’s new!
Things to look for in the paper:
Convolutional and LSTM networks
What’s new!
The novel
technique
Things to look for in the paper:
Convolutional and LSTM networks
Novel spatial smoothing
What’s new!
Something
else that may
be nonstandard.
Things to look for in the paper:
Convolutional and LSTM networks
Novel spatial smoothing
Lattice-free MMI acoustic training
Of course, also
look for the
results!
Why did they
emphasize
that it was
systematic
use?
Things to look for in the paper:
Convolutional and LSTM networks
Novel spatial smoothing
Lattice-free MMI acoustic training
Why “systematic” use?
The results (compare to human)
Modest about Their Method
Best practices: You could should do this, too!
This is explaining the emphasis on
“systematic” use.
While they are proud of their results, they are
being modest about how they achieved them.
They are attributing the results mainly to
careful engineering rather than to the novel
techniques that they have added to the old
stand-bys.
Yes, every course in deep learning
should cover CNNs and RNNs.
FYI: More complete history, not
necessary for gist
What’s new?
CNNs
Notice that they only reference, but do not describe the prior work. Can
you tell which of these references have Microsoft authors?
You will need to read these references to understand the techniques used
in this paper, and this is just to understand the CNNs.
This gist should list the full title of each cited reference as required priorwork reading.
Final CNN Variant: LACE
More prior-work references. Also, ResNet is an implicit
prior-work reference.
LACE
At least the LACE architecture itself is shown in detail.
LSTMs
Although LSTMs are only “a close second”, they are used
in combination with the convolutional networks, so you
need to know how to implement both.
More prior-work references.
Spatial Smoothing
The part that is new. There are no prior-work references,
but there is quite a bit of jargon.
Speaker Adaptive Modeling
The abstract didn’t even mention speaker-adaptive
modeling, but you’ll need to know how to implement
it. More prior-work references. Do you understand
why CNN models are treated differently from the
LSTM models with regard to the appended i-vector?
Lattice-Free Sequence Training
This is only the initial paragraph. There are several more paragraphs
describing “our implementation”. However, this clip shows the priorwork references. This section is a mix of a new implementation and
prior work. They did not call it “novel” as they did the spatial
smoothing.
LM Rescoring and System
Combination
This is how they combine the acoustic analysis with the
language models. In addition to the explicit references,
you will need look at prior references for RNN LMs and
LSTM LMs, but we will not look at the details in the next
two sections.
Results
There are many tables of results, showing tests of individual
components and various combinations. There are also
comparisons with prior work. This table is a sufficient
representative for the gist. It shows results for various
versions of the Microsoft system and compares the final
system with human performance.
Summary of Gist
• A “break-through” paper announcing the first result
exceeding human performance on a well-known, heavily
researched benchmark.
• The result was mainly achieved by “careful engineering
and optimization” based on accumulated prior art.
• If you have already implemented all of the prior art,
there is a relatively small amount of new things that you
need to implement. However, even then there will be a
lot of tuning and optimizing.
• If you are starting from scratch, there is a very large
amount of prior work that you will need to implement
as a prerequisite.
• This is an important paper on a major piece of work.
• If you want to understand state-of-the-art speech
recognition very thoroughly, reading this paper and its
references would be a good start.
• There are many sections in the paper that were skipped
in this summary, such descriptions of the testing of
human performance and the implementation on
Microsoft’s CNTK framework.
• Conclusion: Worth reading this paper if you want to be
up-to-date in speech recognition. Would be extremely
ambitious as a student team project.