資料格式(一) - 語音處理實驗室

台大語音實驗室
暑期專題研究
ASR System & LIBDNN
Yen-Chen Wu
[email protected]
Outline
 DNN in Speech Recognition
 DNN
 TIMIT Introduction
 How to use libdnn
DNN
IN
SPEECH
RECOGNITION
Speech Recognition
 In speech processing…
 each word consists of syllables
 each syllable consists of phonemes
“青色” → “青(ㄑㄧㄥ)色(ㄙㄜ、)” → ”ㄑ” (syllables)
青:TSI --I –N (phonemes)
色:S--@ (phonemes)
 Each time frame, with an
observance (vector) mapped to a
phoneme.
Sample Rate: 16000
Observation Sequences
25 ms
10 ms
sliding
window
frames of
features
Frame 1
Frame 2
Frame 3
Digital Speech Processing
Lect. 2.0
DNN in Speech Recognition
 Goal: predict phoneme given feature in each
time frame.
 Frame-wise prediction
 Input: acoustic features
 MFCC, FBANK or...
 Output: pronunciation units
 Phonemes or...
 To know more about Automatic Speech
Recognition(ASR), please refer to
http://speech.ee.ntu.edu.tw/DSP2015Spring/
Training
Deep Neural
Network
7
Main Problems





Model initialize
Feedforward
Backpropagate
Update
Predict
Model Initialize
 DNN sometimes fails at local optimum problem,
so initialization matters.
 Practically, there exists unsupervised pretraining technique on initialization.
 However, in this homework, we recommend you
initialize them randomly for the simplicity
and efficiency.
Feedforward
Backpropagate
Update
Evaluation
 Framewise phoneme prediction
 Frame Accuracy
WHY DNN?
 Basic Model in Deep Learning
 Feature Extraction (Representation)
 Variety of Structures (CNN, RNN, LSTM, NTM…etc)
 Network Structure
 How many layers?
 Number of neurons in each layer
 Training Parameter
 Learning Rate
 Batch Size
Dataset and
Format
16
Dataset
 TIMIT(Texas Instrument and Massachusetts
Institute of Technology)
 Well-transcribed speech of American English
speakers of different sexes and dialects.
 Designed for the development and evaluation of
ASR systems.
HOW TO USE
LIBDNN
LIBDNN
 libdnn 是一個輕量、好讀、人性化的深層學習函式
庫。由 C++ 和 CUDA 撰寫而成,目的是讓開發人
員、研究人員、或任何有興趣的人都可以輕鬆體驗
並駕馭深層學習所帶來的威力。
 Ref:
 以深層與卷積類神經網路建構聲學模型之大字彙連續
語音辨識 ( Deep and Convolutional Neural Networks for
Acoutic Modeling in Large Vocabulary Continuous Speech
Recognition )
 已安裝於專題生工作站
資料格式(一)
 稀疏矩陣( LibSVM )
 這個向量大部分的值都是0,只有少數幾維的值為1
資料格式(二)
 緊密排列的方式(dense)
 本次練習給的格式
如何使用
 主要有以下三個程式:
 nn-init
 nn-init [train_set_file] <-o> <--input-dim> <--struct> [options]
 EX: nn-init -o init.model --input-dim 69 --struct 1024 --output-dim 39
 nn-train
 nn-train <training_set_file> <model_in> [valid_set_file] [model_out]
<--input-dim> [options]
 EX: nn-train train.dat init.model --input-dim 69
 nn-predict
 nn-predict <testing_set_file> <model_file> [output_file] <--inputdim> [options]
 EX: nn-predict test.dat train.model --input-dim 69
 會將指令寫成shell-script 直接修改參數即可
WORK STATION
 專題生開工作站帳號請找
 實驗室網管: 廖宜修
 [email protected]
 ssh -p 2822 [email protected]
 進入工作站後先確認data位置
 /home/wyc2010/DNN_practice
 複製run.sh回到自己的家目錄
 cp /home/wyc2010/DNN_practice/run.sh
 開始實驗!
 sh run.sh