情報・システム工学概論 画像・映像認識のモデル化

情報・システム工学概論
画像・映像認識のモデル化
機械情報工学科(機械B)
原田達也
一般的な視覚認識機能の困難さ
• 人の認識の曖昧性: weak labeling
Jet plane sky
cat tiger forest tree
beach people water oahu
• 文脈の考慮
皿,鍋,
やかん,
包丁
コップ,
急須
• Data drivenの特徴と意味との不一致: semantic gap
• 大量の学習データへのスケーラビリティ
• 多様な環境への対応:高速かつ安定な追加学習
2
コンセプトの学習と画像認識
white
bear
river
Fox: 0.90
White:0.83
River:0.54
Bear: 0.54
Snow: 0.51
snow
fox
white
bird
sky
flight
bear
brown
grass
fox
grass
brown
black
bear
river
Image
Concept
feature
space
space
3
The data processing theorem
The state of
the world
w
The gathered
data
d
The processed
data
r
Markov chain
The average information
I (W ; D) ≥ I (W ; R)
The data processing theorem states that data
processing can only destroy information.
4
David J.C. MacKay. Information Theory, Inference, and Learning Algorithms. Cambridge University Press 2003.
Pipeline of Image Recognition
Training
Training data
Testing
Feature
extraction
Model (classifier)
downstream
upstream
Test data
Feature
extraction
Model (classifier)
•
Upstream process is more important than downstream process.
•
Data and the feature extraction are more important than the classifier.
•
Knowledge about the objects is very important.
•
It is becoming popular to use a large amount of images collected from the
internet.
5
– There are huge amount of object categories, and the appearance of them is diverse
Large Scale Object Recognition
•
•
ILSVRC (ImageNet Large Scale Visual Recognition Challenge)
– Image recognition competition using large scale images
– http://www.image-net.org/challenges/LSVRC/2012/index
Sports car
Task 1: What’s this image?
– Learning 1.2 million images
– Classifying 1000 object classes
Sports car
•
Task 2: Where’s this object?
•
Task 3: What kind of dog is this?
– Detecting 1000 object classes in images
– Fine-grained classification on 120 dog sub-classes
– More difficult to classify objects than task 1
Deep CNN! Task 1
Task 3
Shih-Tzu
Pomeranian
toy poodle
Team
Flat Error
Team
mAP
1) SuperVision
Univ. of Toronto
0.153
1) ISI (ours)
Univ. of Tokyo
2) ISI (ours) Ours
Univ. of Tokyo
0.262
2) XRCE/INRIA
Xerox Research Centre Europe/INRIA
0.310
3) OXFORD_VGG
Univ. of Oxford
0.270
3) Uni Jena
Univ. Jena
0.246
Ours
0.323
Results (2012)
http://www.isi.imi.i.u-tokyo.ac.jp/pattern/ilsvrc2012/index.html
1. brown bear
2. Tibetan mastiff
3. sloth bear
4. American black bear
5. bison
1. baseball player
2. unicycle
3. racket
4. rugby ball
5. basketball
1. digital watch
2. Band Aid
3. syringe
4. slide rule
5. rubber eraser
1. shower cap
2. bonnet
3. bath towel
4. bathing cap
5. ping-pong ball
1. diaper
2. swimming trunks
3. bikini
4. miniskirt
5. cello
1. Siamese cat
2. Egyptian cat
3. Ibizan hound
4. balance beam
5. basenji
1. king penguin
2. sea lion
3. drake
4. magpie
5. oystercatcher
1. oboe
2. flute
3. ice lolly
4. bassoon
5. cello
1. beer bottle
2. pop bottle
3. wine bottle
4. Polaroid camera
5. microwave
1. butcher shop
2. swimming trunks
3. miniskirt
4. barbell
5. feather boa
7
Fine-grained object recognition results (2012)
English setter
Siberian husky
Welsh springer spaniel
otterhound
Staffordshire bullterrier
Australian terrier
whippet
English springer
Scottish deerhound
bloodhound
Mexican hairless
Airedale
Weimaraner
malamute
soft-coated wheaten terrier Dandie Dinmont
giant schnauzer
Bouvier des Flandres
Great Dane
black-and-tan coonhound
miniature poodle
Cardigan
Walker hound
Old English sheepdog
papillon
malinois
What’s Image Feature?
Generative Approach
General pipeline
Descriptor
→8 dim
Interesting points
(e.g. corner, edge)
d1
d2
d3
dm
dk
dj
Local descriptors
1) Input Image
2) Detection
dk
x = f (θ)
dN
dj
3) Description
p (d; θ)
dm
d 2 d1
dN
d3
4) Local descriptors in feature space
An image is represented by
a probability distribution.
5) PDF estimation
6) Feature vector
9
Distance between Images
Tangent space
Riemannian Metrix
Connection
H. Nakayama, T. Harada,
Y. Kuniyoshi. CVPR, 2010
For the exponential
family, if we take the
natural or expectation
parameters, we can
obtain the flat space!
Similarity measure
Metric on tangent space: Fisher Information Matrix
Feature
Better feature
視覚野
http://www.nature.com/scientificamerican/journal/v16/n3/box
/scientificamerican0906-4sp_BX1.html
ニューロン
ニューロン
ニューロンの数理モデル
1
 n

z = f  ∑ ai xi + h 
 i =1

1
y=
1 + exp(− x)
0
甘利俊一.神経回路網モデルとコネクショニズム.東京大学出版会,1989.
多層パーセプトロン
aij
 I

y j = f  ∑ aij xi + a0 j 
 i =1

y1
b jk
x1
z1
x2
Output
Input
yj
zK
xI
yJ
y=

 J

z k = f  ∑ b jk y j + b0 k 

 j =1
1
1
1 + exp(− x)
0
•
•
•
ロジスティック回帰モデルを多層化したもの
十分な数の隠れユニットがあればどんな連続関数も任意の精度で近似可能
誤差逆伝播により効率的に学習可能
Neocognitron
Kunihiko Fukushima. Neocognitron: A Self-organizing Neural Network Model for a
Mechanism of Pattern Recognition Unaffected by Shift in Position. Biol. Cybernetics 36, 193
202 (1980)
Convolutional Neural Networks
Softmax activation
Fully connected
Fully connected
Average pooling
Convolution
Average pooling
Convolution
Input
(a) LeNet-5, 1989
Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D.
Jackel. Handwritten digit recognition with a back-propagation network. NIPS 1989.
Modules of Convolutional Neural Networks
• Convolution layer
• fully-connected layer
• Pooling layer
Convolutional and Fully-Connected Layer
Fully-Connected Layer
Convolutional Layer
feature map
hidden
𝐿𝐿 th layer
𝐿𝐿 − 1 th layer
input
input
𝑤𝑤1
𝑤𝑤1
𝑤𝑤1
𝑤𝑤
𝑤𝑤
𝑤𝑤
𝑤𝑤2 𝑤𝑤3
3
3
2
𝑤𝑤2
local receptive field
local receptive field
local receptive field
input
𝑙𝑙
𝑘𝑘𝑖𝑖𝑖𝑖
𝐿𝐿 th layer
𝐿𝐿 − 1 th layer
Logistic sigmoid function
𝑓𝑓 𝑥𝑥 = (1 + 𝑒𝑒 −𝛽𝛽𝑥𝑥 )−1
Hyperbolic tangent function
Local receptive field
𝑙𝑙
𝑥𝑥𝑗𝑗𝑙𝑙 = 𝑓𝑓 � 𝑥𝑥𝑖𝑖𝑙𝑙−1 ∗ 𝑘𝑘𝑖𝑖𝑖𝑖
+ 𝑏𝑏𝑗𝑗
𝑖𝑖∈𝑀𝑀𝑗𝑗
𝑓𝑓 𝑥𝑥 = 𝑎𝑎tanh(𝑏𝑏𝑏𝑏)
Pooling Layer
Local receptive field
𝑙𝑙−1
𝑘𝑘𝑖𝑖𝑖𝑖
convolution
sub-sampling
subsampling function
𝑥𝑥𝑗𝑗𝑙𝑙 = 𝑓𝑓 𝛽𝛽𝑗𝑗𝑙𝑙 down 𝑥𝑥𝑖𝑖𝑙𝑙−1 + 𝑏𝑏𝑗𝑗𝑙𝑙
Average pooling
Max pooling
LeNet and Alex Net
Softmax activation
Fully connected
Fully connected
Average pooling
Convolution
Average pooling
Convolution
Input
Features of Alex Net
• 5 conv. layers + 3 fully-connected layers
• ReLU nonlinearity
• Multiple GPGPUs
• Local response normalization
• Overlappling pooling
• Dropout
• Data augementation
(a) LeNet-5, 1989
Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D.
Jackel. Handwritten digit recognition with a back-propagation network. NIPS 1989.
Softmax activation
Fully connected
Fully connected
Fully connected
Max pooling
Convolution
Convolution
Convolution
Max pooling
Response normalization
Convolution
Max pooling
Response normalization
Convolution
Input
(b) AlexNet, 2012
Alex Krizhevsky, Ilya Sutskever and Geoffrey E. Hinton. ImageNet
Classification with Deep Convolutional Neural Networks. NIPS, 2012.
VGG Net
Softmax activation
Fully connected
Fully connected
Fully connected
Max pooling
Convolution
Convolution
Convolution
Max pooling
Response normalization
Convolution
Max pooling
Response normalization
Convolution
Input
Alex Krizhevsky, Ilya Sutskever and Geoffrey E. Hinton. ImageNet
Classification with Deep Convolutional Neural Networks. NIPS, 2012.
Softmax activation
Fully connected
Fully connected
Fully connected
Max pooling
Convolution
Convolution
Convolution
Convolution
Max pooling
Convolution
Convolution
Convolution
Convolution
Max pooling
Convolution
Convolution
Convolution
Convolution
Max pooling
Convolution
Convolution
Max pooling
Convolution
Convolution
Input
Karen Simonyan, Andrew Zisserman. Very Deep Convolutional
Networks for Large-Scale Image Recognition. arXiv:1409.1556, 2014.
(c) VGG Net, 2014
Features of VGG Net
• 16 conv. layers + 3 fully-connected layers
• Very small receptive fields
• No Local response normalization
• Data parallelism
(b) AlexNet, 2012
Network in Network
𝑙𝑙
𝑘𝑘𝑖𝑖𝑖𝑖
Local receptive field
𝑙𝑙
𝑥𝑥𝑗𝑗𝑙𝑙 = 𝑓𝑓 � 𝑥𝑥𝑖𝑖𝑙𝑙−1 ∗ 𝑘𝑘𝑖𝑖𝑖𝑖
+ 𝑏𝑏𝑗𝑗
𝑖𝑖∈𝑀𝑀𝑗𝑗
Local receptive field
Min Lin, Qiang Chen, Shuicheng Yan. Network
In Network. arXiv:1312.4400, 2014.
Spatial Pyramid Pooling Layer
𝑙𝑙
𝑘𝑘𝑖𝑖𝑖𝑖
Local receptive field
Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun.
Spatial Pyramid Pooling in Deep Convolutional Networks
for Visual Recognition. arXiv:1406.4729, 2014.
Parametric ReLUs
𝑓𝑓 𝑥𝑥 = max(0, 𝑥𝑥)
𝑓𝑓 𝑥𝑥
𝑙𝑙
𝑘𝑘𝑖𝑖𝑖𝑖
𝑥𝑥
(b) Rectified Linear Units (ReLUs)
Local receptive field
𝑙𝑙
𝑥𝑥𝑗𝑗𝑙𝑙 = 𝑓𝑓 � 𝑥𝑥𝑖𝑖𝑙𝑙−1 ∗ 𝑘𝑘𝑖𝑖𝑖𝑖
+ 𝑏𝑏𝑗𝑗
𝑓𝑓 𝑥𝑥 = �
𝑖𝑖∈𝑀𝑀𝑗𝑗
𝑓𝑓 𝑥𝑥 = 𝑎𝑎tanh(𝑏𝑏𝑏𝑏)
𝑓𝑓 𝑥𝑥
𝑥𝑥
(a) Hyperbolic tangent function
𝑥𝑥
𝑎𝑎𝑎𝑎
𝑓𝑓 𝑥𝑥
𝑓𝑓 𝑥𝑥 = 𝑎𝑎𝑎𝑎
if 𝑥𝑥 > 0
otherwise
𝑓𝑓 𝑥𝑥 = 𝑥𝑥
(c) Parametric ReLUs (PReLUs)
𝑥𝑥
Invariance and Data Augmentation
• CNNには微小な平行移動以外の幾何学的不変
性を担保する構造を持たない.
• Data augmentation
– 幾何学的不変性を構造に持つのではなく,データ
を疑似的に水増しすることで,幾何学的不変性を
学習させる.
– 例えば,2012年のAlexNetではトレーニングデー
タを2048倍に水増し
Fully connected
Softmax activation
Softmax activation
Average pooling
Inception
Inception
Inception
Inception
Inception
Inception
Response normalization
Convolution
Convolution
Response normalization
Max pooling
Convolution
Input
Softmax activation
Fully connected
Fully connected
Convolution
Average pooling
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre
Sermanet, Scott Reed, Dragomir Anguelov,
Dumitru Erhan, Vincent Vanhoucke, Andrew
Rabinovich. Going Deeper with Convolutions.
arXiv:1409.4842, 2014.
GoogLeNet
Fully connected
Previous layer
Inception
3x3 Max pooling
Fully connected
1x1 Convolution
Inception
1x1 Convolution
Convolution
1x1 Convolution
Inception
5x5 Convolution
1x1 Convolution
3x3 Convolution
Average pooling
Inception
Filter
concatenation
最新の画像識別性能
multi-model results
postcompetition
Team
Top 5 error [%]
Google
4.9
MSRA, PReLU-nets 4.94
in competition
ILSVRC 14
Human
5.1
Baidu
5.98
VGG (arXiv v5)
6.8
GoogLeNet
6.66
VGG (Oxford)
7.32
MSRA, SPP-nets
8.06
Batch Normalization:
Accelerating Deep Network
Training by Reducing Internal
Covariate Shift
Delving Deep into Rectifiers:
Surpassing Human-Level
Performance on ImageNet
Classification
25
Difficulties in Large-Scale Recognition
System (inc. Deep NNs)
•
Computer Resource
•
Structure
– どんだけ計算機をぶん回せるか?
– どれだけ多様な構造を試せるか?
– 正規化
•
•
Batch normalization
Sergey Ioffe, Christian Szegedy. Batch Normalization: Accelerating Deep Network Training by
Reducing Internal Covariate Shift. arXiv:1502.03167, 2015.
– 構造推定の自動化
•
•
•
Optimization
– 高速かつ安定した最適化手法は何か?
•
•
•
Bayesian Optimization, Gaussian process
Jasper Snoek, Oren Rippel, Kevin Swersky, Ryan Kiros, Nadathur Satish, Narayanan Sundaram,
Mostofa Patwary, Mr Prabhat and Ryan Adams. Scalable Bayesian Optimization Using Deep
Neural Networks. ICML, 2015.
Dataset
Natural Gradient
Guillaume Desjardins, Karen Simonyan, Razvan Pascanu, Koray Kavukcuoglu. Natural Neural
Networks. arXiv:1507.00210v1,2015.
– どれだけ大量かつ良質なデータを持っているか?
– 他が持たないデータを如何に囲い込めるか?
Is it enough to label images with words?
http://www.totallifecounseling.com/servicestherapy-orlando-florida-clermont-east-southwesttherapy-therapists-counselors/ptsd/
dog, person, bite
A dog bites a person.
dog, person, bite
A person bites a dog.
27
Sentence Generation Results
A giraffe standing next
to a tree in a fence.
With a cat laying on
top of a laptop
computer.
A yellow train on the
tracks near a train station.
A dog laying on the side
of a zoo enclosure.
A man in the beach with a surfboard.
Black and white dog
on the grass in a
frisbee.
Show and Tell: A Neural Image Caption Generator
•
Oriol Vinyals, Alexander Toshev, Samy Bengio, Dumitru Erhan (Google), CVPR 2015
•
CNNで抽出された画像特徴量をLSTMの最初の入力に用い文を生成
– NNを一貫して使用、そのためパイプラインが比較的単純
•
2014年版との差分
–
–
–
–
–
1. 識別性能の高いCNNを使用
2. better optimization (詳細不明)
3. beam searchのサイズを減らす
4. CNNまでback prop
5. model ensemble (おそらくLSTMのみ)
Image
Image feature
Deep CNN
Language
generation
RNN
Caption
Show and Tellの構造
Long-Short Term Memory
•
caption = {𝑆𝑆0 𝑆𝑆1 𝑆𝑆2 ⋯ 𝑆𝑆0 }
•
キャプション生成時
学習時
– 単語数+1のLSTMを準備.全LSTMのパラメータは共有
– 時刻t-1のLSTMの出力は時刻tのLSTMに与えられる
– はじめの入力はDeep CNNの特徴,2番目のLSTMからキャプションを構成する単語が
入力される
– 入力画像からDeep CNNを利用して特徴を抽出しLSTMに入力される
– 予測される単語の確率に基づきビームサーチにより準最適なキャプションを生成
動画像認識
• THUMOS Challenge
– 6/11/2015@CVPR2015
• 概要
– 大規模動画像認識コンペティション
– 動作毎に細かく区切られた動画ではなく,
YouTubeにあるような動画を認識対象
– 動画像認識の最先端の手法
– http://www.thumos.info/
• Tasks
– Task 1
ApplyEyeMakeup
BabyCrawling
• Classification
• 101 class
• Total dataset size: 412hrs, 370G
– Task 2
• Spatial-temporal localization
BasketBallDunk
Coding (e.g. VLAD)
Classifier
動画像では,まだDeep
Learningがブレークス
ルーではない
Pre-trained Deep CNN
Late fusion
Classifier
Video feature
iDT+
Local descriptor
iDT+
Local descriptor
iDT+
Local descriptor
iDT+
Local descriptor
iDT+
Local descriptor
iDT+
Local descriptor
iDT+
Local descriptor
Coding (e.g. Fisher Vector)
Video feature
CNN
feature
CNN
feature
CNN
feature
CNN
feature
CNN
feature
CNN
feature
CNN
feature
動画像認識の一般的パイプライン
内部の可視化
Softmax activation
複雑すぎて,内部でどのような処理が行われているか分からない,,,
Fully connected
Fully connected
Fully connected
Max pooling
Convolution
Convolution
Convolution
Convolution
Max pooling
Convolution
Convolution
Convolution
Convolution
Max pooling
Convolution
Convolution
Convolution
Softmax activation
Fully connected
Fully connected
Fully connected
Max pooling
Convolution
Convolution
Convolution
Max pooling
Response normalization
Convolution
Max pooling
Response normalization
Convolution
Input
Alex Krizhevsky, Ilya Sutskever and Geoffrey E. Hinton. ImageNet
Classification with Deep Convolutional Neural Networks. NIPS, 2012.
AlexNet, 2012
Convolution
Max pooling
Convolution
Convolution
Max pooling
Convolution
Convolution
Input
Karen Simonyan, Andrew Zisserman. Very Deep Convolutional
Networks for Large-Scale Image Recognition. arXiv:1409.1556, 2014.
VGG Net, 2014
Deconvnet
• Matthew D. Zeiler
and Rob Fergus.
Visualizing and
Understanding
Convolutional
Networks. ECCV
2014.
Matthew D. Zeiler and Rob Fergus. Visualizing and Understanding Convolutional
Networks. ECCV 2014.
Image Feature and Recognition
How to generate a new image from the semantic information?
d3
d2
dm
dm
d1
dk
d 2 d1
dk
dN
d3
dj
dj
1) Input Image
2) Detection
dN
4) Local descriptors
in local feature space
3) Description
p (d; θ)
xj
y = wT x
x = f (θ)
5) PDF
estimation
6) Feature vector
Camera
Cat
7) Image feature space
The image feature space is most closely related to the semantic information.
Results
H. Kato, T. Harada. Image Reconstruction from
Bag-of-Visual-Words. CVPR, 2014.
Good Results
Original Images
Obtained
from BoVW
Our
method
HOGgles*1
This is for HOG
originally
Image
Retrieval
From 1M images
*1 Vondrick et al., ICCV, 2013.
37
Inceptionism: Going Deeper into Neural Networks
•
Posted: Wednesday, June 17, 2015
•
Posted by Alexander Mordvintsev, Software Engineer, Christopher Olah, Software
Engineering Intern and Mike Tyka, Software Engineer
•
http://googleresearch.blogspot.co.uk/2015/06/inceptionism-going-deeper-into-neural.html
38
39
Framework of the Current Recognition System
Cyber-World
Category
Recognition System
Learning
The red train stopped
at the station.
Our baby is
growing fast.
We started the
weight training.
Large-scale Image dataset
Human Filter
Physical-World