情報・システム工学概論 画像・映像認識のモデル化 機械情報工学科(機械B) 原田達也 一般的な視覚認識機能の困難さ • 人の認識の曖昧性: weak labeling Jet plane sky cat tiger forest tree beach people water oahu • 文脈の考慮 皿,鍋, やかん, 包丁 コップ, 急須 • Data drivenの特徴と意味との不一致: semantic gap • 大量の学習データへのスケーラビリティ • 多様な環境への対応:高速かつ安定な追加学習 2 コンセプトの学習と画像認識 white bear river Fox: 0.90 White:0.83 River:0.54 Bear: 0.54 Snow: 0.51 snow fox white bird sky flight bear brown grass fox grass brown black bear river Image Concept feature space space 3 The data processing theorem The state of the world w The gathered data d The processed data r Markov chain The average information I (W ; D) ≥ I (W ; R) The data processing theorem states that data processing can only destroy information. 4 David J.C. MacKay. Information Theory, Inference, and Learning Algorithms. Cambridge University Press 2003. Pipeline of Image Recognition Training Training data Testing Feature extraction Model (classifier) downstream upstream Test data Feature extraction Model (classifier) • Upstream process is more important than downstream process. • Data and the feature extraction are more important than the classifier. • Knowledge about the objects is very important. • It is becoming popular to use a large amount of images collected from the internet. 5 – There are huge amount of object categories, and the appearance of them is diverse Large Scale Object Recognition • • ILSVRC (ImageNet Large Scale Visual Recognition Challenge) – Image recognition competition using large scale images – http://www.image-net.org/challenges/LSVRC/2012/index Sports car Task 1: What’s this image? – Learning 1.2 million images – Classifying 1000 object classes Sports car • Task 2: Where’s this object? • Task 3: What kind of dog is this? – Detecting 1000 object classes in images – Fine-grained classification on 120 dog sub-classes – More difficult to classify objects than task 1 Deep CNN! Task 1 Task 3 Shih-Tzu Pomeranian toy poodle Team Flat Error Team mAP 1) SuperVision Univ. of Toronto 0.153 1) ISI (ours) Univ. of Tokyo 2) ISI (ours) Ours Univ. of Tokyo 0.262 2) XRCE/INRIA Xerox Research Centre Europe/INRIA 0.310 3) OXFORD_VGG Univ. of Oxford 0.270 3) Uni Jena Univ. Jena 0.246 Ours 0.323 Results (2012) http://www.isi.imi.i.u-tokyo.ac.jp/pattern/ilsvrc2012/index.html 1. brown bear 2. Tibetan mastiff 3. sloth bear 4. American black bear 5. bison 1. baseball player 2. unicycle 3. racket 4. rugby ball 5. basketball 1. digital watch 2. Band Aid 3. syringe 4. slide rule 5. rubber eraser 1. shower cap 2. bonnet 3. bath towel 4. bathing cap 5. ping-pong ball 1. diaper 2. swimming trunks 3. bikini 4. miniskirt 5. cello 1. Siamese cat 2. Egyptian cat 3. Ibizan hound 4. balance beam 5. basenji 1. king penguin 2. sea lion 3. drake 4. magpie 5. oystercatcher 1. oboe 2. flute 3. ice lolly 4. bassoon 5. cello 1. beer bottle 2. pop bottle 3. wine bottle 4. Polaroid camera 5. microwave 1. butcher shop 2. swimming trunks 3. miniskirt 4. barbell 5. feather boa 7 Fine-grained object recognition results (2012) English setter Siberian husky Welsh springer spaniel otterhound Staffordshire bullterrier Australian terrier whippet English springer Scottish deerhound bloodhound Mexican hairless Airedale Weimaraner malamute soft-coated wheaten terrier Dandie Dinmont giant schnauzer Bouvier des Flandres Great Dane black-and-tan coonhound miniature poodle Cardigan Walker hound Old English sheepdog papillon malinois What’s Image Feature? Generative Approach General pipeline Descriptor →8 dim Interesting points (e.g. corner, edge) d1 d2 d3 dm dk dj Local descriptors 1) Input Image 2) Detection dk x = f (θ) dN dj 3) Description p (d; θ) dm d 2 d1 dN d3 4) Local descriptors in feature space An image is represented by a probability distribution. 5) PDF estimation 6) Feature vector 9 Distance between Images Tangent space Riemannian Metrix Connection H. Nakayama, T. Harada, Y. Kuniyoshi. CVPR, 2010 For the exponential family, if we take the natural or expectation parameters, we can obtain the flat space! Similarity measure Metric on tangent space: Fisher Information Matrix Feature Better feature 視覚野 http://www.nature.com/scientificamerican/journal/v16/n3/box /scientificamerican0906-4sp_BX1.html ニューロン ニューロン ニューロンの数理モデル 1 n z = f ∑ ai xi + h i =1 1 y= 1 + exp(− x) 0 甘利俊一.神経回路網モデルとコネクショニズム.東京大学出版会,1989. 多層パーセプトロン aij I y j = f ∑ aij xi + a0 j i =1 y1 b jk x1 z1 x2 Output Input yj zK xI yJ y= J z k = f ∑ b jk y j + b0 k j =1 1 1 1 + exp(− x) 0 • • • ロジスティック回帰モデルを多層化したもの 十分な数の隠れユニットがあればどんな連続関数も任意の精度で近似可能 誤差逆伝播により効率的に学習可能 Neocognitron Kunihiko Fukushima. Neocognitron: A Self-organizing Neural Network Model for a Mechanism of Pattern Recognition Unaffected by Shift in Position. Biol. Cybernetics 36, 193 202 (1980) Convolutional Neural Networks Softmax activation Fully connected Fully connected Average pooling Convolution Average pooling Convolution Input (a) LeNet-5, 1989 Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Handwritten digit recognition with a back-propagation network. NIPS 1989. Modules of Convolutional Neural Networks • Convolution layer • fully-connected layer • Pooling layer Convolutional and Fully-Connected Layer Fully-Connected Layer Convolutional Layer feature map hidden 𝐿𝐿 th layer 𝐿𝐿 − 1 th layer input input 𝑤𝑤1 𝑤𝑤1 𝑤𝑤1 𝑤𝑤 𝑤𝑤 𝑤𝑤 𝑤𝑤2 𝑤𝑤3 3 3 2 𝑤𝑤2 local receptive field local receptive field local receptive field input 𝑙𝑙 𝑘𝑘𝑖𝑖𝑖𝑖 𝐿𝐿 th layer 𝐿𝐿 − 1 th layer Logistic sigmoid function 𝑓𝑓 𝑥𝑥 = (1 + 𝑒𝑒 −𝛽𝛽𝑥𝑥 )−1 Hyperbolic tangent function Local receptive field 𝑙𝑙 𝑥𝑥𝑗𝑗𝑙𝑙 = 𝑓𝑓 � 𝑥𝑥𝑖𝑖𝑙𝑙−1 ∗ 𝑘𝑘𝑖𝑖𝑖𝑖 + 𝑏𝑏𝑗𝑗 𝑖𝑖∈𝑀𝑀𝑗𝑗 𝑓𝑓 𝑥𝑥 = 𝑎𝑎tanh(𝑏𝑏𝑏𝑏) Pooling Layer Local receptive field 𝑙𝑙−1 𝑘𝑘𝑖𝑖𝑖𝑖 convolution sub-sampling subsampling function 𝑥𝑥𝑗𝑗𝑙𝑙 = 𝑓𝑓 𝛽𝛽𝑗𝑗𝑙𝑙 down 𝑥𝑥𝑖𝑖𝑙𝑙−1 + 𝑏𝑏𝑗𝑗𝑙𝑙 Average pooling Max pooling LeNet and Alex Net Softmax activation Fully connected Fully connected Average pooling Convolution Average pooling Convolution Input Features of Alex Net • 5 conv. layers + 3 fully-connected layers • ReLU nonlinearity • Multiple GPGPUs • Local response normalization • Overlappling pooling • Dropout • Data augementation (a) LeNet-5, 1989 Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Handwritten digit recognition with a back-propagation network. NIPS 1989. Softmax activation Fully connected Fully connected Fully connected Max pooling Convolution Convolution Convolution Max pooling Response normalization Convolution Max pooling Response normalization Convolution Input (b) AlexNet, 2012 Alex Krizhevsky, Ilya Sutskever and Geoffrey E. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. NIPS, 2012. VGG Net Softmax activation Fully connected Fully connected Fully connected Max pooling Convolution Convolution Convolution Max pooling Response normalization Convolution Max pooling Response normalization Convolution Input Alex Krizhevsky, Ilya Sutskever and Geoffrey E. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. NIPS, 2012. Softmax activation Fully connected Fully connected Fully connected Max pooling Convolution Convolution Convolution Convolution Max pooling Convolution Convolution Convolution Convolution Max pooling Convolution Convolution Convolution Convolution Max pooling Convolution Convolution Max pooling Convolution Convolution Input Karen Simonyan, Andrew Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv:1409.1556, 2014. (c) VGG Net, 2014 Features of VGG Net • 16 conv. layers + 3 fully-connected layers • Very small receptive fields • No Local response normalization • Data parallelism (b) AlexNet, 2012 Network in Network 𝑙𝑙 𝑘𝑘𝑖𝑖𝑖𝑖 Local receptive field 𝑙𝑙 𝑥𝑥𝑗𝑗𝑙𝑙 = 𝑓𝑓 � 𝑥𝑥𝑖𝑖𝑙𝑙−1 ∗ 𝑘𝑘𝑖𝑖𝑖𝑖 + 𝑏𝑏𝑗𝑗 𝑖𝑖∈𝑀𝑀𝑗𝑗 Local receptive field Min Lin, Qiang Chen, Shuicheng Yan. Network In Network. arXiv:1312.4400, 2014. Spatial Pyramid Pooling Layer 𝑙𝑙 𝑘𝑘𝑖𝑖𝑖𝑖 Local receptive field Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. arXiv:1406.4729, 2014. Parametric ReLUs 𝑓𝑓 𝑥𝑥 = max(0, 𝑥𝑥) 𝑓𝑓 𝑥𝑥 𝑙𝑙 𝑘𝑘𝑖𝑖𝑖𝑖 𝑥𝑥 (b) Rectified Linear Units (ReLUs) Local receptive field 𝑙𝑙 𝑥𝑥𝑗𝑗𝑙𝑙 = 𝑓𝑓 � 𝑥𝑥𝑖𝑖𝑙𝑙−1 ∗ 𝑘𝑘𝑖𝑖𝑖𝑖 + 𝑏𝑏𝑗𝑗 𝑓𝑓 𝑥𝑥 = � 𝑖𝑖∈𝑀𝑀𝑗𝑗 𝑓𝑓 𝑥𝑥 = 𝑎𝑎tanh(𝑏𝑏𝑏𝑏) 𝑓𝑓 𝑥𝑥 𝑥𝑥 (a) Hyperbolic tangent function 𝑥𝑥 𝑎𝑎𝑎𝑎 𝑓𝑓 𝑥𝑥 𝑓𝑓 𝑥𝑥 = 𝑎𝑎𝑎𝑎 if 𝑥𝑥 > 0 otherwise 𝑓𝑓 𝑥𝑥 = 𝑥𝑥 (c) Parametric ReLUs (PReLUs) 𝑥𝑥 Invariance and Data Augmentation • CNNには微小な平行移動以外の幾何学的不変 性を担保する構造を持たない. • Data augmentation – 幾何学的不変性を構造に持つのではなく,データ を疑似的に水増しすることで,幾何学的不変性を 学習させる. – 例えば,2012年のAlexNetではトレーニングデー タを2048倍に水増し Fully connected Softmax activation Softmax activation Average pooling Inception Inception Inception Inception Inception Inception Response normalization Convolution Convolution Response normalization Max pooling Convolution Input Softmax activation Fully connected Fully connected Convolution Average pooling Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, Andrew Rabinovich. Going Deeper with Convolutions. arXiv:1409.4842, 2014. GoogLeNet Fully connected Previous layer Inception 3x3 Max pooling Fully connected 1x1 Convolution Inception 1x1 Convolution Convolution 1x1 Convolution Inception 5x5 Convolution 1x1 Convolution 3x3 Convolution Average pooling Inception Filter concatenation 最新の画像識別性能 multi-model results postcompetition Team Top 5 error [%] Google 4.9 MSRA, PReLU-nets 4.94 in competition ILSVRC 14 Human 5.1 Baidu 5.98 VGG (arXiv v5) 6.8 GoogLeNet 6.66 VGG (Oxford) 7.32 MSRA, SPP-nets 8.06 Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification 25 Difficulties in Large-Scale Recognition System (inc. Deep NNs) • Computer Resource • Structure – どんだけ計算機をぶん回せるか? – どれだけ多様な構造を試せるか? – 正規化 • • Batch normalization Sergey Ioffe, Christian Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. arXiv:1502.03167, 2015. – 構造推定の自動化 • • • Optimization – 高速かつ安定した最適化手法は何か? • • • Bayesian Optimization, Gaussian process Jasper Snoek, Oren Rippel, Kevin Swersky, Ryan Kiros, Nadathur Satish, Narayanan Sundaram, Mostofa Patwary, Mr Prabhat and Ryan Adams. Scalable Bayesian Optimization Using Deep Neural Networks. ICML, 2015. Dataset Natural Gradient Guillaume Desjardins, Karen Simonyan, Razvan Pascanu, Koray Kavukcuoglu. Natural Neural Networks. arXiv:1507.00210v1,2015. – どれだけ大量かつ良質なデータを持っているか? – 他が持たないデータを如何に囲い込めるか? Is it enough to label images with words? http://www.totallifecounseling.com/servicestherapy-orlando-florida-clermont-east-southwesttherapy-therapists-counselors/ptsd/ dog, person, bite A dog bites a person. dog, person, bite A person bites a dog. 27 Sentence Generation Results A giraffe standing next to a tree in a fence. With a cat laying on top of a laptop computer. A yellow train on the tracks near a train station. A dog laying on the side of a zoo enclosure. A man in the beach with a surfboard. Black and white dog on the grass in a frisbee. Show and Tell: A Neural Image Caption Generator • Oriol Vinyals, Alexander Toshev, Samy Bengio, Dumitru Erhan (Google), CVPR 2015 • CNNで抽出された画像特徴量をLSTMの最初の入力に用い文を生成 – NNを一貫して使用、そのためパイプラインが比較的単純 • 2014年版との差分 – – – – – 1. 識別性能の高いCNNを使用 2. better optimization (詳細不明) 3. beam searchのサイズを減らす 4. CNNまでback prop 5. model ensemble (おそらくLSTMのみ) Image Image feature Deep CNN Language generation RNN Caption Show and Tellの構造 Long-Short Term Memory • caption = {𝑆𝑆0 𝑆𝑆1 𝑆𝑆2 ⋯ 𝑆𝑆0 } • キャプション生成時 学習時 – 単語数+1のLSTMを準備.全LSTMのパラメータは共有 – 時刻t-1のLSTMの出力は時刻tのLSTMに与えられる – はじめの入力はDeep CNNの特徴,2番目のLSTMからキャプションを構成する単語が 入力される – 入力画像からDeep CNNを利用して特徴を抽出しLSTMに入力される – 予測される単語の確率に基づきビームサーチにより準最適なキャプションを生成 動画像認識 • THUMOS Challenge – 6/11/2015@CVPR2015 • 概要 – 大規模動画像認識コンペティション – 動作毎に細かく区切られた動画ではなく, YouTubeにあるような動画を認識対象 – 動画像認識の最先端の手法 – http://www.thumos.info/ • Tasks – Task 1 ApplyEyeMakeup BabyCrawling • Classification • 101 class • Total dataset size: 412hrs, 370G – Task 2 • Spatial-temporal localization BasketBallDunk Coding (e.g. VLAD) Classifier 動画像では,まだDeep Learningがブレークス ルーではない Pre-trained Deep CNN Late fusion Classifier Video feature iDT+ Local descriptor iDT+ Local descriptor iDT+ Local descriptor iDT+ Local descriptor iDT+ Local descriptor iDT+ Local descriptor iDT+ Local descriptor Coding (e.g. Fisher Vector) Video feature CNN feature CNN feature CNN feature CNN feature CNN feature CNN feature CNN feature 動画像認識の一般的パイプライン 内部の可視化 Softmax activation 複雑すぎて,内部でどのような処理が行われているか分からない,,, Fully connected Fully connected Fully connected Max pooling Convolution Convolution Convolution Convolution Max pooling Convolution Convolution Convolution Convolution Max pooling Convolution Convolution Convolution Softmax activation Fully connected Fully connected Fully connected Max pooling Convolution Convolution Convolution Max pooling Response normalization Convolution Max pooling Response normalization Convolution Input Alex Krizhevsky, Ilya Sutskever and Geoffrey E. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. NIPS, 2012. AlexNet, 2012 Convolution Max pooling Convolution Convolution Max pooling Convolution Convolution Input Karen Simonyan, Andrew Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv:1409.1556, 2014. VGG Net, 2014 Deconvnet • Matthew D. Zeiler and Rob Fergus. Visualizing and Understanding Convolutional Networks. ECCV 2014. Matthew D. Zeiler and Rob Fergus. Visualizing and Understanding Convolutional Networks. ECCV 2014. Image Feature and Recognition How to generate a new image from the semantic information? d3 d2 dm dm d1 dk d 2 d1 dk dN d3 dj dj 1) Input Image 2) Detection dN 4) Local descriptors in local feature space 3) Description p (d; θ) xj y = wT x x = f (θ) 5) PDF estimation 6) Feature vector Camera Cat 7) Image feature space The image feature space is most closely related to the semantic information. Results H. Kato, T. Harada. Image Reconstruction from Bag-of-Visual-Words. CVPR, 2014. Good Results Original Images Obtained from BoVW Our method HOGgles*1 This is for HOG originally Image Retrieval From 1M images *1 Vondrick et al., ICCV, 2013. 37 Inceptionism: Going Deeper into Neural Networks • Posted: Wednesday, June 17, 2015 • Posted by Alexander Mordvintsev, Software Engineer, Christopher Olah, Software Engineering Intern and Mike Tyka, Software Engineer • http://googleresearch.blogspot.co.uk/2015/06/inceptionism-going-deeper-into-neural.html 38 39 Framework of the Current Recognition System Cyber-World Category Recognition System Learning The red train stopped at the station. Our baby is growing fast. We started the weight training. Large-scale Image dataset Human Filter Physical-World
© Copyright 2024 Paperzz