이전에 letter_set 의 0~999개까지는 valid_dataset

Deep Learning in Udacity
Assignment 1
2016. 3. 9
A.I. Lab.
전명중
notMNIST
for training data (about 500,000)
for test data (about 19,000)
Fig 1. examples of letter “A”
• There are 10 classes with A-J from different fonts.
• URL : http://yaroslavvb.com/upload/notMNIST/
Download to the Local Machine
Local 에 파일이 없으면,
url 주소로 부터 local 로 파일을 Download
파일 Size 를 검색을 통하여 원하는 파일이
맞는지 확인
• download the dataset to local machine
• The data consists of characters rendered in a variety of fonts on 28 x 28 image
Uncompresse .tar.gz & labelled A through J
'notMNIST_large.tar.gz’  'notMNIST_large.tar’  'notMNIST_large’
•
os.path.splitext : 입력받은 경로를 확장자와
그 외의 부분으로 나뉨
•
os.path.isdir : 경로가 디렉토리인지 검사
•
os.path.join : OS형식에 맞도록 경로를 연결
•
os.listdir(root) : root 경로에 존재하는 파일이나
디렉토리들을 리스트로 반환
아래와 같은 코드
for d in sorted(os.listdir('notMNIST_large’)):
if os.path.isdir(os.path.join(‘notMNIST_large’, d)):
os.path.join(‘notMNIST_large’, d)
result :
[
notMNIST_large/A,
notMNIST_large/B,
notMNIST_large/C,
...
notMNIST_large/J
]
Changing the data for manageable format
•
•
28
각 폴더의 image들을 normalize 하고 3D-Array 형태로 구성
e.g. notMNIST_large/A 폴더의 경우,
4.
0.
...
-0.4
-0.6
...
(feature scaling)
0.
...
0.
52,909
img1.png
0.
0.
최종 완성되면 dump!
-0.6
-0.4
...
img1.png
0.
...
[
notMNIST_large/A.pickle,
...
]
0.
0.
-0.3
-0.5
0.
0.
0.
...
notMNIST_large/A
0.
normalizing!
28
notMNIST_large/C
notMNIST_large/B
-0.3
...
0.
-0.5
...
[
notMNIST_small/A,
[
notMNIST_small/B,
notMNIST_large/A,
notMNIST_small/C,
notMNIST_large/B,
...
notMNIST_large/C,
notMNIST_small/J
...
]
notMNIST_large/J
]
5. 6.
...
28
train_folders
0.
... 0.
...
test_folders
(Case of A images)
...
28
0.
Changing the data for manageable format(maybe_pickle)
•
maybe_pickle function
•
1
Output을 저장할 폴더를 생성하고 3D-array 의 결과가 리턴되어 오면 저장
기존 폴더 name 뒤에 .pickle 을 붙여서
dataset_names 이름의 List 에 추가
e.g.) notMNIST_large/A.pickle
(folder = ‘notMNIST_large/A’ , min_num_images_per_class = 45000)
Go to load_letter function !
[ ‘notMNIST_large/A’ , ‘notMNIST_large/B’ , ‘notMNIST_large/C’ , ...
Changing the data for manageable format(load_letter)
folder = ‘notMNIST_large/A’ , min_num... = 45000
1
image_files = [’image1.png', ’image2.png', ’image3.png', ... ]
3D-Array (52909, 28, 28)
52,909
0.
0.
0.
0.
0.
0.
...
0.
0.
...
0.
...
28
0.
...
28
0.
Changing the data for manageable format(load_letter)
[[
[
2
0.
1.
0.
0.
82.
0.
0.
0.
0.
0.
116.
0.
0.
9.
0.
0.
61.
0.
0.
5.
0.]
0.
89.
0.]
0.
0.
0.
1.
0.
0.
0.
0.
0.
0.
0.
0.
12.
0.
11.
0.
0.
36.
0.
0.
0.
1.
0.
0.
0.
0.
0.
0.
15.
0.
23.
0.
...
]
‘notMNIST_large/A’
[’image1.png', ’image2.png', ’image3.png', ... ]
normalized
(feature scaling)
10
‘notMNIST_large/A/image1.png’
28
0.
4.
. 0.
..
normalizing!
0.5
0.3
0.
0.4
0.6
.
..
0.
...
0.
6.
...
.
.
.
28
5.
...
0.
...
0.
0.
...
0.
[[-0.5
-0.5
-0.5
-0.5
-0.5
-0.5
-0.5
-0.46470588 -0.48039216
-0.5
-0.5
-0.5
-0.5
]
[-0.5
-0.5
-0.5
-0.5
-0.5
-0.5
-0.04509804 -0.26078431 -0.15098039
-0.5
-0.5
-0.5
-0.5
-0.5
]
-0.5
-0.45294118
-0.5
-0.5
-0.5
-0.45686275
-0.49607843
-0.5
-0.5
-0.5
-0.49607843
-0.5
-0.5
-0.5
-0.5
-0.5
-0.44117647
-0.35882353
-0.5
-0.5
-0.40980392
-0.5
-0.5
-0.5
-0.5
-0.17843137
-0.49607843
-0.5
-0.5
Changing the data for manageable format(load_letter)
52,909
idx: 2
idx: 1
idx: 0
dataset = 3D-Array (52909, 28, 28)
image_data =
-0.5
-0.45686275
-0.49607843
-0.5
-0.5
-0.5
-0.49607843
-0.5
-0.5
-0.5
-0.5
-0.5
-0.44117647
-0.35882353
-0.5
-0.5
-0.40980392
-0.5
-0.5
-0.5
-0.5
-0.17843137
-0.49607843
-0.5
-0.5
0.5
28
0.
0.
0.3
0.4
img1.png
0.
...
28
10
준비해놓은 dataset matrix 에 첫번 째 layer 에 이미지 데이터 삽입
insert  dataset[0, 28, 28]
for문의 끝!
.
..
0.6
0.
...
-0.5
-0.45294118
-0.5
-0.5
0.
...
3
[[-0.5
-0.5
-0.5
-0.5
-0.5
-0.5
-0.5
-0.46470588 -0.48039216
-0.5
-0.5
-0.5
-0.5
]
[-0.5
-0.5
-0.5
-0.5
-0.5
-0.5
-0.04509804 -0.26078431 -0.15098039
-0.5
-0.5
-0.5
-0.5
-0.5
]
0.
0.
Changing the data for manageable format(load_letter)
folder = ‘notMNIST_large/A’ , min_num... = 45000
4
52,909
필요한 이미지 갯수보다 적으면 exception 발생
Changing the data for manageable format(maybe_pickle)
•
maybe_pickle function
•
Output을 저장할 폴더를 생성하고 3D-array 의 결과가 리턴되어 오면 저장
Come back!
2
3D-array
(52914, 28, 28)
# pickle 모듈은 class, function, method 등의 어떠한 object에서도 적용가능하며, 기본적으로 serializing 하게 저장하기 때문에 파일처리 뿐만 아니라
네트워크 통신에도 많이 사용
# HIGHEST_PROTOCOL : 최신 저장 방식 ( binary, readable text file, etc.)
* ’wb’ : opened for writing in binary mode.
Making the Training, Validate, Test Sets
< For Validation, Training Set >
A.pickle
A.pickle
28
A.pickle
A.pickle
Training set
200,000
(10 class x 20,000)
Validation set
28
notMNIST_large/A.pickle
20,000
10,000
< For Test Set >
(10 class x 1,000)
A
A
Validation set
A
Test set
Training set
A.pickle
28
A.pickle
A.pickle
Test set
28
notMNIST_small/A.pickle
•
나머지 B, C, . . . , J 의 pickle 파일들도 분할하여 Set 으로 구성
•
추가적으로 y 값인 label 과 함께 생성
•
A = label(0), B = label(1), C = label(2), ... , J = label(9)
Making the Training, Validate, Test Sets
200,000
(10 class x 20,000)
10,000
(10 class x 1,000)
valid_dataset
9
9
.
.
1
0
0
10,000
valid_labels
test_dataset
9
9
.
.
1
0
0
test_labels
9
9
9
9
9
.
.
.
.
.
.
.
.
.
.
.
.
1
1
0
0
0
0
train_dataset train_labels
train_datasets (A.pickle, B.pickle ...) 을 train_size 와 valid_size 크기만큼 생성
< print 결과 값 >
Making the Training, Validate, Test Sets
0
0
.
.
0
0
0
10,000
(10 class x 1,000)
valid_dataset
200,000
(10 class x 20,000)
valid_labels
0
0
0
0
0
0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0
0
0
0
train_dataset train_labels
pickle_files = ['notMNIST_large/A.pickle', 'notMNIST_large/B.pickle', ... ]
train_size = 200,000
valid_size = 10,000
num_classes = 10
valid_dataset = 3D-array (10000, 28, 28)
valid_labels = (10000,)
train_dataset = 3D-array (200000, 28, 28)
valid_labels = (200000,)
# 각 이미지의 종류마다 몇개의 데이터를 넣을지에 대한 개수
vsize_per_class = 10000 // 10 = 1000
tsize_per_class = 200000 // 10 = 20000
Making the Training, Validate, Test Sets
A.pickle
valid_dataset = 3D-array (10000, 28, 28)
valid_labels = (10000,)
A.pickle
A.pickle
train_dataset = 3D-array (200000, 28, 28)
valid_labels = (200000,)
Training set
Validation set
vsize_per_class = 1000
10,000
tsize_per_class = 20000
(10 class x 1,000)
end_v = 1000, end_t = 20000
end_l = 21000
A
1,000개
valid_dataset
0
0
.
.
.
0
0
valid_labels
label = 0, pickle_file = 'notMNIST_large/A.pickle'
# enumerate 는 0부터 오름차순으로
auto index 값을 붙여주는 기능
(52909, 28, 28)
# with as 는 파일 처리시, 인터프리터가
자동으로 파일을 닫아줌.
1,000 개의 image만 추출
추출한 데이터를 준비한 valid_dataset 에 삽입
[0:1000] 의 값을 label(0)
start_v = 1000
end_v = 2000
* ’rb’ : opened for reading in binary mode.
Making the Training, Validate, Test Sets
valid_dataset = 3D-array (10000, 28, 28)
valid_labels = (10000,)
train_dataset = 3D-array (200000, 28, 28)
valid_labels = (200000,)
vsize_per_class = 1000
tsize_per_class = 20000
label = 0, pickle_file = 'notMNIST_large/A.pickle'
start_t = 0, end_t = 0
end_l = 21,000
letter_set = (52909, 28, 28)
이전에 letter_set 의 0~999개까지는 valid_dataset 으로
썼기 때문에 1,000~20,999 까지의 20,000개 데이터는
train_dataset 으로 사용
start_t = 20000
end_t = 40000
A.pickle
A.pickle
A.pickle
Validation set
Training set
200,000
A
20,000
0
.
.
.
.
.
.
.
.
.
0
0
0
0
train_dataset train_labels
Data shuffle
• For labels well shuffled for the training and test set.
(1000,)  1000
e.g.
>>
# shape = ( 10 , )
# a 변수의 index 를 나타냄
>>
Save Data
• Save just one file !
Dictionary 형태로 저장
# about 690.8 MB