Intel Itanium 2 プロセッサのプログラミング手法

Intel
Developer
Forum
インテル® Itanium® 2プロセッサ
のプログラミング手法
池井満
HPC シニアアーキテクト
インテル㈱
Itanium is a trademark or registered trademark of Intel Corporation or its subsidiaries in the United States and other countries.
Copyright © 2003 Intel Corporation.
1
Intel
Labs
Intel
Developer
目次
•
Forum
ストリーム・ベンチのコンパイル(Fortran)
– 最適化レポートの出力方法
– Itanium® 2 プロセッサの演算リソース
– ソフトウェア・パイプライニング
•
行列乗算のコンパイル（C）
– 最適化オプション
– 最適化ディレクティブ
•
ＳＰＥＣintのコンパイル
– プロシジャ間最適化（ＩＰＯ）
– プロファイル・ガイド最適化（ＰＧＯ）
Intel
Copyright © 2003 Intel Corporation.
2
Labs
1
Intel
Developer
stream_d.fの実行ループ
171
172
173
...
176
177
178
…
186
187
188
...
196
197
198
...
206
207
208
...
212
Forum
DO 70 k = 1,ntimes
t = mysecond()
30
DO 30 j = 1,n
c(j) = a(j)
CONTINUE
40
DO 40 j = 1,n
b(j) = scalar*c(j)
CONTINUE
50
DO 50 j = 1,n
c(j) = a(j) + b(j)
CONTINUE
60
DO 60 j = 1,n
a(j) = b(j) + scalar*c(j)
CONTINUE
70 CONTINUE
Intel
Copyright © 2003 Intel Corporation.
3
Labs
Intel
Developer
単純コンパイルと実行結果
Forum
Intel
Copyright © 2003 Intel Corporation.
4
Labs
2
Intel
コンパイルとリンク・オプションの確認
Developer
Forum
Intel
Copyright © 2003 Intel Corporation.
5
Labs
Intel
最適化(-O2)の内容確認
Developer
Forum
Intel
Copyright © 2003 Intel Corporation.
6
Labs
3
Intel
Developer
最適化のレポート・オプション
Forum
• 関数名の指定
–opt_report_routine<name>
• レポート・ファイル名の指定
–opt_report_file<name>
• 最適化フェーズの指定
-opt_report_phase<name>
• 最適化レポートのレベル
-opt_report_level[min|med|max]
Intel
Copyright © 2003 Intel Corporation.
7
Labs
Intel
Developer
最適化のレポート
Forum
-opt_report_phase<phase>
最適化論理名
最適化の内容
関連する最適化オプション等
ipo
Interprocedural Optimizer
-ipo, -ip
hlo
High-level Language
Optimizer
-O3 ( Loop unrolling )
ilo
Intermediate Language
Scalar Optimizer
ecg
Itanium Compiler Code
Generator
( Software Pipelining )
pgo
Profile Guided Optimizer
-prof_gen, -prof_use
all
All optimizers
Intel
Copyright © 2003 Intel Corporation.
8
Labs
4
Intel
Developer
stream_d.f メインプログラム
Forum
Intel
Copyright © 2003 Intel Corporation.
9
Intel
命令バンドル
127
命令 2
87 86
命令 1
Labs
Developer
Forum
46 45
命令 0
5 4
0
テンプレート
バンドルフォーマット
• バンドル
– 3つの命令スロット (各41-ビット)とテンプレートフィールド(5ビット)
– 命令グループは幾つかのバンドルにまたがることができる
– 例:
命令グループ A
命令グループB
バンドル 1
命令グループ C
バンドル 2
バンドル 3
バンドル 4
柔軟性のある発行と並列実行
柔軟性のある発行と並列実行
Intel
Copyright © 2003 Intel Corporation.
10
Labs
5
Intel
Developer
命令グループ
Forum
• 同時に実行可能な1つ以上の命令の固まり（命令数の制限
無し）
– お互いに依存性のない隣接する命令は並列して実行することがで
きる
• ブレーク・コード（;;）をアセンブリ・コード中に記述することに
よって、命令グループの境界を指示する、もしくは実行時に
分岐によってターミネートする
• 例：
Instr Group A
add
r31 = r5, r6 ;;
add
mov
mov
add
add
mov
mov
mov
mov
mov
mov
mov
mov
add
add
mov
mov
r31 = r5, r6 ;;
r4
r4==r31
r31
r2
r2==r8,
r8,r9
r9;;;;
r7
r7==r2
r2
r15
r15==r27
r27
r16
r16==r28
r28
r17
r17==r29
r29
r3
r3==r11,
r11,r20
r20;;;;
r10
=
r3
r10 = r3
Instr Group B
Instr Group C
Instr Group D
Intel
Copyright © 2003 Intel Corporation.
11
Labs
Intel
Developer
stream_d.f メインプログラム
Forum
Intel
Copyright © 2003 Intel Corporation.
12
Labs
6
Intel
Developer
ソフトウエア・パイプライン
z 考察
擬似コード:
4サイクル
loop:
C コード:
for (i = 0; i < n; i++)
y[i] = a * x[i];
Forum
load xi
fmul yi = a, xi
store yi
branch loop
2サイクル
z 仮定
¾ 命令のレーテンシー:
load
fmul
store
branch
4 サイクル*
2 サイクル*
1 サイクル*
1 サイクル*
* ここでのサイクルカウントは実際の
動作とは異なります。
¾ Load, fmul, store そして branch は同じ命令グループで発行可
能
Intel
Copyright © 2003 Intel Corporation.
13
Intel
Developer
ソフトウエア・パイプライン
Cycle 1:
Cycle 2:
Cycle 3:
Cycle 4:
Cycle 5:
Cycle 6:
Cycle 7:
Cycle 8:
Cycle 9:
Cycle 10:
Cycle 11:
Cycle 12:
Cycle 13:
Cycle 14:
load x1
load x2
load x3
load x4
load x5
load x6
load x7
load x8
Labs
Forum
For n = 8
fmul y1=a,x1
fmul y2=a,x2
fmul y3=a,x3
fmul y4=a,x4
fmul y5=a,x5
fmul y6=a,x6
fmul y7=a,x7
fmul y8=a,x8
プロロー
グ
store y1
store y2
store y3
store y4
store y5
store y6
store y7
store y8
カーネル
エピロー
グ
* ここでのサイクルカウントは実際の動作とは異なります
この例では1回あたりのループに7サイクル必要
この例では1回あたりのループに7サイクル必要
Intel
Copyright © 2003 Intel Corporation.
14
Labs
7
Intel
Developer
Forum
stream_d.fの207行ループ
nop.i
0 ;;
// Block 42: lentry lexit ltail collapsed pipelined
42 43
-S
// Freq 9.0e+06
}
.b1_42:
{
.mfi
(p16) ldfd
f32=[r18],8
(p27) fma.d
f56=f6,f43,f55
nop.i
0
}
{
.mmb
(p16) ldfd
f44=[r17],8
(p31) stfd
[r16]=f60,8
// Branch taken probability 0.99
br.ctop.sptk
.b1_42 ;;
// Block 43: epilog Pred: 42
Succ: 44 -O
// Freq 9.0e+00
}
Pred: 41 42
Succ:
//0:207 395
//11:207 397
//0:207 396
//15:207 398
//0:206
402
Intel
Copyright © 2003 Intel Corporation.
15
Labs
Intel
Developer
stream_d.fの207行ループ
Forum
===========================================================================
SWP REPORT LOG OPENED ON Sun Nov 30 16:39:51 2003
===========================================================================
...
---------------------------------------------------------------------------------------------------------------------------------------------------Swp report for loop at line 207 in MAIN__ in file stream_d.f
Resource II
Recurrence II
Minimum II
Scheduled II
=
=
=
=
1
0
1
1
Percent of Resource II needed by arithmetic ops
= 100%
Percent of Resource II needed by memory ops
= 100%
Percent of Resource II needed by floating point ops = 100%
Number of stages in the software pipeline =
16
---------------------------------------------------------------------------
Intel
Copyright © 2003 Intel Corporation.
16
Labs
8
Intel
Developer
Forum
Initiation Interval (II)
• ループの繰り返しの開始間隔サイクル
– ループ内の処理を実行するために必要な最小サイクル数
• リソースＩＩ
– プロセッサの演算リソースの制限によるＩＩ
• 再帰（リカーランス）ＩＩ
– ループ内のデータ依存性により必要なＩＩ
• 最小ＩＩ
– MAX（リソースＩＩ，再帰ＩＩ）：必要最小ＩＩ
• スケジュールＩＩ
– 実際にコンパイラが適用したＩＩ
Intel
Copyright © 2003 Intel Corporation.
17
Labs
Intel
Intel® Itanium® 2プロセッサのブロック・ダイアグラム
分岐予測
命令キュー
8 バンドル
M M M M I
分岐とプレディケート
レジスタ
I
F F
128 整数レジスタ
整数
と
MM ユニット
分岐
ユニット
ECC
ECC
IAIA-32
デコード
と
制御
レジスタ・スタック・エンジン / マッピング
スコアボード、プレディケート
NaTs,, 例外
,NaTs
L3 キャッシュ
キャッシュ– 4 Port
L2 キャッシュ
11 Issue Ports B B B
ECC
Forum
L1 命令キャッシュと
ITLB
フェッチ/
フェッチ/プリフェッチ・エンジン
ECC
ECC
バス
ECC
4ポート
L1
データ
キャッシュ
と
DTLB
128 FP レジスタ
ALAT
ECC
Developer
浮動小数点
ユニット
コントローラ
Intel
Itanium is a trademark or registered trademark of Intel Corporation or its subsidiaries in the United States and other countries.
Copyright © 2003 Intel Corporation.
18
Labs
9
Intel
Developer
-O3コンパイルと実行結果
Forum
Intel
Copyright © 2003 Intel Corporation.
19
Labs
Intel
Developer
プリフェッチとロードペア
Forum
===========================================================================
High Level Optimizer Report for: MAIN__
...
Estimate of max_trip_count of loop at line 206=125001
Total #of lines prefetched in MAIN__ for loop in line 206=3
# of spatial prefetches in MAIN__ for loop in line 206=3, dist=100
#
...
Load-pair formed at line 197 , 197
[Method = Aligned]
Load-pair formed at line 197 , 197
[Method = Aligned]
Load-pair formed at line 207 , 207
[Method = Aligned]
Load-pair formed at line 207 , 207
[Method = Aligned]
Load-pair formed at line 207 , 207
[Method = Aligned]
Load-pair formed at line 207 , 207
[Method = Aligned]
Load-pair formed at line 207 , 207
[Method = Aligned]
Load-pair formed at line 207 , 207
[Method = Aligned]
Load-pair formed at line 207 , 207
[Method = Aligned]
Load-pair formed at line 207 , 207
[Method = Aligned]
Load-pair formed at line 215 , 215
[Method = Aligned]
Intel
Copyright © 2003 Intel Corporation.
20
Labs
10
Intel
Developer
プリフェッチとロードペア
Forum
Block, Unroll, Jam Report:
(loop line numbers, unroll factors and type of transformation)
Loop at line 151 unrolled without remainder by 4
Loop at line 158 unrolled without remainder by 8
Loop at line 176 unrolled without remainder by 4
Loop at line 186 unrolled without remainder by 4
Loop at line 196 unrolled without remainder by 4
Loop at line 206 unrolled without remainder by 8
Loop at line 216 completely unrolled by 4
...
--------------------------------------------------------------------------Swp report for loop at line 207 in MAIN__ in file stream_d.f
Resource II
Recurrence II
Minimum II
Scheduled II
=
=
=
=
6
0
6
7
Percent of Resource II needed by arithmetic ops
=
Percent of Resource II needed by memory ops
=
Percent of Resource II needed by floating point ops =
83%
83%
67%
Number of stages in the software pipeline =
3
---------------------------------------------------------------------------
Intel
Copyright © 2003 Intel Corporation.
21
Labs
Intel
Developer
ループ・アンローリング
...
206
207
208
...
...
206
207
207
207
207
207
207
207
207
207
207
207
207
208
...
Copyright © 2003 Intel Corporation.
60
DO 60 j = 1,n
a(j) = b(j) + scalar*c(j)
CONTINUE
60
DO jj = 1,n/8
j=(jj-1)*8
a(j) = b(j) + scalar*c(j)
a(j+1) = b(j+1) + scalar*c(j+1)
a(j+2) = b(j+2) + scalar*c(j+2)
a(j+3) = b(j+3) + scalar*c(j+3)
a(j+4) = b(j+4) + scalar*c(j+4)
a(j+5) = b(j+5) + scalar*c(j+5)
a(j+6) = b(j+6) + scalar*c(j+6)
a(j+7) = b(j+7) + scalar*c(j+7)
END DO
DO 60 j = (n/8)*8, n+mod(n,8)
a(j) = b(j) + scalar*c(j)
CONTINUE
Forum
Intel
22
Labs
11
Intel
Developer
プリフェッチとロードペア
// Block 44: lentry lexit ltail collapsed pipelined
45
-S
// Freq 2.7e+07
}
.b1_44:
{
.mfi
(p16) ldfpd
f41,f32=[r39]
(p18) fma.d
f55=f6,f72,f45
nop.i
0
}
{
.mfi
(p16) ldfpd
f53,f50=[r38]
(p18) fma.d
f56=f6,f69,f36
nop.i
0 ;;
}
{
.mmi
(p16) ldfpd
f45,f36=[r42]
(p16) ldfpd
f60,f57=[r44]
nop.i
0
}
{
.mmi
(p18) stfd
[r29]=f39,64
(p18) stfd
[r28]=f48,64
nop.i
0 ;;
}
{
.mmi
(p16) ldfpd
f48,f39=[r46]
(p16) ldfpd
f65,f62=[r37]
nop.i
0
}
{
.mmi
(p18) stfd
[r27]=f34,64
(p18) stfd
[r26]=f43,64
nop.i
0 ;;
}
{
.mmi
(p16) ldfpd
f70,f67=[r41]
(p16) ldfpd
f43,f34=[r36]
(p16) add
r35=64,r36
}
{
.mmi
(p18) stfd
[r25]=f47,64
(p18) stfd
[r24]=f38,64
(p16) add
r40=64,r41 ;;
}
Pred: 44 43
Succ: 44
{
}
{
//0:207 574
//14:207 594
}
{
//0:207 575
//14:207 596
}
{
//1:207
//1:207
580
581
}
{
//15:207
//15:207
577
579
Forum
.mmf
(p16) add
r36=64,r37
(p16) lfetch.nt1
[r34]
(p17) fma.d
f47=f6,f51,f33
//4:206 617
//4:206 598
//11:207 578
.mmf
(p18) stfd
(p18) stfd
(p17) fma.d
[r23]=f56,64
[r22]=f55,64
f38=f6,f54,f42 ;;
//18:207
//18:207
//11:207
.mmf
(p16) add
(p16) add
(p17) fma.d
r37=64,r38
r41=64,r42
f33=f6,f61,f46
//5:206 613
//5:206 614
//12:207 582
.mmf
(p16) add
(p16) add
(p17) fma.d
r43=64,r44
r45=64,r46
f42=f6,f58,f37 ;;
//5:206 615
//5:206 616
//12:207 584
.mfi
(p16) lfetch.nt1
[r21],64
(p17) fma.d
f37=f6,f66,f49
(p16) add
r32=128,r34
597
595
576
//6:206 602
//13:207 588
//6:206 599
}
{
//2:207
//2:207
//16:207
//16:207
//3:207
//3:207
//3:206
586
587
.mfb
(p16) add
r38=64,r39
(p17) fma.d
f46=f6,f63,f40
// Branch taken probability 1.00
br.ctop.sptk
.b1_44 ;;
// Block 45: epilog Pred: 44
Succ: 46
//6:206 612
//13:207 590
//6:206
620
-O
583
585
593
592
618
//17:207 591
//17:207 589
//3:206 619
Copyright © 2003 Intel Corporation.
Intel
23
Itanium®
2 プロセッサメモリ構造
Labs
Intel
Developer
Forum
Itanium® 2 プロセッサ（1.50 ＧＨｚ）
外部
メモリ
6.4 GB/s
L3
6MB
12-way
128B lines
12-15 CLKS
L1I
16KB
48 GB/s
L2
48 GB/s 64B lines
1 CLK
256KB
8-way
L1D
128B lines
16KB
5-7 CLKS
48 GB/s 64B lines
Banked
1 CLK
128 FP レジスタ
128 汎用レジスタ
コア・パイプライン
(演算処理ユニット)
Intel
Itanium is a trademark or registered trademark of Intel Corporation or its subsidiaries in the United States and other countries.
Copyright © 2003 Intel Corporation.
24
Labs
12
Intel
-openmp 並列化コンパイルと実行結果
Developer
Forum
Intel
Copyright © 2003 Intel Corporation.
25
Labs
Intel
Developer
目次
Forum
• ストリーム・ベンチのコンパイル(Fortran)
– 最適化レポートの出力方法
– Itanium® 2 プロセッサの演算リソース
– ソフトウェア・パイプライニング
• 行列乗算のコンパイル（C）
– 最適化オプション
– 最適化ディレクティブ
• ＳＰＥＣintのコンパイル
– プロシジャ間最適化（ＩＰＯ）
– プロファイル・ガイド最適化（ＰＧＯ）
Intel
Copyright © 2003 Intel Corporation.
26
Labs
13
Intel
multiply_d.c プログラム
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Developer
Forum
#include "multiply_d.h"
// matrix multiply routine
void multiply_d(double a[][DIM], double b[][DIM], double c[][DIM])
{
int i,j,k;
double temp;
for(i=0;i<NUM;i++) {
for(k=0;k<NUM;k++) {
for(j=0;j<NUM;j++) {
c[i][j] = c[i][j] + a[i][k] * b[k][j];
}
}
}
}
Intel
Copyright © 2003 Intel Corporation.
27
Labs
Intel
Developer
単純コンパイルと実行結果
Forum
Intel
Copyright © 2003 Intel Corporation.
28
Labs
14
Intel
Developer
Forum
multiply_d.cの12行ループ
--------------------------------------------------------------------------Swp report for loop at line 12 in multiply_d in file multiply_d.c
Resource II
Recurrence II
Minimum II
Scheduled II
=
=
=
=
4
15
15
18
Percent of Resource II needed by arithmetic ops
= 100%
Percent of Resource II needed by memory ops
= 100%
Percent of Resource II needed by floating point ops = 25%
Number of stages in the software pipeline =
2
Following are the loop-carried memory dependence edges:
Store at line
12 --> Load
at line
12
Store at line
12 --> Load
at line
12
Store at line
12 --> Load
at line
12
Store at line
12 --> Load
at line
12
Load
at line
12 --> Store at line
12
Load
at line
12 --> Store at line
12
Store at line
12 --> Load
at line
12
Store at line
12 --> Load
at line
12
---------------------------------------------------------------------------
Intel
Copyright © 2003 Intel Corporation.
29
Labs
Intel
Developer
-fno_alias による依存性解消
Forum
Intel
Copyright © 2003 Intel Corporation.
30
Labs
15
Intel
Developer
Forum
multiply_d.cの12行ループ
•
•
--------------------------------------------------------------------------Swp report for loop at line 12 in multiply_d in file multiply_d.c
•
•
•
•
Resource II =
Recurrence II =
Minimum II =
Scheduled II =
•
•
•
Percent of Resource II needed by arithmetic ops = 100%
Percent of Resource II needed by memory ops
= 100%
Percent of Resource II needed by floating point ops = 33%
•
•
3
0
3
3
Number of stages in the software pipeline = 6
---------------------------------------------------------------------------
Intel
Copyright © 2003 Intel Corporation.
31
Labs
Intel
#pragma ivdep の挿入
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Developer
Forum
#include "multiply_d.h"
// matrix multiply routine
void multiply_d(double a[][DIM], double b[][DIM], double c[][DIM])
{
int i,j,k;
double temp;
for(i=0;i<NUM;i++) {
for(k=0;k<NUM;k++) {
#pragma ivdep
for(j=0;j<NUM;j++) {
c[i][j] = c[i][j] + a[i][k] * b[k][j];
}
}
}
}
Intel
Copyright © 2003 Intel Corporation.
32
Labs
16
Intel
ivdep による依存性解消
Developer
Forum
Intel
Copyright © 2003 Intel Corporation.
33
Labs
Intel
Developer
最適化のディレクティブ
C言語シンタックス
Ｆｏｒｔｒａｎシンタックス
#pragma [no]swp
!DEC$ [NO]SWP
#pragma loop count (10000)
!DEC$ LOOP COUNT (n)
#pragma distribute point
!DEC$ DISTRIBUTE POINT
#pragma [no]unroll(n)
!DEC$ [NO]UNROLL(n)
#pragma [no]prefetch a,b
!DEC$ [NO]PREFETCH
#pragma ivdep
!DEC$ IVDEP
Forum
Intel
Copyright © 2003 Intel Corporation.
34
Labs
17
Intel
Developer
ディレクティブの使用例
#pragma swp
for (i=0; i<m ; i++)
{
if (a[i]==0)
{
b[i]=a[i]+1;
}
else
{
b[i]=a[i]*2;
}
}
Forum
!DEC$ IVDEP
do j=1,n
a(b(j)) = a(b(j))+1
enddo
Intel
Copyright © 2003 Intel Corporation.
35
Labs
Intel
プロシジャ間（IPO）最適化
Developer
Forum
Intel
Copyright © 2003 Intel Corporation.
36
Labs
18
Intel
Developer
目次
Forum
• ストリーム・ベンチのコンパイル(Fortran)
– 最適化レポートの出力方法
– Itanium® 2 プロセッサの演算リソース
– ソフトウェア・パイプライニング
• 行列乗算のコンパイル（C）
– 最適化オプション
– 最適化ディレクティブ
• ＳＰＥＣintのコンパイル
– プロシジャ間最適化（ＩＰＯ）
– プロファイル・ガイド最適化（ＰＧＯ）
Intel
Copyright © 2003 Intel Corporation.
37
Labs
Intel
Spec*2000 から gzipの性能
•
•
•
•
•
Developer
Forum
gcc のコンパイル実行結果 185.151s
ecc
45.146s (4.10x)
-O3 オプション
44.726s (4.14x)
-O3 –ipo
43.341s (4.27x)
-O3 –prof_gen ….
• -O3 –ipo –prof_use 34.305 (5.40x)
Linux kernel 2.4.19-smp,Intel®C/C++ compiler 7.1 #20030703,gcc 2.96
Tiger4 system Intel® Itanium 2 プロセッサ１.５ＧＨｚ６MB Cache 2-way
Intel
Copyright © 2003 Intel Corporation.
38
Labs
19
gcc によるコンパイル実行
Intel
Developer
Forum
Intel
Copyright © 2003 Intel Corporation.
39
Labs
Intel
eccによるコンパイル実行
Developer
Forum
Intel
Copyright © 2003 Intel Corporation.
40
Labs
20
Intel
-O3によるコンパイル実行
Developer
Forum
Intel
Copyright © 2003 Intel Corporation.
41
Labs
Intel
-O3によるコンパイル実行( ls )
Developer
Forum
Intel
Copyright © 2003 Intel Corporation.
42
Labs
21
Intel
-ipoによるコンパイル実行
Developer
Forum
Intel
Copyright © 2003 Intel Corporation.
43
Labs
Intel
-prof_genによるコンパイル実行
Developer
Forum
Intel
Copyright © 2003 Intel Corporation.
44
Labs
22
Intel
-prof_useによるコンパイル実行
Developer
Forum
Intel
Copyright © 2003 Intel Corporation.
45
最適化コンパイラ
Labs
Intel
Performance Switches
Developer
Forum
プロセッサ固有や並列化
/G2 –g2
/Oa, -fno_alias
プロシジャ間
/Qip,-ip
/Qipo, -ipo
/Qipo_wp,
-wp_ipo
-O1, -O2, -O3
一般的な最適化
プ
ロフ
ァ
イル
Copyright © 2003 Intel Corporation.
・ガ
イデ
ッド
/Qprof_gen, -prof_gen
/Qprof_use,-prof_use
Intel
46
Labs
23
IPO
Intel
Developer
使用方法
Compile program
icc -c -ipo foo.c
Forum
foo.o
(fake object file)
foo.il
(un-optimized
intermediate
language files)
Link program
icc -o foo -ipo foo.o
2a. Compiler performs whole-program
optimizations
2b. Compiler invokes linker to produce
executable
foo
(optimized
executable)
Intel
Copyright © 2003 Intel Corporation.
47
PGO
Labs
Intel
Developer
使用方法
Compile+link to add instrumentation
icc –o foo -prof_gen foo.c
Forum
foo
(instrumented
executable)
12345678.dyn
(dynamic profile)
Execute instrumented program
./foo
Compile+link using feedback
icc –o foo -prof_use foo.c
pgopti.dpi
(merged .dyn files)
foo
(optimized
executable)
Intel
Copyright © 2003 Intel Corporation.
48
Labs
24
Intel
まとめ
Developer
Forum
• コンパイラの最適化レポートを利用して重要ループのソ
フトウェア・パイプライン化を確認しましょう
– プロセッサのリソースは有効に活用されていますか？
– 偽の依存性はないでしょうか？
• ディレクティブやコンパイル時のオプションを用いてコン
パイラの最適化をサポートしましょう
– 依存性を無視したら最適化されますか？
– コンパイラを助ける適切なデイレクティブはどれでしょうか？
– 適用可能な場合はインテルのパフォーマンス・ライブラリを使用
しましょう
• プロシジャ間最適化（IPO）とプロファイル・ガイド最適化
（PGO）は必ず試みましょう
– Itanium® アーキテクチャでは特に有効です
Intel
Copyright © 2003 Intel Corporation.
49
御参考
Labs
Intel
Developer
• 開発を用意にするインテルのパフォーマンス・ツールを利用しましょう
Forum
– インテル® パフォーマンスコンパイラ
– インテル® VTune™ パフォーマンスアナライザ
– インテル® パフォーマンスライブラリ
– http://www.intel.co.jp/jp/developer/software/products/compilers
• 弊社のパートナXLsoft（株）により日本語のサポートが得られます
– http://www.xlsoft.com/jp/products/vtune/perftool.htm
– https://premier.intel.com （英語によるサポート）
• インテルSWカレッジによるトレーニング
– http://www.intel.com/software/products/college （日本でのトレーニングについては
要相談）
• パフォーマンスのチューニングやカウンタの説明には
– IA-64 プロセッサ基本講座池井満著 2000年8月オーム社
– インテル® Itanium® 2 プロセッサリファレンス・マニュアルソフトウェアの開発と最適化
http://developer.intel.com/design/itanium2/manuals/
Intel
Copyright © 2003 Intel Corporation.
50
Labs
25

Download Report