M - UCSD CSE - University of California San Diego

GUSTO:
General architecture design Utility and Synthesis
Tool for Optimization
Qualifying Exam
for
Ali Irturk
University of California, San Diego
1
Thesis Objective
 Design of a novel tool, GUSTO, for automatic generation
and optimization of application specific matrix computation
architectures from a given Matlab algorithm;
 Showing the effectiveness of my tool by
• Rapid architectural production of various signal
processing, computer vision and financial computation
algorithms,
2
Motivation

Matrix Computations lie at the heart of
most scientific computational tasks
•
•
•

Wireless Communication,
Financial Computation,
Computer Vision.
QRD,
-1
A
Matrix inversion is required in
•
•
•
Equalization algorithms to remove the effect
of the channel on the signal,
Mean variance framework to solve a
constrained maximization problem,
Optical flow computation algorithm for
motion estimation.
3
Motivation
 There are a number of tools that translate
Matlab algorithms to a hardware description
language;
 However, we believe that the majority of
these tools take the wrong approach;
 We take a more focused approach,
specifically developing a tool that is
targeting matrix computation algorithms.
4
Computing Platforms
ASICs
DSPs
FPGAs
GPU
CELL BE
 Exceptional
Performance
 Ease
of Development
 Ease of Development
 Long Time
to Market
 Fast
Time toFast
Market
Time to Market
 Substantial
Costs
 Low
Performance
 ASIC-like Performance
5
Field Programmable Gate Arrays
 FPGAs are ideal platforms
• High processing power,
• Flexibility,
• Non recurring engineering (NRE) cost.
 If used properly, these features
enhance the performance and
throughput significantly.
 BUT! A few tools exist which can
aid the designer with the many
system, architectural and logic
design choices.
6
GUSTO
General architecture design Utility and Synthesis Tool for Optimization
Algorithm
Matrix dimensions
Resource allocation
GUSTO
Bit width
Required
HDL files
Modes
An easy-to-use tool for more efficient design space exploration
and development.
GUSTO: An Automatic Generation and Optimization Tool for Matrix Inversion Architectures, Ali Irturk, Bridget
Benson, Shahnam Mirzaei and Ryan Kastner, under review, Transactions on Embedded Computing Systems.
7
Outline
 Motivation
 GUSTO: Design Tool and Methodology
 Applications
 Matrix Decomposition Methods
 Matrix Inversion Methods
 Mean Variance Framework for Optimal Asset Allocation
 Future Work
 Publications
8
GUSTO
Design Flow
Algorithm
Algorithm Analysis
Instruction
Generation
Matrix Dimensions
Type and # of
Arithmetic Resources
Resource Allocation
Design Library
+ *
-
/
Error Analysis
Data
Representation
Architecture
Generation
Xilinx
Mentor Graphics
Resource Trimming
and Scheduling
Xilinx
Mentor Graphics
Area, Latency and
Area, Latency and
Simulation Results Throughput Results Simulation Results
Throughput Results
Mode 2
Mode 1
Application Specific Architecture
General Purpose Architecture
(static)
(dynamic)
9
GUSTO
Modes

Mode 1 of GUSTO generates a general purpose architecture and its datapath.
• Can be used to explore other algorithms.
• Do not lead to high-performance results.

Mode 2 of GUSTO creates a scheduled, static, application specific architecture.

Simulates the architecture to
• Collect scheduling information,
• Define the usage of resources.
Instruction
Controller
Arithmetic
Arithmetic
Arithmetic
Unit
Arithmetic
Unit
Unit
Adders
Unit
Memory
Controller
Multipliers
Multipliers
Multipliers
Multipliers
Multipliers
Arithmetic Units
10
Matrix Multiplication Core Design
Algorithm
Algorithm Analysis
Instruction
Generation
Matrix Dimensions
Type and # of
Arithmetic Resources
Resource Allocation
Design Library
Error Analysis
Data
Representation
Architecture
Generation
Xilinx
Mentor Graphics
Resource Trimming
and Scheduling
Xilinx
Mentor Graphics
Area, Latency and
Area, Latency and
Simulation Results Throughput Results Simulation Results
Throughput Results
Mode 2
Mode 1
Application Specific Architecture
General Purpose Architecture
(static scheduling)
(dynamic scheduling)
11
Matrix Multiplication Core Design
Algorithm
Algorithm Analysis
Instruction
Generation
Matrix Dimensions
Type and # of
Arithmetic Resources
Resource Allocation
Design Library
Error Analysis
Data
Representation
Architecture
Generation
Xilinx
Mentor Graphics
Resource Trimming
and Scheduling
Xilinx
Mentor Graphics
Area, Latency and
Area, Latency and
Simulation Results Throughput Results Simulation Results
Throughput Results
Mode 2
Mode 1
Application Specific Architecture
General Purpose Architecture
(static scheduling)
(dynamic scheduling)
12
Matrix Multiplication Core Design
Algorithm Analysis
Built-In Function
C =A* B
for i=1:n,
for j=1:n,
for k=1:n
Temp = A(i,k)*B(k,j);
C(i,j) = C(i,j) + Temp;
end
end
end
13
Matrix Multiplication Core Design
Algorithm
Algorithm Analysis
Instruction
Generation
Matrix Dimensions
Type and # of
Arithmetic Resources
Resource Allocation
Design Library
Error Analysis
Data
Representation
Architecture
Generation
Xilinx
Mentor Graphics
Resource Trimming
and Scheduling
Xilinx
Mentor Graphics
Area, Latency and
Area, Latency and
Simulation Results Throughput Results Simulation Results
Throughput Results
Mode 2
Mode 1
Application Specific Architecture
General Purpose Architecture
(static scheduling)
(dynamic scheduling)
Ali Irturk
14
Matrix Multiplication Core Design
Instruction Generation
 A(1,1)
A
 A(2,1)
 B(1,1)
B
 B(2,1)
A(1,2) 
A(2,2)
 C (1,1)
C
C ( 2,1)
Operation
C(1,1) = A(1,1) * B(1,1)
Temp = A(1,2) * B(2,1)
C(1,1) = C(1,1) + Temp
[mul,
[mul,
[add,
B(1,2) 
B(2,2)
C (1,2) 
C (2,2)
Operand 1
Destination
Operand 2
C(1,1), A(1,1),
temp, A(1,2),
C(1,1), C(1,1),
B(1,1)]
B(2,1)]
temp]
15
Matrix Multiplication Core Design
Instructions
Arithmetic
Arithmetic
Arithmetic
Unit
Arithmetic
Unit
Unit
Adders
Unit
Memory
Controller
Instruction
Controller
Multipliers
Multipliers
Multipliers
Multipliers
Multipliers
Arithmetic Units
16
Matrix Multiplication Core Design
Algorithm
Algorithm Analysis
Instruction
Generation
Matrix Dimensions
Type and # of
Arithmetic Resources
Resource Allocation
Design Library
Error Analysis
Data
Representation
Architecture
Generation
Xilinx
Mentor Graphics
Resource Trimming
and Scheduling
Xilinx
Mentor Graphics
Area, Latency and
Area, Latency and
Simulation Results Throughput Results Simulation Results
Throughput Results
Mode 2
Mode 1
Application Specific Architecture
General Purpose Architecture
(static scheduling)
(dynamic scheduling)
17
Matrix Multiplication Core Design
Number of Arithmetic Units
Arithmetic
Arithmetic
Arithmetic
Unit
Arithmetic
Unit
Unit
Adders
Unit
Memory
Controller
Instruction
Controller
Multipliers
Multipliers
Multipliers
Multipliers
Multipliers
Arithmetic Units
18
Matrix Multiplication Core Design
Error Analysis
Algorithm
Algorithm Analysis
Instruction
Generation
Matrix Dimensions
Type and # of
Arithmetic Resources
Resource Allocation
Design Library
Error Analysis
Data
Representation
Architecture
Generation
Xilinx
Mentor Graphics
Resource Trimming
and Scheduling
Xilinx
Mentor Graphics
Area, Latency and
Area, Latency and
Simulation Results Throughput Results Simulation Results
Throughput Results
Mode 2
Mode 1
Application Specific Architecture
General Purpose Architecture
(static scheduling)
(dynamic scheduling)
19
Matrix Multiplication Core Design
Error Analysis
User Defined Input
Data
GUSTO
MATLAB
Fixed Point Arithmetic
Results
(using variable bit width)
Error Analysis Metrics:
1) Mean Error
2) Peak Error
3) Standard Deviation of Error
4) Mean Percentage Error
Floating Point
Arithmetic Results
(Single/Double
precision)
20
Matrix Multiplication Core Design
Error Analysis
21
Matrix Multiplication Core Design
Architecture Generation
Algorithm
Algorithm Analysis
Instruction
Generation
Matrix Dimensions
Type and # of
Arithmetic Resources
Resource Allocation
Design Library
Error Analysis
Data
Representation
Architecture
Generation
Xilinx
Mentor Graphics
Resource Trimming
and Scheduling
Xilinx
Mentor Graphics
Area, Latency and
Area, Latency and
Simulation Results Throughput Results Simulation Results
Throughput Results
Mode 2
Mode 1
Application Specific Architecture
General Purpose Architecture
(static scheduling)
(dynamic scheduling)
Ali Irturk
22
Matrix Multiplication Core Design
Architecture Generation
General Purpose Architecture
Dynamic Scheduling
Full Connectivity
Dynamic Memory
Assignments
Arithmetic
Arithmetic
Arithmetic
Unit
Arithmetic
Unit
Unit
Adders
Unit
Memory
Controller
Instruction
Controller
Multipliers
Multipliers
Multipliers
Multipliers
Multipliers
Arithmetic Units
23
Matrix Multiplication Core Design
Architecture Generation
Algorithm
Algorithm Analysis
Instruction
Generation
Matrix Dimensions
Type and # of
Arithmetic Resources
Resource Allocation
Design Library
Error Analysis
Data
Representation
Architecture
Generation
Xilinx
Mentor Graphics
Resource Trimming
and Scheduling
Xilinx
Mentor Graphics
Area, Latency and
Area, Latency and
Simulation Results Throughput Results Simulation Results
Throughput Results
Mode 2
Mode 1
Application Specific Architecture
General Purpose Architecture
(static scheduling)
(dynamic scheduling)
Ali Irturk
24
Matrix Multiplication Core Design
Architecture Generation
Application Specific Architecture
Static Memory
Static Scheduling
Assignments
Required Connectivity
Arithmetic
Arithmetic
Arithmetic
Unit
Arithmetic
Unit
Unit
Adders
Unit
Memory
Controller
Instruction
Controller
Multipliers
Multipliers
Multipliers
Multipliers
Multipliers
Arithmetic Units
25
GUSTO
Trimming Feature
Out_A
Out_B
Out_mem1
Out_mem2
Out_A
Out_B
Out_mem1
Out_mem2
In_A1 In_A2
In_B1 In_B2
In_mem1
A
B
mem
Out_A
Out_B
Out_mem1 Out_mem2
Simulation runs
Out_A Out_B
A
0

In_A2 1

In_A1
Out_A
Out_ Out_
mem1 mem2
1
0
0
1
1

0
26
GUSTO
Trimming Feature
Out_A
Out_B
Out_mem1
Out_mem2
Out_A
Out_B
Out_mem1
Out_mem2
In_A1 In_A2
In_B1 In_B2
In_mem1
A
B
mem
Out_A
Out_B
Out_mem1 Out_mem2
Simulation runs
Out_A Out_B
B
0

In_B2 0

In_B1
Out_B
Out_ Out_
mem1 mem2
0
0
0
0
0

0
27
Matrix Multiplication Core Results
Inst.
Cont.
Design 1
Design 2
Design 3
AA
AA
AA
A
AA
MM
MM
Mem.
Cont.
1000
0.86
775
Area
MM
MM
Mem.
Cont.
Inst.
Cont.
MM
1
Throughput
0.86
775
600
0.69
643
0.8
0.6
400
0.4
200
0.2
0
Design 1
Design 2
Mem.
Cont.
Design 3
Hardware Implementation Trade-offs of Matrix Computation Architectures using Hierarchical Datapaths, Ali Irturk,
Nikolay Laptev and Ryan Kastner, under review, Design Automation Conference (DAC 2009).
Throughput
Area (Slices)
800
Inst.
Cont.
0
28
Hierarchical Datapaths

Unfortunately the organization of the architecture does not provide a complete
design space to the user for exploring better design alternatives.

This simple organization also does not scale well with the complexity of the
algorithms:
Number of Instructions
Number of Functional Units

Optimization Performance
Internal Storage and Communication
To overcome these issues, we incorporate hierarchical datapaths and
heterogeneous architecture generation options into GUSTO.
29
Matrix Multiplication Core Results
Design 4
Core
A_1
Core
A_1
Core
A_1
Core
A_1
Core
A_1
Core
A_1
Core
A_1
Core
A_1
Core
A_1
Core
A_1
Core
A_1
Core
A_1
Core
A_1
Core
A_1
Core
A_1
Core
A_1
Design 5
16
Core
A_1
Core A_1
Inst.
Cont.
AA
AA
MM
MM
Mem.
Cont.
 A11
A
 21
 A31

 A41
A12 A13 A14   B11
A22 A23 A24   B21

A32 A33 A34   B31
 
A42 A43 A44   B41
B12 B13 B14  C11
B22 B23 B24  C21

B32 B33 B34  C31
 
B42 B43 B44  C41
C12 C13 C14 
C22 C23 C24 
C32 C33 C34 

C42 C43 C44 
Hardware Implementation Trade-offs of Matrix Computation Architectures using Hierarchical Datapaths, Ali Irturk,
Nikolay Laptev and Ryan Kastner, under review, Design Automation Conference (DAC 2009).
30
Matrix Multiplication Core Results
Design 4
Core
A_1
Core
A_1
Core
A_1
Core
A_1
Core
A_1
Core
A_1
Core
A_1
Core
A_1
Core
A_1
Core
A_1
Core
A_1
Core
A_1
Core
A_1
Core
A_1
Core
A_1
Core
A_1
Design 5
16
Core
A_1
Design 6
Core
A_2
Core
A_2
Core
A_2
Core
A_2
Core
A_2
Core
A_2
Core
A_2
Core
A_2
Design 7
8
Core
A_2
Core A_2
Inst.
Cont.
AA
AA
MM
MM
Mem.
Cont.
 A11
A
 21
 A31

 A41
A12 A13 A14   B11
A22 A23 A24   B21

A32 A33 A34   B31
 
A42 A43 A44   B41
B12 B13 B14  C11
B22 B23 B24  C21

B32 B33 B34  C31
 
B42 B43 B44  C41
C12 C13 C14 
C22 C23 C24 
C32 C33 C34 

C42 C43 C44 
Hardware Implementation Trade-offs of Matrix Computation Architectures using Hierarchical Datapaths, Ali Irturk,
Nikolay Laptev and Ryan Kastner, under review, Design Automation Conference (DAC 2009).
31
Matrix Multiplication Core Results
Design 4
Core
A_1
Core
A_1
Core
A_1
Core
A_1
Core
A_1
Core
A_1
Core
A_1
Core
A_1
Core
A_1
Core
A_1
Core
A_1
Core
A_1
Core
A_1
Core
A_1
Core
A_1
Core
A_1
Design 5
16
Core
A_1
Design 6
Core
A_2
Core
A_2
Core
A_2
Core
A_2
Core
A_2
Core
A_2
Core
A_2
Core
A_2
Design 7
8
Core
A_2
Design 9
4
Design 8
Core
A_4
Core
A_4
Core
A_4
Core
A_4
Core
A_4
Core A_4
Inst.
Cont.
AA
AA
MM
MM
Mem.
Cont.
 A11
A
 21
 A31

 A41
A12 A13 A14   B11
A22 A23 A24   B21

A32 A33 A34   B31
 
A42 A43 A44   B41
B12 B13 B14  C11
B22 B23 B24  C21

B32 B33 B34  C31
 
B42 B43 B44  C41
C12 C13 C14 
C22 C23 C24 
C32 C33 C34 

C42 C43 C44 
Hardware Implementation Trade-offs of Matrix Computation Architectures using Hierarchical Datapaths, Ali Irturk,
Nikolay Laptev and Ryan Kastner, under review, Design Automation Conference (DAC 2009).
32
Matrix Multiplication Core Results
Design 4
Core
A_1
Core
A_1
Core
A_1
Core
A_1
Core
A_1
Core
A_1
Core
A_1
Core
A_1
Core
A_1
Core
A_1
Core
A_1
Core
A_1
Core
A_1
Core
A_1
Core
A_1
Core
A_1
Design 5
16
Core
A_1
Core
A_2
Core
A_2
Core
A_2
Core
A_2
Core
A_2
Core
A_2
Core
A_2
Core
A_2
Area
8.58
10000
Design 7
8
Design 9
4
Design 8
Core
A_2
Core
A_4
Core
A_4
Core
A_4
Core
A_4
Core
A_4
10
Throughput
8
8000
5.97
6
6000
3.71
4
4000
2000
0
0.86
775
0.86
775
1
2
0.69
643 9552
3
4
0.54
597
5
5024
0.75
628
2660
0.93
665
6
7
8
9
Hardware Implementation Trade-offs of Matrix Computation Architectures using Hierarchical Datapaths, Ali Irturk,
Nikolay Laptev and Ryan Kastner, under review, Design Automation Conference (DAC 2009).
Throughput
Area (Slices)
12000
Design 6
2
0
33
Matrix Multiplication Core Results
Design 11
Design 10
3
2
Design 12
4
5
3
2
Core
A_1
Core
A_4
Core
A_2
Core
A_1
5
Core
A_2
Core
A_4
Core
A_1
8.58
10000
Area
Core
A_1
10
Throughput
8
8000
5.97
6
6000
3.71
4
4000
2000
0
1.72 1.85
0.86 0.86
775 775
1
2
0.69
0.93 1.24
0.75
0.54
643 9552 597 5024 628 2660 665 1293 1822 1859
3
4
5
6
7
8
9
10
11
Hardware Implementation Trade-offs of Matrix Computation Architectures using Hierarchical Datapaths, Ali Irturk,
Nikolay Laptev and Ryan Kastner, under review, Design Automation Conference (DAC 2009).
12
Throughput
Area (Slices)
12000
4
2
0
34
Outline
 Motivation
 GUSTO: Design Tool and Methodology
 Applications
 Matrix Decomposition Methods
 Matrix Inversion Methods
 Mean Variance Framework for Optimal Asset Allocation
 Future Work
 Publications
35
Outline
 Motivation
 GUSTO: Design Tool and Methodology
 Applications
 Matrix Decomposition Methods
 Matrix Inversion Methods
 Mean Variance Framework for Optimal Asset Allocation
 Future Work
 Publications
36
MATRIX DECOMPOSITIONS
QR, LU AND CHOLESKY
QT  Q  Q  QT  I
Given Matrix
Q 1  QT
A  Q R
Upper Triangular
Matrix
Orthogonal Matrix
Given Matrix
A  L U
Upper Triangular
Matrix
 A11 A12 A13  Q11 Q12 Q13   R11 R12 R13 
 A A A   Q Q Q    0 R R 
22
23 
 21 22 23   21 22 23  
 A31 A32 A33  Q31 Q32 Q33   0 0 R33 
 A11 A12 A13   1 0 0 U11 U12 U13 
 A A A    L 1 0   0 U U 
22
23 
 21 22 23   21
 
 A31 A32 A33   L31 L32 1  0 0 U 33 
Lower Triangular Matrix
Given Matrix
A  GG
T Transpose of Lower
Triangular Matrix
Unique Lower Triangular
Matrix (Cholesky triangle)
 A11 A12 A13  G11 0 0  G11 G21 G31 
 A A A   G G 0    0 G G 
22
32 
 21 22 23   21 22
 
 A31 A32 A33  G31 G32 G33   0 0 G33 
37
MATRIX INVERSION
Given Matrix
1
A A  I

Inverse Matrix
 A11 A12 A13   x11 x12 x13  1 0 0
 A A A    x x x   0 1 0 
 21 22 23   21 22 23  

 A31 A32 A33   x31 x32 x33  0 0 1

Identity Matrix
A11x11  A12 x21  A13 x31  1
A11x12  A12 x22  A13 x32  0
A11x13  A12 x23  A13 x33  0
Full Matrix Inversion is costly!
1
1
A  R Q
T
1
1
1
A U L
1
T 1
A  (G )  G
1
38
Results
Inflection Point Analysis
1
1
A  Q R
A  R Q
T
Matrix Size
Automatic Generation of Decomposition based Matrix Inversion Architectures, Ali Irturk, Bridget Benson and Ryan Kastner, In
Proceedings of the IEEE International Conference on Field-Programmable Technology (ICFPT), December 2009.
39
Results
Inflection Point Analysis
Inflection Point Analysis
Implementation :
• Serial
• Parallel
Bitwidths
• 16 bits
• 32 bits
• 64 bits
Matrix Sizes
•2×2
•3×3
• ……..
•8×8
40
Results
Inflection Point Analysis: Decomposition Methods
# of Clock Cycles (sequential)
6000
QR Decomposition (16bit)
LU Decomposition (16bit)
Cholesky Decomposition(16bit)
QR decomposition (32bit)
LU Decomposition (32bit)
Cholesky Decomposition(32bit)
QR decomposition (64bit)
LU Decomposition (64bit)
Cholesky Decomposition(64bit)
5000
4000
3000
2000
1000
0
2×2
3×3
4×4
5×5
6×6
7×7
8×8
Matrix Size
41
Results
Inflection Point Analysis: Matrix Inversion
# of Clock Cycles (parallel)
2500
2000
1500
QR Decomposition(16bit)
LU Decomposition(16bit)
Cholesky Decomposition(16bit)
QR Decomposition(32bit)
LU Decomposition(32bit)
Cholesky Decomposition(32bit)
QR Decomposition(64bit)
LU Decomposition(64bit)
Cholesky Decomposition(64bit)
1000
500
0
2×2
3×3
4×4
5×5
Matrix Size
6×6
7×7
8×8
An FPGA Design Space Exploration Tool for Matrix Inversion Architectures, Ali Irturk, Bridget Benson, Shahnam
Mirzaei and Ryan Kastner, In Proceedings of the IEEE Symposium on Application Specific Processors (SASP), June 2008.
42
Results
Finding the Optimal Hardware : Decomposition Methods
14,000
12,000
# of Slices
10,000
8,000
6,000
Decrease in Area (Percentage)
4,000
2,000
0
83%
QR
General Purpose
Architecture (Mode 1)
94%
86%
LU
Cholesky
Application Specific
Architecture (Mode 2)
43
Architectural Optimization of Decomposition Algorithms for Wireless Communication Systems, Ali Irturk, Bridget Benson, Nikolay Laptev
and Ryan Kastner, In Proceedings of the IEEE Wireless Communications and Networking Conference (WCNC 2009), April 2009.
Results
Throughput
Finding the Optimal Hardware: Decomposition Methods
2.00
1.80
1.60
1.40
1.20
1.00
0.80
0.60
0.40
0.20
0.00
Increase in Througput (Percentage)
68%
14%
16%
QR
LU
General Purpose
Architecture (Mode 1)
Cholesky
Application Specific
Architecture (Mode 2)
44
Results
14,000
0.3
12,000
0.25
10,000
Slices (Mode 1)
0.2
Slices (Mode 2)
Throughput (Mode 1) 0.15
Throughput (Mode 2)
0.1
8,000
6,000
4,000
Throughput
# of Slices
Finding the Optimal Hardware: Matrix Inversion (using QR)
0.05
2,000
0
0
2222
2244
3444
4444
# of Adder, Subtractor, Multiplier, Divider

average of 59% decrease in area

3X increase in throughput
45
Results
Architectural Design Alternatives: Matrix Inversion
QR_slices
QR_thr
LU_slices
LU_thr
Cho_slices
Cho_thr
7000
0.4
6000
0.35
# of Slices
0.25
4000
0.2
3000
0.15
2000
Throughput
0.3
5000
0.1
1000
0.05
0
0
19 bits
26 bits
# of bits
32 bits
46
Results
Architectural Design Alternatives: Matrix Inversion
LU_slices
LU_thr
Cho_slices
Cho_thr
14000
0.4
12000
0.35
10000
0.3
0.25
8000
0.2
6000
0.15
4000
Throughput
# of Slices
QR_slices
QR_thr
0.1
2000
0.05
0
0
4
6
Matrix Size
8
47
Results
Comparison with Previously Published Work: Matrix Inversion
Eilert et Eilert et
al.
al.
Method
Our
ImplA
Analytic Analytic Analytic
Our
ImplB
Our
ImplC
Edman
et al.
Karkooti
et al.
Our
Analytic
Analytic
QR
QR
QR
Bit width
16
20
20
20
20
12
20
20
Data type
floating
floating
fixed
fixed
fixed
fixed
floating
fixed
Virtex 4
Virtex 4
Virtex 4
Virtex 2
Virtex 4
Virtex 4
Device type
Virtex 4 Virtex 4
Slices
1561
2094
702
1400
2808
4400
9117
3584
DSP48s
0
0
4
8
16
NR
22
12
BRAMs
NR
NR
0
0
0
NR
NR
1
Throughput
(106×s-1)
1.04
0.83
0.38
0.72
1.3
0.28
0.12
0.26
•J. Eilert, D. Wu, D. Liu, “Efficient Complex Matrix Inversion for MIMO Software Defined Radio”,
IEEE International Symposium on Circuits and Systems. (2007).
•F. Edman, V. Öwall, “A Scalable Pipelined Complex Valued Matrix Inversion Architecture”,
IEEE International Symposium on Circuits and Systems. (2005).
•M. Karkooti, J.R. Cavallaro, C. Dick, “FPGA Implementation of Matrix Inversion Using QRD-RLS Algorithm”,
Asilomar Conference on Signals, Systems and Computers (2005).
48
Results
Comparison with Previously Published Work: Matrix Inversion
Method
Bit width
Data type
Device type
Slices
DSP48s
BRAMs
Throughput (106×s-1)
Edman et
al.
QR
12
fixed
Virtex 2
4400
NR
NR
0.28
Karkooti
et al.
QR
20
floating
Virtex 4
9117
22
NR
0.12
Our
Our
Our
QR
20
fixed
Virtex 4
3584
12
1
0.26
LU
20
fixed
Virtex 4
2719
12
1
0.33
Cholesky
20
fixed
Virtex 4
3682
12
1
0.25
•F. Edman, V. Öwall, “A Scalable Pipelined Complex Valued Matrix Inversion Architecture”,
IEEE International Symposium on Circuits and Systems. (2005).
•M. Karkooti, J.R. Cavallaro, C. Dick, “FPGA Implementation of Matrix Inversion Using QRD-RLS Algorithm”,
Asilomar Conference on Signals, Systems and Computers (2005).
49
Outline
 Motivation
 GUSTO: Design Tool and Methodology
 Applications
 Matrix Decomposition Methods
 Matrix Inversion Methods
 Mean Variance Framework for Optimal Asset Allocation
 Future Work
 Publications
50
Asset Allocation

Asset allocation is the core part of
portfolio management.

An investor can minimize the risk of loss
and maximize the return of his portfolio
by diversifying his assets.

Determining the best allocation requires
solving a constrained optimization
problem.
Markowitz’s mean variance framework
51
Asset Allocation

Increasing the number of assets significantly provides more
efficient allocations.
52
High Performance Computing

Higher number of assets and more
complex diversification require significant
computation.

The addition of FPGAs to the existing
high performance computers can boost the
application performance and design
flexibility.
Zhang et al. and Morris et al. Single Option Pricing
Kaganov et al.
Credit Derivative Pricing
Thomas et al.
Interest Rates and
Value-at-Risk Simulations
We are the first to propose hardware acceleration of the
mean variance framework using FPGAs.
FPGA Acceleration of Mean Variance Framework for Optimum Asset Allocation, Ali Irturk, Bridget Benson, Nikolay Laptev and
Ryan Kastner, In Proceedings of the Workshop on High Performance Computational Finance at SC08 International Conference
for High Performance Computing, Networking, Storage and Analysis, November 2008.
53
THE MEAN VARIANCE FRAMEWORK
Expected Prices
E{M}
Computation of the
Computation of
Efficient Frontier
Required Inputs
Allocations
Computation of the
Optimal Allocation
Optimal
Allocation
Efficient Frontier
Standard Deviation (RISK)
Expected Return
Expected Covariance
Cov{M}
Expected Return
5 Phases
MVF
Highest Utility
Portfolio
Standard Deviation (RISK)
54
IDENTIFICATION OF BOTTLENECKS
Computation Time (sec)
100
Generation of the Required Inputs
Mean Variance Framework Step 1
Mean Variance Framework Step 2
10
1
20
30
40
50
60
70
80
90
100
0.1
0.01
0.001
Number of Securities
# of Portfolios = 100
# of Scenarios = 100,000
55
THE MEAN VARIANCE FRAMEWORK
Expected Prices
E{M}
Computation of the
Computation of
Efficient Frontier
Required Inputs
Allocations
Computation the
Optimal Allocation
Optimal
Allocation
Efficient Frontier
Standard Deviation (RISK)
Expected Return
Expected Covariance
Cov{M}
Expected Return
5 Phases
MVF
Highest Utility
Portfolio
Standard Deviation (RISK)
56
Expected Return
Hardware Architecture for MVF Step 2
Monte Carlo Block
?
Random
Number
Generator
Is this allocation the
best?
α1 = [α 11, α 12, …, α 1Ns]
Standard Deviation (RISK)
Ψα = α × M
Objective
Value
Market Vector
Allocation
α =[α1, α2,…, αNs]
Requires Ns Multiplications
57
Hardware Architecture for MVF Step 2
ψ
58
Hardware Architecture for MVF Step 2
59
Hardware Architecture for MVF Step 2
Satisfaction Function
Calculator Blocks
Parallel Ns
Multipliers
Parallel Nm
Monte Carlo Blocks
Parallel Nm
Utility Calculation Blocks
Parallel Np
Satisfaction Function
Calculation Blocks
Parallel Satisfaction Function
Calculator Blocks
60
Results
1000 runs
Software
Parallel 1
Parallel 2
Computational Time (sec)
1.00E+01
1.00E+00
50
1.00E-01
151 -60221 ×70
30280- 442 90×
100
10 Satisfaction Blocks (1 Monte-Carlo Block with
10 Satisfaction Blocks (1 Monte-Carlo Block with
20 multipliers and 20 Utility Function Calculator
10 multipliers and 10 Utility Function Calculator
Blocks)
Blocks)
1.00E-02
1.00E-03
Number of Securities
Mean Variance Framework – Step 2
100,000 scenarios and 50 Portfolios
FPGA Acceleration of Mean Variance Framework for Optimum Asset Allocation, Ali Irturk, Bridget Benson, Nikolay Laptev and
Ryan Kastner, In Proceedings of the Workshop on High Performance Computational Finance at SC08 International Conference
for High Performance Computing, Networking, Storage and Analysis, November 2008.
61
Outline
 Motivation
 GUSTO: Design Tool and Methodology
 Applications
 Matrix Decomposition Methods
 Matrix Inversion Methods
 Mean Variance Framework for Optimal Asset Allocation
 Future Work
 Publications
62
Thesis Outline and Future Work
1.
2.
3.
4.
5.
6.
7.
8.
9.
Introduction
Comparison of FPGAs, GPUs and CELLs
- Possible journal paper,
- GPU implementation of Face Recognition for journal paper.
GUSTO Fundamentals
Super GUSTO
- Journal paper for Hierarchical design and Heteregenous Core Design,
- Employing different instruction scheduling algorithms and analysis of their effects on
implemented architectures.
Small code applications of GUSTO
- Matrix Decomposition Core (QR, LU, Cholesky) designs with different architectural
choices
- Matrix Inversion Core (Analytic, QR, LU, Cholesky) designs with different
architectural choices
- Design of an Adaptive Weight Calculation Cores.
Large code applications using GUSTO
- Mean Variance Framework Step 2 implementation,
- Short Preamble Processing Unit implementation,
- Optical Flow Computation algorithm implementation.
Conclusions
Future Work
References
63
Outline
 Motivation
 GUSTO: Design Tool and Methodology
 Applications
 Matrix Decomposition Methods
 Matrix Inversion Methods
 Mean Variance Framework for Optimal Asset Allocation
 Future Work
 Publications
64
Publications
[15] An Optimized Algorithm for Leakage Power Reduction of Embedded Memories on FPGAs Through Location
Assignments, Shahnam Mirzaei, Yan Meng, Arash Arfaee, Ali Irturk, Timothy Sherwood, Ryan Kastner, working paper for
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.
[14] Xquasher: A Tool for Efficient Computation of Multiple Linear Expressions, Arash Arfaee, Ali Irturk, Ryan Kastner,
Farzan Fallah, under review, Design Automation Conference (DAC 2009), July 2009.
[13] Hardware Implementation Trade-offs of Matrix Computation Architectures using Hierarchical Datapaths, Ali Irturk,
Nikolay Laptev and Ryan Kastner, under review, Design Automation Conference (DAC 2009), July 2009.
[12] Energy Benefits of Reconfigurable Hardware for use in Underwater Sensor Nets, Bridget Benson, Ali Irturk, Junguk Cho,
Ryan Kastner, under review, 16th Reconfigurable Architectures Workshop (RAW 2009), May 2009.
[11] Architectural Optimization of Decomposition Algorithms for Wireless Communication Systems, Ali Irturk, Bridget
Benson, Nikolay Laptev and Ryan Kastner, In Proceedings of the IEEE Wireless Communications and Networking
Conference (WCNC 2009), April 2009.
[10] FPGA Acceleration of Mean Variance Framework for Optimum Asset Allocation, Ali Irturk, Bridget Benson, Nikolay
Laptev and Ryan Kastner, In Proceedings of the Workshop on High Performance Computational Finance at SC08
International Conference for High Performance Computing, Networking, Storage and Analysis, November 2008.
[9] GUSTO: An Automatic Generation and Optimization Tool for Matrix Inversion Architectures, Ali Irturk, Bridget Benson,
Shahnam Mirzaei and Ryan Kastner, under review (2nd round of reviews), Transactions on Embedded Computing
Systems.
[8] Automatic Generation of Decomposition based Matrix Inversion Architectures, Ali Irturk, Bridget Benson and Ryan
Kastner, In Proceedings of the IEEE International Conference on Field-Programmable Technology (ICFPT), December
2009.
65
Publications
[7] Survey of Hardware Platforms for an Energy Efficient Implementation of Matching Pursuits Algorithm for
Shallow Water Networks, Bridget Benson, Ali Irturk, Junguk Cho, and Ryan Kastner, In Proceedings of the The
Third ACM International Workshop on UnderWater Networks (WUWNet), in conjunction with ACM
MobiCom 2008, September 2008.
[6] Design Space Exploration of a Cooperative MIMO Receiver for Reconfigurable Architectures, Shahnam
Mirzaei, Ali Irturk, Ryan Kastner, Brad T. Weals and Richard E. Cagley, In Proceedings of the IEEE International
Conference on Application-specific Systems, Architectures and Processors (ASAP), July 2008.
[5] An FPGA Design Space Exploration Tool for Matrix Inversion Architectures, Ali Irturk, Bridget Benson,
Shahnam Mirzaei and Ryan Kastner, In Proceedings of the IEEE Symposium on Application Specific Processors
(SASP), June 2008.
[4] An Optimization Methodology for Matrix Computation Architectures, Ali Irturk, Bridget Benson, and Ryan
Kastner, Unsubmitted Manuscript.
[3] FPGA Implementation of Adaptive Weight Calculation Core Using QRD-RLS Algorithm, Ali Irturk, Shahnam
Mirzaei and Ryan Kastner, Unsubmitted Manuscript.
[2] An Efficient FPGA Implementation of Scalable Matrix Inversion Core using QR Decomposition, Ali Irturk,
Shahnam Mirzaei and Ryan Kastner, Unsubmitted Manuscript.
[1] Implementation of QR Decomposition Algorithms using FPGAs, Ali Irturk, MS Thesis, Department of Electrical
and Computer Engineering, University of California, Santa Barbara, June 2007. Advisor: Ryan Kastner
66
Thank You
[email protected]
67
MATRIX INVERSION

Use Decomposition Methods for
• Analytic Simplicity
• Computational Convenience
 Decomposition Methods
 QR
 LU
 Cholesky etc.
 Analytic Method
Ali Irturk -UC San Diego
SASP 2008
Matrix Inversion using QR Decomposition
Given Matrix
A  Q R
Upper Triangular
Matrix
Orthogonal Matrix
 A11 A12 A13  Q11 Q12 Q13   R11 R12 R13 
 A A A 1 Q Q 1Q    0T R R 
23  Q
22
23 
 21 22 A23  21 R22 

 A31 A32 A33  Q31 Q32 Q33   0 0 R33 
Q Q  QQ  I
T
Ali Irturk -UC San Diego
T
1
Q Q
T
SASP 2008
Matrix Inversion using QR Decomposition
Memory
00101010
Three different QR decomposition
methods:
Euclidean
Norm
00101110
for i •Gram-Schmidt
1: n
Orthogonormalization
00001010
X i• 
Ai Rotations
columns of the matrix
Givens
Q12 RQ13
R11 R12 R11101010
Q

11 R
13 
R
11
12
13
X
X
X


00101110
Reflections
11 
12
13
for i • 1Householder
:n
Q Q Q 

 X  0X R22X R00101100
23 
021 R2222 R2323


21
22
23
R
R
R
X
X
X
Q11 Q12
 11  12 13   11
12
13 
Rii  X i
Q031 Q032  RQ33
 0 R X 31 R 0X32 0X
 X33 R33 X
33  Q

Q
X
Entry
at
the
intersection
of
21
22
22
23
21
22
23

 

Qi  X i / Rii 
th column 
row
with
j
 0 0 Rith


X
X
X 33 
Q31 Q32
33 
32
 31
for j  i  1: n Q11 Q12 Q13 X 11 X11X12 X12X13 X13  R11 R12
Q21 Q22 Q23X 21 X 21X 22 X 22X 23 X 23

 
   0 R22
Rij  Qi , X j
Q31 Q32 Q33 X 31 X 31X 32 X 32X 33 X 33   0 0
QRD  MGS (A)
1
2
3
4
5
6
7
8
X j  X j  RijQi
Ali Irturk -UC San Diego
Q13 
Q23 
Q33 
R13 
R23 
R33 
SASP 2008
Matrix Inversion using Analytic Method
1
A 
 Adj ( A)
det A
1

The analytic method uses
• The adjoint matrix,
• Determinant of the given matrix.
Determinant of
a 2 × 2 matrix
Ali Irturk -UC San Diego
a
c

Adj(A)
det A
b
 ad bc

d
SASP 2008
Adjoint Matrix
A33
A44
A34
A43
Adjoint Matrix
Calculation
 A11
A
 21
 A31

 A41
A12
A22
A32
A42
11
(1) A22
UC San Diego
A13 A14 
A23 A24 
A33 A34 

A43 A44 
A33
A43
A32
A44
A34
A42
A22
*
-
*
*
A23
*
*
-
C11
*
A32
A43
*
A34
A32
A331 2
 (A1) A23
*A
42
A
44
Cofactor
Calculation
Core
*
42
*
A24
A-34
A44
*13 A32
A33
A42
A43
 (1)
SASP 2008
Different Implementations of Analytic
Approach
Implementation C
Implementation A
Cofactor
Calculation
Core
Implementation B
Cofactor
Calculation
Core
Cofactor
Calculation
Core
Ali Irturk -UC San Diego
Cofactor
Calculation
Core
Cofactor
Calculation
Core
Cofactor
Calculation
Core
Cofactor
Calculation
Core
SASP 2008
Matrix Inversion using LU Decomposition
Given Matrix
A  L U
Upper Triangular
Matrix
Lower Triangular Matrix
 A11 A12 A13   1 0 0 U11 U12 U13 
 A A A1   L

1 0   01U
1
U
22
23 

 21 22 A 23  
 21U
 L
 A31 A32 A33   L31 L32 1  0
0 U 33 
Ali Irturk - UC San Diego
ICFPT 2008
Matrix Inversion using LU Decomposition
LU ( A)
1
2
3
4
for j  1: n
for k  1: j 1
for i  k  1: j 1
Aij  Aij  Aik  Akj
7
for k  1: j 1
for i  j : n
Aij  Aij  Aik  Akj
8
9
for k  j  1: n
Akj  Akj / A jj
5
6
Ali Irturk - UC San Diego
A31
21
A11
 A11 A12 A13 
A A A 
 21 22 23 
 A31 A32 A33 
 A11 A12 A13 
A A A 
 21 22 23 
 A31 A32 A33 
 A11 A12 A13 
A A A 
 21 22 23 
 A31 A32 A33 
 A11 A12 A13 
A A A 
 21 22 23 
 A31 A32 A33 
A22
32  A22
32  A21
31  A12
ICFPT 2008
Matrix Inversion using LU
Decomposition
LU ( A)
1
2
3
4
for j  1: n
for k  1: j 1
for i  k  1: j 1
Aij  Aij  Aik  Akj
7
for k  1: j 1
for i  j : n
Aij  Aij  Aik  Akj
8
9
for k  j  1: n
Akj  Akj / A jj
5
6
Ali Irturk - UC San Diego
A32
 A11 A12 A13 
A A A 
 21 22 23 
 A31A33A32 AA3333

A22
 A11 A12 A13 
A A A 
 21 22 23 
A31A31 AA3213 A33 

AA3323AA3323AA3221AA2313
 A11 A12 A13 
A A A 
 21 22 23 
 A31 A32 A33 
 A11 A12 A13 
A A A 
 21 22 23 
 A31 A32 A33 
ICFPT 2008
Matrix Inversion using Cholesky
Decomposition
Given Matrix
A  GG
T
Transpose of Lower
Triangular Matrix
Unique Lower Triangular Matrix
(Cholesky triangle)
0  G11 G21 G31 
 A11 A12 A13  G11 0
 A A A1   G GT 01   0 G1 G 
G 22 )  G 22 32 
22
23  (21
 21 A
 A31 A32 A33  G31 G32 G33   0
0 G33 
Ali Irturk - UC San Diego
ICFPT 2008
Matrix Inversion using Cholesky
Decomposition
Cholesky( A)
1
2
3
4
5
6
7
for k  1: n
Gkk  Akk
for i  k  1: n
Gik  Aik / Akk
G11 G12 G13 
G G G 
23 
 21 22
G31 G32 G33 
G11 G12 G13 
G G G 
23 
 21 22
G31 G32 G33 
for j  k  1: n
for t  j : n
Atj  Atj  Gtk  G jk
Ali Irturk - UC San Diego
 A11 A12
A A
 21 22
 A31 A32
G11  A11
G21 
A21
G31 
A31
A11
A11
A13  A22  A22  G21  G21
A23  A32  A32  G31  G21
A33  A  A  G  G
33
33
31
31
ICFPT 2008
Matrix Inversion using Cholesky
Decomposition
Cholesky( A)
1
2
3
4
5
6
7
for k  1: n
Gkk  Akk
for i  k  1: n
Gik  Aik / Akk
G11 G12 G13 
G G G 
23 
 21 22
G31 G32 G33 
G11 G12 G13 
G G G 
23 
 21 22
G31 G32 G33 
for j  k  1: n
for t  j : n
Atj  Atj  Gtk  G jk
Ali Irturk - UC San Diego
G22  A22
G32 
 A11 A12 A13 
A A A 
 21 22 23  A33
 A31 A32 A33 
A32
A22
 A33  G32  G32
ICFPT 2008
Matrix Inversion using Cholesky
Decomposition
Cholesky( A)
1
2
3
4
5
6
7
for k  1: n
Gkk  Akk
for i  k  1: n
Gik  Aik / Akk
G11 G12 G13 
G G G 
23 
 21 22
G31 G32 G33 
G33  A33
for j  k  1: n
for t  j : n
Atj  Atj  Gtk  G jk
Ali Irturk - UC San Diego
ICFPT 2008
THE MEAN VARIANCE FRAMEWORK
Expected Prices
E{M}
Computation of the
Computation of
Efficient Frontier
Required Inputs
Allocations
Computation of the
Optimal Allocation
Optimal
Allocation
Efficient Frontier
Standard Deviation (RISK)
Expected Return
Expected Covariance
Cov{M}
Expected Return
5 Phases
MVF
Highest Utility
Portfolio
Standard Deviation (RISK)
81
THE MEAN VARIANCE FRAMEWORK
1
2
Expected Prices
E{M}
Computation of the
Computation of
Efficient Frontier
Required Inputs
5 Phases
3
Allocations
MVF
Computation the
Optimal Allocation
Optimal
Allocation
Expected Covariance
Cov{M}
82
COMPUTATION OF REQUIRED INPUTS
Estimation
Interval
Publicly Available Data
Prices Covariance
Expected Prices
E{M}
Computation of
Required Inputs
1) Detect the
Invariants
2) Determine
the
Distribution
of Invariants
Expected Covariance
5)3)Compute
Project the
the
Cov{M}
4) Compute the
# of
Securities Horizon
Reference Investor
Allocation Objective
Expected
Invariants
Return and
Investment
Knownand
Data
Expected Return
thetoCovariance
the Investment
Matrix
Horizon
the Covariance Matrix
of the Market
HorizonVectorThe time that
investment made
83
COMPUTATION OF REQUIRED INPUTS
(3)
Publicly Available Data
Prices Covariance
 Investor Objectives
• Absolute Wealth
• Relative Wealth
• Net Profits
Expected Prices
E{M}
Computation of
Required Inputs
Expected Covariance
Cov{M}
# of
Securities
Horizon
Reference Investor
Allocation Objective
Objective
Value
Market Vector
Ψα = α × M
Allocation
α =[α1, α2,…, αNs]
84
COMPUTATION OF REQUIRED INPUTS
STEP 5
Market Vector
Objective
Value
Ψα =α ×
M
Allocation
α =[α1, α2,…, αNs]
M is a transformation
of the Market Prices at
the Investment Horizon
M ≡ a + BPT+τ
 pT  ' 
K  I N   ' 
  pT 
Standard Investor Objectives
Absolute Wealth
Relative Wealth
Net Profits
(a)Specific Form
Ψα = WT+τ(α)
Ψα = WT+τ(α)-γ(α) WT+τ(β)
Ψα = WT+τ(α)- wT(α)
(b)Generalized
Form
a ≡ 0, B ≡ IN
Ψα = α’PT+τ
a ≡ 0, B ≡ K
Ψα = α’ KPT+τ
a ≡ -pT, B ≡ IN
Ψα = α’ (PT+τ-pT)
85
COMPUTATION OF REQUIRED INPUTS
Each step requires to make assumptions:
• Invariants
• Distribution of invariants
• Estimation interval….
 Our assumptions:
• Compounded returns of stocks as market invariants,
• 3 years of the known data,
• 1 week estimation interval,
• 1 year as our horizon.
Phase 5 is a good candidate for hardware implementation.
86
THE MEAN VARIANCE FRAMEWORK
Expected Prices
E{M}
Computation of the
Computation of
Efficient Frontier
Required Inputs
5 Phases
MVF
Allocations
Computation of the
Optimal Allocation
Optimal
Allocation
Expected Covariance
Cov{M}
87
MVF: STEP 1
Computation of the Efficient Frontier
E{M}
Allocation
Computation of the
Efficient Frontier
Computation of
Required Inputs
Expected Covariance
Cov{M}
E{ψα}
Expected Return
# of Portfolios
# of Securities
Current Prices
Budget
Expected Prices
Efficient Frontier
Standard Deviation (RISK)
α(v)≡ arg max α’ E{M}, v ≥ 0
α Є constraints
α‘Cov{M} α=v
Var{ψα}
88
MVF: STEP 1
Computation of the Efficient Frontier
Unachievable Risk-Return Space
α Є constraints
α‘Cov{M} α=v
Efficient Frontier
Expected Return
α(v)≡ arg max α’ E{M}, v ≥ 0
Standard Deviation (RISK)
An investor does NOT want to
be in this area!
89
THE MEAN VARIANCE FRAMEWORK
Expected Prices
E{M}
Computation of the
Computation of
Efficient Frontier
Required Inputs
Allocations
Computation the
Optimal Allocation
Optimal
Allocation
Efficient Frontier
Standard Deviation (RISK)
Expected Return
Expected Covariance
Cov{M}
Expected Return
5 Phases
MVF
Highest Utility
Portfolio
Standard Deviation (RISK)
90
MVF: STEP 2
Computing the Optimal Allocation
Determination of
the Highest Utility
Portfolio
# of Scenarios
Optimal
Allocation
Satisfaction
Index
Expected Return
# of Portfolios
Current Prices
# of Securities
?
Standard Deviation (RISK)
Is this allocation the best?
91
MVF: STEP 2
Computing the Optimal Allocation
 Satisfaction Indices
• Certainty-equivalent,
• Represent all the features of a given • Quantile,
• Coherent indices.
allocation with one single number,
 Satisfaction Indices
 Utility Functions
• Quantify the investor’s satisfaction.
• Exponential,
 Certainty-equivalent satisfaction indices are • Quadratic,
• Power,
• Represented by the investor’s utility • Logarithmic,
• Linear.
function and objective, u(ψ),
• We use Hyperbolic Absolute Risk Aversion
(HARA) class of utility functions.
92
MVF: STEP 2
Computing the Optimal Allocation
 Hyperbolic Absolute Risk Aversion (HARA)
class of utility functions are
• Specific forms of the Arrow-Pratt risk
aversion model,
ψ
• Defined as A(ψ) =
γψ2+ζψ+η
•where η = 0.
Utility Functions
Exponential
Utility
(ζ>0 and γ≡0)
Quadratic Utility
(ζ>0 and γ≡-1)
Power Utility
(ζ≡0 and γ≥1)
Logarithmic
Utility
(lim γ→1γ)
Linear Utility
(lim γ→∞γ)
u(ψ) = -e –(1/ζ) ψ
u(ψ) = ψ – (1/2ζ) ψ2
u(ψ) =ψ 1- 1/γ
u(ψ) = ln(ψ)
u(ψ) = ψ
93
IDENTIFICATION OF BOTTLENECKS
 In terms of computational time, most important variables are:
• Number of Securities,
• Number of Portfolios,
• Number of Scenarios.
Expected Prices
E{M}
Computation of the
Computation of
Efficient Frontier
Required Inputs
Allocations
Determination of
the Highest Utility
Portfolio
Optimal
Allocation
Efficient Frontier
Standard Deviation (RISK)
Expected Return
Expected Return
Expected Covariance
Cov{M}
Highest Utility
Portfolio
Standard Deviation (RISK)
94
Execution Time in Seconds
IDENTIFICATION OF BOTTLENECKS
“# of Securities”
dominates
computation
100
200
300
400
500
600
time over “# of Portfolios”.
800
700
600
500
400
300
200
100
0
90
80
70
60
Number of Securities
50 50
60
# of Scenarios = 100,000
70
80
90
Number of Portfolios
95
IDENTIFICATION OF BOTTLENECKS
“# of Portfolios” dominates computation
time over “# of Scenarios”.
Execution Time in Seconds
200
1500
1200
900
600
300
0
90
80
400
70
Number of Portfolios
60
600
50 1
800
1000
1.01
# of Securities = 100
1.02
1200
1.04
1.03
5
x 10
Number of Senarios
96
IDENTIFICATION OF BOTTLENECKS
Computation Time (sec)
100
Generation of the Required Inputs
Mean Variance Framework Step 1
Mean Variance Framework Step 2
10
1
20
30
40
50
60
70
80
90
100
0.1
0.01
0.001
Number of Securities
# of Portfolios = 100
# of Scenarios = 100,000
97
IDENTIFICATION OF BOTTLENECKS
100
Computation Time (Sec)
Mean Variance Framework Step 1
Mean Variance Framework Step 2
10
1
10
20
30
40
50
60
70
Number of Portfolios
# of Securities = 100
80
90
# of Scenarios = 100,000
100
98
IDENTIFICATION OF BOTTLENECKS
Computation Time (Sec)
100
Mean Variance Framework Step 2
10
1
Number of Scenarios
# of Securities = 100
# of Portfolios = 100
99
THE MEAN VARIANCE FRAMEWORK
Expected Prices
E{M}
Computation of the
Computation of
Efficient Frontier
Required Inputs
Allocations
Computation the
Optimal Allocation
Optimal
Allocation
Efficient Frontier
Standard Deviation (RISK)
Expected Return
Expected Covariance
Cov{M}
Expected Return
5 Phases
MVF
Highest Utility
Portfolio
Standard Deviation (RISK)
100
Generation of Required Inputs – Phase 5
pT
Market Vector Calculator IP Core
cntrl_a
K Building Block
/
β'
pT
pT
×
-
PT+τ
IN
×
×
-
 pT  ' 
K  I N   ' 
  pT 
0
1
cntrl_b
M
Control Inputs
Objective
cntrl_a
cntrl_b
Absolute
0
0
Relative
1
0
Net Profits
0
1
0
1
KPT+ τ or PT+ τ
Absolute
Wealth
Relative Wealth
Net Profits
a ≡ 0, B ≡ IN
Ψα =
α’PT+τ
a ≡ 0, B ≡ K
Ψα =
α’ KPT+τ
a ≡ -pT, B ≡ IN
Ψα =
α’ (PT+τ-pT)
101
Generation of Required Inputs – Phase 5
pT
×
β’
pT
×
102
Generation of Required Inputs – Phase 5
pT
IN
×
/
β
pT
-
×
103
Generation of Required Inputs – Phase 5
cntrl_a
pT
×
/
β'
pT
PT+τ
IN
-
×
0
1
×
104
Generation of Required Inputs – Phase 5
cntrl_a
pT
PT+τ
IN
×
-
/
β'
pT
pT
×
0
1
×
-
0
1
cntrl_b
105
Generation of Required Inputs – Phase 5
106
THE MEAN VARIANCE FRAMEWORK
Expected Prices
E{M}
Computation of the
Computation of
Efficient Frontier
Required Inputs
Allocations
Computation the
Optimal Allocation
Optimal
Allocation
Efficient Frontier
Standard Deviation (RISK)
Expected Return
Expected Covariance
Cov{M}
Expected Return
5 Phases
MVF
Highest Utility
Portfolio
Standard Deviation (RISK)
107
Hardware Architecture for MVF Step 1
α(v)≡ arg max α’ E{M}, v ≥ 0
α Є constraints
α‘Cov{M} α=v
 A popular approach to solve constrained maximization
problems is to use the Lagrangian multiplier method.
108
Hardware Architecture for MVF Step 1
L =  ' E{M} +  (v -  ' Cov{M} )
L  1
 P1 
Cov11 Cov12  1 
 2     (v  1  2 
  
P
Cov
Cov
22   2 
 2
 21
L
 P1  [21Cov11   2 (Cov21  Cov12 )]  0
1
L
 P2  [2 2Cov22  1 (Cov21  Cov12 )]  0
 2
L
 12Cov11  1 2 (Cov21  Cov12 )   22 (Cov22 )]  v

Number of Securities amount of functions need to be
computed for determination of the efficient allocation for
a given risk.
109
Hardware Architecture for MVF Step 1
v1
E{M}
Cov{M}
α1
α2
1.Core αN
s
α1
α2
2.Core αN
s
α1
α2
Np.Core αN
s
110
THE MEAN VARIANCE FRAMEWORK
Expected Prices
E{M}
Computation of the
Computation of
Efficient Frontier
Required Inputs
Allocations
Computation the
Optimal Allocation
Optimal
Allocation
Efficient Frontier
Standard Deviation (RISK)
Expected Return
Expected Covariance
Cov{M}
Expected Return
5 Phases
MVF
Highest Utility
Portfolio
Standard Deviation (RISK)
111
Hardware Architecture for MVF Step 2
Monte Carlo Block
Random
Number
Generator
Utility Functions
Exponential
Utility
(ζ>0 and γ≡0)
Quadratic Utility
(ζ>0 and γ≡-1)
u(ψ) = -e –(1/ζ) ψ
u(ψ) = ψ – (1/2ζ) ψ2
α1 = [α 11, α 12, …, α 1Ns]
Requires Ns Multiplications
112
Hardware Architecture for MVF Step 2
ψ
113
Hardware Architecture for MVF Step 2
114
Hardware Architecture for MVF Step 2
Satisfaction Function
Calculator Blocks
Parallel Ns
Multipliers
Parallel Nm
Monte Carlo Blocks
Parallel Nm
Utility Calculation Blocks
Parallel Np
Satisfaction Function
Calculation Blocks
Parallel Satisfaction Function
Calculator Blocks
115
Results
Ns number of arithmetic resources in parallel
1000 runs
Software
Parallel
Fully Parallel
1.00E+00
50
Computational Time (sec)
1.00E-01
1.00E-02
1.00E-03
60
70
6 - 9.6 ×
80
90
100
629× (for 50
Securities)
1.00E-04
1.00E-05
1.00E-06
1.00E-07
Number of Securities
Generation of Required Inputs – Phase 5
116
Results
1000 runs
Software
Parallel 1
Parallel 2
Computational Time (sec)
1.00E+01
1.00E+00
50
1.00E-01
151 -60221 ×70
30280- 442 90×
100
10 Satisfaction Blocks (1 Monte-Carlo Block with
10 Satisfaction Blocks (1 Monte-Carlo Block with
20 multipliers and 20 Utility Function Calculator
10 multipliers and 10 Utility Function Calculator
Blocks)
Blocks)
1.00E-02
1.00E-03
Number of Securities
Mean Variance Framework – Step 2
100,000 scenarios and 50 Portfolios
117
Conclusion



Mean Variance Framework’s inherent parallelism make the
framework an ideal candidate for an FPGA implementation;
We are bound by hardware resources rather than by the
parallelism Mean Variance Framework offers;
However, there are many different architectural choices to
implement Mean Variance Framework’s steps.
118