Lucian Codrescu
Sr. Director, Technology
Qualcomm Technologies, Inc.
Qualcomm Hexagon
DSP: An architecture
optimized for mobile
multimedia and
communications
Qualcomm Technologies, Inc. All Rights Reserved
1
Hexagon™ DSP processors in Snapdragon products
• aDSP: Real-time
media & sensor
processing
Snapdragon
800
Camera
Adreno
GPU
Display
JPEG
Video
Krait
CPU
Krait
CPU
Krait
CPU
Krait
CPU
Other
Audio
Sensors
Hexagon
aDSP
Misc.
Connectivity
2MB L2
Multimedia Fabric
System Fabric
Fabric & Memory Controller
LPDDR3
LPDDR3
Modem
Hexagon
mDSP
• mDSP: Dedicated
modem processing
Qualcomm Technologies, Inc. All Rights Reserved
2
Expansion of Hexagon DSP use cases beyond audio
Image Enhancement
Camera, Still, Video
HexagonV4 based products
HexagonV2/V3
Computer Vision &
Augmented Reality
HexagonV4 based products
Video
HexagonV5 based products
Voice
Audio
Sensors
HexagonV5 based products
Hexagon DSP is evolving for use beyond voice and audio to
computer vision, video and imaging features
Qualcomm Technologies, Inc. All Rights Reserved
3
The Hexagon DSP evolution
Generational improvements in performance and power efficiency driven by
both architecture and implementation
V4M
V3M
V1
65nm
Oct 2006
28nm
Apr 2011
45nm Nov
2009
65nm
Dec 2007
V3C
V4C
45nm Aug
2009
28nm
Dec 2012
V4L
V3L
V2
V5A
28nm
Dec 2010
45nm
June 2009
28nm
Dec 2010
V5H
28nm
Dec 2012
Time
Qualcomm Technologies, Inc. All Rights Reserved
4
Key characteristics of
modem & multimedia applications
Requirements
• Require fixed real-time
performance level
(fps, Mbit/sec, etc.)
• Extremely aggressive
power & area targets
Characteristics
• Mix of signal processing
& control code
− For modem, Qualcomm does not
use a split CPU/DSP architecture.
All processing is done on Hexagon
DSP
− Multimedia apps have significant
control in the RTOS & frameworks
• Heavy L2$ misses
− Multimedia is data intensive
− Modem is code intensive
Qualcomm Technologies, Inc. All Rights Reserved
5
Hexagon DSP blends features targeted to modem &
multimedia
VLIW
• Need multi-issue to
meet performance
• Low complexity for
Area & Power
Multi-Threading
• To reduce L2$ miss
penalty without the need
for a large L2
• Increases
instructions/VLIW packet
because compiler doesn’t
need to schedule latency
Hexagon
DSP
Innovate in ISA to
maximize IPC
• More work/VLIW packet
reduces energy/instruction
• Keep the pipelines full for
MIPS/mm2
• Target both Signal
Processing & Control code
Qualcomm Technologies, Inc. All Rights Reserved
6
VLIW: Area & power efficient multi-issue
Variable sized
instruction packets
(1 to 4 instructions
per Packet)
Device
DDR
Memory
• Dual 64-bit
load/store
units
• Also 32-bit
ALU
• Dual 64-bit execution units
• Standard 8/16/32/64bit data
types
• SIMD vectorized MPY / ALU
/ SHIFT, Permute, BitOps
• Up to 8 16b MAC/cycle
• 2 SP FMA/cycle
Instruction
Cache
Instruction Unit
L2
Cache
/ TCM
Data Unit
(Load/
Store/
ALU)
Data Unit
(Load/
Store/
ALU)
Execution
Unit
(64-bit
Vector)
Data Cache
Register File/Thread
Register File
Register File
Execution
Unit
(64-bit
Vector)
•
•
•
Unified 32x32bit
General Register
File is best for
compiler.
No separate Address
or Accum Regs
Per-Thread
Qualcomm Technologies, Inc. All Rights Reserved
7
Maximizing the signal processing code work/packet
Example from inner loop of FFT: Executing 29 “simple RISC ops” in 1 cycle
64-bit Load and
64-bit Store with
post-update
addressing
{ R17:16 = MEMD(R0++M1)
MEMD(R6++M1) = R25:24
R20 = CMPY(R20, R8):<<1:rnd:sat
R11:10 = VADDH(R11:10, R13:12)
}:endloop0
Complex multiply with
round and saturation
Rs
I
R
I
R
Rs
Rt
I
R
I
R
Rt
*
Zero-overhead loops
Vector 4x16-bit Add
*
32
0x8000
*
32
<<0-1
<<0-1
*
32
32
<<0-1
<<0-1
-
• Dec count
• Compare
• Jump top
Add
Add
Sat_32
+
+
+
+
0x8000
Sat_32
High 16bits
I
High 16bits
R
Rd
Qualcomm Technologies, Inc. All Rights Reserved
8
Maximizing the control code work/packet
Hexagon DSP ISA improves control code efficiency
over traditional VLIW
Example C code
void example(int *ptr, int val) {
if (ptr!=0) {
*ptr = *ptr + val + 2;
}}
Tradional VLIW
Assembly Code
①
Hexagon DSP:
Compound ALU
p0 = cmp.eq(r0,#0)
{
if (!p0) r2=memw(r0)
if (p0) jumpr:nt r31
}
r2 = add(r2,#2)
r1 = add(r1,r2)
{
memw(r0) = r1
⑤
jumpr r31
p0 = cmp.eq (r0,#0)
①
②
③
if (!p0.new) r2=memw(r0)
if (p0.new) jumpr:nt r31
}
r2 = add(r2,#2)
r1 = add(r1,r2)
{
memw(r0) = r1
④
Hexagon DSP:
New-Value Store
{
{
{
②
③
④
Hexagon DSP:
Dot-New Predication
jumpr r31
p0 = cmp.eq(r0,#0)
①
②
if (!p0.new) r2=memw(r0)
}
}
{
r1 = add(r1,add(r2,#2))
memw(r0) = r1
③
jumpr r31
}
if (!p0.new) r2=memw(r0)
if (p0.new) jumpr:nt r31
if (p0.new) jumpr:nt r31
{
p0 = cmp.eq(r0,#0)
①
r1 = add(r1,add(r2,#2))
②
memw(r0) = r1.new
jumpr r31
}
}
}
Instr/Packet =
7 instr/5 packets = 1.4
Instr/Packet =
7 instr/2packets = 3.5
Qualcomm Technologies, Inc. All Rights Reserved
9
High avg. instructions/packet for targeted use cases
Average Instructions / VLIW Packet
Compound instructions count as 2
5
4.5
4
3.5
3
2.5
2
1.5
1
0.5
0
Computer
Vision
Source: Qualcomm internal measurements
Video
Imaging
Control
Audio
Qualcomm Technologies, Inc. All Rights Reserved
10
Programmer’s view of Hexagon DSP HW
multi-threading
• Hexagon V5 includes three hardware threads
• Architected to look like a multi-core with communication
through shared memory
Shared Instruction Cache
Thread 0
D
U
D
U
X
U
Thread 1
X
U
Register File
D
U
D
U
X
U
Thread 2
X
U
Register File
D
U
D
U
X
U
X
U
L2
Cache /
TCM
Register File
Shared Data Cache
Qualcomm Technologies, Inc. All Rights Reserved
11
Hexagon DSP V1-V4: Interleaved multi-threading
Simple round-robin thread scheduling
• Number of threads match execution pipe depth
(three threads three execute stages)
• All instructions complete before next packet dispatch
• Compiler schedules for zero-latency which helps to increase
instructions/VLIW packet
T0: {
Thread 0 Dispatch
Thread 1 Dispatch
Ld
Add Cmp } T1: {
St
Ld
Mpy Add
}
T2: {
Ld
Add Jump
}
T0: {
Ld
Ld
Add Cmp }
T1: {
St
Ld
Mpy Add
}
T0: {
Ld
Ld
Add Cmp }
Ld
Thread 2 Dispatch
Qualcomm Technologies, Inc. All Rights Reserved
12
Hexagon DSP V5: Dynamic HW multi-threading
Recover some performance when threads idle or stalled
• Remove a thread from IMT rotation
− On L2 cache misses
− When in wait-for-interrupt or off
mode
• Additional forwarding to support
2-cycle packets
Coremarks/
MHz
8
5
7
4.5
4
6
3.5
5
• VLIW packets with dependencies
between long latency instructions
will stall
− But many VLIW packets with
simple instructions can
complete in 2 processor clocks
3
4
2.5
3
2
1.5
2
1
1
0.5
0
0
IMT
Source: Qualcomm internal measurements
Dhrystone
DMIPS/MHz
DMT
IMT
DMT
Qualcomm Technologies, Inc. All Rights Reserved
13
Hexagon DSP instructions per cycle
Average Instructions / Cycle
5
Multi-Threaded Apps
4.5
4
3.5
3
2.5
2
1.5
Single-Threaded Apps
IPC_DMT
IPC_IMT
1
0.5
0
Source: Qualcomm internal measurements
Qualcomm Technologies, Inc. All Rights Reserved
14
Qualcomm Hexagon DSP architecture
BDTImark2000™/MHz
DSP Performance per MHz
Highly efficient mobile application processor—designed for more
performance per MHz
20
18
16
14
12
10
8
6
4
2
0
Clock Rate (MHz)
DSP Performance (BDTImark2000)
Mobile Competitor
Qualcomm HXGN V4 (1 thread)
Qualcomm HXGN V4 (3 threads)
430-520
100-233
300-700
4730-5720
1810-4220
5440-12660*
* - Projected best case score for 3-threads
Source: BDTI - For more detailed information see www.BDTI.com. All scores ©2013 BDTI
Qualcomm Technologies, Inc. All Rights Reserved
15
Hexagon DSP Power Benefits
Qualcomm Technologies, Inc. All Rights Reserved
16
MP3 playback power for competitive smartphones
Lower is better
Power
Competitor A
Qualcomm /
Competitor B
Hexagon-based
Competitor C
Competitor D
Competitor E
Competitor F
Competitor G
• Power measured at the battery for various phones
• Includes everything: DSP, CPU, memory, analog components, etc
Source: Qualcomm internal measurements
Qualcomm Technologies, Inc. All Rights Reserved
17
Computer vision offload – ARM/neon to Hexagon DSP
App CPU
Augmented Reality Java App finding objects in
image using FastCV Feature Detect
Comparison of Feature Detect run on:
• App CPU (ARM/Neon)
• App DSP (Hexagon)
Augmented Reality
Java Application
Call Feature
Detect
FastCV Call Router
VeNum
App DSP
ARM/VeNum
FastCV Library
ARM Only
FastCV Library
Feature
Detect
Function
CPU Utilization (%)
52% Less CPU
Hexagon (QDSP6)
FastCV Library
Feature
Detect
Function
Feature
Detect
Function
Detection Time (%)
7% Less Time
Source: Qualcomm internal measurements. * Power measured at the device battery
Total Device Power (%)
32% Less Power*
Qualcomm Technologies, Inc. All Rights Reserved
18
Hexagon DSP power for different thread utilizations
• Excellent near-linear power scalability
(as threads go idle, power used by the thread is nearly eliminated)
• Achieved through optimized clock tree design & clock gating
Dhrystone Power,
IMT Mode
FIR Power,
IMT Mode
100%
100%
90%
90%
80%
80%
70%
70%
60%
60%
50%
50%
40%
30%
Actual
Ideal
40%
30%
20%
20%
10%
10%
0%
0%
Source: Qualcomm internal measurements
Actual
Ideal
Qualcomm Technologies, Inc. All Rights Reserved
19
Hexagon DSP Software Development
Qualcomm Technologies, Inc. All Rights Reserved
20
Independent Algorithm Developers on Hexagon DSP
Qualcomm Technologies, Inc. All Rights Reserved
21
Announcing the Hexagon DSP SDK
See the Hexagon DSP SDK in action at Uplinq2013 (www.uplinq.com)
Visit http://developer.qualcomm.com for more information.
Qualcomm Technologies, Inc. All Rights Reserved
22
Thank you
Follow us on:
For more information on Qualcomm, visit us at:
www.qualcomm.com & www.qualcomm.com/blog
©2013 Qualcomm Technologies, Inc.
Qualcomm and Hexagon are trademarks of QUALCOMM Incorporated, registered in the United States
and other countries. All QUALCOMM Incorporated trademarks are used with permission. Other
product and brand names may be trademarks or registered trademarks of their respective owners.
Hexagon is a product of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. All Rights Reserved
23
© Copyright 2025 Paperzz