Submitted by:
Meni Orenbach
Roman Kaplan
Instructors:
Zvika Guz
Koby Gottlieb
1
Project Context:
1. Prologue:
1.1. Introduction………………………………………………………………………3
1.2. Project goals………………………………………………………………………3
1.3. About flake………………………………………………………………………..3
2. First steps:
2.1. WAV format structure short explanation………………………...…....…….…4
2.2. FLAC format structure short explanation…………………………...……….…5
2.3. The algorithm and program structures……………………………….………...5
2.4. Measuring the program via benchmark…………………………….……….….7
2.5. Running Vtune………………………………………………………….….….…..8
2.6. Finding Hot-spots…………………………………………………………..……..9
2.7. Setting our goals…………………………………………………………..……….9
3. Optimization of the code:
3.1. Understanding of the code……………………………………...………………10
3.2. Parallelizing Flake……………………………………………....………………..11
3.3. A little about SIMD………………………………………………………………20
3.4. Working with SSE commands………………………………....………………..21
3.5. SSE`s total performance improvement………………………………………...28
3.6. Micro – Architectural issues…………………………………………………….29
4. Epilogue:
4.1. Measuring performance after all changes………………………..……………30
4.2. Working with the Intel compiler……………………………………………….31
4.3. Optimization summary……………………………………………………….…32
4.4. References………………………………………………………………………...33
2
1. Prologue
1.1 Introduction
This project handles the concept of speeding up an open source program taken
from the web. The process is done via multithreading, SIMD instructions and
other code modification techniques.
1.2 Project goals
1. Finding a suitable program, one that hinders the CPU, that works as a
single threaded program, that doesn't contain SIMD instructions, and most
importantly it needs to be interesting.
2. Working with Vtune analyzer and understanding how to analyze programs
quickly and efficiently.
3. Enhancing the program, so it will achieve a decent speedup compared to
the starting point, while keeping a 1:1 bit ratio compatibility with the
original version.
4. Try to return our improvements back to the open source community.
1.3 About flake
In our quest to find a decent program that will stand up to all the criteria that we
set, we found a nice small program called 'Flake' which was written entirely in C.
As mentioned Flake is an open source program which uses a new algorithm in
order to encode .wav format audio files into .FLAC format audio files.
FLAC stands for Free Lossless Audio Codec, which means Flake actually takes
an audio file and compresses it into a significantly smaller audio file without
losing any data in the process.
Flake can encode wav files in different manners that can be controlled by the
user via the command line, (i.e. decide the compression ratio of a wav file, by
choosing the desired block size parameter, etc).
3
2. First Steps
2.1 Wav format structure short explanation
Wav files use the standard RIFF structure which groups the files contents
(sample format, digital audio samples, etc.) into separate chunks, each containing
its own header and data bytes. The chunk header specifies the type and size of
the chunk data bytes.
This organization method allows programs that do not use or recognize
particular types of chunks to easily skip over them and continue processing
following known chunks.
There are quite a few types of chunks defined for Wav files. Many Wav files
contain only two of them, specifically the Format Chunk and the Data Chunk.
These are the two chunks needed to describe the format of the digital audio
samples and the samples themselves.
The format chunk ('fmt') contains information about how the waveform data is
stored and should be played back including the type of compression used,
number of channels, sample rate, bits per sample and other attributes.
The number of channels specifies how many separate audio signals that are
encoded in the wav data chunk. Multi-channel digital audio samples are stored
as interlaced wave data which simply means that the audio samples of a multichannel (such as stereo and surround) wav file are stored by cycling through the
audio samples for each channel before advancing to the next sample time.
Example of the structure:
4
2.2 FLAC format structure short explanation
FLAC is specifically designed for efficient packing of audio data, unlike general
lossless algorithms such as ZIP and gzip. While ZIP may compress a CD-quality
audio file by 10 - 20%, FLAC achieves compression rates of 30 - 50% for most
music, with significantly greater compression for voice recordings.
FLAC uses linear prediction to convert the audio samples to a series of small,
uncorrelated numbers (known as the residual), which stored efficiently using
Golomb-Rice coding (a data compression scheme based on entropy encoding).
FLAC format exploits the fact that audio data typically has a high degree of
sample-to-sample correlation in order to achieve high degree of compression.
2.3 The algorithm and program structures
Flake is written entirely in C language. We started by compiling flake in
windows, as a visual studio 2008 project, and got an executable file.
We then saw that flake indeed takes a wav file and compresses it to a FLAC file
that is in a smaller size and playable with mainstream audio players.
Afterwards we skimmed the code, while trying to understand the flow of
program within the code.
The idea behind the algorithm is first taking a wav file as an input from the user,
reading it frame by frame and encoding each frame's channel separately. Then
the program writes the encoded version into a byte array and eventually to a
new much smaller FLAC file.
The encoding process is done with the following algorithm:
FLAC includes several METADATA blocks, the decoder is allowed to skip any
METADATA block it does not understand except for the STREAMINFO block,
which contains the sample rate, the number of channels, the maximum block size
and the MD5 checksum which can verify a file's integrity.
After the METADATA block comes an AUDIO DATA block that contains the
encoded audio data, like most audio codecs, FLAC splits the unencoded audio
data into blocks and encodes each block separately.
The block size is an important parameter of the encoding process. To small and
the frame overhead will lower the compression rate, to large and the modeling
stage of the compressor will not be able to generate an efficient model. The
model is actually an approximation of the signal via a function. that requires
fewer bits per sample to encode.
5
FLAC's approximation can be done in 2 ways:
Fitting a simple polynomial to the signal
General linear predictive coding (also called: LPC).
The first method is faster but less accurate than the second method (LPC).
Once the model is generated, the encoder subtracts the approximation from the
original signal to get the residual (error) signal. The error signal is then losslessly
coded. To do this, FLAC takes advantage of the fact that the error signal
generally has a Laplacian (two-sided geometric) distribution, and that there are a
set of special Huffman codes called Rice codes that can be used to efficiently
encode these kind of signals quickly and without needing a dictionary.
Rice coding involves finding a single parameter that matches a signal's
distribution, then using that parameter to generate the codes.
An audio frame is preceded by a frame header and trailed by frame footer. The
header starts with a sync code, and contains the minimum information necessary
for a decoder to play the stream, like sample rate, bits per sample, etc. It also
contains the block or sample number and an 8-bit CRC of the frame header. The
sync code, frame header CRC, and block/sample number allow
resynchronization and seeking even in the absence of seek points. The frame
footer contains a 16-bit CRC of the entire encoded frame for error detection. If the
reference decoder detects a CRC error it will generate a silent block.
It is important to say that just by reading the code we didn't fully understand
what every function does, and where the program was spending most of the
time until we ran Vtune Analyzer.
side note: we will later show a call graph image taken from Vtune that explains the
program structure and execution order.
6
2.4 Measuring the program via a benchmark
A benchmark is a tool needed in order to compare between different versions of
the same program. In our case we needed a .wav file which will be large enough
in size, so that we can see a significant difference while optimizing flake
performance. Also it was needed in order to check that flake hinders the CPU,
which was one of the criteria for choosing flake.
Our platform is an Intel 64 bit Pentium Core 2 duo 2.4GHZ processor with Dual
core, 2GB of RAM and with a Windows XP operating System.
The benchmark we used is called 'Yes.wav' and it’s a 238MB song.
We measured flake with our benchmark and saw that it took a time of
T 105.156sec in order to complete the encoding.
Now we can measure a speedup which we will try to gain by optimizing flake's
code in compression to the initial T 105.156sec state.
Side note: we run flake with the highest compression rate. Using the command line (the
parameter –h and the parameter -12 which uses the following parameters: Block size =
4608B, Prediction type is Levinson – Durbin recursion, prediction order = Max,
Prediction order selection method is: “full search”, Rice prediction is: 8 and Stereo
decorrelation method is: “mid-side”).
7
2.5 Running Vtune
At first we studied Vtune, from its tutorial. We gained knowledge of how it
collected data on a program, the events it can measure and how all of that can
benefit us in understanding Flake's code.
The first run of Vtune was done as a "Quick Performance Wizard" project and
gave us a Call Graph, Sampling Results, and Counter Monitor information.
The call graph result was:
From the Call graph we were able to visually understand the control flow of the
functions, and better understand how Flake works.
We also found out which of the functions are the most time consuming.
The Sampling result was:
8
Using the Sampling results we were able to find the hot-spot functions system
information (such as CPU_CLK_UNHALTED.CORE samples,
INSTRUCTION_RETIRED samples and
CLOCKS_PER_INSTRUCTION_RETIRED – CPI) and to measure accurately our
benchmark time of execution (Currently taking 105.156 Sec in order to complete
the encoding).
Lastly we skimmed through the Counter monitor results and we figured out how
Flake utilizes the systems memory, and other parameters like the number of
pages it takes in the memory while running, etc.
2.6 Finding Hot-spots
As stated above, we used Vtune in order to find the hot-spots in Flake's code,
and we decided to start optimizing from there, because they took most of the
execution time.
We focused on 3 main functions which took most of the encoding process time
(the rest weren't nearly as influential as them).
Those 3 functions were (from longest to shortest):
encode_residual_lpc()
calc_rice_params()
compute_autocorr()
We then looked at the code of each function, understood how they work and
considered how to optimize them.
We decided to optimize calc_rice_params() and compute_autocorr() using SIMD
instructions because their loop structure could be easily unrolled, as we'll be able
to see in chapter 3. While trying to implement SIMD instructions on the function
encode_residual_lpc(), we didn’t get any speedup because its loop structure
couldn’t be unrolled and that prevented us from implementing SIMD
instructions efficiently.
Side note: we will further explain how we optimized these functions in the proper
chapter of this book.
2.7 Setting our goals
In this project we desire to gain the following goals:
Optimize Flake, and by that achieving a lower number clock cycles taken to
complete execution of the encoding process.
Better utilizing of system resources such as multiple cores (will be done by
paralleling the program), system memory (reducing memory needed for
flake execution) and using SSE technology.
Retaining a 1:1 bit compatibility with our optimized version's output and
the original's output.
Integration of our version back to the open source community.
9
3. Optimization of the code
3.1 Understanding of the code
When we’ve finished going over the code and the call graph, we decided to draw
a flow chart in order for things to be a little more understandable:
As we can see from the flow chart, each file is split to blocks called ‘frames’.
Every frame contains the audio samples divided into channels, while every
channel is encoded separately. When the encoding process is finished for every
channel the results are written together into the output file.
10
3.2 Parallelizing Flake
One aspect to performance enhancement is acquired via parallelization of a serial
code. In our case, Flake was managed by one thread, and with the help of the
WIN32 API and the thread functions we were able to speedup our program
computation's speed. Before we started writing the code, we needed to find a
place to parallel Flake. So we ran a Vtune quick analyzer project to check flake
call graph and sampling, this way we’ll know the structure of program and
where it is best to parallelize at.
We then studied the subject of paralleling an audio encoder and found a few
common ways to do so:
Paralleling the reads from an input file, and\or the writes done to the
output file.
Paralleling the encode process for each frame.
Paralleling the encode process for each channel of each frame.
After looking at the output we've gotten from Vtune, and the possible ways to
parallel a program, we decided that it is most suitable for Flake to be paralleled
by the method of encoding each channel separately.
To do so we found out that the place of the encoding process was taking place.
It's started in the following way:
for (ch=0; ch<ctx->channels; ch++) {
If (encode_residual(ctx, ch) < 0) {
return -1;
}
}
After quickly examining the code, we found out that ctx was actually a structure
that held the information needed for the encoding phase.
Luckily this data was already divided, for each channel. So we have decided to
create a structure to send to our encoding threads to hold the following items:
typedef struct pa{
int ch, type;
FlacEncodeContext* ctx;
} erParams;
It was done like this, since there is no common data shared between our worker
threads, so that we won’t have to use mutexes (and won’t suffer from the
overhead they cause) and still be able to pass our threads all the information they
needed, without any collisions. We also kept ctx as a pointer, since that saved us
yet another overhead of copying the data for each channel encoding phase.
Note: we tried to copy ctx to two different structures, cause we were afraid that the data
we use for every thread isn’t aligned on the same line of the cache, causing cache misses.
But the overhead for the copy process was too big so we didn’t use that method.
11
The use of this structure is to pass to our threads information, needed for the
encoding phase and suits to our needs since we used the following paralleling
method:
At first we tried doing it using the trivial way, meaning we created as much
threads as there are number of channels in the given .wav file. We then made
every thread run the encoding function and then terminate.
The code for this method:
HANDLE chthread[NUM_OF_CH];
erParams* thread_params[NUM_OF_CH];
for(i=0;i<(chNumber);i++)
{
thread_params[i] = (erParams*) malloc(sizeof(erParams));
thread_params[i]->ch = i;
thread_params[i]->ctx = fec;
if (type == ENCODE_REGULAR)
chthread[i] = CreateThread(NULL,0,(LPTHREAD_START_ROUTINE)
thread_encode_residual,(LPVOID) thread_params[i],0,&threadId);
else
chthread[i] = CreateThread(NULL,0,(LPTHREAD_START_ROUTINE)
thread_encode_residual_verbatim,(LPVOID) thread_params[i],0,&threadId);
}
And each thread function is now:
LPVOID thread_encode_residual_verbatim(LPVOID param)
{
FlacEncodeContext* ctx = ((erParams*) param)->ctx;
int ch = ((erParams*) param)->ch;
reencode_residual_verbatim(ctx, ch);
return NULL;
}
LPVOID thread_encode_residual(LPVOID param)
{
FlacEncodeContext* ctx = ((erParams*) param)->ctx;
int ch = ((erParams*) param)->ch;
encode_residual(ctx, ch);
return NULL;
}
Though this solution wasn't a good one. As we ran Intel's Vtune Thread Checker,
we saw that there is a massive overhead caused by constantly opening and
closing threads.
12
Thread Checker’s result for the offered solution:
Note: there were a lot of open threads as you deduct from the scroll bar’s length.
The Sampling result for the run-time of the new solution:
The speedup achieved by this solution:
105.156
1.683 .
62.469
We then sought of a way to minimize this overhead by using the same threads
for different channels, the entire encoding process.
13
We’ve investigated ways doing so, and at last decided to use the ThreadPool
API, which was written for windows XP and above (mainly used in vista), this
led to yet an additional improvement in our program’s speedup gain. We used
an Event variable used to synchronize between our working threads and we’ve
added him to the structure we pass to each thread.
The main idea behind this code was to get each working thread to deal with each
channel’s encoding separately (we can do this since the number of channels is
constant during the entire encoding phase of a .wav file).
The code was then changed to look like this:
void startMultiThreading(FlacEncodeContext* fec, int chNumber, int type)
{
int i;
#if defined (_WIN32) && defined (USE_THREADS)
BOOL res;
erParams* thread_params[NUM_OF_CH];
for(i=0;i<(chNumber);i++)
{
thread_params[i] = (erParams*) malloc(sizeof(erParams));
thread_params[i]->ch = i;
thread_params[i]->ctx = fec;
thread_params[i]->haha = CreateEvent( NULL, TRUE, FALSE, NULL );
if (type == ENCODE_REGULAR)
res = QueueUserWorkItem(
thread_encode_residual, //callback function
(PVOID) thread_params[i] //arguments passed to function.
,WT_EXECUTEINPERSISTENTTHREAD);
else
res = QueueUserWorkItem(
thread_encode_residual_verbatim, //callback function
(PVOID) thread_params[i] //arguments passed to function.
,WT_EXECUTEINPERSISTENTTHREAD);
//TODO: wait for all threads to finish here, meaning: 'thread_join'.
for(i=0;i<chNumber;i++)
WaitForSingleObject(thread_params[i]->haha,INFINITE);
//TODO: Finished encoding, now “close” all threads.
for(i=0;i<chNumber;i++)
free(thread_params[i]);
#endif
}
DWORD CALLBACK thread_encode_residual_verbatim(PVOID param)
{
FlacEncodeContext* ctx = ((erParams*) param)->ctx;
int ch = ((erParams*) param)->ch;
reencode_residual_verbatim(ctx, ch);
SetEvent(((erParams*) param)->haha);
return NULL;
}
DWORD CALLBACK thread_encode_residual(PVOID param)
{
FlacEncodeContext* ctx = ((erParams*) param)->ctx;
int ch = ((erParams*) param)->ch;
encode_residual(ctx, ch);
14
SetEvent(((erParams*) param)->haha);
return NULL;
}
Yet this solution wasn't sufficient enough either.
As we ran Intel's Vtune Thread Checker and we saw that though fewer threads
were created, there were still more threads open then the number of channels in
the input wave file (2 channels in our benchmark). We tried changing the
parameters given to ThreadPool API, yet we didn’t succeed in making it open as
many threads as we needed (and not more than that).
Note: only 4 threads were opened, which was a great improvement but still not enough.
So to conclude things, we still couldn't control how many threads were opened
by using the ThreadPool API, so we decided to implement it by our selves with
the basic thread functions, and the Event synchronization item (which works
very much like a mutex).
The structure passed to the threads was changed to the following:
typedef struct pa{
int ch, type;
FlacEncodeContext* ctx;
HANDLE threadHandler, endEvent, waitEvent;
} erParams;
Since we now need to synchronize between all the threads by ourselves we
needed more event variables, we also needed to save in each structure the
identifier of the thread that uses him.
15
The idea behind the following code resembles the one we’ve used in the former
implementation (using the ThreadPool API), meaning we create as many threads
as we can (will be explained later) and we won’t close them until the entire
encoding process is done.
We now take notice to how many cores our computer has, using the commands:
GetSystemInfo(&siSysInfo);
siSysInfo.dwNumberOfProcessors;
So that we won’t open more threads than number of cores our computer has, and
cause a slowdown by misusing the computer's architecture.
The code was changed to the following (We call this function at the very
beginning of the program):
void startMultiThreading(HANDLE** threads_arr, erParams** params_arr, int coreNumber)
{
#if defined (_WIN32) && defined (USE_THREADS)
int i;
DWORD threadId;
for(i=0;i<(coreNumber);i++)
{
/* Initializing PVOID parameter which will be sent to the thread functions. */
params_arr[i]->ch = i;
params_arr[i]->type = ENCODE_REGULAR; //equals 0.
params_arr[i]->ctx = NULL;
params_arr[i]->threadHandler = (*threads_arr[i]);
params_arr[i]->endEvent = CreateEvent( NULL, FALSE, FALSE, NULL );
params_arr[i]->waitEvent = CreateEvent( NULL, FALSE, FALSE, NULL );
(*threads_arr[i]) = CreateThread(NULL,0,(LPTHREAD_START_ROUTINE)
thread_encode_residual,(LPVOID) params_arr[i],0,&threadId);
}
#endif
}
In this function we start our threads and their appropriate er_Params structures
(used to pass them information later on).
Note: we save the threads and erParams Handlers so that we will be able to call them
later when we need to actually do the encoding process.
16
We call the following function after the encoding is finished, and we don’t need
the threads anymore so we close them and release the resources we’ve used:
void stopMultiThreading(HANDLE** threads_arr, erParams** params_arr, int coreNumber)
{
#if defined (_WIN32) && defined (USE_THREADS)
int i;
for(i=0;i<coreNumber;i++)
{
CloseHandle(*threads_arr[i]);
CloseHandle(params_arr[i]->endEvent);
CloseHandle(params_arr[i]->waitEvent);
free(threads_arr[i]);
free(params_arr[i]);
}
#endif
}
We call this function every time we encode a frame (meaning its called as many
times as the frames in the .wav file it encodes):
void encodeMT(erParams** params_arr, FlacEncodeContext* fec, int chNumber, int type, int
coreNumber)
{
#if defined (_WIN32) && defined (USE_THREADS)
int i=0,j, min;
min = (coreNumber<chNumber)? coreNumber: chNumber;
while (i < chNumber)
{
for (j=0;((j<coreNumber) && (i<chNumber));j++,i++)
{
params_arr[j]->type = type;
params_arr[j]->ctx = fec;
params_arr[j]->ch = i;
SetEvent(params_arr[j]->waitEvent);
}
/*
Now the threads, encoding each channel separately.
waiting till they are done!
*/
for(j=0;j<min;j++)
{
WaitForSingleObject(params_arr[j]->endEvent,INFINITE);
}
}
#endif
}
In this function we set the correct parameters to send to the encoding threads, set
an event to let the thread know it can start encoding (cause we’ve set the threads
with new data to process). We then wait for an end of the event to know that
every thread finished, and then we can continue to write the data to the output
file.
17
This is the function that every thread works on from the moment it opens up and
until it's closed down:
LPTHREAD_START_ROUTINE thread_encode_residual(LPVOID param)
{
FlacEncodeContext* ctx;
int ch, type;
while(1)
{
WaitForSingleObject(((erParams*) param)->waitEvent,INFINITE);
ctx = ((erParams*) param)->ctx;
ch = ((erParams*) param)->ch;
type = ((erParams*) param)->type;
if (type == ENCODE_REGULAR)
encode_residual(ctx, ch);
else
reencode_residual_verbatim(ctx, ch);
/* finished, now signaling main thread. */
SetEvent(((erParams*) param)->endEvent);
}
return NULL;
}
This function is waiting until an event occurs and then our thread knows there is
new data to encode, it sends the data to the appropriate encoding function
(encode_residual() or reencode_residual_verbatim() depending on the encoding type),
and then he sets and end event to let the main thread know it finished encoding.
When we’ve finished writing the code, and checking that we’ve maintained a 1:1
bit compatibility with the source code, we used Intel's Vtune Thread Checker
and this time we were satisfied, we saw that only 2 threads where opened (since
our benchmark is a wave file with 2 channels, and our computer has 2 cores):
Note: only 3 threads were opened, 2 for encoding and 1 which waits and writes to the
output file the encoded data.
18
We than ran a Vtune quick performance analyzer to measure the speedup we’ve
gained from paralleling Flake:
From the data above we realized that we have gained a speedup of
105.156
1.865 by using MT in Flake’s performance.
56.376
After examining the samples result from Vtune, we saw that the original
program executed 183,329 instructions (that is the Instruction Retired value,
which means the actual instructions the program executed!). After the
parallelization the new instruction count, as seen from Vtune samplings, was
183,620, which is almost as the original instruction count. Thus we conclude that
we have caused a minimal overhead from using the threads in Flake and reached
a great speedup of X1.865.
Note: the best speedup we could’ve reached was X2.0, however the program still has
serial parts that we can’t parallel in our design pattern, so we’ve achieved the best results
for our design.
To conclude, we will show the speedup gain in a graph:
SpeedUp
200
180
160
140
120
Original
100
MT1
80
MT2
60
40
20
0
Original
MT1
MT2
19
3.3 A little about SIMD
SIMD stands for Single Instruction Multiple Data. It is a way of packing N
(usually a power of 2) same operations (e.g. 8 adds) into a single instruction. The
data for the instruction operands is packed into registers capable of holding 128
bit, which is 4 - 32bit Integers or 2 Double precision FP. The advantage of this
format is that for the cost of doing a single instruction, N instructions worth of
work are performed. This can translate into very large speedups for
parallelizable algorithms. A visual example of it:
Intel's implementation of the idea is by a new instruction set called SSE, which
stands for Streaming SIMD Extension. SSE's addition came with a new set of 128
bit registers. The registers are called XMMi (i is the register's index, for example:
XMM0, XMM5). SSE technology gradually developed starting from SSE which
was first introduced in the Pentium III processor. Later came SSE2 as a major
enhancement to SSE, and afterwards SSE3, SSSE3 (Supplemental SSE3) and SSE4.
In our project we used mostly SSE and SSE2 instructions. SSE instructions mostly
deal with Integer types and Single precision floating point. SSE2 instructions
mostly deal with double precision floating point.
The SIMD instructions` type we used was "intrinsic", which are assembly
instructions wrapped in a c function. The function will execute the SIMD
instruction directly using the appropriate assembly code. It will pass the
arguments to the assembly instructions and execute the assembly instructions
and in the end will return the appropriate value.
20
3.4 Working with SSE commands
In order to implement SIMD instructions in the code, we needed to find
functions that consume considerable amount of time and doing a lot of repetitive
calculations (mostly "for" loops) on large arrays of data. Using Vtune (see section
2.5) we found two functions that sustain the above conditions. The first function
is calc_rice_params(), which contains the following loop (all the variables are
Integers):
for(i=0; i<n; i++) {
udata[i] = (2*data[i]) ^ (data[i]>>31);
}
A typical value for n is: 4608. So this loop basically takes an Integer, shifts it left
by 1 bit, shifts it right by 31 bits and doing Bitwise XOR between the results. The
SIMD code we wrote:
for (i=0;i<n;i+=4) {
temp1 = _mm_load_si128((data+i));
temp2 = _mm_slli_epi32(temp1, 1);
temp3 = _mm_srai_epi32(temp1, 31);
temp1 = _mm_xor_si128(temp2, temp3);
_mm_store_si128((udata+i),temp1);
}
for(i=(n - (n%4)); i<n; i++) {
udata[i] = (2*data[i]) ^ (data[i]>>31);
}
//Loads 4 Integers
//SL temp1 1 times.
//SR temp1 31 tim
//Bitwise Xor
//Storing the result
The array 'data' was unaligned, so we aligned it in order to load data from the
memory to SSE registers using the faster '_mm_load_si128' instruction (only on
16Byte aligned memory), instead of '_mm_loadu_si128'. In addition we aligned the
'udata' array which is defined right above the first "for" loop, so we can also store
the calculations result using the faster _mm_store_si128 instruction, instead of
_mm_storeu_si128.
Original allocation: udata = malloc(n * sizeof(uint32_t));
New allocation: udata = _aligned_malloc(n * sizeof(uint32_t), 16);
21
The Vtune run on the new code results were:
The time spent on this function was: 28.144
The new time spend on this function is: 24.629 Sec
Hence, the speedup we got only for this function is:
28.144
1.14
24.629
And the Overall speedup for the entire program is:
105.156
1.037
101.422
22
The second function we optimized was: compute_autocorr(), which has an Inline
function integrated within it called: apply_welch_window(). Both of these
functions have multiple floating point calculations that can be vectorized.
The first function we started working on was apply_welch_window() because it's
the first to do calculations. It has one 'Main loop' with up to about 65000
iterations. Each loop has conversions from Integer to Double FP and FP
multiplications, so it's suitable for SSE2 instructions integration.
The original code is:
c = (2.0 / (len - 1.0)) - 1.0;
for (i=0; i<(len >> 1); i++) {
w = 1.0 - ((c-i) * (c-i));
w_data[i] = data[i] * w;
w_data[len-1-i] = data[len-1-i] * w;
}
Notes:
-
The variable 'c' is constant through the entire loop.
'data' is an array of 32bit signed Integers.
'w_data' is an array of double precision FP.
As we can see from the original code, in order to use SSE instructions we need to
unroll the loop by 4 (XMM register can hold 4 – 32bit Integers) and do the
calculations in parallel.
The new code is:
c = (2.0 / (len - 1.0)) - 1.0;
//Same as in the old code
c_d = _mm_set1_pd(c);
//Loaded to xmm register
four = _mm_set1_pd(4.0);
one = _mm_set1_pd(1.0);
j_low = _mm_set_pd(1.0, 0.0);
j_high = _mm_set_pd(3.0, 2.0);
for (i=0; i<n; i+=4) {
w_d_low = _mm_sub_pd (c_d, j_low);
w_d_high = _mm_sub_pd (c_d, j_high);
w_d_low = _mm_mul_pd (w_d_low, w_d_low);
w_d_high = _mm_mul_pd (w_d_high, w_d_high);
w_d_low = _mm_sub_pd (one, w_d_low);
w_d_high = _mm_sub_pd (one, w_d_high);
iup_align = _mm_load_si128(data+i);
fpup = _mm_cvtepi32_pd(iup_align);
fpup = _mm_mul_pd(fpup, w_d_low);
_mm_store_pd(w_data+i, fpup);
//Loading the
//Converting the integers
//Multiplying fp with w
//Storing the fp values
iup_align = _mm_shuffle_epi32(iup_align, _MM_SHUFFLE(1,0,3,2));
fpup = _mm_cvtepi32_pd(iup_align);
//Converting the integers
fpup = _mm_mul_pd(fpup, w_d_high); //Multiplying fp with w
_mm_store_pd(w_data+i+2, fpup);
//Storing the fp values
idown = _mm_loadu_si128(data+len-i-4);
w_d_low = _mm_shuffle_pd(w_d_low, w_d_low, _MM_SHUFFLE2(0,1));
idown = _mm_shuffle_epi32(idown, _MM_SHUFFLE(1,0,3,2));
fpdown = _mm_cvtepi32_pd(idown);
23
fpdown = _mm_mul_pd(fpdown ,w_d_low);
_mm_store_pd(w_data+len-i-2, fpdown);
j_low = _mm_add_pd(four,j_low);
idown = _mm_shuffle_epi32(idown, _MM_SHUFFLE(1,0,3,2));
w_d_high = _mm_shuffle_pd(w_d_high, w_d_high, _MM_SHUFFLE2(0,1));
fpdown = _mm_cvtepi32_pd(idown);
fpdown = _mm_mul_pd(fpdown ,w_d_high);
_mm_store_pd(w_data+len-i-4, fpdown);
j_high = _mm_add_pd(four,j_high);
}
for (i=(n - (n%4)); i<n; i++)
{ //Leftovers from the unrolled loop
w_low = 1.0 - ((c-i) * (c-i));
w_data[i] = data[i] * w_low;
}
We can see that there are 2 arrays involved, 'data' and 'wdata'. While 'data' is an
array of 32bit Integers and 'wdata' is an array of double precision FP. At the
original code these arrays weren't aligned, so we had to load and store using
only 'unaligned' instructions. Later we aligned the arrays using the following
code:
Note: 'data1' is passed from compute_autocorr() as 'wdata'.
Original code: data1 = malloc((len+16) * sizeof(double));
New code: data1 = _aligned_malloc((len+16) * sizeof(double),16);
Later, we optimized the function compute_autocorr(). This function is using the
array 'data1' that apply_welch_window() calculated ('wdata' in
apply_welch_window()) in order to make its own work.
The original code is:
for (i=0; i<=lag; ++i) {
temp = 1.0;
temp2 = 1.0;
for (j=0; j<=lag-i; ++j)
temp += data1[j+i] * data1[j];
for (j=lag+1; j<=len-1; j+=2) {
temp += data1[j] * data1[j-i];
temp2 += data1[j+1] * data1[j+1-i];
}
autoc[i] = temp + temp2;
}
24
As we can see, there is one main loop with the index 'i' (up to 34 iterations,
depends on the input parameters) which holds two internal loops with the index
'j'. The first is short (up to 34 iterations) and the second has up to about 65000
iterations per one value of 'i'. Every loop has multiplication instructions in it so
we can integrate SSE instructions inside the code. Because the second 'j' loop is
unrolled we needed to unroll also the first 'j' loop and then unroll the main 'i'
loop.
The new code is:
one = _mm_set1_pd(1.0);
for (i=0; i<=lag-1; i+=2) {
c_low = c_high = one;
temp = 0.0;
for (j=0; j<=lag-i-1; j+=2) {
a_high = a_low = _mm_load_pd(data1+j);
b_low = _mm_load_pd(data1+j+i);
a_low = _mm_mul_pd(a_low, b_low);
c_low = _mm_add_pd(a_low, c_low);
if (j !=lag-i-1) {
b_high = _mm_loadu_pd(data1+j+i+1);
a_high = _mm_mul_pd(a_high, b_high);
c_high = _mm_add_pd(a_high, c_high);
}
else {
if (lag %2 == 1) {
temp = data1[lag] * data1[lag-i-1];
e = _mm_set_pd(0.0, temp);
c_high = _mm_add_pd(e, c_high);
}
}
if ((j == lag-i-2) && (lag %2 ==0)){
temp = data1[lag] * data1[lag-i];
e = _mm_set_pd(0.0, temp);
c_low = _mm_add_pd(e, c_low);
}
}
for (j=lag+1; j<=len-1; j+=2) {
if(lag%2==0) {
a_high = a_low = _mm_loadu_pd(data1+j);
b_low = _mm_loadu_pd(data1+j-i);
b_high = _mm_load_pd(data1+j-i-1);
}
else
{
a_high = a_low = _mm_load_pd(data1+j);
b_low = _mm_load_pd(data1+j-i);
b_high = _mm_loadu_pd(data1+j-i-1);
}
a_low = _mm_mul_pd(a_low, b_low);
c_low = _mm_add_pd(a_low, c_low);
a_high = _mm_mul_pd(a_high, b_high);
c_high = _mm_add_pd(a_high, c_high);
}
a_low = _mm_shuffle_pd(c_low, c_low, _MM_SHUFFLE2(0,1));
a_low = _mm_add_pd(c_low, a_low);
//Merging calcultions
25
a_high = _mm_shuffle_pd(c_high, c_high, _MM_SHUFFLE2(0,1));
a_high = _mm_add_pd(c_high, a_high);
//Merging calcultions
e = _mm_shuffle_pd(a_low, a_high, _MM_SHUFFLE2(1,0));
_mm_store_pd(autoc+i, e);
}
if (lag %2 ==0) {
//One extra loop needed when lag is even
c_low = one;
temp = data1[lag] * data1[0];
e = _mm_set_pd(0.0, temp);
c_low = _mm_add_pd(e, c_low);
for (j=lag+1; j<=len-1; j+=2) {
a_low = _mm_loadu_pd(data1+j);
b_low = _mm_loadu_pd(data1+j-lag);
a_low = _mm_mul_pd(a_low, b_low);
c_low = _mm_add_pd(a_low, c_low);
}
b_low = _mm_shuffle_pd(c_low, c_low, _MM_SHUFFLE2(0,1));
b_low = _mm_add_pd(b_low, c_low);
_mm_storel_pd(autoc+lag, b_low);
}
26
When we done writing and correcting the code, we ran Vtune to see the
improvement we got, the results were:
The time spent on this function was: 5.296 Sec.
The new time spend on this function: 2.753 Sec.
5.296
1.92
2.753
105.156
1.032
And the Overall speedup for the entire program is:
101.917
Hence, the speedup we got only for this function is:
27
3.5 SSE`s total performance improvement:
Now we need to see total improvement we got by using SIMD in out program.
This is the conclusive Vtune output regarding the use with SSE instructions:
The overall speedup we got using SIMD instructions is:
105.156
1.068
98.483
As we can see, the overall speedup isn't very high although we improved the
second and third most time consuming functions in our program. The main
speedup we gained was from MT, as we've seen earlier.
28
3.6 Micro–Architectural issues
In order to find Micro-Architectural problems we used Vtune’s sampling mode,
to sample all the events of Flake’s improved code. We then got the following
results, from the Tuning Assistant:
As we can see from the Tuning Assistant’s events summary, there were no major
Micro-Architectural problems with the improved code.
29
4. Epilogue
4.1 Measuring performance after all changes
After we've seen the speedup we got while using every optimization method, we
will now combine them all together to get the total performance speedup
achieved. This includes Multi-threading and all the SIMD additions.
The final results are:
The total speedup we got is:
105.156
1.985 .
52.969
All the changes and additions we made increased Flake's performance by a factor
of almost x2.0!
30
4.2 Working with Intel Compiler
We used Intel Compiler on our final version of the code to see whether it will
improve its performance or no. These are the results we got from using it:
As we can see, the run time is: 62.031 Seconds, which is slower than the result we
got using Visual Studio 8.0 Compiler, so actually it's a slowdown.
52.969
0.854 which is almost 15% slowdown, so we will
62.031
not use Intel's Compiler on our final version of the code.
The Slowdown is =
31
4.3 Optimization summary
Here we show all the results we got in a graphical manner, so we can easily see
the improvement process of Flake. The results in a graphical manner:
SpeedUp
200
180
160
140
120
100
80
60
40
20
0
Original
SIMD
MT
All Together Intel Compiler
As we can see from the speedup plot, the best results are achieved when we use
Multi-threading combined with SIMD instructions. The total speedup for both of
the optimizations is almost x2.0. There is 1:1 bit compatibility between improved
Flake's output and original Flake's output, meaning we didn't compromise
regarding the total number of calculations made during run-time.
We kept the option in our code to use the original version (i.e. use it without SSE
instructions and/or without Multi-threading), which might prove to be useful for
older processors.
32
4.4 References
http://softlab.technion.ac.il/
http://msdn.microsoft.com
http://en.wikipedia.org/wiki/
http://www.google.co.il/
http://arstechnica.com/
http://www.intel.com/
http://sourceforge.net/
http://www.sonicspot.com/guide/wavefiles.html/
33
© Copyright 2026 Paperzz