Pipeline Parallelism

Pipeline Parallelism
Transcript
Pipeline Parallelism Transcript was developed by Michelle Buchecker. Additional contributions were
made by Cheryl Doninger, Glenn Horton, Merry Rabb, and Christine Riddiough. Editing and production
support was provided by the Curriculum Development and Support Department.
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of
SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product
names are trademarks of their respective companies.
Pipeline Parallelism Transcript
Copyright © 2009 SAS Institute Inc. Cary, NC, USA. All rights reserved. Printed in the United States of
America. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in
any form or by any means, electronic, mechanical, photocopying, or otherwise, without the prior written
permission of the publisher, SAS Institute Inc.
Book code E1414, course code RLSPCN09, prepared date 31Mar2009.
RLSPCN09_001
ISBN 978-1-59994-972-7
For Your Information
Table of Contents
Lecture Description ..................................................................................................................... iv Prerequisites ................................................................................................................................. v Pipeline Parallelism ........................................................................................................ 1 1.
Overview of Pipeline Parallelism ....................................................................................... 6 2.
Piping Code....................................................................................................................... 31 3.
Considerations and Benchmarking ................................................................................... 50 Appendix A
1.
Demonstration Programs ................................................................... A-1 Using Pipeline Parallelism .............................................................................................. A-3 iii
iv
For Your Information
Lecture Description
This is the fourth e-lecture of a five-lecture series on parallel processing with SAS/CONNECT software.
This lecture teaches you the necessary code to use SAS/CONNECT software to perform pipeline parallel
processing to execute dependant code simultaneously without writing the data out to disk.
To learn more…
For information on other courses in the curriculum, contact the SAS Education
Division at 1-800-333-7660, or send e-mail to [email protected]. You can also
find this information on the Web at support.sas.com/training/ as well as in the
Training Course Catalog.
For a list of other SAS books that relate to the topics covered in this
Course Notes, USA customers can contact our SAS Publishing Department at
1-800-727-3228 or send e-mail to [email protected]. Customers outside the
USA, please contact your local SAS office.
Also, see the Publications Catalog on the Web at support.sas.com/pubs for a
complete list of books and a convenient order form.
For Your Information
Prerequisites
Before listening to this lecture, you should be able to
• write DATA and PROC steps
• understand error messages in the SAS log and debug your programs
• log on to a remote SAS session through either a SAS spawner or a SAS script file
• use an RSUBMIT statement to submit code to a remote machine
• use a LIBNAME statement to access SAS data libraries.
v
vi
For Your Information
Pipeline Parallelism
1. Overview of Pipeline Parallelism ..................................................................................... 6 2. Piping Code...................................................................................................................... 31 3. Considerations and Benchmarking ............................................................................... 50 2
Pipeline Parallelism
1. Overview of Pipeline Parallelism
3
Pipeline Parallelism
Welcome to the Pipeline Parallelism e-lecture. My name is Michelle and I will be your instructor for this
lecture. I have been an instructor for SAS for over 15 years and my specialties include SAS/CONNECT
software.
4
Pipeline Parallelism
Lectures Available
„
Introduction to Parallel Processing.
„
Using Parallel Processing on a Single Machine (Scaling Up)
„
Using Parallel Processing Across Multiple Machines (Scaling Out)
„
Pipeline Parallelism
„
Managing Asynchronous Execution
2
This lecture series consists of five separate lectures. The first lecture is Introduction to Parallel
Processing. The second lecture is Using Parallel Processing on a Single Machine (Scaling Up). The third
lecture is Using Parallel Processing Across Multiple Machines (Scaling Out). The fourth lecture is
Pipeline Parallelism. And the fifth lecture is Managing Asynchronous Execution.
This is the fourth lecture in the series. We encourage you to listen to all five lectures to get a full
understanding of how to perform parallel processing.
1. Overview of Pipeline Parallelism
Pipeline Parallelism
1. Overview of Pipeline Parallelism
2. Piping Code
3. Considerations and Benchmarking
3
In this lecture you will learn how to perform pipeline parallelism, which is to execute dependant code
simultaneously without writing the data out to disk.
5
6
Pipeline Parallelism
1.
Overview of Pipeline Parallelism
Pipeline Parallelism
1. Overview of Pipeline Parallelism
2. Piping Code
3. Considerations and Benchmarking
4
I’m going to start off with an overview to pipeline parallelism, followed by what code you will need to
implement it, and lastly talk about some factors to take into consideration and benchmarking we have
performed.
1. Overview of Pipeline Parallelism
Objectives
„
Define pipeline parallelism.
„
Determine benefits of pipeline parallelism.
„
Determine requirements of pipeline parallelism.
5
In this section, I’ll define pipeline parallelism, talk about the benefits, and what the requirements are.
7
8
Pipeline Parallelism
Technology Today
Prior to SAS®9, a step that created an output SAS data set wrote that
SAS data set to disk, which could be read by a subsequent step or
steps.
Starting in SAS®9, pipeline parallelism enables a step to bypass writing
to disk by directly writing to a pipe, which a subsequent step can then
read from. This technique saves I/O time, as well as disk space.
6
Regardless if you use sequential processing or parallel processing of independent steps, the data between
steps still needs to be written out to disk. In SAS®9, SAS introduced pipeline parallelism, which negates
the need for writing the data to disk between dependent steps. This process allows you to pipe the data
directly from one step to the next through a TCP/IP pipe and allows some steps to work in parallel. Since
you are no longer writing the data to disk, you are saving both disk space and I/O time. Pipeline
parallelism is only part of SAS/CONNECT software.
1. Overview of Pipeline Parallelism
9
Days Before Pipeline Parallelism
DATA Step Creating
Data Set A
7
PROC Step Reading
Data Set A
For example, let’s say you have a DATA step that is reading data from a raw data file and creating a SAS
data set named “A”. And then a PROC step that processes that data set.
10
Pipeline Parallelism
Days Before Pipeline Parallelism
DATA Step Creating
Data Set A
A
Disk
Wait
8
PROC Step Reading
Data Set A
Traditionally, while the DATA step is executing it writes the data out to disk. The PROC step that reads
that SAS data set has to wait until the DATA step has completed…
1. Overview of Pipeline Parallelism
Days Before Pipeline Parallelism
DATA Step Creating
Data Set A
A
9
Disk
PROC Step Reading
Data Set A
… before it can begin reading the data from the disk and start processing.
11
12
Pipeline Parallelism
Pipeline Parallelism
SAS/CONNECT
Software
DATA Step Creating
Data Set A
A
TCP/IP Socket
(port)
PROC Step Reading
Data Set A
10
Now, with pipeline parallelism, the DATA step can write the data to a TCP/IP port and the PROC step
can read from that port and begin processing the data before the DATA step has finished.
1. Overview of Pipeline Parallelism
Pipeline Parallelism
„
Pipeline parallelism is when multiple steps depend
on each other, but the execution can overlap and the output
of one step is streamed as input to the next step.
„
Pipeline parallelism is possible when Step B requires output from
Step A, but it does not need all the output before it can begin.
„
Because the data flows in a continual stream from one task into
another through a TCP/IP socket, program execution time can
be dramatically shortened.
11
®
Piping is a SAS 9 extension of the MP CONNECT functionality whose purpose is to address pipeline
parallelism. The pipeline can be extended to include any number of steps and can even extend between
different physical machines.
13
14
Pipeline Parallelism
Pipeline Parallelism
DATA Step
PROC SORT Step
12
For example, PROC SORT really does not need all the data output by the DATA step before it can start.
Sorting can start with only two records and then continue by adding and sorting more records as they
become available.
So let’s say I have a DATA step that is reading a raw data file, processing the record, and then outputting
the observation. I then have a PROC SORT to sort that data.
So the DATA step reads the first record, processes it, and outputs it to the TCP/IP pipe.
1. Overview of Pipeline Parallelism
15
Pipeline Parallelism
DATA Step
PROC SORT Step
13
The PROC SORT reads from the pipe and, well, there’s not anything it can do with one observation, so it
just holds on to it.
The DATA step, meanwhile, reads the second record, processes it, and outputs it to the pipe.
16
Pipeline Parallelism
Pipeline Parallelism
DATA Step
PROC SORT Step
14
Now PROC SORT has two observations and it can sort those observations.
The DATA step reads the next record…
1. Overview of Pipeline Parallelism
Pipeline Parallelism
DATA Step
PROC SORT Step
15
… and sends it to PROC SORT, which puts it in the appropriate place.
So the DATA step reads, …
17
18
Pipeline Parallelism
Pipeline Parallelism
DATA Step
PROC SORT Step
16
… and the PROC SORT sorts at the same time.
Read, …
1. Overview of Pipeline Parallelism
Pipeline Parallelism
DATA Step
PROC SORT Step
17
… and sort. Read, …
19
20
Pipeline Parallelism
Pipeline Parallelism
DATA Step
PROC SORT Step
18
… and sort. As you can see, this is going to result in much faster elapsed time…
1. Overview of Pipeline Parallelism
Pipeline Parallelism
DATA Step
PROC SORT Step
19
… than not using pipeline parallelism. Eventually, the DATA step will run out of records, …
21
22
Pipeline Parallelism
Pipeline Parallelism
DATA Step
PROC SORT Step
20
… and PROC SORT will finish soon after.
1. Overview of Pipeline Parallelism
23
Pipeline Parallelism
DATA Step Creating
Data Set A
TCP/IP Socket
(port)
Read and Feed
A
PROC SORT Reading
Data Set A
21
So as we saw, PROC SORT can easily read the data coming from a DATA step through a TCP/IP pipe
since it doesn’t need to have all of the data to start doing this work. This is what I like to call a “read and
feed”.
24
Pipeline Parallelism
Pipeline Parallelism
DATA Step Creating
Data Set A
TCP/IP Socket
(port)
A
PROC SORT Reading
Data Set A
TCP/IP Socket
(port)
22
Read and Feed
A
Read and Impede
Wait for
all data to
be sorted
PROC PRINT Reading
Data Set A
However, if I followed that PROC SORT with a PROC PRINT, PROC SORT needs to finish before
PROC PRINT can begin producing the report. I call this a “read and impede”. You can still send the data
through the pipe as long as some considerations are met, and I’ll discuss those considerations at the end of
this e-lecture.
1. Overview of Pipeline Parallelism
25
Types of Execution
Sequential Execution (Synchronous)
DATA Step A
PROC SORT A
DATA Step B
PROC SORT B
Independent Parallelism (Asynchronous)
DATA Step A
DATA Step B
23
0
PROC SORT A
PROC SORT B
elapsed time
So thinking about the different types of executions then, our first type of execution is the good old
sequential execution, also known as synchronous execution. In this method, no step can start until the
previous step has finished, even if it is working on different data. This program is going to take the
longest.
In previous e-lectures in this series, I discussed independent parallelism, also known as asynchronous
execution. This method allows me to have steps execute at the same time as long as they are working with
different, or independent, data. This can drastically reduce the overall elapsed time needed. In this
method though, if two steps use the same data, the second step must still wait for the first step to finish
and write the data out to disk.
26
Pipeline Parallelism
Types of Execution
Pipeline Parallelism (Piping)
Asynchronous
DATA Step A
24
PROC SORT A
DATA Step B
PROC SORT B
0
elapsed time
With pipeline parallelism, I can overlap steps even if they are using the same data, as long as that makes
sense to do so, like with a DATA step and PROC SORT step. I can combine this technique with
asynchronous processing to really give me reduction in overall elapsed time.
As a reminder, both pipeline parallelism and asynchronous processing need SAS/CONNECT software to
work, even if you are doing this processing on the same physical machine.
1. Overview of Pipeline Parallelism
Benefits of Piping
The benefits of piping include:
„
overlapped execution of PROC and/or DATA steps
„
elimination of intermediate writes to disk
„
improved performance
„
reduced disk space requirements
25
What are the benefits of piping? You can overlap execution of DATA steps and PROC steps and
eliminate writing to disk, which saves disk space, which reduces I/O time. And all of this means an
increase in performance.
If you do not have sufficient resources on a single machine, you can pipe between remote machines.
27
28
Pipeline Parallelism
Consideration and Requirements
„
You must have sufficient CPU and I/O resources when
implementing piping.
„
Piping, like all scalability solutions, is most effective when
the execution time of an application is substantial.
„
Piping requires a SAS/CONNECT software license because
it is part of the SAS/CONNECT product.
„
Piping requires SAS®9. If you are piping across machines,
both machines must be running SAS®9.
26
What are some of the issues with piping? First, you must have sufficient CPU and I/O resources. You
®
must be using SAS/CONNECT under SAS 9, and it is not a good choice if your application already runs
quickly. It’s best used for long-running programs.
Piping does introduce a small overhead from doing a SIGNON, CPU overhead of additional processes,
and a complexity overhead to the application.
1. Overview of Pipeline Parallelism
29
Limitations of Piping
DATA Step Creating
Data Set A
Disk
TCP/IP Socket (port)
A
PROC SORT Reading
Data Set A
27
A limitation of piping is that it supports single-pass, sequential data processing. Because piping stores
data for reading and writing in TCP/IP ports instead of disks, the data is never permanently stored.
Instead, after the data is read from a port, the data is removed entirely from that port and the data cannot
be read again. If your data requires multiple passes for processing, piping cannot be used. So, once it has
been processed it is no longer available for a second pass.
30
Pipeline Parallelism
Sequential Steps
Examples of SAS steps that process single-pass, sequential data:
„
SORT
„
SUMMARY
„
GANT
„
COPY
„
PRINT *
„
DATA step *
* = exceptions apply
28
So what steps are good for piping? PROC SORT, PROC SUMMARY, PROC GANT, and PROC COPY.
These are great PROCs to read in data from a pipe. As we saw earlier, if they are writing to a pipe,
though, the step must complete before the next step executes.
Now PROC PRINT is a special case scenario. In most cases I can take the data from a DATA step and
send it to PROC PRINT through a pipe. Because by default, PROC PRINT only makes a single pass
through the data. The exception to this rule is if you use the UNIFORM option in the PROC PRINT
statement. The UNIFORM option forces PROC PRINT to spin through the data twice, calculating the
longest data value for each variable.
The DATA step can also read the data from a pipe unless there is a KEY= or POINT= option in the
DATA step. Those options require direct, not sequential, access to the data.
2. Piping Code
2.
31
Piping Code
Pipeline Parallelism
1. Overview of Pipeline Parallelism
2. Piping Code
3. Considerations and Benchmarking
29
Now that you have a basic idea of what pipeline parallelism is, let’s see the code needed to make it work.
32
Pipeline Parallelism
Objectives
„
Use the SASESOCK engine.
„
Apply the pipe in more than one RSUBMIT block.
„
Perform piping on a single machine.
„
Perform piping across multiple machines.
30
In this section you will learn how to use the SASESOCK engine in the LIBNAME statement, and how to
specify it in your RSUBMIT blocks. We’ll see how to perform the piping on a single machine and then
discuss how to scale out to another machine.
2. Piping Code
33
Piping Syntax
General form of the SASESOCK engine:
LIBNAME libref SASESOCK "port-specifier" <TIMEOUT=time-in-seconds>;
31
To specify that you want to use pipeline parallelism, this information is specified in the LIBNAME
statement. The syntax is LIBNAME, followed by a library reference, then the keyword SASESOCK.
Next in quotes is the port-specifier and optionally a TIMEOUT= option, which I’ll discuss in more detail
in a moment.
34
Pipeline Parallelism
Port Specifiers
port-specifier can be represented in these ways:
":explicit-port "
specifies an explicit port on the
machine where the asynchronous
RSUBMIT is executing. It is signified
by a hardcoded port number.
libname payroll sasesock ":256";
32
continued...
One option for a port-specifier is to explicitly type in the port number of the machine the code is running
on. This port number is preceded by a colon.
You do not have to configure a port number in the SERVICES file in order to use it. Be aware that if the
port that you specify is already in use, you will be denied access.
2. Piping Code
35
Port Specifiers
":port service"
specifies the name of the service on
the machine where the asynchronous
RSUBMIT is executing. It is signified
by the name of the port service.
libname payroll sasesock ":pipe1";
An example of what would be placed in your SERVICES file:
pipe1
pipe2
pipe3
33
13001/tcp
13002/tcp
13003/tcp
Under Windows: C:\WINNT\system32\drivers\etc\services
Alternatively, if a port service is configured in the SERVICES file you can use it. So in this case in quotes
I have a colon followed by the name of the port I want to use.
Your SERVICES file is typically found in a location where the operating system is installed. For
Windows, it may be found in C:\WINNT\system32\drivers\etc\services.
Once again, if the port is already in use, you will be denied access.
36
Pipeline Parallelism
Scaling Up Piping
/* start session READTASK and */
/* execute first data step */
/* asynchronously to a pipe */
signon readtask sascmd='!sascmd –nosyntaxcheck';
rsubmit readtask wait=no;
libname outlib sasesock ":pipe1";
data outlib.equipment;
/* data step statements */
run;
endrsubmit;
34
continued...
To implement piping, we need to do this in asynchronous tasks. In this case I’m going to pipe to my same
machine, so I will use the scaling up technique.
My first statement is to sign on, and I’m naming this task readtask and using the SASCMD option to
kick off a separate SAS session on my current machine.
Next is an RSUBMIT statement to readtask with the WAIT=NO option to process code asynchronously.
Now I have a LIBNAME statement creating a library reference named outlib using the SASESOCK
engine and pointing to :pipe1. So if you think about what a LIBNAME statement normally does, it
normally points to a physical path location on disk. Here we are just pointing it to a TCP/IP pipe instead.
In my DATA step it is the same code you would have if you were writing to disk, namely DATA
outlib.equipment;. So this creates a data set named equipment and outputs to wherever outlib is
pointing to, which is our TCP/IP pipe named pipe1.
Then we have whatever DATA step code is needed including a RUN statement, and finally, an
ENDRSUBMIT.
2. Piping Code
37
Scaling Up Piping
/* start session READTASK and */
/* execute first data step */
/* asynchronously to a pipe */
signon readtask sascmd='!sascmd –nosyntaxcheck';
rsubmit readtask wait=no;
libname outlib sasesock ":pipe1";
data outlib.equipment;
/* data step statements */
run;
/* start session SORTTASK and */
endrsubmit;
/* execute second step which */
/* gets its input from a pipe */
signon sorttask sascmd='!sascmd -nosyntaxcheck';
rsubmit sorttask wait=no;
libname inlib sasesock ":pipe1";
libname newlib
'c:\workshop\winsas\mpdp';
proc sort data=inlib.equipment
out=newlib.final;
by costprice_per_unit;
run;
endrsubmit;
35
continued...
To receive this data for another step, I’ll write another SIGNON statement creating a SAS session named
SORTTASK (don’t you love the names I came up with? Very creative I think!).
In the RSUBMIT block for SORTTASK, you’ll notice that once again I have a LIBNAME statement with
the SASESOCK engine that is pointing to that same pipe, :pipe1. I named this libref inlib just to
distinguish it from the other LIBNAME statement. But you certainly could have used the same libref
because this code will be processed in a separate SAS session.
Now this RSUBMIT block is doing a PROC SORT. Remember that a TCP/IP pipe is one direction. So
PROC SORT can read from the pipe but can’t write the data back. And in this case I wouldn’t want to do
that anyway, because eventually I do want the data stored on disk somewhere.
So notice the second LIBNAME statement creating a libref of newlib pointing to a physical path on my
C: drive.
In my PROC SORT, it’s data=inlib (the name of the libref of the pipe) dot equipment (the name of the
data set the DATA step was creating), OUT=newlib (my disk libref) dot final.
So this is the code that you need to pipe data from one step to the next. Again the keys are the separate
RSUBMIT blocks and the SASESOCK engine.
38
Pipeline Parallelism
Scaling Up Piping
/* start session READTASK and */
/* execute first data step */
/* asynchronously to a pipe */
signon readtask sascmd='!sascmd –nosyntaxcheck';
rsubmit readtask wait=no;
libname outlib sasesock ":pipe1";
data outlib.equipment;
/* data step statements */
run;
/* start session SORTTASK and */
endrsubmit;
/* execute second step which */
/* gets its input from a pipe */
signon sorttask sascmd='!sascmd -nosyntaxcheck';
rsubmit sorttask wait=no;
libname inlib sasesock ":pipe1";
libname newlib
'c:\workshop\winsas\mpdp';
proc sort data=inlib.equipment
out=newlib.final;
by costprice_per_unit;
waitfor _all_run;
readtask sorttask;
endrsubmit;
36
signoff readtask ;
signoff sorttask;
Of course once all of the processing has completed, you will want to sign off of both tasks.
So the two steps must be placed in two separate RSUBMIT blocks, or one RSUBMIT block and the
parent process. This is required so that the two steps can run simultaneously.
The first DATA step writes the results to the TCP/IP pipe. The job of the second step is to remove this
data from the TCP/IP pipe. Because of the limited amount of buffer space in the pipe, the processes must
run at the same time.
2. Piping Code
39
Scaling Up Piping – Alternative Solution
/* start session SORTTASK and
*/
/* execute step which gets its */
/* input from a pipe eventually */
signon sorttask sascmd='!sascmd -nosyntaxcheck';
rsubmit sorttask wait=no;
libname inlib sasesock ":pipe1";
libname newlib
'c:\workshop\winsas\mpdp';
proc sort data=inlib.equipment
out=newlib.final;
by costprice_per_unit;
run;
endrsubmit;
37
continued...
In that last scenario I kicked off two RSUBMIT SAS sessions, which was additional overhead and meant
my parent SAS session wasn’t doing anything. Alternatively, I could just RSUBMIT the two LIBNAME
statements and the PROC SORT, which will open up the pipe.
40
Pipeline Parallelism
Scaling Up Piping – Alternative Solution
/* start session SORTTASK and
*/
/* execute step which gets its */
/* input from a pipe eventually */
signon sorttask sascmd='!sascmd -nosyntaxcheck';
rsubmit sorttask wait=no;
libname inlib sasesock ":pipe1";
libname newlib
'c:\workshop\winsas\mpdp';
proc sort data=inlib.equipment
out=newlib.final;
by costprice_per_unit;
run;
endrsubmit;
/* In Parent Session submit... */
libname outlib sasesock ":pipe1";
data outlib.equipment;
/* data step statements */
run;
38
continued...
And then in my parent SAS session have the LIBNAME statement that writes to the pipe with the DATA
step. This will be slightly more efficient than the earlier example since there is less overhead to kick off
an additional SAS session.
I need to make sure my RSUBMIT block with the WAIT=NO option is submitted before the code that
the parent SAS session is processing. Otherwise the parent SAS session will just start processing the
DATA step and in good, traditional SAS is not going to move past that step until it is completed, which
means the RSUBMIT code will not process at the same time.
2. Piping Code
41
Scaling Up Piping – Alternative Solution
/* start session SORTTASK and
*/
/* execute step which gets its */
/* input from a pipe eventually */
signon sorttask sascmd='!sascmd -nosyntaxcheck';
rsubmit sorttask wait=no;
libname inlib sasesock ":pipe1";
libname newlib
'c:\workshop\winsas\mpdp';
proc sort data=inlib.equipment
out=newlib.final;
by costprice_per_unit;
run;
endrsubmit;
/* In Parent Session submit... */
libname outlib sasesock ":pipe1";
data outlib.equipment;
/* data step statements */
run;
39
waitfor _all_ sorttask;
signoff sorttask;
You still need to make sure you wait for the sorttask session to complete, though, before you sign off of
it.
42
Pipeline Parallelism
Multiple Pipes
signon readtask sascmd='!sascmd
-nosyntaxcheck';
rsubmit readtask wait=no;
libname outlib1 sasesock ":pipe1";
libname outlib2 sasesock ":pipe2";
data outlib1.equip outlib2.clothes;
/* data step statements */
if type='EQUIPMENT' then output
outlib1.equip;
else if type='CLOTHES' then output
outlib2.clothes;
run;
endrsubmit;
40
continued...
I can also work with multiple pipes. In this RSUBMIT block I have two LIBNAME statements, one with
a libref of OUTLIB1 pointing to pipe1, and one with a libref of OUTLIB2 pointing to pipe2.
In my DATA statement, I have two data sets listed. Then in my DATA step I have a series of IF/THEN
statements. So if the variable type has a value of “EQUIPMENT”, then I’ll write the observations to the
OUTLIB1.EQUIP data set. Otherwise if the type is “CLOTHES”, then write the observation out to the
OUTLIB2.CLOTHES data set.
2. Piping Code
43
Multiple Pipes
signon readtask sascmd='!sascmd
-nosyntaxcheck';
rsubmit readtask wait=no;
libname outlib1 sasesock ":pipe1";
libname outlib2 sasesock ":pipe2";
data outlib1.equip outlib2.clothes;
/* data step statements */
if type='EQUIPMENT' then output
outlib1.equip;
else if type='CLOTHES' then output
outlib2.clothes;
run;
signon equpsort sascmd='!sascmd -nosyntaxcheck';
endrsubmit;
rsubmit equpsort wait=no;
libname inlib sasesock ":pipe1";
libname newlib
'c:\workshop\winsas\mpdp';
proc sort data=inlib.equip
out=newlib.final1;
by costprice_per_unit;
run;
endrsubmit;
41
continued...
Next I’ll have another RSUBMIT block that reads the data from pipe1, which was the pipe I wrote the
equipment observations to. So this RSUBMIT block can start sorting those observations from the pipe as
soon as the DATA step in the first RSUBMIT block starts writing to the pipe.
Notice in the PROC SORT there is an OUT= option writing the sorted data to the NEWLIB.FINAL1 data
set, where NEWLIB is pointing to C:\workshop\winsas\mpdp.
44
Pipeline Parallelism
Multiple Pipes
signon readtask sascmd='!sascmd
-nosyntaxcheck';
rsubmit readtask wait=no;
libname outlib1 sasesock ":pipe1";
libname outlib2 sasesock ":pipe2";
data outlib1.equip outlib2.clothes;
/* data step statements */
if type='EQUIPMENT' then output
outlib1.equip;
signon
sascmd ='!sascmd
–nosyntaxcheck';
else clthsort
if type='CLOTHES'
then output
rsubmit
clthsort wait=no;
outlib2.clothes;
libname inlib sasesock ":pipe2";
run;
signon equpsort sascmd='!sascmd -nosyntaxcheck';
libname newlib
endrsubmit;
rsubmit equpsort wait=no;
'c:\workshop\winsas\mpdp';
libname inlib sasesock ":pipe1";
proc sort data=inlib.clothes
libname newlib
out=newlib.final2;
'c:\workshop\winsas\mpdp';
by costprice_per_unit;
proc
sort
data=inlib.equip
run;
out=newlib.final1;
endrsubmit;
by costprice_per_unit;
run;
endrsubmit;
42
continued...
While this is executing I have another RSUBMIT block reading from pipe2, which is where the clothes
observations had been written to. Once again this PROC SORT is writing out to a new data set as
identified by the OUT= option, and this new data set name is NEWLIB.FINAL2. And the NEWLIB libref
in this RSUBMIT block is also pointing to C:\workshop\winsas\mpdp.
2. Piping Code
Multiple Pipes
signon readtask sascmd='!sascmd
-nosyntaxcheck';
rsubmit readtask wait=no;
libname outlib1 sasesock ":pipe1";
libname outlib2 sasesock ":pipe2";
data outlib1.equip outlib2.clothes;
/* data step statements */
if type='EQUIPMENT' then output
outlib1.equip;
signon
sascmd ='!sascmd
–nosyntaxcheck';
else clthsort
if type='CLOTHES'
then output
rsubmit
clthsort wait=no;
outlib2.clothes;
libname inlib sasesock ":pipe2";
run;
signon equpsort sascmd='!sascmd -nosyntaxcheck';
libname newlib
endrsubmit;
rsubmit equpsort wait=no;
'c:\workshop\winsas\mpdp';
libname inlib sasesock ":pipe1";
proc sort data=inlib.clothes
libname newlib
out=newlib.final2;
'c:\workshop\winsas\mpdp';
by costprice_per_unit;
proc
sort
data=inlib.equip
run;
out=newlib.final1;
endrsubmit;
by
costprice_per_unit;
waitfor _all_ readtask clthsort equpsort;
signoff readtask; run;
43 signoff clthsort; endrsubmit;
signoff equpsort;
And as always, I have a WAITFOR statement waiting for all RSUBMIT blocks to complete before
signing off.
45
46
Pipeline Parallelism
Using Pipeline Parallelism
ACCESSORIES.CSV
DATA step
CLOTHES.CSV
CHILDREN.CSV
DATA step
DATA step
Concatenate
44
Let me go ahead and demo how to use piping on a single machine. I need to read the ACCESSORIES
raw data file, the CLOTHES raw data file, and the CHILDREN raw data file using separate DATA
steps. I’ll take that data and send it through separate pipes and then concatenate the data and write it out to
disk.
Notice in my code I have four different SIGNON statements, one for each of the raw data files to be
processed, and one that will perform the concatenation.
In my first RSUBMIT block, named accwin, I have a LIBNAME statement creating a libref of outlib
using the SASESOCK engine and pointing to a pipe from my SERVICES file named :pipe1.
In my DATA statement I’m creating OUTLIB.ACCESSORIES, so the accessories data set will be
written to pipe1. Next I have a traditional INFILE statement pointing to the accessories.csv file and an
INPUT statement.
In the second RSUBMIT block, named childwin, I have a similar LIBNAME statement creating a libref
of outlib but this time pointing to :pipe2. This INFILE statement points to the children raw data file.
The third RSUBMIT block is the same, but this time named clothwin and pointing to :pipe3 with the
INFILE statement reading the clothes raw data file.
2. Piping Code
47
In my fourth RSUBMIT block I have three LIBNAME statements creating librefs of READ1, READ2,
and READ3 pointing to the pipes of :pipe1, :pipe2, and :pipe3 respectively.
I have a fourth LIBNAME statement creating a libref of ORION actually point to a path on my machine.
In the next statement is the DATA statement which creates a SAS data set named concat in the libref
ORION which is the one pointing to disk.
In the SET statement, it is SET READ3.CLOTHES READ2.CHILDRENS. READ1.ACCESSORIES.
This will concatenate the data.
After that RSUBMIT block is a WAITFOR all tasks to finish, and then a SIGNOFF statement for each
task. I could have said SIGNOFF _ALL_;, as that is a new option in SAS 9.2. If you are using SAS 9.1,
you need a separate SIGNOFF statement for each task.
I’m going to go ahead and submit this code. And wait while the code is processed.
Once the code has finished processing, I’m going to activate the log and scroll up to the top.
All of the SIGNON statements look good. Scrolling down farther I see the LIBNAME statement that
points to :pipe1 and the note says successfully assigned. Notice the engine name says SASESOCK and
the physical name is pipe1.
The notes following the DATA step indicate that the data set OUTLIB.ACCESSORIES has 73,507
observations.
Scrolling down there is a similar note about the libref being successfully assigned for pipe2, and the
OUTLIB.CHILDRENS data set has 88,482 observations.
Scrolling down even more is another successful libref pointing to pipe3 and a data set of
OUTLIB.CLOTHES with 110,032 observations.
In the last RSUBMIT block you see the three LIBNAME statements pointing to pipe1, pipe2, and pipe3.
The last LIBNAME statement successfully points to a path on the C: drive.
So all three data sets get concatenated and the resulting permanent data set ORION.CONCAT has
272,021 observations.
So this technique worked really well to prevent all of that intermediate writing to disk that would have
normally occurred without piping.
48
Pipeline Parallelism
Port Specifiers
"parent-machine-name:port-number"
specifies an explicit port number on the machine specified
by parent-machine-name.
libname payroll sasesock "orion.finance.com:256";
45
continued...
What if your machine only has a single processor and you attempt to run pipeline parallelism? The piping
test case will probably take longer than the sequential test case. This is a result of contention for the I/O
channel, the CPU, and memory.
If you work in an environment such as this, piping can still work for you. Simply farm the work out to
other machines.
In the LIBNAME statement before specifying the port number, just type in the name of the remote
machine and then a colon.
When specifying a port on a remote machine you can only read, not write, from that pipe.
2. Piping Code
49
Port Specifiers
"machine-name:port service"
specifies the name of the service on the machine
specified by machine-name.
libname payroll sasesock "orion.finance.com:pipe1";
46
Or as we saw before you can specify a port name instead. Here I have
orion.finance.com:pipe1.
When specifying a port on a remote machine you can only read from, not write to, that pipe.
Ensure that the port that the output is written to is on the same machine that the asynchronous process is
run on. However, a SAS procedure that reads from that port can run on another machine.
50
Pipeline Parallelism
3.
Considerations and Benchmarking
Pipeline Parallelism
1. Overview of Pipeline Parallelism
2. Piping Code
3. Considerations and Benchmarking
47
Lastly, let’s talk in more detail about factors you need to consider and the benchmarking statistics.
3. Considerations and Benchmarking
Objectives
„
Factor into consideration issues associated with piping.
„
Compare the benchmarks of piping versus not piping.
48
In this section you will learn about issues to consider with regard to piping. You will also see some
benchmarks of using piping versus not using piping.
51
52
Pipeline Parallelism
Additional Considerations
„
The benefits of piping should be weighed against the cost of
potential CPU or I/O bottlenecks. If execution time for a SAS
procedure or statement is relatively short, piping is probably
counterproductive.
„
If you have only a few I/O channels, I/O intensive code
(like PROC SORT) will clog up the I/O channels.
„
You might minimize port access collisions on the
same machine by reserving a range of ports in the SERVICES file.
„
Be sure that the task reading the data does not complete before
the task writing the data.
49
There are many items you need to consider before rewriting all of your code to support piping. The first
item is piping should not be used if your execution time is already relatively short. You won’t buy
anything, and it could have a negative impact because it requires more CPU time and could present I/O
bottlenecks.
Secondly, if you have few I/O channels available, I/O intensive code like PROC SORT will clog up the
I/O channels. How do you know how many I/O channels are available? Ask your systems administrator.
Also, we recommend that you reserve a range of ports in your SERVICES file and that helps reduce port
access collisions.
And lastly, the task reading the data must not complete before the task writing the data.
3. Considerations and Benchmarking
53
Closing a Pipe Early
DATA Step Creating
Data Set A
PROC Step Reading
Data Set A with
OBS=10 option
50
For example, if the DATA step produces a large number of observations and PROC PRINT only prints
the first few observations specified by the OBS= option, this can result in the reading task closing the pipe
after the first few observations are printed. This would cause an error for the DATA step because it would
continue to try to write to the pipe, which was closed.
54
Pipeline Parallelism
Closing a Pipe Early
DATA Step Creating
Data Set A
1. DATA step opens port
and begins writing
data.
PROC Step Reading
Data Set A with
OBS=10 option
51
continued...
Let me show you what this would look like. Let’s say I have a DATA step that creates data set A, and a
PROC STEP that only processes the first 10 observations. The first step is that the DATA step opens the
port and begins writing to the pipe.
3. Considerations and Benchmarking
Closing a Pipe Early
DATA Step Creating
Data Set A
1. DATA step opens port
and begins writing
data.
A
2. PROC step begins
reading data.
PROC Step Reading
Data Set A with
OBS=10 option
52
continued...
The PROC step then begins reading the data from that pipe.
55
56
Pipeline Parallelism
Closing a Pipe Early
DATA Step Creating
Data Set A
1. DATA step opens port
and begins writing
data.
A
3. PROC step finishes
reading data and
closes port.
2. PROC step begins
reading data.
PROC Step Reading
Data Set A with
OBS=10 option
53
continued...
Once the PROC step finishes, it says “OK, I’m done” and closes the port.
3. Considerations and Benchmarking
Closing a Pipe Early
4. DATA step can no
longer write to port
and generates an
error message.
3. PROC step finishes
reading data and
closes port.
DATA Step Creating
Data Set A
57
1. DATA step opens port
and begins writing
data.
2. PROC step begins
reading data.
PROC Step Reading
Data Set A with
OBS=10 option
54
Because the port is now closed, the DATA step can’t write to it anymore and generates an error message.
While the task that does the writing generates an error and will not complete, the task that reads will
complete successfully. You can ignore the error in the writing task if the completion of this task is not
required (as is the case with the DATA step and PROC PRINT example in the item). Alternatively, move
the OBS=10 option to the DATA step so that the reader does not finish before the writer.
58
Pipeline Parallelism
Closing a Pipe Early
DATA Step Creating
Data Set A
1. DATA step
opens port and
processes for a long
time; no data written
out yet.
A
PROC Step Reading
Data Set A
55
continued...
Another issue with regard to timing is that if the task that reads from the pipe opens the pipe to read but
data isn’t written to the pipe due to some kind of delay, then the reading step times out and closes the
pipe.
So pictorially, once again let’s say we have a DATA step writing to a pipe and a PROC step reading from
that pipe. This particular DATA step does a lot of processing on an observation.
3. Considerations and Benchmarking
Closing a Pipe Early
DATA Step Creating
Data Set A
59
1. DATA step
opens port and
processes for a long
time; no data written
out yet.
A
PROC Step Reading
Data Set A
2. PROC step waits for data
in port; gives up after
value specified with the
TIMEOUT= option.
56
continued...
The PROC step keeps waiting for data, and after awhile it gets bored. How long is “awhile”? By default it
is 10 seconds, as that is the default value of the TIMEOUT= option in the LIBNAME statement using the
SASESOCK engine.
60
Pipeline Parallelism
Closing a Pipe Early
DATA Step Creating
Data Set A
1. DATA step
opens port and
processes for a long
time; no data written
out yet.
A
3. PROC step closes
port.
PROC Step Reading
Data Set A
2. PROC step waits for data
in port; gives up after
value specified with the
TIMEOUT= option.
57
continued...
As soon as that TIMEOUT value is reached without having data in the pipe, the PROC step closes the
pipe.
3. Considerations and Benchmarking
Closing a Pipe Early
4. DATA step can no
longer write to port
and generates an
error message.
DATA Step Creating
Data Set A
3. PROC step closes
port.
PROC Step Reading
Data Set A
61
1. DATA step
opens port and
processes for a long
time; no data written
out yet.
2. PROC step waits for data
in port; gives up after
value specified with the
TIMEOUT= option.
58
Eventually the DATA step is finished with the first observation and tries to output the observation to the
port. But because the port is closed, it can’t and you will get an error message.
62
Pipeline Parallelism
TIMEOUT= Option
LIBNAME libref SASESOCK "port-specifier" <TIMEOUT=time-in-seconds>;
59
If you need the pipe to stay open longer, just use the TIMEOUT= option in the LIBNAME statement to
increase the timeout value for the task that is reading. This causes the reading task to “wait” longer for the
writing task to begin writing to the pipe. This enables the initial steps in the writing task to complete and
the DATA step or SAS procedure to begin writing to the pipe before the reader times out.
The value for the TIMEOUT= option is the number of seconds to keep the pipe open.
3. Considerations and Benchmarking
63
Performance Gains
Look at an example scenario to compare the following:
„
traditional sequential processing
„
MP CONNECT + Piping
„
MP CONNECT + Piping + Threading
60
MP CONNECT, pipeline parallelism, and threading improve performance, but by how much? Let’s look
at some different scenarios. First, we’ll use traditional sequential processing. Next, you will see
asynchronous processing plus piping. And the last scenario is to use asynchronous processing plus piping
plus threading.
This example was executed using the following criteria:
an eight-way 900 MHz UNIX box
two raw input files (~1.5G each)
two DATA steps, two SUMMARY procedures, and a DATA step merge.
64
Pipeline Parallelism
Sequential SAS Job
equip.csv
DATA Step
outdoors.csv
DATA Step
PROC
SUMMARY
PROC
SUMMARY
DATA Step
Merge
61
In this scenario I have a DATA step to read the EQUIP.CSV raw data file and do some calculations, and a
DATA step to read the OUTDOORS.CSV file and do similar processing.
Then each data set goes through a PROC SUMMARY, and the resulting output data sets are merged
using a DATA step.
3. Considerations and Benchmarking
Execution Times
Sequential implementation
„
1210 seconds
MP CONNECT without threads
„
Up Next
MP CONNECT with threads
„
On Deck
62
With traditional processing, the sequential implementation took a little over 1200 seconds.
65
66
Pipeline Parallelism
MP CONNECT and Piping
EQUIP.CSV
OUTDOORS.CSV
DATA step
DATA step
Summary
Summary
Simultaneously
Merge
63
Now we put in the code to process the equipment data and outdoors data simultaneously, using one
processor per task. We are also using piping to reduce disk space, I/O operations, and CPU time.
Then lastly, merge the data sets together and write it out to disk.
Piping works well here because there is no need to retain the temporary data sets created by the two
DATA steps and the subsequent PROC summaries. Therefore, we can use piping to pipe the data from
one step to the next. This gives us
- overlapped execution of all of the steps
- eliminates write to disk of intermediate results.
3. Considerations and Benchmarking
Execution Times
Sequential implementation
„
1210 seconds
MP CONNECT and Piping without threads
„
620 seconds (49% improvement)
MP CONNECT and Piping with threads
„
Up Next
64
The overall elapsed time for this technique was almost half of the sequential processing.
Next, let’s look at adding threading to this scenario.
67
68
Pipeline Parallelism
MP CONNECT with Piping and Threads
OUTDOORS.CSV
EQUIP.CSV
Simultaneously
DATA step
DATA step
Summary
Summary
Merge
65
PROC SUMMARY turns threading off by default for BY-group processing. However, because these BYgroups are fairly large, I can turn on threading by adding the THREADS option to the PROC
SUMMARY statement. This lets additional CPUs process the PROC SUMMARY code.
3. Considerations and Benchmarking
Execution Times
Sequential implementation
„
1210 seconds
MP CONNECT and Piping without threads
„
620 seconds (49% improvement)
MP CONNECT and Piping with threads
„
382 seconds (70% improvement)
66
Now with threading turned on it reduced the overall elapsed time to 382 seconds, which is a 70%
improvement over the sequential execution.
69
70
Pipeline Parallelism
MP CONNECT and Threaded Summary
total improvement in elapsed time of 70%
67
So graphically, the top bar represents the relative elapsed time necessary to run the original sequential test
case.
The second bar represents the time savings from the MP CONNECT solution and piping.
The bottom bar represents the final implementation combining MP CONNECT functionality, piping, and
threaded PROC SUMMARY. The combination gives us an improvement in total elapsed time of 70%.
3. Considerations and Benchmarking
Summary
68
Pipeline parallelism can significantly reduce the overall elapsed time a program takes to run. By taking
advantage of pipes you can eliminate the need to write the data to disk and overlap processing of steps
that process the same data.
71
72
Pipeline Parallelism
Credits
Pipeline Parallelism was developed by M. Michelle Buchecker.
Additional contributions were made by Cheryl Doninger, Glenn Horton,
Merry Rabb, and Chris Riddiough.
69
That concludes the Pipeline Parallelism e-lecture.
3. Considerations and Benchmarking
Comments?
We would like to hear what you think.
„
Do you have any comments about this lecture?
„
Did you find the information in this lecture useful?
„
What other e-lectures would you like to see SAS develop
in the future?
Please e-mail your comments to
[email protected]
70
If you have any comments on this lecture or other lectures you would like to see, please e-mail
[email protected].
73
74
Pipeline Parallelism
Copyright
SAS and all other SAS Institute Inc. product or service names
are registered trademarks or trademarks of SAS Institute Inc.
in the USA and other countries.
® indicates USA registration. Other brand and product names
are trademarks of their respective companies.
Copyright © 2009 by SAS Institute Inc., Cary, NC 27513, USA.
All rights reserved.
71
I hope you learned a lot in this e-lecture and can put the topics to good use.
Appendix A Demonstration
Programs
1.
Using Pipeline Parallelism ............................................................................................ A-3
A-2
Appendix A Demonstration Programs
1. Using Pipeline Parallelism
1.
Using Pipeline Parallelism
Section 2, Slide 44
signon accwin sascmd='!sascmd -nosyntaxcheck';
signon childwin sascmd='!sascmd -nosyntaxcheck';
signon clothwin sascmd='!sascmd -nosyntaxcheck';
signon combine sascmd='!sascmd -nosyntaxcheck';
rsubmit accwin wait=no;
libname outlib sasesock ":pipe1";
data outlib.accessories;
infile 'c:\workshop\winsas\mpdp\accessories.csv' dsd;
input order_id order_item_num product_id
quantity total_retail_price : comma7.
costprice_per_unit : comma7. discount :percent7.
product_name :$45. supplier_id
product_level product_ref_id;
run;
endrsubmit;
rsubmit childwin wait=no;
libname outlib sasesock ":pipe2";
data outlib.childrens;
infile 'c:\workshop\winsas\mpdp\children.csv' dsd;
input order_id order_item_num product_id
quantity total_retail_price : comma7.
costprice_per_unit : comma7. discount :percent7.
product_name :$45. supplier_id
product_level product_ref_id;
run;
endrsubmit;
rsubmit clothwin wait=no;
libname outlib sasesock ":pipe3";
data outlib.clothes;
infile 'c:\workshop\winsas\mpdp\clothes.csv' dsd;
input order_id order_item_num product_id
quantity total_retail_price : comma7.
costprice_per_unit : comma7. discount :percent7.
product_name :$45. supplier_id
product_level product_ref_id;
run;
endrsubmit;
rsubmit combine wait=no;
libname read1 sasesock ":pipe1";
libname read2 sasesock ":pipe2";
A-3
A-4
Appendix A Demonstration Programs
libname read3 sasesock ":pipe3";
libname orion 'c:\workshop\winsas\mpdp';
data orion.concat;
set read3.clothes read2.childrens read1.accessories;
run;
endrsubmit;
waitfor _all_ accwin childwin clothwin combine;
signoff
signoff
signoff
signoff
accwin;
childwin;
clothwin;
combine;