Pipeline Parallelism Transcript Pipeline Parallelism Transcript was developed by Michelle Buchecker. Additional contributions were made by Cheryl Doninger, Glenn Horton, Merry Rabb, and Christine Riddiough. Editing and production support was provided by the Curriculum Development and Support Department. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies. Pipeline Parallelism Transcript Copyright © 2009 SAS Institute Inc. Cary, NC, USA. All rights reserved. Printed in the United States of America. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, or otherwise, without the prior written permission of the publisher, SAS Institute Inc. Book code E1414, course code RLSPCN09, prepared date 31Mar2009. RLSPCN09_001 ISBN 978-1-59994-972-7 For Your Information Table of Contents Lecture Description ..................................................................................................................... iv Prerequisites ................................................................................................................................. v Pipeline Parallelism ........................................................................................................ 1 1. Overview of Pipeline Parallelism ....................................................................................... 6 2. Piping Code....................................................................................................................... 31 3. Considerations and Benchmarking ................................................................................... 50 Appendix A 1. Demonstration Programs ................................................................... A-1 Using Pipeline Parallelism .............................................................................................. A-3 iii iv For Your Information Lecture Description This is the fourth e-lecture of a five-lecture series on parallel processing with SAS/CONNECT software. This lecture teaches you the necessary code to use SAS/CONNECT software to perform pipeline parallel processing to execute dependant code simultaneously without writing the data out to disk. To learn more… For information on other courses in the curriculum, contact the SAS Education Division at 1-800-333-7660, or send e-mail to [email protected]. You can also find this information on the Web at support.sas.com/training/ as well as in the Training Course Catalog. For a list of other SAS books that relate to the topics covered in this Course Notes, USA customers can contact our SAS Publishing Department at 1-800-727-3228 or send e-mail to [email protected]. Customers outside the USA, please contact your local SAS office. Also, see the Publications Catalog on the Web at support.sas.com/pubs for a complete list of books and a convenient order form. For Your Information Prerequisites Before listening to this lecture, you should be able to • write DATA and PROC steps • understand error messages in the SAS log and debug your programs • log on to a remote SAS session through either a SAS spawner or a SAS script file • use an RSUBMIT statement to submit code to a remote machine • use a LIBNAME statement to access SAS data libraries. v vi For Your Information Pipeline Parallelism 1. Overview of Pipeline Parallelism ..................................................................................... 6 2. Piping Code...................................................................................................................... 31 3. Considerations and Benchmarking ............................................................................... 50 2 Pipeline Parallelism 1. Overview of Pipeline Parallelism 3 Pipeline Parallelism Welcome to the Pipeline Parallelism e-lecture. My name is Michelle and I will be your instructor for this lecture. I have been an instructor for SAS for over 15 years and my specialties include SAS/CONNECT software. 4 Pipeline Parallelism Lectures Available Introduction to Parallel Processing. Using Parallel Processing on a Single Machine (Scaling Up) Using Parallel Processing Across Multiple Machines (Scaling Out) Pipeline Parallelism Managing Asynchronous Execution 2 This lecture series consists of five separate lectures. The first lecture is Introduction to Parallel Processing. The second lecture is Using Parallel Processing on a Single Machine (Scaling Up). The third lecture is Using Parallel Processing Across Multiple Machines (Scaling Out). The fourth lecture is Pipeline Parallelism. And the fifth lecture is Managing Asynchronous Execution. This is the fourth lecture in the series. We encourage you to listen to all five lectures to get a full understanding of how to perform parallel processing. 1. Overview of Pipeline Parallelism Pipeline Parallelism 1. Overview of Pipeline Parallelism 2. Piping Code 3. Considerations and Benchmarking 3 In this lecture you will learn how to perform pipeline parallelism, which is to execute dependant code simultaneously without writing the data out to disk. 5 6 Pipeline Parallelism 1. Overview of Pipeline Parallelism Pipeline Parallelism 1. Overview of Pipeline Parallelism 2. Piping Code 3. Considerations and Benchmarking 4 I’m going to start off with an overview to pipeline parallelism, followed by what code you will need to implement it, and lastly talk about some factors to take into consideration and benchmarking we have performed. 1. Overview of Pipeline Parallelism Objectives Define pipeline parallelism. Determine benefits of pipeline parallelism. Determine requirements of pipeline parallelism. 5 In this section, I’ll define pipeline parallelism, talk about the benefits, and what the requirements are. 7 8 Pipeline Parallelism Technology Today Prior to SAS®9, a step that created an output SAS data set wrote that SAS data set to disk, which could be read by a subsequent step or steps. Starting in SAS®9, pipeline parallelism enables a step to bypass writing to disk by directly writing to a pipe, which a subsequent step can then read from. This technique saves I/O time, as well as disk space. 6 Regardless if you use sequential processing or parallel processing of independent steps, the data between steps still needs to be written out to disk. In SAS®9, SAS introduced pipeline parallelism, which negates the need for writing the data to disk between dependent steps. This process allows you to pipe the data directly from one step to the next through a TCP/IP pipe and allows some steps to work in parallel. Since you are no longer writing the data to disk, you are saving both disk space and I/O time. Pipeline parallelism is only part of SAS/CONNECT software. 1. Overview of Pipeline Parallelism 9 Days Before Pipeline Parallelism DATA Step Creating Data Set A 7 PROC Step Reading Data Set A For example, let’s say you have a DATA step that is reading data from a raw data file and creating a SAS data set named “A”. And then a PROC step that processes that data set. 10 Pipeline Parallelism Days Before Pipeline Parallelism DATA Step Creating Data Set A A Disk Wait 8 PROC Step Reading Data Set A Traditionally, while the DATA step is executing it writes the data out to disk. The PROC step that reads that SAS data set has to wait until the DATA step has completed… 1. Overview of Pipeline Parallelism Days Before Pipeline Parallelism DATA Step Creating Data Set A A 9 Disk PROC Step Reading Data Set A … before it can begin reading the data from the disk and start processing. 11 12 Pipeline Parallelism Pipeline Parallelism SAS/CONNECT Software DATA Step Creating Data Set A A TCP/IP Socket (port) PROC Step Reading Data Set A 10 Now, with pipeline parallelism, the DATA step can write the data to a TCP/IP port and the PROC step can read from that port and begin processing the data before the DATA step has finished. 1. Overview of Pipeline Parallelism Pipeline Parallelism Pipeline parallelism is when multiple steps depend on each other, but the execution can overlap and the output of one step is streamed as input to the next step. Pipeline parallelism is possible when Step B requires output from Step A, but it does not need all the output before it can begin. Because the data flows in a continual stream from one task into another through a TCP/IP socket, program execution time can be dramatically shortened. 11 ® Piping is a SAS 9 extension of the MP CONNECT functionality whose purpose is to address pipeline parallelism. The pipeline can be extended to include any number of steps and can even extend between different physical machines. 13 14 Pipeline Parallelism Pipeline Parallelism DATA Step PROC SORT Step 12 For example, PROC SORT really does not need all the data output by the DATA step before it can start. Sorting can start with only two records and then continue by adding and sorting more records as they become available. So let’s say I have a DATA step that is reading a raw data file, processing the record, and then outputting the observation. I then have a PROC SORT to sort that data. So the DATA step reads the first record, processes it, and outputs it to the TCP/IP pipe. 1. Overview of Pipeline Parallelism 15 Pipeline Parallelism DATA Step PROC SORT Step 13 The PROC SORT reads from the pipe and, well, there’s not anything it can do with one observation, so it just holds on to it. The DATA step, meanwhile, reads the second record, processes it, and outputs it to the pipe. 16 Pipeline Parallelism Pipeline Parallelism DATA Step PROC SORT Step 14 Now PROC SORT has two observations and it can sort those observations. The DATA step reads the next record… 1. Overview of Pipeline Parallelism Pipeline Parallelism DATA Step PROC SORT Step 15 … and sends it to PROC SORT, which puts it in the appropriate place. So the DATA step reads, … 17 18 Pipeline Parallelism Pipeline Parallelism DATA Step PROC SORT Step 16 … and the PROC SORT sorts at the same time. Read, … 1. Overview of Pipeline Parallelism Pipeline Parallelism DATA Step PROC SORT Step 17 … and sort. Read, … 19 20 Pipeline Parallelism Pipeline Parallelism DATA Step PROC SORT Step 18 … and sort. As you can see, this is going to result in much faster elapsed time… 1. Overview of Pipeline Parallelism Pipeline Parallelism DATA Step PROC SORT Step 19 … than not using pipeline parallelism. Eventually, the DATA step will run out of records, … 21 22 Pipeline Parallelism Pipeline Parallelism DATA Step PROC SORT Step 20 … and PROC SORT will finish soon after. 1. Overview of Pipeline Parallelism 23 Pipeline Parallelism DATA Step Creating Data Set A TCP/IP Socket (port) Read and Feed A PROC SORT Reading Data Set A 21 So as we saw, PROC SORT can easily read the data coming from a DATA step through a TCP/IP pipe since it doesn’t need to have all of the data to start doing this work. This is what I like to call a “read and feed”. 24 Pipeline Parallelism Pipeline Parallelism DATA Step Creating Data Set A TCP/IP Socket (port) A PROC SORT Reading Data Set A TCP/IP Socket (port) 22 Read and Feed A Read and Impede Wait for all data to be sorted PROC PRINT Reading Data Set A However, if I followed that PROC SORT with a PROC PRINT, PROC SORT needs to finish before PROC PRINT can begin producing the report. I call this a “read and impede”. You can still send the data through the pipe as long as some considerations are met, and I’ll discuss those considerations at the end of this e-lecture. 1. Overview of Pipeline Parallelism 25 Types of Execution Sequential Execution (Synchronous) DATA Step A PROC SORT A DATA Step B PROC SORT B Independent Parallelism (Asynchronous) DATA Step A DATA Step B 23 0 PROC SORT A PROC SORT B elapsed time So thinking about the different types of executions then, our first type of execution is the good old sequential execution, also known as synchronous execution. In this method, no step can start until the previous step has finished, even if it is working on different data. This program is going to take the longest. In previous e-lectures in this series, I discussed independent parallelism, also known as asynchronous execution. This method allows me to have steps execute at the same time as long as they are working with different, or independent, data. This can drastically reduce the overall elapsed time needed. In this method though, if two steps use the same data, the second step must still wait for the first step to finish and write the data out to disk. 26 Pipeline Parallelism Types of Execution Pipeline Parallelism (Piping) Asynchronous DATA Step A 24 PROC SORT A DATA Step B PROC SORT B 0 elapsed time With pipeline parallelism, I can overlap steps even if they are using the same data, as long as that makes sense to do so, like with a DATA step and PROC SORT step. I can combine this technique with asynchronous processing to really give me reduction in overall elapsed time. As a reminder, both pipeline parallelism and asynchronous processing need SAS/CONNECT software to work, even if you are doing this processing on the same physical machine. 1. Overview of Pipeline Parallelism Benefits of Piping The benefits of piping include: overlapped execution of PROC and/or DATA steps elimination of intermediate writes to disk improved performance reduced disk space requirements 25 What are the benefits of piping? You can overlap execution of DATA steps and PROC steps and eliminate writing to disk, which saves disk space, which reduces I/O time. And all of this means an increase in performance. If you do not have sufficient resources on a single machine, you can pipe between remote machines. 27 28 Pipeline Parallelism Consideration and Requirements You must have sufficient CPU and I/O resources when implementing piping. Piping, like all scalability solutions, is most effective when the execution time of an application is substantial. Piping requires a SAS/CONNECT software license because it is part of the SAS/CONNECT product. Piping requires SAS®9. If you are piping across machines, both machines must be running SAS®9. 26 What are some of the issues with piping? First, you must have sufficient CPU and I/O resources. You ® must be using SAS/CONNECT under SAS 9, and it is not a good choice if your application already runs quickly. It’s best used for long-running programs. Piping does introduce a small overhead from doing a SIGNON, CPU overhead of additional processes, and a complexity overhead to the application. 1. Overview of Pipeline Parallelism 29 Limitations of Piping DATA Step Creating Data Set A Disk TCP/IP Socket (port) A PROC SORT Reading Data Set A 27 A limitation of piping is that it supports single-pass, sequential data processing. Because piping stores data for reading and writing in TCP/IP ports instead of disks, the data is never permanently stored. Instead, after the data is read from a port, the data is removed entirely from that port and the data cannot be read again. If your data requires multiple passes for processing, piping cannot be used. So, once it has been processed it is no longer available for a second pass. 30 Pipeline Parallelism Sequential Steps Examples of SAS steps that process single-pass, sequential data: SORT SUMMARY GANT COPY PRINT * DATA step * * = exceptions apply 28 So what steps are good for piping? PROC SORT, PROC SUMMARY, PROC GANT, and PROC COPY. These are great PROCs to read in data from a pipe. As we saw earlier, if they are writing to a pipe, though, the step must complete before the next step executes. Now PROC PRINT is a special case scenario. In most cases I can take the data from a DATA step and send it to PROC PRINT through a pipe. Because by default, PROC PRINT only makes a single pass through the data. The exception to this rule is if you use the UNIFORM option in the PROC PRINT statement. The UNIFORM option forces PROC PRINT to spin through the data twice, calculating the longest data value for each variable. The DATA step can also read the data from a pipe unless there is a KEY= or POINT= option in the DATA step. Those options require direct, not sequential, access to the data. 2. Piping Code 2. 31 Piping Code Pipeline Parallelism 1. Overview of Pipeline Parallelism 2. Piping Code 3. Considerations and Benchmarking 29 Now that you have a basic idea of what pipeline parallelism is, let’s see the code needed to make it work. 32 Pipeline Parallelism Objectives Use the SASESOCK engine. Apply the pipe in more than one RSUBMIT block. Perform piping on a single machine. Perform piping across multiple machines. 30 In this section you will learn how to use the SASESOCK engine in the LIBNAME statement, and how to specify it in your RSUBMIT blocks. We’ll see how to perform the piping on a single machine and then discuss how to scale out to another machine. 2. Piping Code 33 Piping Syntax General form of the SASESOCK engine: LIBNAME libref SASESOCK "port-specifier" <TIMEOUT=time-in-seconds>; 31 To specify that you want to use pipeline parallelism, this information is specified in the LIBNAME statement. The syntax is LIBNAME, followed by a library reference, then the keyword SASESOCK. Next in quotes is the port-specifier and optionally a TIMEOUT= option, which I’ll discuss in more detail in a moment. 34 Pipeline Parallelism Port Specifiers port-specifier can be represented in these ways: ":explicit-port " specifies an explicit port on the machine where the asynchronous RSUBMIT is executing. It is signified by a hardcoded port number. libname payroll sasesock ":256"; 32 continued... One option for a port-specifier is to explicitly type in the port number of the machine the code is running on. This port number is preceded by a colon. You do not have to configure a port number in the SERVICES file in order to use it. Be aware that if the port that you specify is already in use, you will be denied access. 2. Piping Code 35 Port Specifiers ":port service" specifies the name of the service on the machine where the asynchronous RSUBMIT is executing. It is signified by the name of the port service. libname payroll sasesock ":pipe1"; An example of what would be placed in your SERVICES file: pipe1 pipe2 pipe3 33 13001/tcp 13002/tcp 13003/tcp Under Windows: C:\WINNT\system32\drivers\etc\services Alternatively, if a port service is configured in the SERVICES file you can use it. So in this case in quotes I have a colon followed by the name of the port I want to use. Your SERVICES file is typically found in a location where the operating system is installed. For Windows, it may be found in C:\WINNT\system32\drivers\etc\services. Once again, if the port is already in use, you will be denied access. 36 Pipeline Parallelism Scaling Up Piping /* start session READTASK and */ /* execute first data step */ /* asynchronously to a pipe */ signon readtask sascmd='!sascmd –nosyntaxcheck'; rsubmit readtask wait=no; libname outlib sasesock ":pipe1"; data outlib.equipment; /* data step statements */ run; endrsubmit; 34 continued... To implement piping, we need to do this in asynchronous tasks. In this case I’m going to pipe to my same machine, so I will use the scaling up technique. My first statement is to sign on, and I’m naming this task readtask and using the SASCMD option to kick off a separate SAS session on my current machine. Next is an RSUBMIT statement to readtask with the WAIT=NO option to process code asynchronously. Now I have a LIBNAME statement creating a library reference named outlib using the SASESOCK engine and pointing to :pipe1. So if you think about what a LIBNAME statement normally does, it normally points to a physical path location on disk. Here we are just pointing it to a TCP/IP pipe instead. In my DATA step it is the same code you would have if you were writing to disk, namely DATA outlib.equipment;. So this creates a data set named equipment and outputs to wherever outlib is pointing to, which is our TCP/IP pipe named pipe1. Then we have whatever DATA step code is needed including a RUN statement, and finally, an ENDRSUBMIT. 2. Piping Code 37 Scaling Up Piping /* start session READTASK and */ /* execute first data step */ /* asynchronously to a pipe */ signon readtask sascmd='!sascmd –nosyntaxcheck'; rsubmit readtask wait=no; libname outlib sasesock ":pipe1"; data outlib.equipment; /* data step statements */ run; /* start session SORTTASK and */ endrsubmit; /* execute second step which */ /* gets its input from a pipe */ signon sorttask sascmd='!sascmd -nosyntaxcheck'; rsubmit sorttask wait=no; libname inlib sasesock ":pipe1"; libname newlib 'c:\workshop\winsas\mpdp'; proc sort data=inlib.equipment out=newlib.final; by costprice_per_unit; run; endrsubmit; 35 continued... To receive this data for another step, I’ll write another SIGNON statement creating a SAS session named SORTTASK (don’t you love the names I came up with? Very creative I think!). In the RSUBMIT block for SORTTASK, you’ll notice that once again I have a LIBNAME statement with the SASESOCK engine that is pointing to that same pipe, :pipe1. I named this libref inlib just to distinguish it from the other LIBNAME statement. But you certainly could have used the same libref because this code will be processed in a separate SAS session. Now this RSUBMIT block is doing a PROC SORT. Remember that a TCP/IP pipe is one direction. So PROC SORT can read from the pipe but can’t write the data back. And in this case I wouldn’t want to do that anyway, because eventually I do want the data stored on disk somewhere. So notice the second LIBNAME statement creating a libref of newlib pointing to a physical path on my C: drive. In my PROC SORT, it’s data=inlib (the name of the libref of the pipe) dot equipment (the name of the data set the DATA step was creating), OUT=newlib (my disk libref) dot final. So this is the code that you need to pipe data from one step to the next. Again the keys are the separate RSUBMIT blocks and the SASESOCK engine. 38 Pipeline Parallelism Scaling Up Piping /* start session READTASK and */ /* execute first data step */ /* asynchronously to a pipe */ signon readtask sascmd='!sascmd –nosyntaxcheck'; rsubmit readtask wait=no; libname outlib sasesock ":pipe1"; data outlib.equipment; /* data step statements */ run; /* start session SORTTASK and */ endrsubmit; /* execute second step which */ /* gets its input from a pipe */ signon sorttask sascmd='!sascmd -nosyntaxcheck'; rsubmit sorttask wait=no; libname inlib sasesock ":pipe1"; libname newlib 'c:\workshop\winsas\mpdp'; proc sort data=inlib.equipment out=newlib.final; by costprice_per_unit; waitfor _all_run; readtask sorttask; endrsubmit; 36 signoff readtask ; signoff sorttask; Of course once all of the processing has completed, you will want to sign off of both tasks. So the two steps must be placed in two separate RSUBMIT blocks, or one RSUBMIT block and the parent process. This is required so that the two steps can run simultaneously. The first DATA step writes the results to the TCP/IP pipe. The job of the second step is to remove this data from the TCP/IP pipe. Because of the limited amount of buffer space in the pipe, the processes must run at the same time. 2. Piping Code 39 Scaling Up Piping – Alternative Solution /* start session SORTTASK and */ /* execute step which gets its */ /* input from a pipe eventually */ signon sorttask sascmd='!sascmd -nosyntaxcheck'; rsubmit sorttask wait=no; libname inlib sasesock ":pipe1"; libname newlib 'c:\workshop\winsas\mpdp'; proc sort data=inlib.equipment out=newlib.final; by costprice_per_unit; run; endrsubmit; 37 continued... In that last scenario I kicked off two RSUBMIT SAS sessions, which was additional overhead and meant my parent SAS session wasn’t doing anything. Alternatively, I could just RSUBMIT the two LIBNAME statements and the PROC SORT, which will open up the pipe. 40 Pipeline Parallelism Scaling Up Piping – Alternative Solution /* start session SORTTASK and */ /* execute step which gets its */ /* input from a pipe eventually */ signon sorttask sascmd='!sascmd -nosyntaxcheck'; rsubmit sorttask wait=no; libname inlib sasesock ":pipe1"; libname newlib 'c:\workshop\winsas\mpdp'; proc sort data=inlib.equipment out=newlib.final; by costprice_per_unit; run; endrsubmit; /* In Parent Session submit... */ libname outlib sasesock ":pipe1"; data outlib.equipment; /* data step statements */ run; 38 continued... And then in my parent SAS session have the LIBNAME statement that writes to the pipe with the DATA step. This will be slightly more efficient than the earlier example since there is less overhead to kick off an additional SAS session. I need to make sure my RSUBMIT block with the WAIT=NO option is submitted before the code that the parent SAS session is processing. Otherwise the parent SAS session will just start processing the DATA step and in good, traditional SAS is not going to move past that step until it is completed, which means the RSUBMIT code will not process at the same time. 2. Piping Code 41 Scaling Up Piping – Alternative Solution /* start session SORTTASK and */ /* execute step which gets its */ /* input from a pipe eventually */ signon sorttask sascmd='!sascmd -nosyntaxcheck'; rsubmit sorttask wait=no; libname inlib sasesock ":pipe1"; libname newlib 'c:\workshop\winsas\mpdp'; proc sort data=inlib.equipment out=newlib.final; by costprice_per_unit; run; endrsubmit; /* In Parent Session submit... */ libname outlib sasesock ":pipe1"; data outlib.equipment; /* data step statements */ run; 39 waitfor _all_ sorttask; signoff sorttask; You still need to make sure you wait for the sorttask session to complete, though, before you sign off of it. 42 Pipeline Parallelism Multiple Pipes signon readtask sascmd='!sascmd -nosyntaxcheck'; rsubmit readtask wait=no; libname outlib1 sasesock ":pipe1"; libname outlib2 sasesock ":pipe2"; data outlib1.equip outlib2.clothes; /* data step statements */ if type='EQUIPMENT' then output outlib1.equip; else if type='CLOTHES' then output outlib2.clothes; run; endrsubmit; 40 continued... I can also work with multiple pipes. In this RSUBMIT block I have two LIBNAME statements, one with a libref of OUTLIB1 pointing to pipe1, and one with a libref of OUTLIB2 pointing to pipe2. In my DATA statement, I have two data sets listed. Then in my DATA step I have a series of IF/THEN statements. So if the variable type has a value of “EQUIPMENT”, then I’ll write the observations to the OUTLIB1.EQUIP data set. Otherwise if the type is “CLOTHES”, then write the observation out to the OUTLIB2.CLOTHES data set. 2. Piping Code 43 Multiple Pipes signon readtask sascmd='!sascmd -nosyntaxcheck'; rsubmit readtask wait=no; libname outlib1 sasesock ":pipe1"; libname outlib2 sasesock ":pipe2"; data outlib1.equip outlib2.clothes; /* data step statements */ if type='EQUIPMENT' then output outlib1.equip; else if type='CLOTHES' then output outlib2.clothes; run; signon equpsort sascmd='!sascmd -nosyntaxcheck'; endrsubmit; rsubmit equpsort wait=no; libname inlib sasesock ":pipe1"; libname newlib 'c:\workshop\winsas\mpdp'; proc sort data=inlib.equip out=newlib.final1; by costprice_per_unit; run; endrsubmit; 41 continued... Next I’ll have another RSUBMIT block that reads the data from pipe1, which was the pipe I wrote the equipment observations to. So this RSUBMIT block can start sorting those observations from the pipe as soon as the DATA step in the first RSUBMIT block starts writing to the pipe. Notice in the PROC SORT there is an OUT= option writing the sorted data to the NEWLIB.FINAL1 data set, where NEWLIB is pointing to C:\workshop\winsas\mpdp. 44 Pipeline Parallelism Multiple Pipes signon readtask sascmd='!sascmd -nosyntaxcheck'; rsubmit readtask wait=no; libname outlib1 sasesock ":pipe1"; libname outlib2 sasesock ":pipe2"; data outlib1.equip outlib2.clothes; /* data step statements */ if type='EQUIPMENT' then output outlib1.equip; signon sascmd ='!sascmd –nosyntaxcheck'; else clthsort if type='CLOTHES' then output rsubmit clthsort wait=no; outlib2.clothes; libname inlib sasesock ":pipe2"; run; signon equpsort sascmd='!sascmd -nosyntaxcheck'; libname newlib endrsubmit; rsubmit equpsort wait=no; 'c:\workshop\winsas\mpdp'; libname inlib sasesock ":pipe1"; proc sort data=inlib.clothes libname newlib out=newlib.final2; 'c:\workshop\winsas\mpdp'; by costprice_per_unit; proc sort data=inlib.equip run; out=newlib.final1; endrsubmit; by costprice_per_unit; run; endrsubmit; 42 continued... While this is executing I have another RSUBMIT block reading from pipe2, which is where the clothes observations had been written to. Once again this PROC SORT is writing out to a new data set as identified by the OUT= option, and this new data set name is NEWLIB.FINAL2. And the NEWLIB libref in this RSUBMIT block is also pointing to C:\workshop\winsas\mpdp. 2. Piping Code Multiple Pipes signon readtask sascmd='!sascmd -nosyntaxcheck'; rsubmit readtask wait=no; libname outlib1 sasesock ":pipe1"; libname outlib2 sasesock ":pipe2"; data outlib1.equip outlib2.clothes; /* data step statements */ if type='EQUIPMENT' then output outlib1.equip; signon sascmd ='!sascmd –nosyntaxcheck'; else clthsort if type='CLOTHES' then output rsubmit clthsort wait=no; outlib2.clothes; libname inlib sasesock ":pipe2"; run; signon equpsort sascmd='!sascmd -nosyntaxcheck'; libname newlib endrsubmit; rsubmit equpsort wait=no; 'c:\workshop\winsas\mpdp'; libname inlib sasesock ":pipe1"; proc sort data=inlib.clothes libname newlib out=newlib.final2; 'c:\workshop\winsas\mpdp'; by costprice_per_unit; proc sort data=inlib.equip run; out=newlib.final1; endrsubmit; by costprice_per_unit; waitfor _all_ readtask clthsort equpsort; signoff readtask; run; 43 signoff clthsort; endrsubmit; signoff equpsort; And as always, I have a WAITFOR statement waiting for all RSUBMIT blocks to complete before signing off. 45 46 Pipeline Parallelism Using Pipeline Parallelism ACCESSORIES.CSV DATA step CLOTHES.CSV CHILDREN.CSV DATA step DATA step Concatenate 44 Let me go ahead and demo how to use piping on a single machine. I need to read the ACCESSORIES raw data file, the CLOTHES raw data file, and the CHILDREN raw data file using separate DATA steps. I’ll take that data and send it through separate pipes and then concatenate the data and write it out to disk. Notice in my code I have four different SIGNON statements, one for each of the raw data files to be processed, and one that will perform the concatenation. In my first RSUBMIT block, named accwin, I have a LIBNAME statement creating a libref of outlib using the SASESOCK engine and pointing to a pipe from my SERVICES file named :pipe1. In my DATA statement I’m creating OUTLIB.ACCESSORIES, so the accessories data set will be written to pipe1. Next I have a traditional INFILE statement pointing to the accessories.csv file and an INPUT statement. In the second RSUBMIT block, named childwin, I have a similar LIBNAME statement creating a libref of outlib but this time pointing to :pipe2. This INFILE statement points to the children raw data file. The third RSUBMIT block is the same, but this time named clothwin and pointing to :pipe3 with the INFILE statement reading the clothes raw data file. 2. Piping Code 47 In my fourth RSUBMIT block I have three LIBNAME statements creating librefs of READ1, READ2, and READ3 pointing to the pipes of :pipe1, :pipe2, and :pipe3 respectively. I have a fourth LIBNAME statement creating a libref of ORION actually point to a path on my machine. In the next statement is the DATA statement which creates a SAS data set named concat in the libref ORION which is the one pointing to disk. In the SET statement, it is SET READ3.CLOTHES READ2.CHILDRENS. READ1.ACCESSORIES. This will concatenate the data. After that RSUBMIT block is a WAITFOR all tasks to finish, and then a SIGNOFF statement for each task. I could have said SIGNOFF _ALL_;, as that is a new option in SAS 9.2. If you are using SAS 9.1, you need a separate SIGNOFF statement for each task. I’m going to go ahead and submit this code. And wait while the code is processed. Once the code has finished processing, I’m going to activate the log and scroll up to the top. All of the SIGNON statements look good. Scrolling down farther I see the LIBNAME statement that points to :pipe1 and the note says successfully assigned. Notice the engine name says SASESOCK and the physical name is pipe1. The notes following the DATA step indicate that the data set OUTLIB.ACCESSORIES has 73,507 observations. Scrolling down there is a similar note about the libref being successfully assigned for pipe2, and the OUTLIB.CHILDRENS data set has 88,482 observations. Scrolling down even more is another successful libref pointing to pipe3 and a data set of OUTLIB.CLOTHES with 110,032 observations. In the last RSUBMIT block you see the three LIBNAME statements pointing to pipe1, pipe2, and pipe3. The last LIBNAME statement successfully points to a path on the C: drive. So all three data sets get concatenated and the resulting permanent data set ORION.CONCAT has 272,021 observations. So this technique worked really well to prevent all of that intermediate writing to disk that would have normally occurred without piping. 48 Pipeline Parallelism Port Specifiers "parent-machine-name:port-number" specifies an explicit port number on the machine specified by parent-machine-name. libname payroll sasesock "orion.finance.com:256"; 45 continued... What if your machine only has a single processor and you attempt to run pipeline parallelism? The piping test case will probably take longer than the sequential test case. This is a result of contention for the I/O channel, the CPU, and memory. If you work in an environment such as this, piping can still work for you. Simply farm the work out to other machines. In the LIBNAME statement before specifying the port number, just type in the name of the remote machine and then a colon. When specifying a port on a remote machine you can only read, not write, from that pipe. 2. Piping Code 49 Port Specifiers "machine-name:port service" specifies the name of the service on the machine specified by machine-name. libname payroll sasesock "orion.finance.com:pipe1"; 46 Or as we saw before you can specify a port name instead. Here I have orion.finance.com:pipe1. When specifying a port on a remote machine you can only read from, not write to, that pipe. Ensure that the port that the output is written to is on the same machine that the asynchronous process is run on. However, a SAS procedure that reads from that port can run on another machine. 50 Pipeline Parallelism 3. Considerations and Benchmarking Pipeline Parallelism 1. Overview of Pipeline Parallelism 2. Piping Code 3. Considerations and Benchmarking 47 Lastly, let’s talk in more detail about factors you need to consider and the benchmarking statistics. 3. Considerations and Benchmarking Objectives Factor into consideration issues associated with piping. Compare the benchmarks of piping versus not piping. 48 In this section you will learn about issues to consider with regard to piping. You will also see some benchmarks of using piping versus not using piping. 51 52 Pipeline Parallelism Additional Considerations The benefits of piping should be weighed against the cost of potential CPU or I/O bottlenecks. If execution time for a SAS procedure or statement is relatively short, piping is probably counterproductive. If you have only a few I/O channels, I/O intensive code (like PROC SORT) will clog up the I/O channels. You might minimize port access collisions on the same machine by reserving a range of ports in the SERVICES file. Be sure that the task reading the data does not complete before the task writing the data. 49 There are many items you need to consider before rewriting all of your code to support piping. The first item is piping should not be used if your execution time is already relatively short. You won’t buy anything, and it could have a negative impact because it requires more CPU time and could present I/O bottlenecks. Secondly, if you have few I/O channels available, I/O intensive code like PROC SORT will clog up the I/O channels. How do you know how many I/O channels are available? Ask your systems administrator. Also, we recommend that you reserve a range of ports in your SERVICES file and that helps reduce port access collisions. And lastly, the task reading the data must not complete before the task writing the data. 3. Considerations and Benchmarking 53 Closing a Pipe Early DATA Step Creating Data Set A PROC Step Reading Data Set A with OBS=10 option 50 For example, if the DATA step produces a large number of observations and PROC PRINT only prints the first few observations specified by the OBS= option, this can result in the reading task closing the pipe after the first few observations are printed. This would cause an error for the DATA step because it would continue to try to write to the pipe, which was closed. 54 Pipeline Parallelism Closing a Pipe Early DATA Step Creating Data Set A 1. DATA step opens port and begins writing data. PROC Step Reading Data Set A with OBS=10 option 51 continued... Let me show you what this would look like. Let’s say I have a DATA step that creates data set A, and a PROC STEP that only processes the first 10 observations. The first step is that the DATA step opens the port and begins writing to the pipe. 3. Considerations and Benchmarking Closing a Pipe Early DATA Step Creating Data Set A 1. DATA step opens port and begins writing data. A 2. PROC step begins reading data. PROC Step Reading Data Set A with OBS=10 option 52 continued... The PROC step then begins reading the data from that pipe. 55 56 Pipeline Parallelism Closing a Pipe Early DATA Step Creating Data Set A 1. DATA step opens port and begins writing data. A 3. PROC step finishes reading data and closes port. 2. PROC step begins reading data. PROC Step Reading Data Set A with OBS=10 option 53 continued... Once the PROC step finishes, it says “OK, I’m done” and closes the port. 3. Considerations and Benchmarking Closing a Pipe Early 4. DATA step can no longer write to port and generates an error message. 3. PROC step finishes reading data and closes port. DATA Step Creating Data Set A 57 1. DATA step opens port and begins writing data. 2. PROC step begins reading data. PROC Step Reading Data Set A with OBS=10 option 54 Because the port is now closed, the DATA step can’t write to it anymore and generates an error message. While the task that does the writing generates an error and will not complete, the task that reads will complete successfully. You can ignore the error in the writing task if the completion of this task is not required (as is the case with the DATA step and PROC PRINT example in the item). Alternatively, move the OBS=10 option to the DATA step so that the reader does not finish before the writer. 58 Pipeline Parallelism Closing a Pipe Early DATA Step Creating Data Set A 1. DATA step opens port and processes for a long time; no data written out yet. A PROC Step Reading Data Set A 55 continued... Another issue with regard to timing is that if the task that reads from the pipe opens the pipe to read but data isn’t written to the pipe due to some kind of delay, then the reading step times out and closes the pipe. So pictorially, once again let’s say we have a DATA step writing to a pipe and a PROC step reading from that pipe. This particular DATA step does a lot of processing on an observation. 3. Considerations and Benchmarking Closing a Pipe Early DATA Step Creating Data Set A 59 1. DATA step opens port and processes for a long time; no data written out yet. A PROC Step Reading Data Set A 2. PROC step waits for data in port; gives up after value specified with the TIMEOUT= option. 56 continued... The PROC step keeps waiting for data, and after awhile it gets bored. How long is “awhile”? By default it is 10 seconds, as that is the default value of the TIMEOUT= option in the LIBNAME statement using the SASESOCK engine. 60 Pipeline Parallelism Closing a Pipe Early DATA Step Creating Data Set A 1. DATA step opens port and processes for a long time; no data written out yet. A 3. PROC step closes port. PROC Step Reading Data Set A 2. PROC step waits for data in port; gives up after value specified with the TIMEOUT= option. 57 continued... As soon as that TIMEOUT value is reached without having data in the pipe, the PROC step closes the pipe. 3. Considerations and Benchmarking Closing a Pipe Early 4. DATA step can no longer write to port and generates an error message. DATA Step Creating Data Set A 3. PROC step closes port. PROC Step Reading Data Set A 61 1. DATA step opens port and processes for a long time; no data written out yet. 2. PROC step waits for data in port; gives up after value specified with the TIMEOUT= option. 58 Eventually the DATA step is finished with the first observation and tries to output the observation to the port. But because the port is closed, it can’t and you will get an error message. 62 Pipeline Parallelism TIMEOUT= Option LIBNAME libref SASESOCK "port-specifier" <TIMEOUT=time-in-seconds>; 59 If you need the pipe to stay open longer, just use the TIMEOUT= option in the LIBNAME statement to increase the timeout value for the task that is reading. This causes the reading task to “wait” longer for the writing task to begin writing to the pipe. This enables the initial steps in the writing task to complete and the DATA step or SAS procedure to begin writing to the pipe before the reader times out. The value for the TIMEOUT= option is the number of seconds to keep the pipe open. 3. Considerations and Benchmarking 63 Performance Gains Look at an example scenario to compare the following: traditional sequential processing MP CONNECT + Piping MP CONNECT + Piping + Threading 60 MP CONNECT, pipeline parallelism, and threading improve performance, but by how much? Let’s look at some different scenarios. First, we’ll use traditional sequential processing. Next, you will see asynchronous processing plus piping. And the last scenario is to use asynchronous processing plus piping plus threading. This example was executed using the following criteria: an eight-way 900 MHz UNIX box two raw input files (~1.5G each) two DATA steps, two SUMMARY procedures, and a DATA step merge. 64 Pipeline Parallelism Sequential SAS Job equip.csv DATA Step outdoors.csv DATA Step PROC SUMMARY PROC SUMMARY DATA Step Merge 61 In this scenario I have a DATA step to read the EQUIP.CSV raw data file and do some calculations, and a DATA step to read the OUTDOORS.CSV file and do similar processing. Then each data set goes through a PROC SUMMARY, and the resulting output data sets are merged using a DATA step. 3. Considerations and Benchmarking Execution Times Sequential implementation 1210 seconds MP CONNECT without threads Up Next MP CONNECT with threads On Deck 62 With traditional processing, the sequential implementation took a little over 1200 seconds. 65 66 Pipeline Parallelism MP CONNECT and Piping EQUIP.CSV OUTDOORS.CSV DATA step DATA step Summary Summary Simultaneously Merge 63 Now we put in the code to process the equipment data and outdoors data simultaneously, using one processor per task. We are also using piping to reduce disk space, I/O operations, and CPU time. Then lastly, merge the data sets together and write it out to disk. Piping works well here because there is no need to retain the temporary data sets created by the two DATA steps and the subsequent PROC summaries. Therefore, we can use piping to pipe the data from one step to the next. This gives us - overlapped execution of all of the steps - eliminates write to disk of intermediate results. 3. Considerations and Benchmarking Execution Times Sequential implementation 1210 seconds MP CONNECT and Piping without threads 620 seconds (49% improvement) MP CONNECT and Piping with threads Up Next 64 The overall elapsed time for this technique was almost half of the sequential processing. Next, let’s look at adding threading to this scenario. 67 68 Pipeline Parallelism MP CONNECT with Piping and Threads OUTDOORS.CSV EQUIP.CSV Simultaneously DATA step DATA step Summary Summary Merge 65 PROC SUMMARY turns threading off by default for BY-group processing. However, because these BYgroups are fairly large, I can turn on threading by adding the THREADS option to the PROC SUMMARY statement. This lets additional CPUs process the PROC SUMMARY code. 3. Considerations and Benchmarking Execution Times Sequential implementation 1210 seconds MP CONNECT and Piping without threads 620 seconds (49% improvement) MP CONNECT and Piping with threads 382 seconds (70% improvement) 66 Now with threading turned on it reduced the overall elapsed time to 382 seconds, which is a 70% improvement over the sequential execution. 69 70 Pipeline Parallelism MP CONNECT and Threaded Summary total improvement in elapsed time of 70% 67 So graphically, the top bar represents the relative elapsed time necessary to run the original sequential test case. The second bar represents the time savings from the MP CONNECT solution and piping. The bottom bar represents the final implementation combining MP CONNECT functionality, piping, and threaded PROC SUMMARY. The combination gives us an improvement in total elapsed time of 70%. 3. Considerations and Benchmarking Summary 68 Pipeline parallelism can significantly reduce the overall elapsed time a program takes to run. By taking advantage of pipes you can eliminate the need to write the data to disk and overlap processing of steps that process the same data. 71 72 Pipeline Parallelism Credits Pipeline Parallelism was developed by M. Michelle Buchecker. Additional contributions were made by Cheryl Doninger, Glenn Horton, Merry Rabb, and Chris Riddiough. 69 That concludes the Pipeline Parallelism e-lecture. 3. Considerations and Benchmarking Comments? We would like to hear what you think. Do you have any comments about this lecture? Did you find the information in this lecture useful? What other e-lectures would you like to see SAS develop in the future? Please e-mail your comments to [email protected] 70 If you have any comments on this lecture or other lectures you would like to see, please e-mail [email protected]. 73 74 Pipeline Parallelism Copyright SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies. Copyright © 2009 by SAS Institute Inc., Cary, NC 27513, USA. All rights reserved. 71 I hope you learned a lot in this e-lecture and can put the topics to good use. Appendix A Demonstration Programs 1. Using Pipeline Parallelism ............................................................................................ A-3 A-2 Appendix A Demonstration Programs 1. Using Pipeline Parallelism 1. Using Pipeline Parallelism Section 2, Slide 44 signon accwin sascmd='!sascmd -nosyntaxcheck'; signon childwin sascmd='!sascmd -nosyntaxcheck'; signon clothwin sascmd='!sascmd -nosyntaxcheck'; signon combine sascmd='!sascmd -nosyntaxcheck'; rsubmit accwin wait=no; libname outlib sasesock ":pipe1"; data outlib.accessories; infile 'c:\workshop\winsas\mpdp\accessories.csv' dsd; input order_id order_item_num product_id quantity total_retail_price : comma7. costprice_per_unit : comma7. discount :percent7. product_name :$45. supplier_id product_level product_ref_id; run; endrsubmit; rsubmit childwin wait=no; libname outlib sasesock ":pipe2"; data outlib.childrens; infile 'c:\workshop\winsas\mpdp\children.csv' dsd; input order_id order_item_num product_id quantity total_retail_price : comma7. costprice_per_unit : comma7. discount :percent7. product_name :$45. supplier_id product_level product_ref_id; run; endrsubmit; rsubmit clothwin wait=no; libname outlib sasesock ":pipe3"; data outlib.clothes; infile 'c:\workshop\winsas\mpdp\clothes.csv' dsd; input order_id order_item_num product_id quantity total_retail_price : comma7. costprice_per_unit : comma7. discount :percent7. product_name :$45. supplier_id product_level product_ref_id; run; endrsubmit; rsubmit combine wait=no; libname read1 sasesock ":pipe1"; libname read2 sasesock ":pipe2"; A-3 A-4 Appendix A Demonstration Programs libname read3 sasesock ":pipe3"; libname orion 'c:\workshop\winsas\mpdp'; data orion.concat; set read3.clothes read2.childrens read1.accessories; run; endrsubmit; waitfor _all_ accwin childwin clothwin combine; signoff signoff signoff signoff accwin; childwin; clothwin; combine;
© Copyright 2026 Paperzz