Getting started with SAS University Edition - InfoLab

University of Southern California
Getting started with
SAS University
Edition
INF 550 : Overview of Data Informatics in Large Data Environments
TAB LE OF CONTENTS
Installing SAS University Edition................................................................................................ 3
Getting a DATASET : Kaggle's "Titanic : Machine Learning from Disaster" ............................. 3
Analysing the Data ...................................................................................................................... 3
Importing training dataset into SAS............................................................................................ 3
Who were the passengers on the Titanic? (Ages,Gender,Class..etc) .............................................. 4
What factors played a role in a persons survival ? ....................................................................... 5
2 |P a ge
INSTALLI NG SAS UNIVERSITY EDITION
1. Visit the link below and follow the instructions given there
http://www.sas.com/en_us/software/university-edition/download-software.html
The Quick Start document and videos provided there are very precise and clear
GETTING A DATASET : KAGGLE'S "TITANIC : M ACHINE LEARNING FROM
DISASTER"
1. Create an account on the website https://www.kaggle.com
2. Go to https://www.kaggle.com/c/titanic/data
3. Download the file train.csv
The file contains information on some of the passengers like Age, Sex, Fare paid, Passenger
Class, Cabin and most important of all if they survived the wreck or not.
The objective of the problem is to predict if a passenger survived the ship wreck or not given
other attributes about a passenger.
4. Copy the file train.csv to the shared folder ("myfolders") you would have created while setting
up the SAS University Edition server.
ANALYSING THE DATA
IMP ORTING TRAINING D ATASET INTO SAS
1. Click the "Server Files and Folders" Tab in the Top Left region of your screen
2. Click on New -> SAS Program
3. In the SAS program window punch in following code
PROC import DBMS=csv out=train REPLACE
DATAFILE= "/folders/myfolders/kaggle_titanic/train.csv";
DELIMITER= ",";
GETNAMES= YES;
RUN;
4. You can now see the imported data as a table in the results Tab
5. The dataset can also be seen under "Work" library
6. Go back to the SAS Program console and punch in following code
PROC CONTENTS DATA=WORK.TRAIN;
RUN;
7. This displays details about the imported data
3 |P a ge
WHO WERE THE P ASSENGERS ON TH E TITANIC? (AGES,GEN DER,CLASS..ETC)
CHECKING THE GENDER RATIO
/*Tabular Format*/
PROC FREQ DATA=WORK.TRAIN;
TABLES Sex;
RUN;
/*Visualizing the table*/
TITLE 'Gender distribution onboard';
ODS graphics / reset width=3in height=4.8in imagemap;
PROC SGPLOT data=WORK.TRAIN;
/*--Bar chart settings--*/
VBAR Sex;
yaxis GRID;
RUN;
SEP ERATING GEN DER BY CLASS
/*Tabular Format*/
PROC FREQ DATA=WORK.TRAIN;
/*TABLES Pclass*Sex / NOROW NOPERCENT;*/
TABLES Pclass*Sex;
RUN;
/*Visualizing the table*/
TITLE 'Gender distribution by class';
ODS graphics / reset width=4in height=4in imagemap;
/*--SGPLOT proc statement--*/
PROC SGPLOT data=WORK.TRAIN;
/*--Bar chart settings--*/
VBAR Pclass / GROUP=Gender groupdisplay=Cluster;
yaxis GRID;
RUN;
SEP ARATING P OP ULATION BA SED ON MALE/FEMALE/CHILD
/*Adding a new column "Gender" based on age*/
DATA WORK.TRAIN;
SET WORK.TRAIN;
LENGTH Gender $ 6;
IF Age < 16 THEN Gender = "child";
ELSE Gender = Sex;
RUN;
TITLE 'Distribution by male/female/child';
ODS graphics / reset width=4in height=4in imagemap;
/*--SGPLOT proc statement--*/
PROC SGPLOT data=WORK.TRAIN;
/*--Bar chart settings--*/
VBAR Pclass / GROUP=Gender groupdisplay=Cluster;
yaxis GRID;
RUN;
4 |P a ge
WERE P EOP LE TRAVELLING ALONE OR WITH FAMILY ?
DATA WORK.TRAIN;
SET WORK.TRAIN;
LENGTH Alone $ 12;
IF SibSp + Parch > 0 THEN Alone = "With Family";
ELSE Alone = "Alone";
RUN;
TITLE 'People travelling alone or with family';
ODS graphics / reset width=5in height=4in imagemap;
/*--SGPLOT proc statement--*/
PROC SGPLOT data=WORK.TRAIN;
/*--Bar chart settings--*/
VBAR Alone / FILLATTRS=(color=CX4c60a2);
yaxis GRID;
RUN;
WHAT FACTORS P LAYED A ROLE IN A P ERSONS SURVIVAL ?
UPDATING "SURVIVED" COLUMN FOR BETTER READABILITY
DATA WORK.TRAIN;
LENGTH Survived $ 3;
SET WORK.TRAIN (RENAME=(Survived=Survival));
IF Survival = 0 THEN Survived = "No";
ELSE Survived = "Yes";
RUN;
WHAT WERE THE SURVIVAL RATES FOR EACH GENDER
/*Tabular Format*/
ODS NOPROCTITLE;
PROC FREQ DATA=WORK.TRAIN;
TABLES Survived*Gender / NOROW NOPERCENT;
RUN;
/*Visualizing the table*/
TITLE 'Survival rate based on gender';
ODS graphics / reset width=4in height=4in imagemap;
/*--SGPLOT proc statement--*/
PROC SGPLOT data=WORK.TRAIN;
/*--Bar chart settings--*/
VBAR Gender / GROUP=Survived groupdisplay=Cluster;
yaxis GRID;
RUN;
5 |P a ge
HOW DOES SURVIVAL VA RY WITH AGE AND CLASS
Here we will use the UI interface instead of writing code to graph this covariance plot
1. Click on the "Tasks" Tab in the left column in the SAS Studio screen
 A list of categories of tasks that can be performed through UI will appear
2. Click on the arrow to the left of "Statistics" task category
3. Scroll down the list that appears to "Analysis of Covariance" task and double click on it
4. You will see a new SAS program window opens in the right portion of the screen
5. Select following values for the options you see on the screen
 Data : WORK.TRAIN
 Dependent Variable : Survival
 Categorical Variable : Pclass
 Continuous covariate : Age
6. As you select the options you will see corresponding code being generated
7. Hit 'F3' function key to run the generated code
8. Scroll down the "Results" Tab to see the generated covariance plot.
6 |P a ge