SYSTAT: Statistical Visualizaiton Software

Systat: Statistical
Visualization Software
Hilary R. Hafner
Jennifer L. DeWinter
Steven G. Brown
Theresa E. O‟Brien
Sonoma Technology, Inc.
Petaluma, CA
Presented in Toledo, OH
October 28, 2011
STI-910019-3946
Topics to Cover
• Systat basics
–
–
–
–
–
What is Systat?
Why use Systat?
Overview of the user interface
Resources
Command language vs. menus
• Importing data
–
–
–
–
Accepted file types
Formatting
Limitations
Tips and tricks
• Analysis tools
–
–
Graphs and analyses
Statistics
• Data manipulation
–
–
Command language vs. menus
Creating variables, appends/merges, transformations, selections, and grouping
• Saving output
• Graph customizations
• Advanced graphs and analyses
–
Regression, significance tests, nonparametric tests, factor analysis, cluster analysis, and
analysis of variance (ANOVA)
TOPICS
2
What Is Systat?
Systat is statistical and graphical analysis
software that allows you to explore your data
using both menus and a batch command
language (similar to macros)
15
1500
10
TNMOC
BENZW
1000
5
500
YEAR
0
0
8
16
0
3000 2000 1000
Count
24
HOUR
INTRODUCTION
0
1000 2000 3000
Count
1994
1995
3
Why Use Systat?
• In data analysis, we nearly always need to
investigate central tendencies, correlations,
trends, and other statistical descriptions of data
• Systat‟s graphical interface allows the analyst to
immediately see the data and rapidly generate
and regenerate graphs for review
• Systat contains statistical functions not found in
Excel or Access
INTRODUCTION
4
Systat Basics –
Graphical User Interface
Viewspace
Workspace
Commandspace
INTRODUCTION
5
Systat Basics – File Types
Output (filename.syo)
Data (filename.syd, .syz)
Command (filename.syc)
INTRODUCTION
6
Systat Basics – Resources
Help – a click away
•
•
•
•
•
•
•
Index
Search
Mouse-overs, F1 key
? button
Command line
Manuals
Examples
Training videos at
http://www.systat.com/downloads/
Useful: Interface, data, graph, help
INTRODUCTION
7
Systat Basics – Resources
INTRODUCTION
8
Command Language vs. Menus
• Systat is a Windows menu driven package, but full
coverage of the menu is provided in the command
language
• Commands are useful for repetitive analyses (and we
almost never do anything just once!)
• Commands help the analyst document analyses that
have been performed and where the output is stored
• Commands can be used in future analyses
• Log window in Systat records most actions
• Commands = faster!
INTRODUCTION
10
Importing Data into Systat
•
•
•
•
Accepted file formats
Limitations
Data formatting
Tips and tricks
IMPORTING DATA
12
$ signifies text field
“.” signifies missing data
Names > 1 word
require underline char.
Text field is
left-justified.
IMPORTING DATA
16
Tricks and Tips with Excel
• Data sets can be processed in Excel prior to bringing them into
Systat
• Make date/time conversions and calculations in Excel (convert
date/time into separate fields for day of week, month, day, year,
etc.)
• Prepare sums and other calculations easily performed in Excel
• Copy/paste values to remove all formulae
• Check that records are continuous
• Replace missing values (e.g., -999) with „.‟ (Systat‟s missing
value code)
• Save as Excel (designate by NAME_sys.xls)
• Note that only one page of a workbook can be selected per
import
Hot tip: Systat doesn’t like the variable name “temp”
IMPORTING DATA
17
Exploring Your Variables
Ozone data: right
click on “variable
statistics”
19
Common Graphs and Analyses
O3
50
2
1
2,
00
0
2,
00
9
2,
00
8
1,
99
7
1,
99
6
1,
99
5
1,
99
4
1,
99
1,
99
3
0
1,
99
• The analyst must
determine the
appropriate plot(s) to
answer different types
of questions
100
YEAR
150
100
O3
• Systat can create
numerous types of
graphs and plots and
perform many statistical
functions
150
50
WDWE
0
-10
1
2
0
10
20
30
40
TEMP
DATA ANALYSIS
22
Commonly Used Plots and
Statistical Functions
•
•
•
•
•
Summary statistics – quantify data characteristics
Histograms – understand data distribution
Bar charts – compare quantities (counts, or means)
Scatter plots – understand relationships
Box plots – compare distribution and central
tendencies
• Scatter plot matrices – compare many relationships
• Correlation analysis – quantify relationships
• Linear regression – identify predictive variables
Open (WY_Site0123_data_ct.syz)
DATA ANALYSIS
23
Summary Statistics Used for Trends Plots
Average Ozone
60.00
Conc. (ppb)
50.00
40.00
2005
2006
30.00
2007
2008
20.00
10.00
0.00
0
5
10
15
20
25
Hour
•
•
•
Diurnal trends in median ozone concentrations for a Wyoming site from 2005 to 2008
Overall increase in average ozone concentrations observed – less titration?
Plot was created in Excel from Systat summary statistics by year and hour
DATA ANALYSIS
26
Scatter Plots
Scatter plots are useful for determining relationships between variables
0.9
0.8
S25LC_7
0.7
0.6
Sulfur vs. Sulfate
0.5
0.4
0.3
0.2
0
1
2
SO425LC_7
3
4
30
These plots are useful for both data
validation and analysis
• Are there outliers, and if so, how
do they affect comparisons?
• What are the
similarities/differences
between parameters?
NO2
0.1
40
NO2 vs. Ozone
20
10
0
0
10
20
30
DATA ANALYSIS
40
50
60
70
80
90
O3
27
Example of Scatter Plot
Do we see the expected relationships?
REM The following command (PLOT)
REM creates a scatter plot of NO2
REM concentrations by wind direction
REM and year.
PLOT NO2*RD / OVERLAY GROUP
= {YEAR}
40
NO2
30
20
YEAR
2,005
2,006
2,007
2,008
10
0
0
90
180
270
This graphic explores NO2
concentrations and resultant wind
direction as a function of year.
Is there a change in the direction
of high concentrations in this time period?
360
RD
DATA ANALYSIS
28
Box-Whisker Plots
• Sample box-whisker plot and a notched box whisker plot as
defined by Systat
• Always define this plot because different packages have
Confidence Interval (CI) for a population parameter is an
different definitions
interval with an associated probability p that is generated
DATA ANALYSIS
from a random sample of an underlying population such
that if the sampling was repeated numerous times and the
confidence interval recalculated from each sample
according to the same method, a proportion p of the
confidence intervals would contain the population
parameter in question.
29
Example of a Notched Box-Whisker Plot
Notched box-whisker plots are useful for showing the central trends of
the data (i.e., the median) while also showing variability (i.e., the box
and whiskers)
REM The following command (DENSITY)
REM creates a notched box plot of ozone
REM concentrations by year.
DENSITY O3 * YEAR / BOX NOTCH
COLOR=BLACK
O3 = ozone (ppb)
DATA ANALYSIS
30
Linear and Nonlinear Regression
Regression analyses identify and quantify predictive
relationships between variables
Options
• Multiple linear regression
• Stepwise regression
• Automatic outlier and
influential point detection
• Plots of residuals vs.
predicted values
• Many nonlinear regression
forms
DATA ANALYSIS
36
Example Linear Regression Analysis
• Before performing linear regression, it is vital to examine a
scatter plot of the data!
• Outliers at the ends of data set highly influence linear
regression
Total nonmethane organic
compounds (TNMOC) and
NOx at 7 a.m. in an urban
setting should have
relatively good correlation
DATA ANALYSIS
38
Example Results
Effect
Coefficient
Standard
Error
Std.
Coefficient
Tolerance
t
p-Value
28.134
2.953
0.000
.
9.527
0.000
2.485
0.112
0.706
1.000
22.283
0.000
CONSTANT
NOX
Final equation: TNMOC =2.5(NOx)+28.1
Case
344
is an Outlier
(Studentized Residual
:
11.168)
Case
2,360
has large Leverage
(Leverage
:
0.053)
Case
2,576
has large Leverage
(Leverage
:
0.047)
Case
2,648
has large Leverage
(Leverage
:
0.038)
Case
2,936
has large Leverage
(Leverage
:
0.036)
Case
5,408
has large Leverage
(Leverage
:
0.036)
Case
8,028
is an Outlier
(Studentized Residual
:
5.155)
Case
11,490
has large Leverage
(Leverage
:
0.047)
Case
14,536
has large Leverage
(Leverage
:
0.060)
Case
16,240
has large Leverage
(Leverage
:
0.040)
Case
17,488
has large Leverage
(Leverage
:
0.045)
Case
18,256
has large Leverage
(Leverage
:
0.047)
Case
19,432
is an Outlier
(Studentized Residual
:
-4.275)
Dependent Variable
TNMOC
N
502
Multiple R
0.706
Squared Multiple R
0.498
Adjusted Squared Multiple R
0.497
Standard Error of Estimate
41.259
Random Scatter
Desired
39
Summary
• Systat is a powerful graphical statistical
tool
• Explore options and learn statistics through
use of the Help facility and examples
• Share your command files, tips, and tricks
with other users
SUMMARY
71
Appendix – Key Systat Commands
• Box plot in black and white
– DENSITY benz*year / BOX NOTCH COLOR=BLACK
• Save output, graphs
– OSAVE “file path and name”/rtf (best for multiple graphs
such as with “by” command)
– GSAVE “file path and name”/wmf (also saves .bmp, .emf,
.pct, .eps, .pg, and .cgm formats)
– Save “file path and name”
• Export to Excel
– EXPORT “file path and name.xls”/type=excel
– Note that this saves only 16,000 lines!!! (Excel 3.0)
APPENDIX
72
Appendix – Key Systat Commands
• Select range of data
– SELECT QCS=0 AND month>5 AND Month<10
• Scatter plot matrix
– SPLOM var1 var2 etc. / half color=black
• Setting coordinates
– DENSITY benz*hour / BOX NOTCH COLOR=BLACK
xmin=0 xmax=24 xtick =6
APPENDIX
73
Appendix – Troubleshooting Ideas
• Importing
– Remove any formulas or formatting in file
– Make sure there are no gaps (empty lines)
– Make sure each column is uniquely named
– Save as Excel 3.0 or tab-delimited .txt
• Scripts
– Go via menu and compare log with script
– Move the “stats” line one line up or down
APPENDIX
74