No.12 Basic Stata

29/10/57
Basic Concepts of Medical Statistics
Pawin Numthavaj M.D.
Section for Clinical Epidemiology and
Biostatistics
Faculty of Medicine Ramathibodi
Hospital
Mahidol University
[email protected]
Outline
•Types of data
•Descriptive statistics
•Inferential statistics
•Statistical software
•How to use Stata program
1
29/10/57
Types of Data
Types of Data
Categorical
(Qualitative)
Nominal
Numerical
(Quantitative)
Ordinal
Discrete
Continuous
Categorical data
• Data that possible entries are “categories”
1. Nominal
• These categories are without ‘order’
• Example:
• Blood group – A / B / AB / O
• Sex – Male / Female
• Cancer status – Have cancer / Do not have cancer
• Special case of 2 choices: bin ary/dichotomous data
2. Ordinal
• With ‘order’ – we can see which is bigger/smaller
• Example:
• Cancer staging – I / II / III / IV
• Degree of disease – Mild / Moderate / Severe
2
29/10/57
Numerical data
•Data that possible entries are
numbers
1. Discrete
•Numbers are stepping levels
•Usually without fractions
• Number of positive lymph nodes after surgery
• Number of heart beats per minute
2. Continuous
•Possible numbers could be with
fractions
• Age (years)
• Cholesterol level (mg/dL)
Types of data
•Types of data are important in
several aspects of choosing
the right statistical analysis
•Summarizing
•Estimation
•Hypothesis testing
3
29/10/57
Post test for types of data
Data
Types of data
Weight (kg)
Numerical
History of smoking (Y/N)
Categorical (dichotomous)
Serum potassium level
Numerical
Length of stay in hospital (day)
Numerical
Patient status (dead/alive)
Categorical (dichotomous)
Descriptive statistics
•Summaries of data
•Methods for describing a set
of data by using graphs or
summary measures
•Provide an ‘overview’ of
general features of a set of
data
4
29/10/57
Descriptive statistics
Summarizing of Data
Categorical Data
Numerical Data
- Frequency
- Measures of central
tendency
- Percentage
- Measures of dispersion
Summarizing categorical data
•Frequency
•Percentage
5
29/10/57
Summarizing Categorical Data
•Ex1. Summarizing of sex among 70
patients
Sex
Male
Female
Total
Frequency
56
14
70
Percentage
80
20
100
Summarizing Categorical Data
•Ex2. Summarizing of staging of
cancers
Stages of cancers
Frequency
I
120
II
320
III
160
III
200
Total
800
Percentage
100.0
6
29/10/57
Statistical Software and Stata Program
Statistical Software
Name
Website
Price
Features
Ease
of use
Note
SPSS
http://www.ibm.com/
software/analytics/sps
s/
$$$$$
++++
++++
Need to purchase separate
modules for complicated
analyses (such as Survival
Analysis)
Available from MU
(http://softwaredownload.ma
hidol/)
Stata
http://www.stata.com
/
$$$$
++++
+++
Ramathibodi access (CEB
server)
R
http://www.rproject.org/
(Free)
+++
+
R-commander is nice add on
SAS
http://www.sas.com/
$$$$$
++++
0
Need programming skill
7
29/10/57
Stata window
•Console
•Command
box
•Review
pane
•Variables
•Properties
Stata commands
•Type in Stata command and then
see the result
•Most command also have
alternative menu and dialog
•Beware of UPPER and lower
cases
•Beware of comma, colon, and
spaces
8
29/10/57
Let’s use Stata as a calculator
•Calculate summation of 23, 25,
29, 30
display 23+25+29+30
help display
di 23+25+29+30
di sqrt(400) + (12^3)
Let’s load data into Stata
9
29/10/57
Let’s load data into Stata
•File – Import
•Check “Import first row as variable
names”
Data Editor (Edit/Browse)
10
29/10/57
Description of Data
•Codebook
•Display information about data
codebook
•Missing values in Stata
misstable summarize
11
29/10/57
List data
• List
• List variable
list <variable> <variable> …
• List if
• List variable upon condition
list <variable> <variable> … if <condition>
• Bysort
• Do the command after colon by each items of the first variable
bysort <variable>: list <variable> <variable> …
•list hn age sex
•list hn age sex if revision ==1
•bysort sex: list hn age
surgerydate
12
29/10/57
Summarize Categorical Data
•Tabulate
• Display table of variable with frequency and
percentage
tab <variable>
• tab sex
• bysort revision: tab sex
Summarizing Numerical Data
Summarizing
Numerical Data
Measures of
Central
Tendency
Mean
Median
Measures of
Dispersion
Mode
Standard
Deviation
Range
13
29/10/57
Summarized Data
• sum age, detail
• sum age if revision == 1, detail
• bysort sex: sum age, detail
Distribution of Data
•Normal
Frequency
•Central tendency: use mean
•Dispersion: use standard deviation
•Non-normal distribution
Frequency
•Central tendency: use median
•Dispersion: use range
14
29/10/57
Skewness
Checking for normal distribution
•Histogram
•Normal probability plot
•Compare mean and median
•Compare mean and
standard deviation
15
29/10/57
1. Histogram
• histogram <variable>
0
20
Frequency
40
60
• histogram age
0.00
100.00
200.00
CD4 count
300.00
400.00
0.00
Normal F[(cd4c-m)/s]
0.25
0.50
0.75
1.00
2. Normal probability plot
0.00
0.25
0.50
Empirical P[i] = i/(N+1)
0.75
1.00
16
29/10/57
3. Compare mean and median
• Mean 62.4
• Median 30.5
0
20
Frequency
40
60
• Distribution of CD4 data is skewed to the right
(mean > median)
0.00
100.00
200.00
CD4 count
300.00
400.00
4.Compare mean and standard deviation
Mean
SD
Mean±2SD
Height (cm)
164.7
7.8
149.1, 180.3
CD4 count
62.4
74.4
-86.4, 211.2
17
29/10/57
Summarizing Numerical Data
Age (year)
Weight (kg)
Height (cm)
BMI
CD4
Mean
(SD)
49.6
(14.3)
95.6
(21.7)
171.5
(9.2)
32.5 (7.1)
62.4
(74.4)
Dummy table
Characteristics
N(%)
Gender
Male
Female
Age; years; Mean(SD)
Preoperative clinical score;
Median(Range)
Preoperative function score;
Median(Range)
Postoperative clinical score;
Median(Range)
18
29/10/57
19