Introduction to - mbg - Università degli studi di Pavia

Introduction to
Francesco Comandatore
[email protected]
Post-doc Bioinformatician at
Università degli studi di Milano
Università degli studi di Pavia
Four lectures
17th November – 14-17pm room 3 B1
24th November – 14-17pm room 3 B1
1st December – 14-17pm room 3 B1
15th December – 14-17pm room 3 B1
Exam stucture
Three sections
1) Load a file + extract rows and/or columns from the data frame + perform a
computation
Functions: read.csv, mean, sum, sd
2) Select a subset of rows from the data frame + generate a scatter plot
Functions: subset, plot
3) Perform a statistical analysis: compare two distributions/averages,
correlation analysis
Functions: boxplot,lm,plot,abline,shapiro.test,wilcox.test,t.test,
Download the CSV file
File name
WHO_mortality_data_ITA.csv
Launch R and set the working
directory accordingly (look at
Lesson 2 slides for a brief
tutorial)
Test 1
1.1 Load the csv file WHO_mortality_data_ITA.csv
1.2 Extract the column "Deaths_at_all_ages"
1.3 Calculate the average of the extracted column
2.1 Select, from the entire data frame, the rows for which the cause
of death is Acute_poliomyelitis in the period between 1950 and 2000
(included)
2.2 Using the obtained data frame, generate two scatter plots, one for
males and the other for females: plot deaths_at_all_ages VS year
3.1 Take a look at males and females killed (Deaths_at_all_ages) by
Acute_poliomyelitis between 1951 and 1978. Visualize the two
distributions (boxplot) and compute the averages. Are the two
distributions significantly different (remember to check for normality)?
Test 2
1.1 Load the csv file WHO_mortality_data_ITA.csv
1.2 Extract the rows from 50 to 250
1.3 Calculate the sum of the column Deaths_at_age_4_years
2.1 Select, from the entire data frame, the rows for which the cause of
death is Injury_resulting_from_operations_of_war in the period between
1950 and 2000 (included)
2.2 Using the obtained data frame, generate two scatter plots, one for
males and the other for females: plot deaths_at_all_ages VS year
3.1 Take a look at males and females killed (Deaths_at_all_ages) by
Injury_resulting_from_operations_of_war between 1951 and 1960.
Visualize the two distributions (boxplot) and compute the averages. Are
the two distributions significantly different (remember to check for
normality)?
Test 3
1.1 Load the csv file WHO_mortality_data_ITA.csv
1.2 Extract the rows from 134 to 350
1.3 From the obtained data frame, calculate the sum of the column
Deaths_at_age_15.19_years
2.1 Select, from the entire data frame, the rows for which the cause of death is
Suicide_and_self-inflicted_injury
2.2 Using the obtained data frame, generate two scatter plots, one for males and the
other for females: plot deaths_at_all_ages VS year
3.1 Perform a correlation analysis between the yearly number of males and females
dead by suicide: a) plot male deaths VS female deaths (due to suicide); b) compute
and plot the linear regression line; c) evaluate the results of the linear regression. Is
the correlation statistically significant?
3.2 Repeat the analyses performed (3.1) on the yearly number of males and females
dead by suicide between 1970 and 1978 (included)
List of the skills (part 1)
- Open a csv file in R
read.csv(“File name”)
Extract rows/columns on the basis of names
- Extract column
- Extract rows
df[,”column_name”] or df[,column_number] or
df$column_name
df[”row_name”,] or df[row_number,]
Note: Look at the slides of the lessons for more (important) details
Extract rows on the basis of a pattern
- Extract rows
subset(df, df$column == pattern)
subset(df, df$column == pattern & df$column2 == pattern2)
subset(df, df$column == pattern | df$column2 == pattern2)
Note: Look at the slides of the lessons for more (important) details
Computation
- Compute the average
- Compute the sum
- Compute the standard deviation
mean(vector)
sum(vector)
sd(vector)
List of the skills (part 2)
Regression analysis
- Regression analysis
- Scatter plot
- Plot regression line
lm(vector1 ~ vector2) → summary
plot(vector1 ~ vector2)
abline()
Compare averages
- Plot Boxplot
- Select the statistical test
- Parametric test
- Non-parametric test
boxplot(vector_values ~ vector_categories)
shapiro.tets(vector_values)
t.test(vector_values ~ categories)
wilcox.test(vector_values ~ categories)
Save plot in a file
- Save plot in a jpeg file
jpeg(“File_name.jpg”) → command → dev.off()