Introduction to Francesco Comandatore [email protected] Post-doc Bioinformatician at Università degli studi di Milano Università degli studi di Pavia Four lectures 17th November – 14-17pm room 3 B1 24th November – 14-17pm room 3 B1 1st December – 14-17pm room 3 B1 15th December – 14-17pm room 3 B1 Exam stucture Three sections 1) Load a file + extract rows and/or columns from the data frame + perform a computation Functions: read.csv, mean, sum, sd 2) Select a subset of rows from the data frame + generate a scatter plot Functions: subset, plot 3) Perform a statistical analysis: compare two distributions/averages, correlation analysis Functions: boxplot,lm,plot,abline,shapiro.test,wilcox.test,t.test, Download the CSV file File name WHO_mortality_data_ITA.csv Launch R and set the working directory accordingly (look at Lesson 2 slides for a brief tutorial) Test 1 1.1 Load the csv file WHO_mortality_data_ITA.csv 1.2 Extract the column "Deaths_at_all_ages" 1.3 Calculate the average of the extracted column 2.1 Select, from the entire data frame, the rows for which the cause of death is Acute_poliomyelitis in the period between 1950 and 2000 (included) 2.2 Using the obtained data frame, generate two scatter plots, one for males and the other for females: plot deaths_at_all_ages VS year 3.1 Take a look at males and females killed (Deaths_at_all_ages) by Acute_poliomyelitis between 1951 and 1978. Visualize the two distributions (boxplot) and compute the averages. Are the two distributions significantly different (remember to check for normality)? Test 2 1.1 Load the csv file WHO_mortality_data_ITA.csv 1.2 Extract the rows from 50 to 250 1.3 Calculate the sum of the column Deaths_at_age_4_years 2.1 Select, from the entire data frame, the rows for which the cause of death is Injury_resulting_from_operations_of_war in the period between 1950 and 2000 (included) 2.2 Using the obtained data frame, generate two scatter plots, one for males and the other for females: plot deaths_at_all_ages VS year 3.1 Take a look at males and females killed (Deaths_at_all_ages) by Injury_resulting_from_operations_of_war between 1951 and 1960. Visualize the two distributions (boxplot) and compute the averages. Are the two distributions significantly different (remember to check for normality)? Test 3 1.1 Load the csv file WHO_mortality_data_ITA.csv 1.2 Extract the rows from 134 to 350 1.3 From the obtained data frame, calculate the sum of the column Deaths_at_age_15.19_years 2.1 Select, from the entire data frame, the rows for which the cause of death is Suicide_and_self-inflicted_injury 2.2 Using the obtained data frame, generate two scatter plots, one for males and the other for females: plot deaths_at_all_ages VS year 3.1 Perform a correlation analysis between the yearly number of males and females dead by suicide: a) plot male deaths VS female deaths (due to suicide); b) compute and plot the linear regression line; c) evaluate the results of the linear regression. Is the correlation statistically significant? 3.2 Repeat the analyses performed (3.1) on the yearly number of males and females dead by suicide between 1970 and 1978 (included) List of the skills (part 1) - Open a csv file in R read.csv(“File name”) Extract rows/columns on the basis of names - Extract column - Extract rows df[,”column_name”] or df[,column_number] or df$column_name df[”row_name”,] or df[row_number,] Note: Look at the slides of the lessons for more (important) details Extract rows on the basis of a pattern - Extract rows subset(df, df$column == pattern) subset(df, df$column == pattern & df$column2 == pattern2) subset(df, df$column == pattern | df$column2 == pattern2) Note: Look at the slides of the lessons for more (important) details Computation - Compute the average - Compute the sum - Compute the standard deviation mean(vector) sum(vector) sd(vector) List of the skills (part 2) Regression analysis - Regression analysis - Scatter plot - Plot regression line lm(vector1 ~ vector2) → summary plot(vector1 ~ vector2) abline() Compare averages - Plot Boxplot - Select the statistical test - Parametric test - Non-parametric test boxplot(vector_values ~ vector_categories) shapiro.tets(vector_values) t.test(vector_values ~ categories) wilcox.test(vector_values ~ categories) Save plot in a file - Save plot in a jpeg file jpeg(“File_name.jpg”) → command → dev.off()
© Copyright 2026 Paperzz