Introduction R is an extremely popular tool for Statistical Analysis and Data Mining. It is free and open source, can be installed on any platform with ease Result of the collective work of several researchers and experts in Data Mining Prestigious Organizations like SAP, Oracle, Tableau etc. allow R to be integrated with their powerful applications Visit https://cran.r-project.org/ for Software downloads, Packages, Documents and the latest on R Google is awash with valuable information, blogs, tips etc. on R, you will have answers for everything R Data Types (Simple) Getting Started Numeric Integer Complex Logical Character Try the following in R Console on Command prompt (>) “<-” is R’s assignment operator, interchangeably used with “=“ > X <- 10 > class(X) Repeat the same with following and check class(X) each time > X <- 10.5 > X <- 10 + 5i > X <- “a” Even an integer is by default considered “numeric”, which can be changed to “integer” as follows > X <- as.integer(X) In the same way, as.character(X) would convert a variable to character is.integer(X) or is.character(X) can be used to check whether a variable is integer or character or not Getting Started R Data Types (Complex) Vector A vector is a sequence of data elements of the same basic type. Vectors can be constituted by integers, numerics, characters and so on But all elements of a Vector needs to be of same data type. R would coerce elements to the same data type. Try the following in R Console on Command prompt Creates two vectors v1 & v2 > v1 <- c(1,2,3) > v2 <- c(4,5,6) Creates two vectors add & sub by > add <- V1 + V2 arithmetic operations on them > sub <- V2 – V2 Vectors for all these operations need to be of same length, or else R throws a warning, Although it will try to do the operation with recycling the shorter vector Length(v) gives the length of the vector V[i] retrieves the ith member of the vector V[-i] would retrieve all but ith member of the vector V[i:j] would retriece i to jth member from the vector Class(v1) would give type of vector element, as numeric, and not as vector Matrix Matrix Matrix is similar to a vector but arranged in 2D format with rows and columns. Run the following sequence of commands to build a matrix > mat <- c(1,2,3,4,5,6) > mat <- matrix(mat, nrow = 2, ncol = 3) > dim(mat) returns nrow and ncol for a matrix > mat[n, ] would return nth row > mat[ ,n] would return nth column > mat[ , n:m] would return n to mth columns > mat[ , c(n,m)] would return n and m columns > t(mat) would return the transpose of matrix > solve(mat) is inverse of mat; mat has to be a square matrix • Create two matrices and > A %*%B is multiplication of A and B try them in R console > rbind(A, B) merges two matrices by rows • Rbind and cbind need > cbind(A, B) merges two matrices by columns matrices of same length List List List is similar to vector but can be heterogeneous and can contain different data types. > list[[i]] refers to the ith element of List. It could be a vector, a matrix or a single numeric Vector or List elements can have names too. > names(V1) <- c(“first”, “second”) > names(list) <- c(“first”, “second”, “third”) • would assign names of the elements of the vector V1 or List • then the particular element can be retrieved by its name too. > V1[“name”] would fetch named element for vector > list[[“name”]] > list$name • would fetch the named element (notice double brackets for list or matrix) • Using $ for vector returns an error - $ not valid for atomic vectors Data Frames Data Frame It is perhaps the most important data type for data mining purposes. It is used to store data in a tabular or spreadsheet fashion. It can have several named columns, containing fields and rows represent records or data points. df is name of the Data Frame > colnames(df) is used to name the columns > rownames(df) <- NULL to remove names of rows, desirable sometimes > df[[“name”]] A column of a df can be retrieved by these commands > df$names ith row can be retrieved > df[i, ] jth row can be retrieved > df[ ,j] Cell from ith row and jth column can be retrieved > df[i,j] Data Frames R comes loaded with several sample Data sets One such data set is “mtcars”, which has 32 car models with 11 measurement (mile per gallon, # of cylinders etc.) Let us work with this data set > data <- data.frame(mtcars) Try these on mtcars data set > data$mpg gives the “mpg” column from data > data[[“mpg”]] gives the “mpg” column from data > data[2,] gives 2nd row from data > data[,4] gives 4th column from data > data[4,5] gives value in 4th row and 5th column View Data by clicking on the Object Load Data Data Frames Create a new Data frame from Data > data1 <- data.frame(data$mpg, data$hp) You may have to rename the columns as below > colnames(data1) <- c(“mpg”, “hp”) Try some Descriptive Statistics on this > mean(data1$mpg) mean for all mpg values > sd(data1$mpg) std. deviation for all mpg values Try some Graphical Statistics on this > hist(data1$mpg) generates histogram for mpg values > plot(data1$hp, data1$mpg) generates a scatter plot for hp vs. mpg Clean up the Rstudio before existing for others to use To clear Objects from the Workspace CTR + L (lower case) To clear the console To clear Plots from the Window R Packages R is the collective work of several researchers and experts in Data Mining They contribute in terms of libraries or set of functions called Packages R works through these packages for specific works, and there is endless list of them They provide multiple ways to achieve the same result in R Provides “enormous power” but can be “confusing” at times One should know the right package, install it and call it as following > install.packages(“package_name”) > library(“package_name”) All the required packages for this tutorial have already been installed, One can’t install on his own Subsetting Data Frames Let us work with one of these packages, called sqldf, to perform a very important task on Data Frame We would be subsetting a Data Frame, i.e., picking part of data by certain conditions > library(“sqldf”) already installed, not need to call install command Try out the following and view the results to understand them > dt <- sqldf("select * from data where mpg > 20.0") > dt <- sqldf("select * from data where mpg > 20.0 and disp > 200") > temp <- data[c("mpg", "cyl")] > temp <- data[c(1, 2)] the same result is returned > temp <- data[which("mpg" == 21.0 & "hp" == 110)] > temp <- subset(data, "mpg" > 20.0 & "disp" > 200) Data Mining with R Let us build a Multiple Regression Model on the Cars Data The packages for Regression Modelling is part of Base R and one need not call any package Let us attempt to do it step-by-step > result <- lm(mpg ~ ., data) This is standard format of building Predictive Models in R > Y ~ x1 + x2 + x3 + …., data Y is response, X are predictors and data is the data frame to be used > Y ~ ., data dot(.) Means including all the variables except Y in model building Let us review the result > summary(result) result is actually a “list” which has several components, summary displays all of them, we can look at the components separately too > result$coefficients > result$residuals Multiple Linear Regression Let us do some data discovery before attempting Regression Modelling > plot(data$mpg, data$cyl) > plot(data$mpg, data$disp) > plot(data$mpg, data$hp) > plot(data$mpg, data$drat) > plot(data$mpg, data$wt) plot(data$mpg, data$qsec) Call: lm(formula = mpg ~ disp + hp + drat + wt, data = data) Residuals: Min 1Q Median 3Q Max -3.5077 -1.9052 -0.5057 0.9821 5.6883 Based on correlation, we use only 4 of them as below result <- lm(mpg ~ disp + hp + drat + wt, data) Result$coefficients or result$residuals to view them in detail They can be manipulated separately Residual <- data.frame(result$residuals) And you have residuals for each Cars Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 29.148738 6.293588 4.631 8.2e-05 *** disp 0.003815 0.010805 0.353 0.72675 hp -0.034784 0.011597 -2.999 0.00576 ** drat 1.768049 1.319779 1.340 0.19153 wt -3.479668 1.078371 -3.227 0.00327 ** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 2.602 on 27 degrees of freedom Multiple R-squared: 0.8376, Adjusted R-squared: 0.8136 F-statistic: 34.82 on 4 and 27 DF, p-value: 2.704e-10 Multiple Linear Regression the other way Now let us try result <- glm(mpg ~ disp + hp + drat + wt, family = "gaussian", data) Call: Any difference in results?? lm(formula = mpg ~ disp + hp + drat + wt, data = data) It is same as with lm except we have AIC in place of Residuals Residuals: We have used glm (Generalized Linear Model), which lm(Linear Min 1Q Median 3Q Max Model - Regression) is a subset of, and note family = “Gaussian” -3.5077 -1.9052 -0.5057 0.9821 5.6883 In the same way, we can use “binomial” to model a Binary Logistic Regression Coefficients: This emphasizes the point that there are various ways to achieve theEstimate Std. Error t value Pr(>|t|) (Intercept) 29.148738 6.293588 4.631 8.2e-05 *** same objective in R, we have to weigh the options disp 0.003815 0.010805 0.353 0.72675 hp -0.034784 0.011597 -2.999 0.00576 ** drat 1.768049 1.319779 1.340 0.19153 wt -3.479668 1.078371 -3.227 0.00327 ** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 2.602 on 27 degrees of freedom Multiple R-squared: 0.8376, Adjusted R-squared: 0.8136 F-statistic: 34.82 on 4 and 27 DF, p-value: 2.704e-10
© Copyright 2025 Paperzz