Data Mining - cloudfront.net

Introduction
 R is an extremely popular tool for Statistical Analysis and Data Mining.
 It is free and open source, can be installed on any platform with ease
 Result of the collective work of several researchers and experts in Data
Mining
 Prestigious Organizations like SAP, Oracle, Tableau etc. allow R to be
integrated with their powerful applications
 Visit https://cran.r-project.org/ for Software downloads, Packages,
Documents and the latest on R
 Google is awash with valuable information, blogs, tips etc. on R, you will
have answers for everything
R Data Types (Simple)





Getting Started
Numeric
Integer
Complex
Logical
Character
Try the following in R Console on Command prompt (>)
“<-” is R’s assignment operator, interchangeably used with “=“
> X <- 10
> class(X)
Repeat the same with following and check class(X) each
time
> X <- 10.5
> X <- 10 + 5i
> X <- “a”
Even an integer is by default
considered “numeric”, which
can be changed to “integer” as
follows
> X <- as.integer(X)
In the same way,
as.character(X) would convert
a variable to character
is.integer(X) or is.character(X)
can be used to check whether
a variable is integer or
character or not
Getting Started
R Data Types (Complex)
Vector
 A vector is a sequence of data elements of the same
basic type.
 Vectors can be constituted by integers, numerics,
characters and so on
 But all elements of a Vector needs to be of same data
type.
 R would coerce elements to the same data type.
Try the following in R Console on Command prompt
Creates two vectors v1 & v2
> v1 <- c(1,2,3)
> v2 <- c(4,5,6)
Creates two vectors add & sub by
> add <- V1 + V2
arithmetic operations on them
> sub <- V2 – V2
 Vectors for all these
operations need to be of
same length, or else R
throws a warning,
 Although it will try to do
the operation with
recycling the shorter vector
 Length(v) gives the length
of the vector
 V[i] retrieves the ith
member of the vector
 V[-i] would retrieve all but
ith member of the vector
 V[i:j] would retriece i to jth
member from the vector
 Class(v1) would give type
of vector element, as
numeric, and not as vector
Matrix
Matrix
Matrix is similar to a vector but arranged in 2D format with rows and columns.
Run the following sequence of commands to build a matrix
> mat <- c(1,2,3,4,5,6)
> mat <- matrix(mat, nrow = 2, ncol = 3)
> dim(mat)
returns nrow and ncol for a matrix
> mat[n, ]
would return nth row
> mat[ ,n]
would return nth column
> mat[ , n:m]
would return n to mth columns
> mat[ , c(n,m)]
would return n and m columns
> t(mat)
would return the transpose of matrix
> solve(mat)
is inverse of mat; mat has to be a square matrix
• Create two matrices and
> A %*%B
is multiplication of A and B
try them in R console
> rbind(A, B)
merges two matrices by rows
• Rbind and cbind need
> cbind(A, B)
merges two matrices by columns
matrices of same length
List
List
List is similar to vector but can be heterogeneous and can contain different data types.
> list[[i]] refers to the ith element of List. It could be a vector, a matrix or a
single numeric
Vector or List elements can have names too.
> names(V1) <- c(“first”, “second”)
> names(list) <- c(“first”, “second”, “third”)
• would assign names of the elements of the vector V1 or List
• then the particular element can be retrieved by its name too.
> V1[“name”] would fetch named element for vector
> list[[“name”]]
> list$name
• would fetch the named element (notice double brackets for list or
matrix)
• Using $ for vector returns an error - $ not valid for atomic vectors
Data Frames
Data Frame
It is perhaps the most important data type for data mining purposes. It is used to store data
in a tabular or spreadsheet fashion. It can have several named columns, containing fields
and rows represent records or data points.
df is name of the Data Frame
> colnames(df)
is used to name the columns
> rownames(df) <- NULL to remove names of rows, desirable sometimes
> df[[“name”]]
A column of a df can be retrieved by these commands
> df$names
ith row can be retrieved
> df[i, ]
jth row can be retrieved
> df[ ,j]
Cell from ith row and jth column can be retrieved
> df[i,j]
Data Frames
R comes loaded with several sample Data sets
One such data set is “mtcars”, which has 32 car models with 11 measurement
(mile per gallon, # of cylinders etc.)
Let us work with this data set
> data <- data.frame(mtcars)
Try these on mtcars data set
> data$mpg
gives the “mpg” column from data
> data[[“mpg”]] gives the “mpg” column from data
> data[2,]
gives 2nd row from data
> data[,4]
gives 4th column from data
> data[4,5]
gives value in 4th row and 5th column
View Data by clicking on the Object
Load Data
Data Frames
Create a new Data frame from Data
> data1 <- data.frame(data$mpg, data$hp)
You may have to rename the columns as below
> colnames(data1) <- c(“mpg”, “hp”)
Try some Descriptive Statistics on this
> mean(data1$mpg)
mean for all mpg values
> sd(data1$mpg)
std. deviation for all mpg values
Try some Graphical Statistics on this
> hist(data1$mpg)
generates histogram for mpg values
> plot(data1$hp, data1$mpg) generates a scatter plot for hp vs. mpg
Clean up the Rstudio before existing for others to use
To clear Objects from the Workspace
CTR + L (lower case)
To clear the console
To clear Plots from the Window
R Packages
R is the collective work of several researchers and experts in Data Mining
They contribute in terms of libraries or set of functions called Packages
R works through these packages for specific works, and there is endless list of
them
They provide multiple ways to achieve the same result in R
Provides “enormous power” but can be “confusing” at times
One should know the right package, install it and call it as following
> install.packages(“package_name”)
> library(“package_name”)
All the required packages for this tutorial have already been installed,
One can’t install on his own
Subsetting Data Frames
Let us work with one of these packages, called sqldf, to perform a very
important task on Data Frame
We would be subsetting a Data Frame, i.e., picking part of data by certain
conditions
> library(“sqldf”) already installed, not need to call install command
Try out the following and view the results to understand them
> dt <- sqldf("select * from data where mpg > 20.0")
> dt <- sqldf("select * from data where mpg > 20.0 and disp > 200")
> temp <- data[c("mpg", "cyl")]
> temp <- data[c(1, 2)] the same result is returned
> temp <- data[which("mpg" == 21.0 & "hp" == 110)]
> temp <- subset(data, "mpg" > 20.0 & "disp" > 200)
Data Mining with R
Let us build a Multiple Regression Model on the Cars Data
The packages for Regression Modelling is part of Base R and one need not
call any package
Let us attempt to do it step-by-step
> result <- lm(mpg ~ ., data) This is standard format of building Predictive Models in R
> Y ~ x1 + x2 + x3 + …., data Y is response, X are predictors and data is the data frame
to be used
> Y ~ ., data dot(.) Means including all the variables except Y in model building
Let us review the result
> summary(result) result is actually a “list” which has several components, summary
displays all of them, we can look at the components separately too
> result$coefficients
> result$residuals
Multiple Linear Regression
Let us do some data discovery before attempting Regression Modelling
> plot(data$mpg, data$cyl)
> plot(data$mpg, data$disp)
> plot(data$mpg, data$hp)
> plot(data$mpg, data$drat)
> plot(data$mpg, data$wt)
 plot(data$mpg, data$qsec)






 Call:
 lm(formula = mpg ~ disp + hp + drat + wt, data = data)
 Residuals:

Min 1Q Median 3Q Max
 -3.5077 -1.9052 -0.5057 0.9821 5.6883

Based on correlation, we use only 4 of them as below


result <- lm(mpg ~ disp + hp + drat + wt, data)
Result$coefficients or result$residuals to view them in detail

They can be manipulated separately


Residual <- data.frame(result$residuals)

And you have residuals for each Cars

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 29.148738 6.293588 4.631 8.2e-05 ***
disp
0.003815 0.010805 0.353 0.72675
hp
-0.034784 0.011597 -2.999 0.00576 **
drat
1.768049 1.319779 1.340 0.19153
wt
-3.479668 1.078371 -3.227 0.00327 **
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
 Residual standard error: 2.602 on 27 degrees of freedom
 Multiple R-squared: 0.8376,
Adjusted R-squared: 0.8136
 F-statistic: 34.82 on 4 and 27 DF, p-value: 2.704e-10
Multiple Linear Regression the other way
Now let us try
 result <- glm(mpg ~ disp + hp + drat + wt, family = "gaussian", data)
 Call:
 Any difference in results??
 lm(formula = mpg ~ disp + hp + drat + wt, data = data)
 It is same as with lm except we have AIC in place of Residuals
 Residuals:
 We have used glm (Generalized Linear Model), which lm(Linear

Min 1Q Median 3Q Max
Model - Regression) is a subset of, and note family = “Gaussian”
 -3.5077 -1.9052 -0.5057 0.9821 5.6883
 In the same way, we can use “binomial” to model a Binary Logistic
Regression
 Coefficients:

 This emphasizes the point that there are various ways to achieve
theEstimate Std. Error t value Pr(>|t|)
 (Intercept) 29.148738 6.293588 4.631 8.2e-05 ***
same objective in R, we have to weigh the options






disp
0.003815 0.010805 0.353 0.72675
hp
-0.034784 0.011597 -2.999 0.00576 **
drat
1.768049 1.319779 1.340 0.19153
wt
-3.479668 1.078371 -3.227 0.00327 **
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
 Residual standard error: 2.602 on 27 degrees of freedom
 Multiple R-squared: 0.8376,
Adjusted R-squared: 0.8136
 F-statistic: 34.82 on 4 and 27 DF, p-value: 2.704e-10