Detroit Blight Analysis

Detroit Blight Analysis
DManh
May 8, 2016
Contents
1 Introduction
1
2 Getting to know the data
2
2.1
Extracting gps location data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
2.2
Visualization incident data
5
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3 Construting a balanced data set with label: blighted
4 Feature Engineering
9
11
4.1
The first and second features: the number of violation . . . . . . . . . . . . . . . . . . . . . .
11
4.2
Features from crime data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
4.3
Features from 311 calls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
4.4
Features from total fee . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
5 Model Selection
12
6 Further discussion
12
7 Links
18
Abstract: This primilinary paper studies the phenomenom of abandonned buildings in Detroit based on data
from blight violations, crimes, Detroit 311 calls and data on demolition permits. I construct spatial features
such as the number of blight violation incidents, number of crimes, number of 311 calls , total fees charged by
blight violations within about 1km radius of the building. I find that the number of blight violation incidents
that happen within 10m radius of the building is the most critical predictor. Also, the number of blight
violations, number of crimes within 1km neighborhood are significant, increasing the probability of “being
blighted”. Interestingly, the crimes related to properties increase the probablity of being blighted, whereas other
crimes decrease the probability of being blighted. The 311 calls and total fees charged are not significant in
predicting the blight. I used various models to evaluate the data set: logistic regression, linear and quadratic
discriminant analysis and tree model. I find the test error is about 27%. More refined models could be built by
taking into account the temporal element of data sets. This is subject to a more advanced analysis
1
Introduction
Since 2005, a third of Detroit’s properties have been forclosed (140k over 384k properties). Mortgages defaults
and unpaid taxes are the two principal reasons. If the subprime crisis 2007-2008 is the direct cause of this
ongoing Detroit foreclosure crisis, Detroit’s main activity shifting elsewhere is the longterm cause. This
priliminary study tries to explore the data sets from Detroit open data portail to predict the probability of
1
a building being “blighted”: the blight violations data, the 311 calls, crimes data and data on demolition
permits.
The data are rich in tempospatial data with each incidents being recorded in gps location and time precisely.
Due to the scope of this study, we restrict ourselves only to spatial data and not explore yet the time elements
in data sets. We do, however, find very interesting insights and promising for a more advanced analysis.
Much of the time is spent on extracting useful information from data sets, in dealing with spatial data (which
is not an evident challenge). Models used in the analysis: logistic regression, linear discriminant analysis,
quadratic discriminant analysis and tree model. We evaluated the 3 formers by 5-fold cross validation and
the latter by the missclassification rate. They all yield consistent 33-34% validation error, showing these
models all function equally well. The test error is estimated to be about 34% as well.
Finally, the analysis is conducted in R with the source code can be found at https://github.com/dmanh/DetroitBlight-Analysis. This study adopts the outline proposed by Coursera Data Science at Scale by University
of Washington. We will start first by exploring the data sets, then cleaning data sets, constructing first
training data set, training a simple model and finally engineering more features. The study concludes by
discussing more directions for further analysis.
2
Getting to know the data
The goal of this section is to visualize spatial data of blight violations, dcalls, crimes and demolition permits
on Detroit’s map. As we will show that these incidents happen all over Detroit and there are some indicents
happen outside Detroit (due to measure errors), but these incidents are very small.
The blight violation provided in the Coursera is corrupt, with extracted gps locations and postal location are
not the same, so we use the original data on the Detroit open data portail at https://data.detroitmi.gov/
Property-Parcels/Blight-Violations/teu6-anhh.
2.1
Extracting gps location data
viols <- read.csv("./data/Blight_Violations.csv")
names(viols)
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
[1]
[3]
[5]
[7]
[9]
[11]
[13]
[15]
[17]
[19]
[21]
[23]
[25]
[27]
[29]
[31]
[33]
"FileID"
"TicketNumber"
"ViolName"
"ViolationStreetName"
"MailingStreetName"
"MailingState"
"NonUsAddressCode"
"TicketIssuedDT"
"HearingDT"
"ViolationCode"
"Disposition"
"AdminFee"
"LateFee"
"LienFilingFee"
"PaymentStatus"
"ViolationCategory"
"MailingLocation"
"TicketID"
"AgencyName"
"ViolationStreetNumber"
"MailingStreetNumber"
"MailingCity"
"MailingZipCode"
"Country"
"TicketIssuedTime"
"CourtTime"
"ViolDescription"
"FineAmt"
"StateFee"
"CleanUpCost"
"JudgmentAmt"
"Void"
"ViolationLocation"
As said previously, we are only interested in spatial data, so:
2
viols <- viols[,c("ViolationStreetNumber","ViolationStreetName","ViolDescription",
"FineAmt","AdminFee","StateFee","LateFee","CleanUpCost",
"LienFilingFee","JudgmentAmt","ViolationLocation")]
str(viols)
## 'data.frame':
318762 obs. of 11
## $ ViolationStreetNumber: int 8430
## $ ViolationStreetName : Factor w/
## $ ViolDescription
: Factor w/
## $ FineAmt
: Factor w/
## $ AdminFee
: Factor w/
## $ StateFee
: Factor w/
## $ LateFee
: Factor w/
## $ CleanUpCost
: Factor w/
## $ LienFilingFee
: Factor w/
## $ JudgmentAmt
: Factor w/
## $ ViolationLocation
: Factor w/
variables:
18610 5819 4658 16133 8219 17397 17153 17517 2566 ...
1843 levels "10TH ST","11TH ST",..: 95 1504 1670 120 419 204 1389
294 levels "Allowing bulk solid waste to lie or accumulate on or
60 levels "$0.00","$1.00",..: 3 3 24 31 3 3 3 3 3 14 ...
2 levels "$0.00","$20.00": 2 2 2 1 1 1 2 1 2 2 ...
2 levels "$0.00","$10.00": 2 2 2 1 1 1 2 1 2 2 ...
58 levels "$0.00","$0.10",..: 3 3 23 1 1 1 3 1 3 13 ...
245 levels "$0.00","$1.00",..: 1 1 1 1 1 1 1 1 1 1 ...
3 levels "","$0.00","$48.00": 1 1 1 2 2 2 1 2 1 1 ...
390 levels "$0.00","$1004.80",..: 53 53 165 1 1 1 53 1 53 80 ...
109314 levels "","0 10TH ST\nDETROIT, MI\n",..: 100875 45444 8624
We need to clean up the fee-related variables by transforming them into numeric values and from “ViolationLocation” column, we extract lattitude and langitude values. We will see that there are quite some
“ViolationLocation” with no gps coordinates. We also create another variable that is the total fee related.
########################### 1. Getting to know your data ####################
################## 1.1 blight violation data
## Parsing GPS coordinates
## define regular expression pattern
gpsParsing <- function(addr, p="\\(.*\\)")
{
r <- regexpr(p, addr)
out <- rep(NA, length(r))
out[r != -1] <- regmatches(addr, r)
## strip the \\( and \\)
out <- gsub("[()]", "", out)
lats <- unlist(lapply(out, function(x) as.numeric(strsplit(x, split=",")[[1]][1])))
lngs <- unlist(lapply(out, function(x) as.numeric(strsplit(x, split=",")[[1]][2])))
list(lat=lats, lng=lngs)
}
latlngs <- gpsParsing(viols$ViolationLocation)
viols$lat <- latlngs$lat
viols$lng <- latlngs$lng
viols$FineAmt <- as.numeric(gsub("\\$","", as.character(viols$FineAmt)), na.rm=FALSE)
viols$FineAmt[is.na(viols$FineAmt)] <- 0
viols$AdminFee <- as.numeric(gsub("\\$","", as.character(viols$AdminFee)), na.rm=FALSE)
viols$AdminFee[is.na(viols$AdminFee)] <- 0
viols$StateFee <- as.numeric(gsub("\\$","", as.character(viols$StateFee)), na.rm=FALSE)
viols$StateFee[is.na(viols$StateFee)] <- 0
viols$LateFee <- as.numeric(gsub("\\$","", as.character(viols$LateFee)), na.rm=FALSE)
viols$LateFee[is.na(viols$LateFee)] <- 0
viols$CleanUpCost <- as.numeric(gsub("\\$","", as.character(viols$CleanUpCost)), na.rm=FALSE)
viols$CleanUpCost[is.na(viols$CleanUpCost)] <- 0
3
viols$LienFilingFee <- as.numeric(gsub("\\$","", as.character(viols$LienFilingFee)), na.rm=FALSE)
viols$LienFilingFee[is.na(viols$LienFilingFee)] <- 0
viols$JudgmentAmt <- as.numeric(gsub("\\$","", as.character(viols$JudgmentAmt)), na.rm=FALSE)
viols$JudgmentAmt[is.na(viols$JudgmentAmt)] <- 0
viols$totalfee <- viols$FineAmt + viols$AdminFee + viols$StateFee + viols$LateFee + viols$CleanUpCost+ v
We use geosphere’s distHarversine distance to calculate distance between two gps coordinates and we see that
two points with longitudes and lattides as close as 10-4 are about 10 m far away, 10-3 for 120 m and 10-2 for
1300 m. So we created latR and lngR are lattides and longitdes at 4-digit level of accuracy.
library(geosphere)
## Loading required package: sp
distHaversine(c(0.0,0.0),c(0.0001,0.0001))
## [1] 15.74295
distHaversine(c(0.0,0.0),c(0.001,0.001))
## [1] 157.4295
distHaversine(c(0.0,0.0),c(0.01,0.01))
## [1] 1574.295
viols$latR <- round(viols$lat, digits=4)
viols$lngR <- round(viols$lng, digits=4)
For the 311 calls, crimes and permits, we do the same:
################### 1.2. Detroit 311 calls
dcalls <- read.csv("./data/detroit-311.csv")
dcalls <- dcalls[,c("issue_type","lat","lng")]
dcalls$latR <- round(dcalls$lat, digits=4)
dcalls$lngR <- round(dcalls$lng, digits=4)
################### 1.3. Detroit Crimes Data
crimes <- read.csv("./data/detroit-crime.csv")
crimes <- crimes[,c("CATEGORY","LON","LAT")]
names(crimes) <- c("category","lng","lat")
crimes$latR <- round(crimes$lat, digits=4)
crimes$lngR <- round(crimes$lng, digits=4)
#################### 1.4. Detroit Demolition permits
permits <- read.csv("./data/detroit-demolition-permits.tsv", sep ="\t")
4
permits <- permits[,c("BLD_PERMIT_TYPE","site_location")]
p.latlngs <- gpsParsing(permits$site_location)
permits$lat <- p.latlngs$lat
permits$lng <- p.latlngs$lng
permits$latR <- round(permits$lat, digits=4)
permits$lngR <- round(permits$lng, digits=4)
2.2
Visualization incident data
We use ggmap package to get from google Detroit map and add points from blight violations, 311 calls,
crimes and permits data. We can see the distribution of these points:
###################### 2. Visualization ####################################
## getting Detroit map from google
library(ggmap)
## Loading required package: ggplot2
detroitmap <- get_googlemap(center = c(lon=-83.119128,lat=42.384713),maptype = "roadmap", size=c(640,640
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=42.384713,-83.119128&zoom=11&size
ggmap(detroitmap)
42.5
lat
42.4
42.3
−83.3
−83.2
−83.1
lon
5
−83.0
−82.9
## Blight violation data: we sample about 4000 points
ggmap(detroitmap) + geom_point(aes(x=lng,y=lat),
data=viols[sample(1:nrow(viols), 4000),],
color = I("red"))
## Warning: Removed 756 rows containing missing values (geom_point).
42.5
lat
42.4
42.3
−83.3
−83.2
−83.1
−83.0
−82.9
lon
## crimes data: we sample 4000 point
ggmap(detroitmap) + geom_point(aes(x=lng,y=lat),
data=crimes[sample(1:nrow(crimes), 4000),],
color = I("red"))
## Warning: Removed 16 rows containing missing values (geom_point).
6
42.5
lat
42.4
42.3
−83.3
−83.2
−83.1
−83.0
−82.9
lon
## 311 calls data
ggmap(detroitmap) + geom_point(aes(x=lng,y=lat),
data=dcalls[sample(1:nrow(dcalls), 4000),],
color = I("red"))
7
42.5
lat
42.4
42.3
−83.3
−83.2
−83.1
−83.0
−82.9
lon
## blighted data on demolition permits
ggmap(detroitmap) + geom_point(aes(x=lng,y=lat),
data=permits[sample(1:nrow(permits), 4000),],
color = I("red"))
## Warning: Removed 449 rows containing missing values (geom_point).
8
42.5
lat
42.4
42.3
−83.3
−83.2
−83.1
−83.0
−82.9
lon
We can see that blighted buildings concentrate in the west, south west, north east and east of Detroit.
3
Construting a balanced data set with label: blighted
Once, we have a some ideas about the data, we are going to construct a balanced data set with label. From
demolition data set, we have about 4500 data points of Dismantle/DMLT in type, which we can consider as
“blighted” buildings. We need to construct about 4500 random other points not blighted inside Detroit. This
part is a little bit tricky since it involes the notion of polygon. We can download Detroit polygon data set
from Detroit Multipolygon Data. In short, they are coordinates of boundary points in Detroit. We can
download .json file and parse this file to get all the boundaries coordinates. We used Python to parse this file
to obtain the file detroit_multipolygon.csv.
R’s sp package provides the function spsample which is very neat to sample points inside a polygon as we do
below:
######################### 3. Generating random detroit point ####################
library(sp)
det_polygon <- read.csv("./data/clean/detroit_multipolygon.csv")
det_sample <- spsample(Polygon(det_polygon), n=2*nrow(permits), type = "random")
det_sample <- as.data.frame(det_sample)
# head(det_sample)
names(det_sample) <- c("lng","lat")
det_sample$lngR <- round(det_sample$lng, digits=4)
det_sample$latR <- round(det_sample$lat, digits=4)
9
## we can visualize the data to see they are really inside Detroit city
ggmap(detroitmap) + geom_point(aes(x=lng,y=lat),data=det_sample,
color = I("red"))
42.5
lat
42.4
42.3
−83.3
−83.2
−83.1
−83.0
−82.9
lon
## Now we extract dismantled locations
blighted <- permits[,c("latR","lngR")]
blighted <- unique(blighted)
blighted$blighted <- TRUE
or "blighted" label
## In order to work with spatial data, we need to create an id columns
## for all points in both det_sample and blighted
allbuildings <- rbind(blighted[,c("latR","lngR")], det_sample[,c("latR","lngR")])
allbuildings <- unique(allbuildings)
allbuildings$id <- 1:nrow(allbuildings)
## We need an auxiliary function that returns an id for an gps location
################### FINDING THE BUILDING ID ##############################
findIds <- function(x = c(0.0, 0.0), batsID, bycolumn="id", digits=4)
{
# This function take a point (lat, lng) and the aggregate building index batsID
# as arguments and return the id in the batsID
# Condition: the point x = (lat, lng) comes from the population of (lat, lng)
# which is huge, so this is not valid for outsider points (lat, lng)
# Well for outsiders or NAs, it just returns NA
x <- round(x, digits=digits)
10
}
return (batsID[which(batsID[,"latR"] == x[1] &
batsID[,"lngR"] == x[2]),"id"])
## Then now we create id column for blighted and det_sample
## apply the function findIds by rows
blighted$id <- as.numeric(apply(matrix(c(blighted$latR, blighted$lngR), ncol=2), 1, findIds, allbuilding
det_sample$id <- as.numeric(apply(matrix(c(det_sample$latR, det_sample$lngR), ncol=2), 1, findIds, allbu
## Now, we construct the sample by labelling data in det_sample
det_sample$blighted <- FALSE
det_sample[which(det_sample$id %in% blighted$id),"blighted"] <- TRUE
## We use sample_n function from package dplyr
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
##
filter, lag
## The following objects are masked from 'package:base':
##
##
intersect, setdiff, setequal, union
df_false <- sample_n(det_sample[which(!det_sample$blighted),], nrow(blighted))
df_false <- df_false[,c("latR","lngR","id","blighted")]
## We combine the blighted and random non-blighted buildings
data <- rbind(df_false,blighted)
data[sample(1:nrow(data),5),]
##
##
##
##
##
##
9402
6582
9398
1669
7718
latR
42.3405
42.4241
42.3547
42.3097
42.4448
lngR
-83.2143
-83.0486
-83.1404
-83.1353
-83.0438
id blighted
14218
FALSE
11409
FALSE
14214
FALSE
748
TRUE
12542
FALSE
Now, the “dat” data set we have is ready. It is balanced and the non-blighted buildings are chosen randomly
from Detroit using polygon.
4
4.1
Feature Engineering
The first and second features: the number of violation
We create the first feature, which will be proved to be the most significant feature, as the number of blight
violation within 15m of every building. The second feature is the number of violations that happen within
11
1500 m of every building. This second feature captures the spatiality of the data. This intuition is clear,
which is if I am sourrounded by more likely blighted buildings, I am more likely blighted. This is due to the
fact that buildings within few blocs share some construction characteristics such as age, the bloc environment
state and some similar economic activity (price being close for example). The threshold of 1500 m (or .0001
radius of lattitude and longitude coordinates) is quite reasonbale. It is average to capture the neighborhood
factor. We construct two variables “bviols” and “neighbor_nbviols”
4.2
Features from crime data
We divide types of crime into 02 classes: building related (“ARSON”, “DAMAGE TO PROPERTY”,
“ENVIRONMENT”,“RUNAWAY” ) and other crimes (“AGGRAVATED ASSAULT”, “DRUNKENNESS”,
“EMBEZZLEMENT”, “HOMICIDE”,“JUSTIFIABLE HOMICIDE”,“LARCENY”,“NEGLIGENT HOMICIDE”,“OTHER BURGLARY”,“OUIL DISPOSE OF VEHICLE TO AVOID FORFEITURE”,“STOLEN
VEHICLE”,“VAGRANCY (OTHER)”, “ASSAULT”,“BURGLARY”,“DANGEROUS DRUGS”, “HEALTHSAFETY”, “IMMIGRATION”, “KIDNAPING”,“LIQUOR”, “WEAPONS OFFENSES”, “STOLEN PROPERTY”, “ROBBERY”) being indirectly related to the building’s location, while excluding other types such
as CONGRESS, ELECTION etc. We will see that the first class increases the likelihood of a building being
blighted while the second class decreases. The former is quite trivial but the latter can be explained as such
since these crimes are “intentional” types which are associated with some “motivation”, the neighborhood
must be in some “good state” to take advantage of. This is why, these crime decrese the likelihood of a
building being “blighted”. We construct two variables “bd.crimes” and “other.crimes” for this.
4.3
Features from 311 calls
We exclude the following types Traffic Sign Issue“,”Traffic Signal Issue“,”Street Light Pole Down“,”Test
(internal use only, public issue) which are in no connection to the degradation of the location. We construct
variable nb_311 for this purpose.
4.4
Features from total fee
To see the gravity of blight violations, we can use the total fee related to violations as a proxy. We constructed
voisin_fee for this.
Note that, nb_311, voisin_fee, bd.crimes, other.crimes, neighbor_nbviols are calculated in taking the total
of incidents within .0001 in lattides and longituges coordinates.
5
Model Selection
We first split the data set into training and testing data sets. We then choose tree model, and logistic
regression, linear discriminant analysis and quadratic discriminant analysis with 5-fold cross validation
techniques to estimate training errors. We all approximately obtain the estimated training error about 27%
and the testing error 28% (except for quadratic discriminant analysis).
6
Further discussion
Since we have time data for each incidents, we can calculate these features by breaking into years and take
into account the lag effect between year on year. This would be a temporal spatial analysis. Since blighted
buildings are quite concentrated on the North East, East, West, South West sides, we should be interested in
getting data from nearby cities, so that we can calculate the incidents that happen close to location in Detroit.
12
As we talked at the begining of the study that the default and taxes are main reasons for Detroit forclusure
crisis. It would be useful to be able to integrate economic data, disaggregated into blocs, for example real
estate prices. Another direction for further analysis would be to vary the distance 1.5km that we supposed by
default. Can we find the optimal distance or the effect of this distance on the accuracy of our model. Since
we have spatial data, we can do this by doing simulation.
Finally, check out my source code for improvements. Stay in touch !
######################## 4. FEATURE ENGINEERING ##################################
###################### RETURN total nbviols of neighbors #############################
# given x = (lat, lng), return the total nbviols of its neighbors from
# nbviols data frame df, which has 3 columns latR, lngR and nbviols or freq
neighbors_total <- function(x=c(0.0, 0.0), df, epsilon=.01)
{
## we define neighbor as all locations within about .01 and .01 in
## longitude and lattitude distance, or about 1.3 km in Harversine distance
(sum(df[which(abs(df[,"latR"] - x[1]) < epsilon &
abs(df[,"lngR"] - x[2]) < epsilon), "freq"]))
}
############################## 4.1. nbviols and neighbor_nbviols
library(plyr)
## ------------------------------------------------------------------------## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## ------------------------------------------------------------------------##
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
##
##
arrange, count, desc, failwith, id, mutate, rename, summarise,
##
summarize
## nbviols: within 0.0001 of longitudes and lattitudes coordinates
df <- plyr::count(viols, vars = c("latR","lngR"))
data$nbviols <- as.numeric(apply(as.matrix(data[,c("latR","lngR")]), 1,
neighbors_total, df=df, epsilon = 0.0001),
na.rm=FALSE)
## neighbor_nbviols: total blight violation within 0.01 of longitudes and
## lattitudes coordinates
df <- plyr::count(viols, vars = c("latR","lngR"))
data$neighbor_nbviols <- as.numeric(apply(as.matrix(data[,c("latR","lngR")]), 1,
neighbors_total, df=df), na.rm=FALSE)
########################## 4.2. CRIME DATA
buildings.crimes <- c("ARSON", "DAMAGE TO PROPERTY", "ENVIRONMENT","RUNAWAY")
13
related.crimes <- c("AGGRAVATED ASSAULT", "DRUNKENNESS", "EMBEZZLEMENT",
"HOMICIDE","JUSTIFIABLE HOMICIDE","LARCENY","NEGLIGENT HOMICIDE",
"OTHER BURGLARY","OUIL DISPOSE OF VEHICLE TO AVOID FORFEITURE",
"STOLEN VEHICLE","VAGRANCY (OTHER)", "ASSAULT","BURGLARY",
"DANGEROUS DRUGS", "HEALTH-SAFETY", "IMMIGRATION", "KIDNAPING",
"LIQUOR", "WEAPONS OFFENSES", "STOLEN PROPERTY", "ROBBERY"
)
## bd.crimes feature: building related crimes
df <- plyr::count(crimes[which(crimes$category %in% buildings.crimes),],
vars = c("latR","lngR"))
data$bd.crimes <- as.numeric(apply(as.matrix(data[,c("latR","lngR")]), 1,
neighbors_total, df), na.rm=FALSE)
## other.crime features: other crimes that are indirectly related to
df <- plyr::count(crimes[which(crimes$category %in% related.crimes),],
vars = c("latR","lngR"))
data$other.crimes <- as.numeric(apply(as.matrix(data[,c("latR","lngR")]), 1,
neighbors_total, df), na.rm=FALSE)
######################### 4.3. 311 calls data
exluded.types <- c("Traffic Sign Issue", "Traffic Signal Issue","Street Light Pole Down","Test (internal
df <- plyr::count(dcalls[-which(dcalls$issue_type %in% exluded.types),], vars = c("latR","lngR"))
data$nb_311 <- as.numeric(apply(as.matrix(data[,c("latR","lngR")]), 1,
neighbors_total, df=df), na.rm=FALSE)
######################### 4.4. Violation total fee of neighbors
df <- viols[,c("latR","lngR","totalfee")]
names(df) <- c("latR","lngR","freq")
data$voisin_fee <- as.numeric(apply(as.matrix(data[,c("latR","lngR")]), 1,
neighbors_total, df=df), na.rm=FALSE)
summary(data)
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
latR
Min.
:42.26
1st Qu.:42.36
Median :42.39
Mean
:42.39
3rd Qu.:42.42
Max.
:69.00
NA's
:1
nbviols
Min.
: 0.000
1st Qu.: 0.000
Median : 0.000
Mean
: 1.259
3rd Qu.: 1.000
Max.
:51.000
lngR
id
Min.
:-83.29
Min.
:
1
1st Qu.:-83.17
1st Qu.: 2424
Median :-83.10
Median : 4846
Mean
:-83.08
Mean
: 7152
3rd Qu.:-83.01
3rd Qu.:11880
Max.
: 67.00
Max.
:19059
NA's
:1
NA's
:1
neighbor_nbviols
bd.crimes
Min.
:
0
Min.
: 0
1st Qu.:1745
1st Qu.: 72
Median :2624
Median :103
Mean
:2727
Mean
:106
3rd Qu.:3813
3rd Qu.:134
Max.
:6485
Max.
:296
nb_311
Min.
: 0.0
1st Qu.:128.0
Median :169.0
Mean
:170.1
voisin_fee
Min.
:
0
1st Qu.:1102964
Median :1717867
Mean
:1829348
14
blighted
Mode :logical
FALSE:4845
TRUE :4845
NA's :0
other.crimes
Min.
:
0.0
1st Qu.: 418.0
Median : 589.0
Mean
: 608.6
3rd Qu.: 786.0
Max.
:2047.0
##
##
##
3rd Qu.:207.0
Max.
:722.0
3rd Qu.:2575991
Max.
:5157415
############################### 5. MODEL SELECTION ##########################
################# 5.1 Training and test set
Train <- sample(nrow(data), size = nrow(data)*.8)
training <- data[Train,]
testing <- data[-Train,]
################ 5.2 Creating 5-fold cross validation
training <- training[sample(nrow(training)),]
#Create 5 equally size folds
folds <- cut(seq(1,nrow(training)),breaks=5,labels=FALSE)
################ 5.3. CV error with glm
#Perform 5 fold cross validation
error <- 1:5
for(i in 1:5){
#Segement your data by fold using the which() function
validIndexes <- which(folds==i,arr.ind=TRUE)
validData <- training[validIndexes, ]
trainData <- training[-validIndexes, ]
#Use test and train data partitions however you desire
mod_fit <- glm(blighted ~ log(1+nbviols) + log(1+neighbor_nbviols)
+ log(1 + bd.crimes) + log(1+ other.crimes) + log(1+ nb_311) +
log(1 + voisin_fee),
data=trainData, family = binomial)
mod_probs <- predict(mod_fit, newdata = validData, type="response")
mod_preds <- rep(FALSE, length(mod_probs))
mod_preds[mod_probs > 0.5] <- TRUE
error[i] <- mean(mod_preds != validData$blight)
}
## Estimated training error for logistic regression
mean(error) ## 27.84%
## [1] 0.2794088
## other.crimes decreases the likelihood of being blighted and
## bd.crimes increases the likelihood of being blighted. Both and
## nbviols and neighbor_nbviols are significant, while the other
## two are not.
summary(mod_fit)
##
## Call:
## glm(formula = blighted ~ log(1 + nbviols) + log(1 + neighbor_nbviols) +
##
log(1 + bd.crimes) + log(1 + other.crimes) + log(1 + nb_311) +
##
log(1 + voisin_fee), family = binomial, data = trainData)
15
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
Deviance Residuals:
Min
1Q
Median
-3.2591 -0.9559
0.0356
3Q
1.1393
Max
2.7145
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)
-3.65877
1.12365 -3.256 0.001129
log(1 + nbviols)
1.95638
0.07440 26.297 < 2e-16
log(1 + neighbor_nbviols) 0.74399
0.20715
3.591 0.000329
log(1 + bd.crimes)
0.49512
0.12790
3.871 0.000108
log(1 + other.crimes)
-0.99628
0.14632 -6.809 9.85e-12
log(1 + nb_311)
0.18104
0.10874
1.665 0.095935
log(1 + voisin_fee)
0.03093
0.18279
0.169 0.865618
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
**
***
***
***
***
.
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 8596.4
Residual deviance: 6705.0
AIC: 6719
on 6200
on 6194
degrees of freedom
degrees of freedom
Number of Fisher Scoring iterations: 6
############### 5.4. CV error with linear discriminant analysis lda
library(MASS)
# for lda function
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
##
select
error <- 1:5
for(i in 1:5){
#Segement your data by fold using the which() function
validIndexes <- which(folds==i,arr.ind=TRUE)
validData <- training[validIndexes, ]
trainData <- training[-validIndexes, ]
#Use test and train data partitions however you desire
mod_fit <- lda(blighted ~ log(1+nbviols) + log(1+neighbor_nbviols)
+ log(1 + bd.crimes) + log(1+ other.crimes) + log(1+ nb_311) +
log(1 + voisin_fee),
data=trainData)
mod_probs <- predict(mod_fit, validData)
mod_preds <- mod_probs$class
error[i] <- mean(mod_preds != validData$blight)
}
### Estimated training error for lda
mean(error) # 27.52%
16
## [1] 0.2817307
############### 5.5. CV error with quadratic discriminant analysis
library(MASS) # for qda function
error <- 1:5
for(i in 1:5){
#Segement your data by fold using the which() function
validIndexes <- which(folds==i,arr.ind=TRUE)
validData <- training[validIndexes, ]
trainData <- training[-validIndexes, ]
#Use test and train data partitions however you desire
mod_fit <- qda(blighted ~ log(1+nbviols) + log(1+neighbor_nbviols)
+ log(1 + bd.crimes) + log(1+ other.crimes) + log(1+ nb_311) +
log(1 + voisin_fee),
data=trainData)
mod_probs <- predict(mod_fit, validData)
mod_preds <- mod_probs$class
error[i] <- mean(mod_preds != validData$blight)
}
mean(error) ## 36.36% error
## [1] 0.3689363
############### 5.6. test error with tree
library(tree)
data$blighted <- factor(data$blighted)
training$blighted <- factor(training$blighted)
testing$blighted <- factor(testing$blighted)
blighted.tree <- tree(blighted ~ nbviols + neighbor_nbviols
+ bd.crimes + other.crimes + nb_311
+ voisin_fee, data = training)
blighted.tree.pred <- predict(blighted.tree, testing, type="class")
mean(blighted.tree.pred != testing$blighted) # error: 27.81%
## [1] 0.2765738
summary(blighted.tree)
##
##
##
##
##
##
##
##
##
## missclassification error rate: 27.36%
Classification tree:
tree(formula = blighted ~ nbviols + neighbor_nbviols + bd.crimes +
other.crimes + nb_311 + voisin_fee, data = training)
Variables actually used in tree construction:
[1] "nbviols"
"voisin_fee"
Number of terminal nodes: 3
Residual mean deviance: 1.085 = 8406 / 7749
Misclassification error rate: 0.2793 = 2165 / 7752
17
7
Links
1.
2.
3.
4.
5.
Source code
Detroit Blight Violations Data
Detroit Multipolygon Data
Detroit Polygon Points - clean points
This file
18