Detecting Gentrification with Spatial Regression

The following analysis aims at detecting the geography of gentrification in Washington DC through regrious statistical and econmetrics modeling techniques. We shall assess gentrification as indiciated by the rise of middle or high income class at census tract level. The data contains the change of values between the years 2000 and 2015.

The first part covers some basic EDA, OLS and GWR analyses as well as tests for assumptions of a linear model. After dtecting spatial association using Moran's I and the type of association using ML tests, we'll introduce the Generalized S2SLS model to predict and visualize gentrification. The analysis is conducted in R/R-Studio environment and other tools such as Geoda/GeodaSpace or Stata can be also used. You can download the Rmd file and the data here.

The data

gd<-readOGR("gentrification_data.shp")
OGR data source with driver: ESRI Shapefile
Source: "gentrification_data.shp", layer: "gentrification_data"
with 179 features
It has 17 fields
Integer64 fields read as strings:  change_hhs change_vcr 
names(gd)
 [1] "ALAND"      "AWATER"     "INTPTLAT"   "INTPTLON"   "GEO_id"
 [6] "p_chg_wh"   "p_chg_bk"   "p_chg_edc"  "change_inc" "change_hhs"
[11] "change_ren" "p_chg_ownd" "p_chg_rent" "p_chg_pvt"  "change_vcr"
[16] "p_chg_chfm" "chg_md_age"

Data Source: National Census Bearau, Geolytics.com, The National Archive of Criminal Justice Data.

Dimension: 179 X 13 (excluding metadata)

Key Description
GEO_id Geographic (unique) id
p_chg_wh Percentage change in white population
p_chg_bk Percentage chagne in black population
p_chg_edc Percentage change in the 25 years old and above with a Bachelors or Higher academic degree (s)
change_inc Change in median household income($) (USD, adjusted to infilation)
change_hhs Change in median house value ($) (USD, adjusted to infilation)
change_ren Change in gross median rent ($) (USD, adjsted to infiliation)
p_chg_ownd Percentage change in the number of households owned
p_chg_rent Percentage change in the number of households rented
p_cg_pvt Percentage change in the number of people below poverty line
change_vcr Change in the number of Violent Crimes
p_chg_chfm Percentage change in the number of families with children
chg_md_age Change in median age

EDA

Summary

    p_chg_wh          p_chg_bk        p_chg_edc         change_inc
 Min.   :-14.400   Min.   :-74.57   Min.   :-15.380   Min.   :-37259.1
 1st Qu.: -0.155   1st Qu.:-21.25   1st Qu.:  3.235   1st Qu.:  -823.7
 Median :  3.780   Median : -7.58   Median :  7.870   Median : 10750.5
 Mean   :  9.666   Mean   :-13.45   Mean   : 12.439   Mean   : 15910.3
 3rd Qu.: 18.385   3rd Qu.: -1.21   3rd Qu.: 20.800   3rd Qu.: 31858.0
 Max.   : 66.850   Max.   :  8.03   Max.   : 56.070   Max.   : 95491.5

   change_hhs    change_ren        p_chg_ownd        p_chg_rent
 -104513:  1   Min.   : -33.39   Min.   :-22.700   Min.   :-28.2500
 -13737 :  1   1st Qu.: 240.32   1st Qu.: -3.220   1st Qu.: -5.5750
 -158222:  1   Median : 418.39   Median :  0.100   Median : -1.0900
 -164277:  1   Mean   : 478.16   Mean   :  1.339   Mean   : -0.9529
 -203462:  1   3rd Qu.: 637.17   3rd Qu.:  5.225   3rd Qu.:  3.7550
 -21100 :  1   Max.   :1951.96   Max.   : 26.620   Max.   : 33.0500
 (Other):173
   p_chg_pvt        change_vcr    p_chg_chfm       chg_md_age
 Min.   :-50.46   -124   :  3   Min.   :-45.83   Min.   :-15.3000
 1st Qu.: -7.83   -28    :  3   1st Qu.:  8.03   1st Qu.: -2.5000
 Median : -1.44   -43    :  3   Median : 17.06   Median : -0.4000
 Mean   : -2.41   -53    :  3   Mean   : 18.57   Mean   : -0.2715
 3rd Qu.:  3.68   -54    :  3   3rd Qu.: 28.23   3rd Qu.:  2.0500
 Max.   : 17.14   -58    :  3   Max.   : 65.97   Max.   : 12.5000
                  (Other):161                                      

Structure

'data.frame':   179 obs. of  12 variables:
 $ p_chg_wh  : num  -0.63 3.61 2.42 2.68 5.4 ...
 $ p_chg_bk  : num  -0.43 -4.48 -4.83 -5.07 -5.44 ...
 $ p_chg_edc : num  3.2 7.44 12.17 6.22 7.21 ...
 $ change_inc: num  -2683 -10890 33582 -4377 7279 ...
 $ change_hhs: Factor w/ 179 levels "-104513","-13737",..: 16 23 29 161 138 156 53 73 62 76 ...
 $ change_ren: num  206 156 520 154 285 ...
 $ p_chg_ownd: num  -4.95 0.87 22.38 14.52 3.65 ...
 $ p_chg_rent: num  1.52 15.57 4.44 -28.25 -6.84 ...
 $ p_chg_pvt : num  10.8 4.34 -24.88 5.27 -7.43 ...
 $ change_vcr: Factor w/ 127 levels "-100","-102",..: 15 38 7 34 96 4 25 8 117 108 ...
 $ p_chg_chfm: num  30.9 14.9 14.7 44.6 18 ...
 $ chg_md_age: num  -12.7 0 2.7 -0.8 3.1 -1.2 -0.7 -4.2 -6.5 -1.3 ...

Plots

plot1<-ggplot(gd_v2, aes(p_chg_wh, p_chg_bk)) + geom_point( color ="#5ce32d", alpha = 0.9) +theme_bw() +labs ( title ="Demographic Change", x = "Change in White Population (%) ", y = " Change in Black Population (%) ") + theme(legend.position = "none")

plot2<-ggplot(gd_v2, aes(p_chg_edc, change_inc)) + geom_point( color ="#a67ff9", alpha = 0.9) +theme_bw() +labs ( title ="Household Income", x = " Change in Educated (%)  ", y = " Change in Income ($)") + theme(legend.position = "none")

grid.arrange(plot1, plot2, ncol = 2, widths = c(1, 1))

plot3<-ggplot(gd_v2, aes(change_vcr)) + geom_histogram( bins =15, fill ="#adadfe", color ="#ffffff", alpha = 0.9) +theme_bw() +labs ( title ="Violent Crime", x = " Change in the number of incidents", y = " Census Tracts ") + theme(legend.position = "none")

plot4<-ggplot(gd_v2, aes(change_hhs, change_ren)) + geom_boxplot(color ="#f35e80", fill ="#ffc9d5", alpha = 0.9) +theme_bw() +labs ( title ="Housing Price", x = " Change in House Value ($)", y = "Change in Rent ($)") + theme(legend.position = "none")

grid.arrange(plot3, plot4, ncol = 2, widths = c(1,1))

plot5<-ggplot(gd_v2, aes(p_chg_pvt, change_ren)) + geom_violin(color ="#a67f9c", fill ="#ffedfa", alpha = 0.9) +theme_bw() +labs ( title = "Rent", x = " Share of Low income residents (%)  ", y = " Change in rent ($)") + theme(legend.position = "none")

plot6<-ggplot(gd_v2, aes(p_chg_rent, p_chg_ownd)) + geom_hex( color ="#ffffff", alpha = 0.8) +theme_bw() +labs ( title = "Property Type", x = "Rented Properties (%)  ", y = "Owned Properties (%)") + theme(legend.position = "none")

grid.arrange(plot5, plot6, ncol = 2, widths = c(1, 1))

Normality test

Basic plots

par(mfrow=c(1,3))
Income_change <-gd_v2$change_inc
plot(Income_change, main = "Scatter Plot")
h <- hist(Income_change, col = "lightgray", xlab = "Accuracy", main = "Histogram")
xfit <- seq(min(Income_change), max(Income_change), length = 40)
yfit <- dnorm(xfit, mean = mean(Income_change), sd = sd(Income_change))
yfit <- yfit * diff(h$mids[1:2]) * length(Income_change)
lines(xfit, yfit, col = "blue", lwd = 1)
qqnorm(Income_change)
qqline(Income_change, col = "red")

Statistical tests

Skewness<-skew(Income_change)
Kurtosis<-kurtosi(Income_change) # Taildness
cbind(Skewness, Kurtosis) 
      Skewness  Kurtosis
[1,] 0.9246697 0.7904545
shapiro.test(Income_change) # Shapiro Wilk test

    Shapiro-Wilk normality test

data:  Income_change
W = 0.93937, p-value = 7.244e-07
jb.norm.test(Income_change, nrepl=2000) #Jarque–Bera test

    Jarque-Bera test for normality

data:  Income_change
JB = 31.118, p-value = 0.0015

Correlation

cor.gd_sub<-cor(gd_v2)
corrplot(cor.gd_sub, method ="number")

Feature Selection

par(mfrow=c(1:2))
reg.best <- regsubsets(change_inc~., data = gd_v2, nvmax = 19)
plot(reg.best, scale = "adjr2", main = "Adjusted R^2")
plot(reg.best, scale = "bic", main = "BIC")

Selected Explanatory Variables:

  • Percentage change in White population (p_chg_wh)
  • Change in Rent (change_ren)
  • Percentage change in households rented (p_chg_rent)
  • Percentage change in People living under poverty line (p_chg_pvt)

Formula <- change_inc~p_chg_wh+change_ren+p_chg_pvt+p_chg_rent

OLS Regression Model

model.lm<-lm(Formula, data =gd)
summary(model.lm)

Call:
lm(formula = Formula, data = gd)

Residuals:
   Min     1Q Median     3Q    Max
-32795  -7002    145   6181  40995

Coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept) -4530.115   1671.908  -2.710  0.00741 **
p_chg_wh      518.399     83.867   6.181 4.39e-09 ***
change_ren     28.600      3.434   8.329 2.35e-14 ***
p_chg_pvt    -551.159    136.592  -4.035 8.16e-05 ***
p_chg_rent   -446.655    108.955  -4.099 6.34e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 11930 on 174 degrees of freedom
Multiple R-squared:  0.7342,    Adjusted R-squared:  0.7281
F-statistic: 120.2 on 4 and 174 DF,  p-value: < 2.2e-16

Residuals

Plots

par(mfrow=c(2,2))
plot(model.lm, which = 1:4)

bptest(model.lm, data=sub_data, studentize=FALSE)

    Breusch-Pagan test

data:  model.lm
BP = 24.251, df = 4, p-value = 7.114e-05

Heteroscedasticity

Breusch–Pagan test

bptest(model.lm, data=sub_data, studentize=FALSE)

    Breusch-Pagan test

data:  model.lm
BP = 24.251, df = 4, p-value = 7.114e-05

Multicollinearity

Varrience Infilation Factor (VIF)

data.frame(vif(model.lm))
           vif.model.lm.
p_chg_wh        1.875582
change_ren      1.574237
p_chg_pvt       1.895434
p_chg_rent      1.115477

GWR model

Geographically weighted regression (GWR) is an exploratory technique mainly intended to indicate where non-stationarity is taking place on the map, that is where locally weighted regression coefficients move away from their global values.

Bandwidth

model.bnw<-gwr.sel(Formula, data = gd, gweight =gwr.Gauss, verbose = F)
model.bnw
[1] 3.119498

Result

model.gwr<-gwr(Formula, data = gd, bandwidth = model.bnw, gweight =gwr.Gauss, hatmatrix = TRUE)

model.gwr
Call:
gwr(formula = Formula, data = gd, bandwidth = model.bnw, gweight = gwr.Gauss,
    hatmatrix = TRUE)
Kernel function: gwr.Gauss
Fixed bandwidth: 3.119498
Summary of GWR coefficient estimates at data points:
                   Min.    1st Qu.     Median    3rd Qu.       Max.
X.Intercept. -6808.0532 -6138.5644 -4814.0667 -3235.7688    84.9717
p_chg_wh       379.8810   464.4163   515.2463   549.5800   691.5876
change_ren       8.3169    28.2562    29.3029    30.2283    35.5242
p_chg_pvt     -930.1638  -791.5239  -589.2421  -409.3736  -291.8086
p_chg_rent    -719.0812  -567.8842  -473.2876  -345.5224  -137.6871
               Global
X.Intercept. -4530.11
p_chg_wh       518.40
change_ren      28.60
p_chg_pvt     -551.16
p_chg_rent    -446.65
Number of data points: 179
Effective number of parameters (residual: 2traceS - traceS'S): 23.13577
Effective degrees of freedom (residual: 2traceS - traceS'S): 155.8642
Sigma (residual: 2traceS - traceS'S): 11652.51
Effective number of parameters (model: traceS): 17.01369
Effective degrees of freedom (model: traceS): 161.9863
Sigma (model: traceS): 11430.2
Sigma (ML): 10873.42
AICc (GWR p. 61, eq 2.33; p. 96, eq. 4.21): 3875.569
AIC (GWR p. 96, eq. 4.22): 3852.273
Residual sum of squares: 21163414626
Quasi-global R2: 0.7727706 

Spatial Autoregressive Model

Contiguity & Weight matrix

First Order Queen Contiguity matrix
proj4string(gd)<-CRS("+proj=longlat +datum=NAD83 +no_defs +ellps=GRS80 +towgs84=0,0,0") # Create projection just in case .pjr file doesn't come alone with the shp file.
gd_NAD<-spTransform(gd, CRS("+init=ESRI:102685")) # Reproject to ESRI:102685 (NAD 1983 Stateplane Maryland FIPS 1900)

queen.cont<-poly2nb(gd)
coords<-coordinates(gd)
plot(gd)
plot(queen.cont, coords, add =TRUE, col ="red")

Weight matrix

queen.con.wmx<-nb2listw(queen.cont, style ="B") #B stands for binary matrix in which neighbors are assigned in 1 & 0
queen.con.wmx
Characteristics of weights list object:
Neighbour list object:
Number of regions: 179
Number of nonzero links: 1070
Percentage nonzero weights: 3.339471
Average number of links: 5.977654

Weights style: B
Weights constants summary:
    n    nn   S0   S1    S2
B 179 32041 1070 2140 28504

Spatial autocorrelation

Moran’s I

moran.test(gd_NAD$change_inc, listw=queen.con.wmx)

    Moran I test under randomisation

data:  gd_NAD$change_inc
weights: queen.con.wmx

Moran I statistic standard deviate = 10.395, p-value < 2.2e-16
alternative hypothesis: greater
sample estimates:
Moran I statistic       Expectation          Variance
      0.433376410      -0.005617978       0.001783346 

Scatterplot

locm<-localmoran(gd_NAD$change_inc, queen.con.wmx)
gd$scInc<-scale(gd$change_inc) # Scale
gd$lagged_scInc<-lag.listw(queen.con.wmx, gd$scInc)

plot(x = gd$scInc, y = gd$lagged_scInc, main ="Moran Scatterplot of Income")
abline(h =, v =0)
abline(lm(gd$lagged_scInc ~ gd$scInc), lty =3, lwd = 4, col = "red")

Local Indicators of Spatial Association (LISA) map

gd$quad_sig <- NA
gd@data[(gd$scInc >= 0 & gd$lagged_scInc >= 0) & (locm[, 5] <= 0.05), "quad_sig"] <- 1
gd@data[(gd$scInc <= 0 & gd$lagged_scInc <= 0) & (locm[, 5] <= 0.05), "quad_sig"] <- 2
gd@data[(gd$scInc >= 0 & gd$lagged_scInc <= 0) & (locm[, 5] <= 0.05), "quad_sig"] <- 3
gd@data[(gd$scInc >= 0 & gd$lagged_scInc <= 0) & (locm[, 5] <= 0.05), "quad_sig"] <- 4
gd@data[(gd$scInc <= 0 & gd$lagged_scInc >= 0) & (locm[, 5] <= 0.05), "quad_sig"] <- 5

breaks <- seq(1, 5, 1) #Set the breaks for the thematic map classes

labels <- c("high-High", "low-Low", "High-Low", "Low-High", "Not Signif.")  #thematic map classes

np <- findInterval(gd$quad_sig, breaks) #?findInterval 
# Assign colors to each map class
colors <- c("red", "blue", "lightpink", "skyblue2", "white")
plot(gd, col = colors[np])  #colors[np] manually sets the color for each county
mtext("Local Moran's I", cex = 1.5, side = 3, line = 1)
legend("bottomleft", legend = labels, fill = colors, bty = "n")

Lagrange Multiplier Tests

Dignostics for Spatial Dependence

queen.con.wmx.row <-nb2listw(queen.cont) # Spatial Weights matrix row standardized

lm.result<-lm.LMtests(model.lm, queen.con.wmx, zero.policy=NULL, test="all", spChk=NULL, naSubset=TRUE)

t.lm.result<-t(sapply(lm.result, function(x) c(x$statistic, x$parameter, x$p.value)))

colnames(t.lm.result) <-c("Statistic", "df", "p-value")
printCoefmat((t.lm.result))
       Statistic       df p-value
LMerr    6.51749  1.00000  0.0107
LMlag   16.95051  1.00000  0.0000
RLMerr   0.55979  1.00000  0.4543
RLMlag  10.99281  1.00000  0.0009
SARMA   17.51030  2.00000  0.0002

All except Robust Lagrange Multiplier in the error term (RMLerr) are statistically signficant. Hence, based on RMlag result, the spatial dependence is in the Spatial lag.

Generalized Spatial 2SLS model

The fitting implementation fits a spatial lag model:

     Yn = Xnβ + λWnYn + Un, |λ| < 1       (1)
     Un = ρMnUn + εn,       |ρ| < 1       (2)

where: Yn is the n ×1 vector of observations on the dependent variable, Xn is the n × k matrix of observations on k exogenous variables, Wn and Mn are n × n spatial weighting matrices of known constants, β is the k × 1 vector of regression parameters, λ and ρ are scalar autoregressive parameters, Un is the n × 1 vector of regression disturbances, and εn is an n × 1 vecto of innovations Wnyn and Mnun are spatial lags of Yn and Un, respectively.

The models uses spatially lagged X variables as instruments for the spatially lagged dependent variable. In the above equation, ρ (Rho) is a scalar parameter that indicates the effect of the dependent variable in the neighbors on the dependent variable in the focal area.

‘spdep’ Package

gd_v2scale<-data.frame(scale(gd_v2))
model.s2sls<-stsls(Formula, data =gd, queen.con.wmx, robust = TRUE) #estimates adjusted for Hetroscadesticity

summary(model.s2sls)

Call:
stsls(formula = Formula, data = gd, listw = queen.con.wmx, robust = TRUE)

Residuals:
     Min       1Q   Median       3Q      Max
-32467.9  -6059.2  -1322.8   6433.1  33332.2

Coefficients:
               Estimate HC0 std. Error z value  Pr(>|z|)
Rho          3.5127e-02     1.0547e-02  3.3307 0.0008663
(Intercept) -6.3013e+03     1.4917e+03 -4.2243 2.397e-05
p_chg_wh     3.8762e+02     9.5397e+01  4.0632 4.841e-05
change_ren   2.7516e+01     3.5610e+00  7.7270 1.110e-14
p_chg_pvt   -5.4613e+02     1.2041e+02 -4.5357 5.742e-06
p_chg_rent  -3.9241e+02     1.0298e+02 -3.8107 0.0001386

Residual variance (sigma squared): 131330000, (sigma: 11460)
RMSE<-model.s2sls$sse/179
RMSE
[1] 126929451

Quantile Maps

Residuals

gd$Residuals<-model.s2sls$residuals

res.map<-qtm(gd, fill="Residuals", fill.style ="quantile", fill.alpha  = 0.9, fill.palette ="RdBu", borders ="#f1f1f1")+ tm_legend(legend.position =c("left", "bottom"))
lf.res<-tmap_leaflet(res.map)
lf.res

Predicted Values

Predicted.val<- (gd$change_inc - gd$Residuals)
gd$Predicted.values<-Predicted.val

pred.map<-qtm(gd, fill="Predicted.values", fill.style = "quantile", fill.palette ="RdBu", borders ="#f1f1f1")

lf.pr<-tmap_leaflet(pred.map)
lf.pr