The following analysis aims at detecting the geography of gentrification in Washington DC through regrious statistical and econmetrics modeling techniques. We shall assess gentrification as indiciated by the rise of middle or high income class at census tract level. The data contains the change of values between the years 2000 and 2015.
The first part covers some basic EDA, OLS and GWR analyses as well as tests for assumptions of a linear model. After dtecting spatial association using Moran's I and the type of association using ML tests, we'll introduce the Generalized S2SLS model to predict and visualize gentrification. The analysis is conducted in R/R-Studio environment and other tools such as Geoda/GeodaSpace or Stata can be also used. You can download the Rmd file and the data here.
gd<-readOGR("gentrification_data.shp")
OGR data source with driver: ESRI Shapefile
Source: "gentrification_data.shp", layer: "gentrification_data"
with 179 features
It has 17 fields
Integer64 fields read as strings: change_hhs change_vcr
names(gd)
[1] "ALAND" "AWATER" "INTPTLAT" "INTPTLON" "GEO_id"
[6] "p_chg_wh" "p_chg_bk" "p_chg_edc" "change_inc" "change_hhs"
[11] "change_ren" "p_chg_ownd" "p_chg_rent" "p_chg_pvt" "change_vcr"
[16] "p_chg_chfm" "chg_md_age"
Data Source: National Census Bearau, Geolytics.com, The National Archive of Criminal Justice Data.
Dimension: 179 X 13 (excluding metadata)
Key | Description |
---|---|
GEO_id | Geographic (unique) id |
p_chg_wh | Percentage change in white population |
p_chg_bk | Percentage chagne in black population |
p_chg_edc | Percentage change in the 25 years old and above with a Bachelors or Higher academic degree (s) |
change_inc | Change in median household income($) (USD, adjusted to infilation) |
change_hhs | Change in median house value ($) (USD, adjusted to infilation) |
change_ren | Change in gross median rent ($) (USD, adjsted to infiliation) |
p_chg_ownd | Percentage change in the number of households owned |
p_chg_rent | Percentage change in the number of households rented |
p_cg_pvt | Percentage change in the number of people below poverty line |
change_vcr | Change in the number of Violent Crimes |
p_chg_chfm | Percentage change in the number of families with children |
chg_md_age | Change in median age |
p_chg_wh p_chg_bk p_chg_edc change_inc
Min. :-14.400 Min. :-74.57 Min. :-15.380 Min. :-37259.1
1st Qu.: -0.155 1st Qu.:-21.25 1st Qu.: 3.235 1st Qu.: -823.7
Median : 3.780 Median : -7.58 Median : 7.870 Median : 10750.5
Mean : 9.666 Mean :-13.45 Mean : 12.439 Mean : 15910.3
3rd Qu.: 18.385 3rd Qu.: -1.21 3rd Qu.: 20.800 3rd Qu.: 31858.0
Max. : 66.850 Max. : 8.03 Max. : 56.070 Max. : 95491.5
change_hhs change_ren p_chg_ownd p_chg_rent
-104513: 1 Min. : -33.39 Min. :-22.700 Min. :-28.2500
-13737 : 1 1st Qu.: 240.32 1st Qu.: -3.220 1st Qu.: -5.5750
-158222: 1 Median : 418.39 Median : 0.100 Median : -1.0900
-164277: 1 Mean : 478.16 Mean : 1.339 Mean : -0.9529
-203462: 1 3rd Qu.: 637.17 3rd Qu.: 5.225 3rd Qu.: 3.7550
-21100 : 1 Max. :1951.96 Max. : 26.620 Max. : 33.0500
(Other):173
p_chg_pvt change_vcr p_chg_chfm chg_md_age
Min. :-50.46 -124 : 3 Min. :-45.83 Min. :-15.3000
1st Qu.: -7.83 -28 : 3 1st Qu.: 8.03 1st Qu.: -2.5000
Median : -1.44 -43 : 3 Median : 17.06 Median : -0.4000
Mean : -2.41 -53 : 3 Mean : 18.57 Mean : -0.2715
3rd Qu.: 3.68 -54 : 3 3rd Qu.: 28.23 3rd Qu.: 2.0500
Max. : 17.14 -58 : 3 Max. : 65.97 Max. : 12.5000
(Other):161
'data.frame': 179 obs. of 12 variables:
$ p_chg_wh : num -0.63 3.61 2.42 2.68 5.4 ...
$ p_chg_bk : num -0.43 -4.48 -4.83 -5.07 -5.44 ...
$ p_chg_edc : num 3.2 7.44 12.17 6.22 7.21 ...
$ change_inc: num -2683 -10890 33582 -4377 7279 ...
$ change_hhs: Factor w/ 179 levels "-104513","-13737",..: 16 23 29 161 138 156 53 73 62 76 ...
$ change_ren: num 206 156 520 154 285 ...
$ p_chg_ownd: num -4.95 0.87 22.38 14.52 3.65 ...
$ p_chg_rent: num 1.52 15.57 4.44 -28.25 -6.84 ...
$ p_chg_pvt : num 10.8 4.34 -24.88 5.27 -7.43 ...
$ change_vcr: Factor w/ 127 levels "-100","-102",..: 15 38 7 34 96 4 25 8 117 108 ...
$ p_chg_chfm: num 30.9 14.9 14.7 44.6 18 ...
$ chg_md_age: num -12.7 0 2.7 -0.8 3.1 -1.2 -0.7 -4.2 -6.5 -1.3 ...
plot1<-ggplot(gd_v2, aes(p_chg_wh, p_chg_bk)) + geom_point( color ="#5ce32d", alpha = 0.9) +theme_bw() +labs ( title ="Demographic Change", x = "Change in White Population (%) ", y = " Change in Black Population (%) ") + theme(legend.position = "none")
plot2<-ggplot(gd_v2, aes(p_chg_edc, change_inc)) + geom_point( color ="#a67ff9", alpha = 0.9) +theme_bw() +labs ( title ="Household Income", x = " Change in Educated (%) ", y = " Change in Income ($)") + theme(legend.position = "none")
grid.arrange(plot1, plot2, ncol = 2, widths = c(1, 1))
plot3<-ggplot(gd_v2, aes(change_vcr)) + geom_histogram( bins =15, fill ="#adadfe", color ="#ffffff", alpha = 0.9) +theme_bw() +labs ( title ="Violent Crime", x = " Change in the number of incidents", y = " Census Tracts ") + theme(legend.position = "none")
plot4<-ggplot(gd_v2, aes(change_hhs, change_ren)) + geom_boxplot(color ="#f35e80", fill ="#ffc9d5", alpha = 0.9) +theme_bw() +labs ( title ="Housing Price", x = " Change in House Value ($)", y = "Change in Rent ($)") + theme(legend.position = "none")
grid.arrange(plot3, plot4, ncol = 2, widths = c(1,1))
plot5<-ggplot(gd_v2, aes(p_chg_pvt, change_ren)) + geom_violin(color ="#a67f9c", fill ="#ffedfa", alpha = 0.9) +theme_bw() +labs ( title = "Rent", x = " Share of Low income residents (%) ", y = " Change in rent ($)") + theme(legend.position = "none")
plot6<-ggplot(gd_v2, aes(p_chg_rent, p_chg_ownd)) + geom_hex( color ="#ffffff", alpha = 0.8) +theme_bw() +labs ( title = "Property Type", x = "Rented Properties (%) ", y = "Owned Properties (%)") + theme(legend.position = "none")
grid.arrange(plot5, plot6, ncol = 2, widths = c(1, 1))
par(mfrow=c(1,3))
Income_change <-gd_v2$change_inc
plot(Income_change, main = "Scatter Plot")
h <- hist(Income_change, col = "lightgray", xlab = "Accuracy", main = "Histogram")
xfit <- seq(min(Income_change), max(Income_change), length = 40)
yfit <- dnorm(xfit, mean = mean(Income_change), sd = sd(Income_change))
yfit <- yfit * diff(h$mids[1:2]) * length(Income_change)
lines(xfit, yfit, col = "blue", lwd = 1)
qqnorm(Income_change)
qqline(Income_change, col = "red")
Statistical tests
Skewness<-skew(Income_change)
Kurtosis<-kurtosi(Income_change) # Taildness
cbind(Skewness, Kurtosis)
Skewness Kurtosis
[1,] 0.9246697 0.7904545
shapiro.test(Income_change) # Shapiro Wilk test
Shapiro-Wilk normality test
data: Income_change
W = 0.93937, p-value = 7.244e-07
jb.norm.test(Income_change, nrepl=2000) #Jarque–Bera test
Jarque-Bera test for normality
data: Income_change
JB = 31.118, p-value = 0.0015
cor.gd_sub<-cor(gd_v2)
corrplot(cor.gd_sub, method ="number")
par(mfrow=c(1:2))
reg.best <- regsubsets(change_inc~., data = gd_v2, nvmax = 19)
plot(reg.best, scale = "adjr2", main = "Adjusted R^2")
plot(reg.best, scale = "bic", main = "BIC")
Selected Explanatory Variables:
Formula <- change_inc~p_chg_wh+change_ren+p_chg_pvt+p_chg_rent
model.lm<-lm(Formula, data =gd)
summary(model.lm)
Call:
lm(formula = Formula, data = gd)
Residuals:
Min 1Q Median 3Q Max
-32795 -7002 145 6181 40995
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4530.115 1671.908 -2.710 0.00741 **
p_chg_wh 518.399 83.867 6.181 4.39e-09 ***
change_ren 28.600 3.434 8.329 2.35e-14 ***
p_chg_pvt -551.159 136.592 -4.035 8.16e-05 ***
p_chg_rent -446.655 108.955 -4.099 6.34e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 11930 on 174 degrees of freedom
Multiple R-squared: 0.7342, Adjusted R-squared: 0.7281
F-statistic: 120.2 on 4 and 174 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(model.lm, which = 1:4)
bptest(model.lm, data=sub_data, studentize=FALSE)
Breusch-Pagan test
data: model.lm
BP = 24.251, df = 4, p-value = 7.114e-05
Breusch–Pagan test
bptest(model.lm, data=sub_data, studentize=FALSE)
Breusch-Pagan test
data: model.lm
BP = 24.251, df = 4, p-value = 7.114e-05
Varrience Infilation Factor (VIF)
data.frame(vif(model.lm))
vif.model.lm.
p_chg_wh 1.875582
change_ren 1.574237
p_chg_pvt 1.895434
p_chg_rent 1.115477
Geographically weighted regression (GWR) is an exploratory technique mainly intended to indicate where non-stationarity is taking place on the map, that is where locally weighted regression coefficients move away from their global values.
model.bnw<-gwr.sel(Formula, data = gd, gweight =gwr.Gauss, verbose = F)
model.bnw
[1] 3.119498
model.gwr<-gwr(Formula, data = gd, bandwidth = model.bnw, gweight =gwr.Gauss, hatmatrix = TRUE)
model.gwr
Call:
gwr(formula = Formula, data = gd, bandwidth = model.bnw, gweight = gwr.Gauss,
hatmatrix = TRUE)
Kernel function: gwr.Gauss
Fixed bandwidth: 3.119498
Summary of GWR coefficient estimates at data points:
Min. 1st Qu. Median 3rd Qu. Max.
X.Intercept. -6808.0532 -6138.5644 -4814.0667 -3235.7688 84.9717
p_chg_wh 379.8810 464.4163 515.2463 549.5800 691.5876
change_ren 8.3169 28.2562 29.3029 30.2283 35.5242
p_chg_pvt -930.1638 -791.5239 -589.2421 -409.3736 -291.8086
p_chg_rent -719.0812 -567.8842 -473.2876 -345.5224 -137.6871
Global
X.Intercept. -4530.11
p_chg_wh 518.40
change_ren 28.60
p_chg_pvt -551.16
p_chg_rent -446.65
Number of data points: 179
Effective number of parameters (residual: 2traceS - traceS'S): 23.13577
Effective degrees of freedom (residual: 2traceS - traceS'S): 155.8642
Sigma (residual: 2traceS - traceS'S): 11652.51
Effective number of parameters (model: traceS): 17.01369
Effective degrees of freedom (model: traceS): 161.9863
Sigma (model: traceS): 11430.2
Sigma (ML): 10873.42
AICc (GWR p. 61, eq 2.33; p. 96, eq. 4.21): 3875.569
AIC (GWR p. 96, eq. 4.22): 3852.273
Residual sum of squares: 21163414626
Quasi-global R2: 0.7727706
proj4string(gd)<-CRS("+proj=longlat +datum=NAD83 +no_defs +ellps=GRS80 +towgs84=0,0,0") # Create projection just in case .pjr file doesn't come alone with the shp file.
gd_NAD<-spTransform(gd, CRS("+init=ESRI:102685")) # Reproject to ESRI:102685 (NAD 1983 Stateplane Maryland FIPS 1900)
queen.cont<-poly2nb(gd)
coords<-coordinates(gd)
plot(gd)
plot(queen.cont, coords, add =TRUE, col ="red")
queen.con.wmx<-nb2listw(queen.cont, style ="B") #B stands for binary matrix in which neighbors are assigned in 1 & 0
queen.con.wmx
Characteristics of weights list object:
Neighbour list object:
Number of regions: 179
Number of nonzero links: 1070
Percentage nonzero weights: 3.339471
Average number of links: 5.977654
Weights style: B
Weights constants summary:
n nn S0 S1 S2
B 179 32041 1070 2140 28504
moran.test(gd_NAD$change_inc, listw=queen.con.wmx)
Moran I test under randomisation
data: gd_NAD$change_inc
weights: queen.con.wmx
Moran I statistic standard deviate = 10.395, p-value < 2.2e-16
alternative hypothesis: greater
sample estimates:
Moran I statistic Expectation Variance
0.433376410 -0.005617978 0.001783346
locm<-localmoran(gd_NAD$change_inc, queen.con.wmx)
gd$scInc<-scale(gd$change_inc) # Scale
gd$lagged_scInc<-lag.listw(queen.con.wmx, gd$scInc)
plot(x = gd$scInc, y = gd$lagged_scInc, main ="Moran Scatterplot of Income")
abline(h =, v =0)
abline(lm(gd$lagged_scInc ~ gd$scInc), lty =3, lwd = 4, col = "red")
gd$quad_sig <- NA
gd@data[(gd$scInc >= 0 & gd$lagged_scInc >= 0) & (locm[, 5] <= 0.05), "quad_sig"] <- 1
gd@data[(gd$scInc <= 0 & gd$lagged_scInc <= 0) & (locm[, 5] <= 0.05), "quad_sig"] <- 2
gd@data[(gd$scInc >= 0 & gd$lagged_scInc <= 0) & (locm[, 5] <= 0.05), "quad_sig"] <- 3
gd@data[(gd$scInc >= 0 & gd$lagged_scInc <= 0) & (locm[, 5] <= 0.05), "quad_sig"] <- 4
gd@data[(gd$scInc <= 0 & gd$lagged_scInc >= 0) & (locm[, 5] <= 0.05), "quad_sig"] <- 5
breaks <- seq(1, 5, 1) #Set the breaks for the thematic map classes
labels <- c("high-High", "low-Low", "High-Low", "Low-High", "Not Signif.") #thematic map classes
np <- findInterval(gd$quad_sig, breaks) #?findInterval
# Assign colors to each map class
colors <- c("red", "blue", "lightpink", "skyblue2", "white")
plot(gd, col = colors[np]) #colors[np] manually sets the color for each county
mtext("Local Moran's I", cex = 1.5, side = 3, line = 1)
legend("bottomleft", legend = labels, fill = colors, bty = "n")
Dignostics for Spatial Dependence
queen.con.wmx.row <-nb2listw(queen.cont) # Spatial Weights matrix row standardized
lm.result<-lm.LMtests(model.lm, queen.con.wmx, zero.policy=NULL, test="all", spChk=NULL, naSubset=TRUE)
t.lm.result<-t(sapply(lm.result, function(x) c(x$statistic, x$parameter, x$p.value)))
colnames(t.lm.result) <-c("Statistic", "df", "p-value")
printCoefmat((t.lm.result))
Statistic df p-value
LMerr 6.51749 1.00000 0.0107
LMlag 16.95051 1.00000 0.0000
RLMerr 0.55979 1.00000 0.4543
RLMlag 10.99281 1.00000 0.0009
SARMA 17.51030 2.00000 0.0002
All except Robust Lagrange Multiplier in the error term (RMLerr) are statistically signficant. Hence, based on RMlag result, the spatial dependence is in the Spatial lag.
The fitting implementation fits a spatial lag model:
Yn = Xnβ + λWnYn + Un, |λ| < 1 (1) Un = ρMnUn + εn, |ρ| < 1 (2)
where: Yn is the n ×1 vector of observations on the dependent variable, Xn is the n × k matrix of observations on k exogenous variables, Wn and Mn are n × n spatial weighting matrices of known constants, β is the k × 1 vector of regression parameters, λ and ρ are scalar autoregressive parameters, Un is the n × 1 vector of regression disturbances, and εn is an n × 1 vecto of innovations Wnyn and Mnun are spatial lags of Yn and Un, respectively.
The models uses spatially lagged X variables as instruments for the spatially lagged dependent variable. In the above equation, ρ (Rho) is a scalar parameter that indicates the effect of the dependent variable in the neighbors on the dependent variable in the focal area.
gd_v2scale<-data.frame(scale(gd_v2))
model.s2sls<-stsls(Formula, data =gd, queen.con.wmx, robust = TRUE) #estimates adjusted for Hetroscadesticity
summary(model.s2sls)
Call:
stsls(formula = Formula, data = gd, listw = queen.con.wmx, robust = TRUE)
Residuals:
Min 1Q Median 3Q Max
-32467.9 -6059.2 -1322.8 6433.1 33332.2
Coefficients:
Estimate HC0 std. Error z value Pr(>|z|)
Rho 3.5127e-02 1.0547e-02 3.3307 0.0008663
(Intercept) -6.3013e+03 1.4917e+03 -4.2243 2.397e-05
p_chg_wh 3.8762e+02 9.5397e+01 4.0632 4.841e-05
change_ren 2.7516e+01 3.5610e+00 7.7270 1.110e-14
p_chg_pvt -5.4613e+02 1.2041e+02 -4.5357 5.742e-06
p_chg_rent -3.9241e+02 1.0298e+02 -3.8107 0.0001386
Residual variance (sigma squared): 131330000, (sigma: 11460)
RMSE<-model.s2sls$sse/179
RMSE
[1] 126929451
gd$Residuals<-model.s2sls$residuals
res.map<-qtm(gd, fill="Residuals", fill.style ="quantile", fill.alpha = 0.9, fill.palette ="RdBu", borders ="#f1f1f1")+ tm_legend(legend.position =c("left", "bottom"))
lf.res<-tmap_leaflet(res.map)
lf.res
Predicted.val<- (gd$change_inc - gd$Residuals)
gd$Predicted.values<-Predicted.val
pred.map<-qtm(gd, fill="Predicted.values", fill.style = "quantile", fill.palette ="RdBu", borders ="#f1f1f1")
lf.pr<-tmap_leaflet(pred.map)
lf.pr