Machine Learning for Image Classification

Part Two

On this chapter, we will classify two satellite images covering the Washington DC Metropolitan Area and examine the land cover change that took place in the years between 2001 and 2017. The topics covered are:

1. Image Classificaiton
    - Normality Test
    - Stratified Random Sampling
    - Random Forest Classifier
2. Post image classificaiton Change Detection

In part four, we classified one satellite image into three classes and assessed the accuracy of the classified image. This time, we will look into two time period and classify two images using sample points that can be identified in both.

In [1]:
# for file management
import os
import logging
from docopt import docopt

# For analysis
import numpy as np
from osgeo import gdal
import pandas as pd
from sklearn import metrics
from sklearn.ensemble import RandomForestClassifier
from scipy import stats
from itertools import product
from sklearn.metrics import cohen_kappa_score
# For visualization
% matplotlib inline
from matplotlib import pyplot as plt

We will use two landsat images for the years 2001 and 2017 that were collected Landsat-5 and Landsat-8 missions, respectively. We will start by setting the path to the directories.

In [2]:
raster_data_path01 = "data/rf_data/DC_final/dc_ls5_2001_c.tif"
raster_data_path17 = "data/rf_data/DC_final/dc_ls8_2017_c.tif"
train_data_path = "data/rf_data/train/"
test_data_path = "data/rf_data/test/"

Next, we'll read the images into Python environment. First, let's load The 2001 Landsat 5 image

In [3]:
# Read raster
raster_dataset01 = gdal.Open(raster_data_path01, gdal.GA_ReadOnly)
# Get Projection system 
geo_transform = raster_dataset01.GetGeoTransform()
proj = raster_dataset01.GetProjectionRef()

# Read bands as arrays
bands_data01 = []
for b in range(1, raster_dataset01.RasterCount+1):
    band = raster_dataset01.GetRasterBand(b)
    bands_data01.append(band.ReadAsArray())

#Stack arrays together 
bands_data01 = np.dstack(bands_data01)

# Change NA values to zero
bands_data01[np.isnan(bands_data01)] = 0

# Get the shape of the image 
rows, cols, n_bands = bands_data01.shape

The 2017 Landsat image:

In [4]:
# Read raster
raster_dataset17 = gdal.Open(raster_data_path17, gdal.GA_ReadOnly)
# Get Projection system 
geo_transform = raster_dataset17.GetGeoTransform()
proj = raster_dataset17.GetProjectionRef()

# Read bands as arrays
bands_data17 = []
for b in range(1, raster_dataset17.RasterCount+1):
    band = raster_dataset17.GetRasterBand(b)
    bands_data17.append(band.ReadAsArray())

#Stack arrays together 
bands_data17 = np.dstack(bands_data17)

# Change NA values to zero
bands_data17[np.isnan(bands_data17)] = 0

# Get the shape of the image 
rows, cols, n_bands = bands_data17.shape

Let's go ahead and visualize the images.

In [5]:
fig, ax = plt.subplots(1,2, figsize =(18,15))
# Select False color bands
b4_01 = bands_data01[:,:,3]
b3_01 = bands_data01[:,:,2]
b2_01 = bands_data01[:,:,1]

# Stack them
false_color01 = np.dstack([b4_01,b3_01*2,b2_01*2])

# Select False color bands
b4_17 = bands_data17[:,:,3]
b3_17 = bands_data17[:,:,2]
b2_17 = bands_data17[:,:,1]

# Stack them
false_color17 = np.dstack([b4_17,b3_17*2,b2_17*2])

#Plot
#Plot
plt.subplot(121), plt.imshow(false_color01*2), plt.title("2001")
plt.subplot(122), plt.imshow(false_color17*2), plt.title("2017")

plt.show()

Normality Test

Assessing the the distribution of pixels on an image can help us refine our methodology and by generateing a probability plot of sample data against the quantiles the normal distribution of the image, we can evaluate if the assumption that pixels are normally distributed holds can hold.

In [6]:
# Read image as array
img_arr01 = raster_dataset01.ReadAsArray()
img_arr17 = raster_dataset17.ReadAsArray()

# change the array type to float 32
img_arr01_f = img_arr01.astype(np.float32)
img_arr17_f = img_arr17.astype(np.float32)

# Flatten the array to a one dimension array
img01_flatten= img_arr01_f.flatten()
img17_flatten= img_arr17_f.flatten()

The raster data converted into six stacked arrays was flattened into an array of one dimension. The dimension of the two images are:

In [7]:
print "1XD Array of the 2001 image = %s" %img01_flatten.shape
print "1XD Array of the 2017 image = %s" %img17_flatten.shape
1XD Array of the 2001 image = 37530006
1XD Array of the 2017 image = 37530006
In [8]:
fig = plt.figure(figsize=(15, 6))
# Calculate and show probability data on the flatten array
ax1 = fig.add_subplot(121)
stats.probplot(img01_flatten, dist="norm", plot =ax1)
ax1.set_title("Probability Plot for the 2001 image")
ax2 = fig.add_subplot(122)
stats.probplot(img17_flatten, dist="norm", plot =ax2)
ax2.set_title("Probability Plot for the 2017 image")
plt.show()

The probability plots of both images indicates a distribution that deviates from the normal distribution. In this case, we will apply a non-parametric classifier and attempt to assess an unbiased estimate.

Training and Testing Sample Data

To collect training and testing sample data, for each of the three major landcover classes, polygons(strata) were created in QGIS environment from which points were generated randomly. The following chart shows how th strata were created and identified as a class</br>

            id         class_name            class_id
             1         builtup_dark                 1
             2         builtup_md_dark              1
             3         builtup_bright               1
             -----------------------------------------
             1         vegetation_dense             2
             2         vegetation_park              2
             3         vegetation_sparse            2
             -----------------------------------------
             1         water_body_river             3
             2         water_body_lake              3
             3         water_bodY_wetland           3

The id simply identifies each row and the class name describes each of our sample data. The class_id is a unique identifier we will request the classifier to recognize and classify our images as. Hence, Builtup areas will be assigned 1 while vegetation cover and water bodies will be assigned as 2 and 3 respectively.

We'll first create two functions will allow us read the shapefiles and extract the consistant values from the images.

In [9]:
# Create mask
def create_mask_from_vector(vector_data_path, cols, rows, geo_transform,
                            projection, target_value=1):
    data_source = gdal.OpenEx(vector_data_path, gdal.OF_VECTOR)

    # Read each layer
    layer = data_source.GetLayer(0)

    # Get the data type
    driver = gdal.GetDriverByName('MEM')  # In memory dataset
    target_ds = driver.Create('', cols, rows, 1, gdal.GDT_UInt16)

    # Set the projection the same as the images
    target_ds.SetGeoTransform(geo_transform)
    target_ds.SetProjection(projection)

    # Rasterize mask
    gdal.RasterizeLayer(target_ds, [1], layer, burn_values=[target_value])
    return target_ds


# Rasterize vector
def vectors_to_raster(file_paths, rows, cols, geo_transform, projection):
    """Rasterize the vectors in the given directory in a single image."""
    labeled_pixels = np.zeros((rows, cols))

    # Create arrays of sample data
    for i, path in enumerate(file_paths):
        label = i+1
        ds = create_mask_from_vector(path, cols, rows, geo_transform,
                                     projection, target_value=label)
        band = ds.GetRasterBand(1)
        labeled_pixels += band.ReadAsArray()
        ds = None
    return labeled_pixels

Training a model

In order to train a model, we will need to load the shapefiles containing the randomly generated points and convert them to raster masks. We will use these masks to extract corresponding pixel values from both images.

In [10]:
# Get all the files ending with .shp
train_files = [f for f in os.listdir(train_data_path) if f.endswith('.shp')]

# Get the name of shapefiles and save it as classes
training_classes = [f.split('.')[0] for f in train_files]

#Join directory path with file names
train_shpfile = [os.path.join(train_data_path, f)
              for f in train_files if f.endswith('.shp')]

# Rasterize vectors
labeled_pixels = vectors_to_raster(train_shpfile, rows, cols, geo_transform, proj)

# Return non-zero values
is_train = np.nonzero(labeled_pixels)

#Get label and sample data
training_labels = labeled_pixels[is_train]
training_samples01 = bands_data01[is_train]
training_samples17 = bands_data17[is_train]
In [11]:
label_vals=[]
sample_vals=[]
for i in np.unique(training_labels):
    label_tot = sum(training_labels==i)
    sample_tot = (sum(training_labels==i))*6
    label_vals.append(label_tot)
    sample_vals.append(sample_tot)

stats_all= [sample_vals, label_vals]


stat_summary = pd.DataFrame(data =stats_all,
            index = ['Training samples','Label/target samples'],
            columns = ['Builtup','Vegetation','Water'])


print "Total number of Training samples: %s points" %training_labels.size
print "Total number of Label samples %s" %training_samples01.size
print " Across the three classes: "
print ""
print stat_summary
Total number of Training samples: 665 points
Total number of Label samples 3990
 Across the three classes:

                      Builtup  Vegetation  Water
Training samples         2346        1170    474
Label/target samples      391         195     79

After having the sample training (X_train) and label (Y_train) data ready, the next step is to run the classifier. Due to its ability to prevent overfitting, we will use the machine learning algorithm Random Forests.

Random Forest Classifier

Random Forests also known as Random Decision Forests is an ensemble learning method that operates by constructing a several decision trees to generate the final mode of the classes. Ensembles are a group of “weak learners” building up on each other to form a “strong learner”. </br>

Some of the benefits of using RF are:

- It Prevents Overfitting
- Performs well with multiple classes
- Efficency with large datasets

That being said, lets go ahead train the classifier for both 2001 and 2017.

Here is for the 2001 image.

In [12]:
#Train the classifier the default parameters
classifier01 = RandomForestClassifier()

# Fit the classifier
classifier_01 = classifier01.fit(training_samples01, training_labels)
classifier_01
Out[12]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

Variable Importance

One advantage of using Random Forest is the insight into variable importance. In this case, we can evaluate which of the six bands are more important to the classifier.

In [13]:
# For the 2001 image classifier
bands = [1, 2, 3, 4, 5, 6]
for b, imp in zip(bands, classifier_01.feature_importances_):
    print('Band {b} importance: {imp}'.format(b=b, imp=imp))
Band 1 importance: 0.133329124791
Band 2 importance: 0.15360598606
Band 3 importance: 0.171452576554
Band 4 importance: 0.217794757495
Band 5 importance: 0.20975326421
Band 6 importance: 0.11406429089

We can see that the surface reflectance captured in the Short Wave Infrared Red band (Band 5) alone contributed 25 percent.

Next, we'll train the model for 2017 image.

In [14]:
#Train the classifier with the default parameters
classifier17 =RandomForestClassifier()

# Fit the classifier
classifier_17 = classifier17.fit(training_samples17, training_labels)
classifier_17
Out[14]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

The variable importance of our second classifier

In [15]:
# For the 2017 image classifier
for b, imp in zip(bands, classifier_17.feature_importances_):
    print('Band {b} importance: {imp}'.format(b=b, imp=imp))
Band 1 importance: 0.259419822009
Band 2 importance: 0.108911016862
Band 3 importance: 0.138152269744
Band 4 importance: 0.147705026262
Band 5 importance: 0.217347444715
Band 6 importance: 0.128464420408

The values in the 4th and 5th bands alone contributed close to 50 percent of the variance for the model to detect classes. Next, we will change the dimension of the images and conduct prediction. This where things get exciting. After having trained the model using a sample of 655 pixels, we will now ask the model to predict for the entire image which is compsed of 6.25 million pixels.

In [16]:
# Grab the dimension
n_samples = rows*cols

# Reshape images dimension
flat_pixels01 = bands_data01.reshape((n_samples, n_bands))
flat_pixels17 = bands_data17.reshape((n_samples, n_bands))

# Predict
result_01 = classifier_01.predict(flat_pixels01)
result_17 = classifier_17.predict(flat_pixels17)


# Get the rows and columns
classification_2001 = result_01.reshape((rows, cols))
classification_2017 = result_17.reshape((rows, cols))

Let's visualize the classified images.

In [17]:
fig, ax = plt.subplots(1,2, figsize =(20,16))

plt.subplot(121), plt.imshow(classification_2001, cmap ='YlGnBu'),plt.title("2001")
plt.subplot(122), plt.imshow(classification_2017, cmap ='YlGnBu'),plt.title("2017")
plt.show()

In visualizing the classified images, we can see that the metropolitan seem to have gone through little to no change. Before moving on to detecting the change, let's first assess the accuracy of the classifications. In the same we extracted training samples, we will go a head get the validation samples as well.

Accuracy Assessment

After training and builiding the models for the two image, our next task will be evaluating the accuracy of the prediction. To do so, we've prepared a sample data for testing the model in a similar way the training sample data was prepared.

In [18]:
test_files  = [f for f in os.listdir(test_data_path) if f.endswith('.shp')]

# Get the name of shapefiles and save it as classes
testing_classes = [f.split('.')[0] for f in test_files]


# Get the test shapefile data
test_shpfile = [os.path.join(test_data_path, "%s.shp" % c)
              for c in testing_classes]

# Rasterize shapefiles
test_pixels = vectors_to_raster(test_shpfile, rows, cols,
                                        geo_transform, proj)

#Create non-zero images
test_image = np.nonzero(test_pixels)


#Get test outputs and labels
testing_labels = test_pixels[test_image]
predicted_labels01 = classification_2001[test_image]
predicted_labels17 = classification_2017[test_image]

For each of the three classes, the sample pixels were created using stratified random sampling. A total of 335 pixels were collected with respect to their overall distribution and how they were represented in the models.

Accuracy assessment of the 2001 image

In [19]:
target_names = ['Class %s' % s for s in testing_classes]
print("Classification report:\n%s" %
      metrics.classification_report(testing_labels, predicted_labels01,
                                    target_names=target_names))
print("Overall accuracy: %f" %
      metrics.accuracy_score(testing_labels, predicted_labels01))

print ("Kappa Statistics: %f " %
       cohen_kappa_score(testing_labels, predicted_labels01))
Classification report:
                       precision    recall  f1-score   support

   Class Builtup_test       0.85      1.00      0.92       159
Class Vegetation_test       1.00      0.77      0.87       117
     Class Water_test       1.00      1.00      1.00        59

          avg / total       0.93      0.92      0.92       335

Overall accuracy: 0.919403
Kappa Statistics: 0.868226

Accuracy assessment of the 2017 image

In [20]:
target_names = ['Class %s' % s for s in testing_classes]
print("Classification report:\n%s" %
      metrics.classification_report(testing_labels, predicted_labels17,
                                    target_names=target_names))
print("Overall accuracy: %f" %
      metrics.accuracy_score(testing_labels, predicted_labels17))
print ("Kappa Statistics: %f " %
       cohen_kappa_score(testing_labels, predicted_labels17))
Classification report:
                       precision    recall  f1-score   support

   Class Builtup_test       0.95      1.00      0.97       159
Class Vegetation_test       1.00      0.92      0.96       117
     Class Water_test       1.00      1.00      1.00        59

          avg / total       0.97      0.97      0.97       335

Overall accuracy: 0.973134
Kappa Statistics: 0.956554

The overall accuracy of both images is greater than 90 percent although the classification for the 2001 image performed less, most likely, due to the noise from soil and urban tree canopy.

Post Image Classification Change Detection

Now that the raw images for 2001 and 2017 are classifed in to three major classes, we can start detecting the landcover changes in 16 years in between. The logic behind the post image classification change detection is depicted below.

(classification_2001==1 and classification_2017==1) = 0
(classification_2001==1 and classification_2017==2) = 2
(classification_2001==1 and classification_2017==3) = 3
(classification_2001==2 and classification_2017==1) = 1
(classification_2001==2 and classification_2017==2) = 0
(classification_2001==2 and classification_2017==3) = 3
(classification_2001==3 and classification_2017==1) = 1
(classification_2001==3 and classification_2017==2) = 2
(classification_2001==3 and classification_2017==3) = 0

In this case, after conducting a pixel to pixel comparison, for those pixels that went through no change a value of 0 will be assigned. The pixels for vegetation cover in 2001 that were changed to Builtup are will be re-coded as 1. Similarly, pixels in each class group will be assigned a newer value if change is detected.

In [21]:
# Change each image to an integer array type
arr1 = np.array(classification_2001, dtype = int)
arr2 = np.array(classification_2017, dtype = int)

# Iterate to check for change
lookup = np.array(((0,2,3),
                   (1,0,3),
                   (1,2,0)), dtype=int)

detection_layer = np.zeros_like(arr1)

for r, c in product(*[range(x) for x in arr1.shape]):
    a, b = arr1[r,c], arr2[r,c]
    detection_layer[r,c] = lookup[a-1,b-1]

Let's summerize values on the change detection layer.

In [22]:
change_unique = np.unique(detection_layer)
class_change = []

# Count Pixels
for group in change_unique:
    class_change.append(len(detection_layer[detection_layer == group]))

# Change Pixels to Sq.Km
class_size = []
for i in class_change:
    area = i *0.0009
    area_arr = np.array(area).round(2)
    class_size.append(area_arr)


print "%r pixels or %r Km.Sq area showed no change "  %(class_change[0], class_size[0])
print "%r pixels %r Km.Sq changed to Builtup area" %(class_change[1],  class_size[1])
print "%r pixels %r Km.Sq changed to Vegetation" %(class_change[2], class_size[2])
print "%r pixels %r Km.Sq changed to Water body" %(class_change[3], class_size[3])
4867952 pixels or 4381.16 Km.Sq area showed no change
745772 pixels 671.19 Km.Sq changed to Builtup area
632184 pixels 568.97 Km.Sq changed to Vegetation
9093 pixels 8.18 Km.Sq changed to Water body

Accordingly, in the years between 2001 and 2017, the metropolitan area went through a signficant change in builtup surface area. As discussed above, the signficant change to vegetation cover and water body most likely is the result of mis-classified pixels, particularly on the 2001 image.

Let's visualize the newly created Builtup up surface.

In [23]:
fig = plt.subplots( figsize =(8,8))
plt.imshow(detection_layer==1, cmap="Blues" )
plt.show()

The change detected in Builtup area indicates urban expansion in Virginia (Reston, Chantilly, Alexandria), in Maryland (Germantown, Waldroof, Bowe, Laurel) as well as in District of Columbia(Chavey Chase, Barnaby Woods).

Summary

On this chapter, we covered image classification with a goal in mind of detecting changes in landcover. We classified images for the years 2001 and 2017 using Random Forest and the result shows 92 and 97% overall accuracy, respectively. We then used these images to detect a change in landcover through which we were able to detect, particularly, the change in builtup area across several cities.