Machine Learning for Image Classification

Part One

Satellite image classification is a process of extracting information by grouping Pixels with relatively similar values (digital number) into major classes. Doing so, will help us quantify, assess and better understand the landcover types at a scale.

On this section, the topics covered are:

1. Unsupervised Image Classification
        with K-Means Clustering
2. Supervised Image Classification
        with Random Forest

Let's go ahead and load the libraries we will be working with.

In [1]:
# For data managment
import os
import numpy as np
from osgeo import gdal

# For analysis 
from sklearn import metrics
from scipy.cluster.vq import *
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestClassifier

# For visualization
from matplotlib import pyplot as plt
% matplotlib inline

Unspervised Image Classification

Unspervised Image Classification is a technique of grouping spectral values into classes without the need to guide or dictate the learning algorithm. Pixel values will be assigned to a class by the algorithm in line with a criteria provided.

Although there are several unspervised classification methods,for our case, we will use the K-Means clustering and attempt to classify the image into three groups. We will also classify the same image later using supervised classification so that we can compare ouputs from the two.

Let's go ahead and open the image, grab the first component (with 76% of variance), and convert to Float 32 array data type.

In [2]:
# Read image
image_data = gdal.Open(r"data/2001_dc_sub.tif")

#Convert image to float 32 array type
image_data_arr = image_data.ReadAsArray()
image_data_arr32 = image_data_arr.astype(np.float32)

K-Means

K-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean. Therefore, Each centroid of a cluster is a collection of feature values which define the resulting groups. </br>

In this case, pixels will be grouped to any one of the three clusters based on their proximity to the mean of each of the three centroids. In order for apply K-means, the first thing we we would need to flatten the array and change the entire data into a 1 dimensional feature space.

In [40]:
image_data_flatten= image_data_arr32.flatten()

Next, we will calculate the 3 centroids in the flattened image using the kmeans function from Scipy library.

In [41]:
# Calculate the 3 centroids 
centroids = kmeans(image_data_flatten, 3)

The three means or centroids generated are:

In [29]:
print centroids
(array([0.92940545, 0.5272824 , 0.00930272], dtype=float32), 0.058260694)

Also, we will quantize vectors using (vq = vector quantizing) and assign each pixel to one of the three classes.

In [ ]:
#calculate
k_classes = vq(image_data_flatten, centroids)

Since the three classes are generated, the next step would be visualiing the classified image. To do so, however, we would need first return pixels to their original dimensional space. Let's go through step by step and plot the classes.

In [375]:
# Dimensions
img_rows = 2171
img_cols = 1971
img_bands = 6
img_shape = img_rows*img_cols

# 1D clustered image to 2-Dimensional space
k_classes_2d = k_classes.reshape(img_bands, img_shape)

# Columns from 2-Dimension array
k_classes_columns = k_classes_2d[-1:, :]

# 2- Dimn Columns to 3-Dimn rows and cols
k_classes_final = k_classes_columns.reshape((img_rows, img_cols))

Now, we can plot both the classified and original image side by side.

In [376]:
b_4 = image_data_arr2[3]
b_3 = image_data_arr2[2]
b_2 = image_data_arr2[1]
# Stack them
false_color_img = np.dstack([b_4,b_3,b_2])

fig = plt.subplots(1,2, figsize=(15,12))
plt.subplot(121),plt.imshow(false_color_img),plt.title('Original Image')
plt.subplot(122), plt.imshow(k_classes_final, cmap ="viridis"), plt.title('Clustered Image')
plt.show()

The 3-clusters by three means classification identified vegetation cover and built up areas. However, it did a very poor job of detecting water bodies. The builtup areas were also classified in to two groups where those with bright pixels were seprated from the darker ones. Let's see what happen when we classify the same image using Random Forest.

Supervised Image Classification

Supervised Image Classification is the most widely used technique of grouping spectral values into classes in Remote Sensing. It differs from the unsupervised technique in a way the learning algorithm learns from the data. Supervised learning is guided (trained) to learn create a link between "observed" data and and the “target” data. In this regard, The target also known as "labels" data created by a user is fed into the learning method to guide in how it learns from the training data.

While there are several well known classifying methods, this lesson leverages an decision tree based ensemble learning known as Random Forest. This learning algorithm evaluates pixel values using the classes (targets) provided by constructing decisions trees to finally generate the mode of the classes. The script below is modified from a blog post by <a href ="#"> machinials.com</a>.

We will begin by setting up the directories for our data. We will use the image with the 3 components just like we did ealier. Unlike to the steps in unspervised approach, this time around we will also set the path to the training and validation sample data sets.

In [6]:
raster_data_path = "data/rf_data/2001_dc_sub.tif"
train_data_path = "data/rf_data/train/"
validation_data_path = "data/rf_data/test/"

Let's go ahead and read the raster file, read components (bands) as arrays, stack them together and reshape the dimension of the data

In [7]:
raster_dataset = gdal.Open(raster_data_path, gdal.GA_ReadOnly)
geo_transform = raster_dataset.GetGeoTransform()
proj = raster_dataset.GetProjectionRef()

bands_data = []
for b in range(1, raster_dataset.RasterCount+1):
    band = raster_dataset.GetRasterBand(b)
    bands_data.append(band.ReadAsArray())
#dataset.dropna(inplace=True)

bands_data = np.dstack(bands_data)
#bands_data = bands_data[~np.isnan(bands_data)]
rows, cols, n_bands = bands_data.shape

As of this moment, the training and validation data path only contain point shapefiles for the three classes. We have sample data for Builtup, Vegetation and Water bodies. These three classes are the same classes used for the unspurvised approach as well. The next step is, therefore, to rasterize the vector files so that we can create comparable data sets and data types.

In [8]:
# Define a function to create a mask
def create_mask_from_vector(vector_data_path, cols, rows, geo_transform,
                            projection, target_value=1):
    # Open the layer
    data_source = gdal.OpenEx(vector_data_path, gdal.OF_VECTOR)
    layer = data_source.GetLayer(0)
    driver = gdal.GetDriverByName('MEM')  # In memory dataset

    # Create a container and set original image information
    target_ds = driver.Create('', cols, rows, 1, gdal.GDT_UInt16)
    target_ds.SetGeoTransform(geo_transform)
    target_ds.SetProjection(projection)

    # Rasterize Layer
    gdal.RasterizeLayer(target_ds, [1], layer, burn_values=[target_value])
    return target_ds

    # Define a functionto convert vector to Raster
def vectors_to_raster(file_paths, rows, cols, geo_transform, projection):

    # Create empty cell and load pixel values
    labeled_pixels = np.zeros((rows, cols))
    for i, path in enumerate(file_paths):
        label = i+1
        ds = create_mask_from_vector(path, cols, rows, geo_transform,
                                     projection, target_value=label)
        band = ds.GetRasterBand(1)
        labeled_pixels += band.ReadAsArray()
        ds = None
    return labeled_pixels

The next step is gather the target data, reterive the classes, training labels and samples data.

In [9]:
# Get all the files ending with .shp
files = [f for f in os.listdir(train_data_path) if f.endswith('.shp')]

# Get the name of shapefiles and save it as classes
classes = [f.split('.')[0] for f in files]

#Join directory path with file names
shapefiles = [os.path.join(train_data_path, f)
              for f in files if f.endswith('.shp')]

# Rasterize vectors
labeled_pixels = vectors_to_raster(shapefiles, rows, cols, geo_transform, proj)

# Return non-zero values
is_train = np.nonzero(labeled_pixels)

#Get label and sample data
training_labels = labeled_pixels[is_train]
training_samples = bands_data[is_train]

Before we run the classifier, let's take a look at our training data.

In [10]:
print "The three classes are: %s" %classes
print "Total number of training labels: %s" %training_labels.size
print "Total number of training sample size: %s" %training_samples.size
The three classes are: ['built_up', 'vegetation', 'water']
Total number of training labels: 29
Total number of training sample size: 174

The next step is to classify the image using the Random Forest Classifer.

In [11]:
#Dispatch computation on all the CPUs 
classifier = RandomForestClassifier(n_jobs=-1)

# Fit the classifier
classifier.fit(training_samples, training_labels)
Out[11]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=-1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

Great! we now have the model trained using the sample data. Let's go a head and make prediction for the entire image.

In [12]:
# Resampling size
n_samples = rows*cols

# reshape the dimension
flat_pixels = bands_data.reshape((n_samples, n_bands))

# make prediction
result = classifier.predict(flat_pixels)

# reshpae output two dimension
classification = result.reshape((rows, cols))

We can now visualize the classified image and compare it to the original one.

In [13]:
fig, ax = plt.subplots(1,2, figsize =(18,15))
# Select False color bands
b4 = bands_data[:,:,3]
b3 = bands_data[:,:,2]
b2 = bands_data[:,:,1]

# Stack them
false_color = np.dstack([b4,b3,b2])

#Plot
plt.subplot(121), plt.imshow(false_color), plt.title("Original Image")
plt.subplot(122), plt.imshow(classification),plt.title("Classified Image")

plt.show()

The classified image clearly dipicted the three classes with great precision. However, we will need to assess the accuracy of the data using the test/validation dataset. Let's load the shapefiles for testing the model.

In [16]:
# Get all the files ending with .shp
test_files = [f for f in os.listdir(validation_data_path) if f.endswith('.shp')]

# Get the name of shapefiles and save it as classes
classes_2 = [f.split('.')[0] for f in test_files]

# Get the shapefiles in the d=path
test_shpfiles = [os.path.join(validation_data_path, "%s.shp" % c)
              for c in classes_2]

verification_pixels = vectors_to_raster(test_shpfiles, rows, cols,
                                        geo_transform, proj)
#Get non-zero pixels
for_verification = np.nonzero(verification_pixels)

# Get the labels
verification_labels = verification_pixels[for_verification]
predicted_labels = classification[for_verification]
In [17]:
test_shpfiles
Out[17]:
['data/rf_data/test/built_up.shp',
 'data/rf_data/test/vegetation.shp',
 'data/rf_data/test/water.shp']

Next, print the accuracy report.

In [18]:
target_names = ['Class %s' % s for s in classes_2]
print("Classification report:\n%s" %
      metrics.classification_report(verification_labels, predicted_labels,
                                    target_names=target_names))
print("Classification accuracy: %f" %
      metrics.accuracy_score(verification_labels, predicted_labels))
Classification report:
                  precision    recall  f1-score   support

  Class built_up       1.00      0.80      0.89         5
Class vegetation       0.83      1.00      0.91         5
     Class water       1.00      1.00      1.00         5

     avg / total       0.94      0.93      0.93        15

Classification accuracy: 0.933333

The report indicates that we classified the image with 93% accuracy. Compared to the the image classified using the Unsupervised K-Means clustering algorithm, the image classified a supervised Random forest classifier clearly identified the three land cover classes.

Summary

On this tutorial, we classified a satellite image using unsupervised K-Means clustering and a Supervised Classifier called Random Forest. Although the clustering algorithm grouped pixels with similar values based on their proximity to the nearest mean value, it failed to properly classify the landcovers. As was expected, the supervised approach did a better job at classifying the image.