This is an introductory and exploratory image analysis tutorial. The topics covered are:
1. Reading satellite images
1.1 Accessing meta data
1.2 Visualizing bands
2. Descriptive Statistics
2.1 Univarite Metrics
2.2 Histogram and Normality Test
2.3 Measures of Dispersion
In order to better understand the satellite image we will be going to work with, the first basic step would be conducting exploratory data analysis. There are a wide variety of python libraries that can help us do this but for our purpose we will focus on GDAL, Rasterio, Matplotlib, Numpy and Sk-Learn. You can follow along with your own data or download the complete dataset from here.
Let's go head and load the libraries.
# For reading files
from osgeo import gdal
import rasterio
# For visualization
%matplotlib inline
import pylab
import matplotlib.pyplot as plt
from rasterio import plot
from rasterio.plot import show
# For Summary and Analysis
import numpy as np
import pandas as pd
from scipy import stats
GDAL (Geospatial Data Abstract Library) is one of the most powerful and extensively used python libraries. In addition to supporting a wide variety of data types and functionalities, GDAL offers manipulation from a command-line. Rasterio is another open source python library used for reading and writing image. It's built up on GDAL and attempts to simplify processing and accessibility for cloud computation.
The satellite image we will use is a Landsat 4 Themathic Mapper data product covering the Metropolitan Washington DC area. Let's go ahead and read the image using GDAL.
image_data = gdal.Open(r"data/2001_dc_sub.tif")
And, to read the image in Rasterio:
image_data_2 = rasterio.open(r"data/2001_dc_sub.tif")
Now that the image is opened, we will move on to assessing some of its attributes.
Number of Bands
total_number_of_bands = image_data.RasterCount
print total_number_of_bands
Rows and Columns
rows = image_data.RasterYSize
columns = image_data.RasterXSize
print "rows: %s pixels" %columns
print "columns: %s pixels" %rows
Type of Projection
projection = image_data.GetProjection()
print projection
GeoTransformation
geotransform = image_data.GetGeoTransform()
print geotransform
Meta Data
img_meta_data = image_data.GetMetadata()
print img_meta_data
The basic assessment entails that the six bands satellite image has a 30 meters resolution and the projection system WGS 84 UTM Zone 18N. In order to visualize the image by creating different composites, let's go ahead and convert the image to an array and subset each band.
band_1 = image_data.GetRasterBand(1).ReadAsArray()
band_2 = image_data.GetRasterBand(2).ReadAsArray()
band_3 = image_data.GetRasterBand(3).ReadAsArray()
band_4 = image_data.GetRasterBand(4).ReadAsArray()
band_5 = image_data.GetRasterBand(5).ReadAsArray()
band_6 = image_data.GetRasterBand(6).ReadAsArray()
Let's take a look at the digital numbers or pixel values the image composed of.
print band_1
Each of these pixel values represents surface reflectance captured by a satellite in 30 by 30 meters land surface. In geospatial analysis, the focus is therefore, to analyze the relationships of these values to each other with-in and among bands.
Creating different composites by arranging bands in different order often help us better visualize and examine patterns. Next, we will create four of the most commonly used composites: The True, False and False Natural Color I and II.
true_color = np.dstack([band_3, band_2, band_1])
color_infrared = np.dstack([band_4, band_3, band_2])
false_natural_color = np.dstack([band_5, band_4, band_3])
false_color_II = np.dstack([band_6, band_5, band_3])
To visualize these composites, we will use the the matplotlib library.
# Set figure size
fig = plt.subplots(2,2, figsize=(15,20))
# Position, title & plot of each band
plt.subplot(221), plt.imshow(true_color),plt.title("Natural Color")
plt.subplot(222), plt.imshow(color_infrared), plt.title("Color Infrared")
plt.subplot(223), plt.imshow(false_natural_color), plt.title("False Color I")
plt.subplot(224), plt.imshow(false_color_II), plt.title(" False Color II")
plt.show()
The True or Natural Color shows the reflectance captured in the visible spectrum while the False and the False Natural composites highlights reflectance captured in the Near Infrared (NIR) and Short Wave Infrared Red (SWIR) bands respectively.
We can also visualize individual bands and to do so we will use Rasterio's show function. The plots below are zoomed-in to the District of Columbia.
# Set figure size
fig, ((ax1, ax2), (ax3,ax4)) = plt.subplots(2,2, figsize=(15,20))
# Plot each band, zoom-in, set location color and title
show(band_1[1150:1575, 850:1200], ax=ax1, cmap='Blues', title='Blue Band')
show(band_2[1150:1575, 850:1200], ax=ax2, cmap='Greens', title='Green Band')
show(band_3[1150:1575, 850:1200], ax=ax3, cmap='Reds', title='Red Band')
show(band_4[1150:1575, 850:1200], ax=ax4, cmap='brg', title='NIR Band')
# Remove grid from plots
(ax1).grid(b=False)
(ax2).grid(b=False)
(ax3).grid(b=False)
# show all
plt.show()
Statistical assessment of satellite images involves examining pixel values to detect relationships, anomalies and unique patterns. The univariate descriptive metrics we will focus on below include measures of central tendancy and dispersion. We will begin by converting the entire image to an array.
# Read image as Array
image_data_arr = image_data.ReadAsArray()
Next, we'll create empty lists that will contain the metrics we are going to calculate.
stat_min = [] # Minimum
stat_max = [] # Maximum
stat_ave = [] # Mean
stat_med = [] # Median
stat_P25 = [] # 25 percent quantile
stat_P75 = [] # 75 percent quantile
stat_var = [] # Variance
stat_std = [] # Standard Deviation
# put all in one list
stat_all = [stat_min, stat_max, stat_ave, stat_P25, stat_P75, stat_med, stat_var, stat_std]
Now, calculate the values for each metric and append output on the empty list created above.
# Calcualate and append value to lists
for i in image_data_arr:
mn =i.min() # minimum
stat_min.append(mn)
mx = i.max() #maximum
stat_max.append(mx)
av = i.mean() #Average
stat_ave.append(av)
p25 = np.percentile(i, 25) #1st quantile
stat_P25.append(p25)
md = np.median(i)# median
stat_med.append(md)
p75 = np.percentile(i, 75) # 3rd quantile
stat_P75.append(p75)
va = i.var() # variance
stat_var.append(va)
st = i.std() # standard deviation
stat_std.append(st)
# Rename Column and Row headers
stat_summary = pd.DataFrame(data =stat_all,
index = ['Minimum','Maximum','Mean','25th Quantile','Median',
'75th Quantile','Variance','Standard Deviation'],
columns = ['Band 1', 'Band 2', 'Band 3','Band 4','Band 5','Band 6'])
stat_summary.round(3)
A Histogram represents the distribution of numerical data and plotting the frequency of digital numbers allow us to understand the over all pattern of pixels. To plot the histogram, we will use the show_hist function from Rasterio.
fig, ax = plt.subplots(1, figsize=(9, 7))
ax.grid(linestyle='None')
plot.show_hist(image_data_2, bins=50, lw=2.0,
stacked=True, alpha=0.4,
histtype='stepfilled',
normed=True, ax=ax)
plt.show()
The histogram above shows that some of the bands are skewed to the left while others exhibit a sharp peak. We can further assess these patterns by calculating for Kurtosis and Skewness. Also, the Shapiro Wilk Test can help us examin the normality of their distribution.
print "Skewness = %.3f" %np.mean(stats.skew(image_data_arr))
print "Kurtosis = %.3f" %np.mean(stats.kurtosis(image_data_arr))
print "Shapiro test = %.3f, P-value = %.3f" %stats.shapiro(image_data_arr)
The pixel values are slightly skewed to the left with negative excess kurtosis indicating the thinner tails.Based on Shapiro Test, We can reject the null hypothesis that the bands/pixel values are normally distributed. </br>
However, The Shapiro Test is often not a reliable instrument for large dataset and to better support our assessment, we will investigate the distribution of pixels using The Normal Probability Plot. The plot is a graphical technique for assessing whether or not a data set is approximately normally distributed.
fig, ax = plt.subplots(1, figsize=(9, 7))
image_data_arr = image_data.ReadAsArray()
image_data_arr32 = image_data_arr.astype(np.float32)
image_data_flatten= image_data_arr32.flatten()
stats.probplot(image_data_flatten, dist="norm", plot=pylab)
pylab.show()
The probability plot indicates that when the pixel values approach the minimum and maximum values, their pattern becomes non-normal. This is a very critical information in that if we are interested in, for instance, supervised image classification or conducting change detection, we now know that relying on a Maximum Likelihood Classifier will most likely lead us to generate biased estimates.
In addition of univarite metrics, measures of dispersion about the mean of the distribution of a data provide valuable information. For our purpose, we will focus on the instruments Covariance and Correlation to help us understand the relationshp that exist among bands.
Covariance or Cross-Variance is measure of the directional relationship of two or more variables (bands, in our case). It's calulated by averaging the square deviation of all possible observations. </br>
To evaluate the dispersion of pixel values through covariance and correlation matrices, we will begin by reshaping the dimension of the array from a three to a two dimensions array.
# Reshaping dimension
# get the format of the orignial data
n_samples, nx, ny = image_data_arr.shape
# reshape the three dimension to two
image_data_arr_2_dim= image_data_arr.reshape(n_samples, nx*ny)
Next, from the numpy python library, we'll call the cov function for covariance on the newly reshaped array values.
# Calcuate covariance
cov_mat = np.cov(image_data_arr_2_dim)
# Create dataframe and rename headers
covMat_Df = pd.DataFrame(data = cov_mat,
index = ['Band 1','Band 2','Band 3','Band 4','Band 6', 'Band 6'],
columns = ['Band 1', 'Band 2', 'Band 3','Band 4','Band 5','Band 6'])
# Round values to 3
covMat_Df.round(3)
Correlation: is a measure of the statistical relationships involving dependence. In imagery analysis, correlation matrix allows us to examine how strongly pairs of bands are related. Since the dimension of the array has already been reshaped, we will simply call the corrcoef function (correlation coefficent) from the numpy library and pass the array.
# Calcualte correlation
d_corr = np.corrcoef(image_data_arr_2_dim)
# Create data frame and rename headers
corMat_Df = pd.DataFrame(data = d_corr,
index = ['Band 1','Band 2','Band 3','Band 4','Band 6', 'Band 6'],
columns = ['Band 1', 'Band 2', 'Band 3','Band 4','Band 5','Band 6'])
corMat_Df.round(3)
The correlation matrix indicates that band 1, 2, 3 are highly correlated. On the other hand, Band 4 turns out to be the least correlated bands. Hence, we can reduce the file size of the image by simply focusing on the least correlated bands there by reducing the time and computational power it takes to conduct further image analysis.
On this part, we covererd basic techniques of reading, visualizing and examining a satellite image. We visualized the image by creating different composites as well as by plotted individual bands. We also attempted to better understand the image using univariate metrics.
Next, we will cover image enhancement and component reduction techniques.