Exploratory Image Analysis¶

This is an introductory and exploratory image analysis tutorial. The topics covered are:

 1. Reading satellite images
      1.1 Accessing meta data
      1.2 Visualizing bands
 2. Descriptive Statistics
      2.1 Univarite Metrics
      2.2 Histogram and Normality Test
      2.3 Measures of Dispersion

Introduction¶

In order to better understand the satellite image we will be going to work with, the first basic step would be conducting exploratory data analysis. There are a wide variety of python libraries that can help us do this but for our purpose we will focus on GDAL, Rasterio, Matplotlib, Numpy and Sk-Learn. You can follow along with your own data or download the complete dataset from here.

Let's go head and load the libraries.

# For reading files
from osgeo import gdal
import rasterio

# For visualization
%matplotlib inline
import pylab
import matplotlib.pyplot as plt
from rasterio import plot
from rasterio.plot import show

# For Summary and Analysis
import numpy as np
import pandas as pd
from scipy import stats

Reading Images¶

GDAL (Geospatial Data Abstract Library) is one of the most powerful and extensively used python libraries. In addition to supporting a wide variety of data types and functionalities, GDAL offers manipulation from a command-line. Rasterio is another open source python library used for reading and writing image. It's built up on GDAL and attempts to simplify processing and accessibility for cloud computation.

The satellite image we will use is a Landsat 4 Themathic Mapper data product covering the Metropolitan Washington DC area. Let's go ahead and read the image using GDAL.

image_data = gdal.Open(r"data/2001_dc_sub.tif")

And, to read the image in Rasterio:

image_data_2 = rasterio.open(r"data/2001_dc_sub.tif")

Basic Attributes Data¶

Now that the image is opened, we will move on to assessing some of its attributes.

Number of Bands

total_number_of_bands = image_data.RasterCount
print total_number_of_bands

6

Rows and Columns

rows = image_data.RasterYSize
columns = image_data.RasterXSize
print "rows: %s pixels" %columns
print "columns: %s pixels"  %rows

rows: 1971 pixels
columns: 2171 pixels

Type of Projection

projection = image_data.GetProjection()
print projection

PROJCS["WGS 84 / UTM zone 18N",GEOGCS["WGS 84",DATUM["WGS_1984",SPHEROID["WGS 84",6378137,298.257223563,AUTHORITY["EPSG","7030"]],AUTHORITY["EPSG","6326"]],PRIMEM["Greenwich",0,AUTHORITY["EPSG","8901"]],UNIT["degree",0.0174532925199433,AUTHORITY["EPSG","9122"]],AUTHORITY["EPSG","4326"]],PROJECTION["Transverse_Mercator"],PARAMETER["latitude_of_origin",0],PARAMETER["central_meridian",-75],PARAMETER["scale_factor",0.9996],PARAMETER["false_easting",500000],PARAMETER["false_northing",0],UNIT["metre",1,AUTHORITY["EPSG","9001"]],AXIS["Easting",EAST],AXIS["Northing",NORTH],AUTHORITY["EPSG","32618"]]

GeoTransformation

geotransform = image_data.GetGeoTransform()
print geotransform

(294495.0, 30.0, 0.0, 4347615.0, 0.0, -30.0)

Meta Data

img_meta_data = image_data.GetMetadata()
print img_meta_data

{'TIFFTAG_RESOLUTIONUNIT': '1 (unitless)', 'TIFFTAG_XRESOLUTION': '1', 'TIFFTAG_SOFTWARE': 'IMAGINE TIFF Support\nCopyright 1991 - 1999 by ERDAS, Inc. All Rights Reserved\n@(#)$RCSfile: etif.c $ $Revision: 1.10.1.9.1.9.2.11 $ $Date: 2004/09/15 18:42:01EDT $', 'TIFFTAG_YRESOLUTION': '1', 'AREA_OR_POINT': 'Area'}

The basic assessment entails that the six bands satellite image has a 30 meters resolution and the projection system WGS 84 UTM Zone 18N. In order to visualize the image by creating different composites, let's go ahead and convert the image to an array and subset each band.

band_1 = image_data.GetRasterBand(1).ReadAsArray()
band_2 = image_data.GetRasterBand(2).ReadAsArray()
band_3 = image_data.GetRasterBand(3).ReadAsArray()
band_4 = image_data.GetRasterBand(4).ReadAsArray()
band_5 = image_data.GetRasterBand(5).ReadAsArray()
band_6 = image_data.GetRasterBand(6).ReadAsArray()

Let's take a look at the digital numbers or pixel values the image composed of.

print band_1

[[ 76  73  75 ... 119 113 110]
 [ 72  74  73 ... 124 127 115]
 [ 72  74  72 ... 117 113 106]
 ...
 [ 85  81  90 ...  80  77  79]
 [ 91 105 105 ...  78  77  81]
 [114 104 102 ...  76  76  77]]

Each of these pixel values represents surface reflectance captured by a satellite in 30 by 30 meters land surface. In geospatial analysis, the focus is therefore, to analyze the relationships of these values to each other with-in and among bands.

Visualization¶

Creating different composites by arranging bands in different order often help us better visualize and examine patterns. Next, we will create four of the most commonly used composites: The True, False and False Natural Color I and II.

true_color = np.dstack([band_3, band_2, band_1])
color_infrared = np.dstack([band_4, band_3, band_2])
false_natural_color = np.dstack([band_5, band_4, band_3])
false_color_II = np.dstack([band_6, band_5, band_3])

To visualize these composites, we will use the the matplotlib library.

# Set figure size
fig = plt.subplots(2,2, figsize=(15,20))

# Position, title & plot of each band
plt.subplot(221), plt.imshow(true_color),plt.title("Natural Color")
plt.subplot(222), plt.imshow(color_infrared), plt.title("Color Infrared")
plt.subplot(223), plt.imshow(false_natural_color), plt.title("False Color I")
plt.subplot(224), plt.imshow(false_color_II), plt.title(" False Color II")
plt.show()

The True or Natural Color shows the reflectance captured in the visible spectrum while the False and the False Natural composites highlights reflectance captured in the Near Infrared (NIR) and Short Wave Infrared Red (SWIR) bands respectively.

We can also visualize individual bands and to do so we will use Rasterio's show function. The plots below are zoomed-in to the District of Columbia.

# Set figure size
fig, ((ax1, ax2), (ax3,ax4)) = plt.subplots(2,2, figsize=(15,20))

# Plot each band, zoom-in, set location color and title
show(band_1[1150:1575, 850:1200], ax=ax1, cmap='Blues', title='Blue Band')
show(band_2[1150:1575, 850:1200], ax=ax2, cmap='Greens', title='Green Band')
show(band_3[1150:1575, 850:1200], ax=ax3, cmap='Reds', title='Red Band')
show(band_4[1150:1575, 850:1200], ax=ax4, cmap='brg', title='NIR Band')


# Remove grid from plots
(ax1).grid(b=False)
(ax2).grid(b=False)
(ax3).grid(b=False)

# show all
plt.show()

Descriptive Statistics¶

Statistical assessment of satellite images involves examining pixel values to detect relationships, anomalies and unique patterns. The univariate descriptive metrics we will focus on below include measures of central tendancy and dispersion. We will begin by converting the entire image to an array.

# Read image as Array
image_data_arr = image_data.ReadAsArray()

Next, we'll create empty lists that will contain the metrics we are going to calculate.

stat_min = [] # Minimum
stat_max = [] # Maximum
stat_ave = [] # Mean
stat_med = [] # Median
stat_P25 = [] # 25 percent quantile
stat_P75 = [] # 75 percent quantile
stat_var = [] # Variance
stat_std = [] # Standard Deviation

# put all in one list
stat_all = [stat_min, stat_max, stat_ave, stat_P25, stat_P75, stat_med, stat_var, stat_std]

Now, calculate the values for each metric and append output on the empty list created above.

# Calcualate and append value to lists
for i in image_data_arr:
    mn =i.min() # minimum
    stat_min.append(mn)
    mx = i.max() #maximum
    stat_max.append(mx)
    av = i.mean() #Average
    stat_ave.append(av)
    p25 = np.percentile(i, 25) #1st quantile
    stat_P25.append(p25)
    md = np.median(i)# median
    stat_med.append(md)
    p75 = np.percentile(i, 75) # 3rd quantile
    stat_P75.append(p75)
    va = i.var() # variance
    stat_var.append(va)
    st = i.std() # standard deviation
    stat_std.append(st)

# Rename Column and Row headers
stat_summary = pd.DataFrame(data =stat_all,
            index = ['Minimum','Maximum','Mean','25th Quantile','Median',
                     '75th Quantile','Variance','Standard Deviation'],
            columns = ['Band 1', 'Band 2', 'Band 3','Band 4','Band 5','Band 6'])

stat_summary.round(3)

Histogram¶

A Histogram represents the distribution of numerical data and plotting the frequency of digital numbers allow us to understand the over all pattern of pixels. To plot the histogram, we will use the show_hist function from Rasterio.

fig, ax = plt.subplots(1, figsize=(9, 7))
ax.grid(linestyle='None')
plot.show_hist(image_data_2, bins=50, lw=2.0,
               stacked=True, alpha=0.4,
               histtype='stepfilled',
               normed=True, ax=ax)
plt.show()

Normality Test¶

The histogram above shows that some of the bands are skewed to the left while others exhibit a sharp peak. We can further assess these patterns by calculating for Kurtosis and Skewness. Also, the Shapiro Wilk Test can help us examin the normality of their distribution.

print "Skewness = %.3f" %np.mean(stats.skew(image_data_arr))
print "Kurtosis = %.3f" %np.mean(stats.kurtosis(image_data_arr))
print "Shapiro test = %.3f, P-value = %.3f" %stats.shapiro(image_data_arr)

Skewness = 0.211
Kurtosis = -1.022
Shapiro test = 0.973, P-value = 0.000

The pixel values are slightly skewed to the left with negative excess kurtosis indicating the thinner tails.Based on Shapiro Test, We can reject the null hypothesis that the bands/pixel values are normally distributed. </br>

However, The Shapiro Test is often not a reliable instrument for large dataset and to better support our assessment, we will investigate the distribution of pixels using The Normal Probability Plot. The plot is a graphical technique for assessing whether or not a data set is approximately normally distributed.

fig, ax = plt.subplots(1, figsize=(9, 7))
image_data_arr = image_data.ReadAsArray()
image_data_arr32 = image_data_arr.astype(np.float32)
image_data_flatten= image_data_arr32.flatten()
stats.probplot(image_data_flatten, dist="norm", plot=pylab)
pylab.show()

The probability plot indicates that when the pixel values approach the minimum and maximum values, their pattern becomes non-normal. This is a very critical information in that if we are interested in, for instance, supervised image classification or conducting change detection, we now know that relying on a Maximum Likelihood Classifier will most likely lead us to generate biased estimates.

Measures of Dispersion¶

In addition of univarite metrics, measures of dispersion about the mean of the distribution of a data provide valuable information. For our purpose, we will focus on the instruments Covariance and Correlation to help us understand the relationshp that exist among bands.

Covariance or Cross-Variance is measure of the directional relationship of two or more variables (bands, in our case). It's calulated by averaging the square deviation of all possible observations. </br>

To evaluate the dispersion of pixel values through covariance and correlation matrices, we will begin by reshaping the dimension of the array from a three to a two dimensions array.

# Reshaping dimension

# get the format of the orignial data
n_samples, nx, ny = image_data_arr.shape

# reshape the three dimension to two
image_data_arr_2_dim= image_data_arr.reshape(n_samples, nx*ny)

Next, from the numpy python library, we'll call the cov function for covariance on the newly reshaped array values.

# Calcuate covariance
cov_mat = np.cov(image_data_arr_2_dim)

# Create dataframe and rename headers
covMat_Df = pd.DataFrame(data = cov_mat,
                         index = ['Band 1','Band 2','Band 3','Band 4','Band 6', 'Band 6'],
                         columns = ['Band 1', 'Band 2', 'Band 3','Band 4','Band 5','Band 6'])
# Round values to 3
covMat_Df.round(3)

Correlation: is a measure of the statistical relationships involving dependence. In imagery analysis, correlation matrix allows us to examine how strongly pairs of bands are related. Since the dimension of the array has already been reshaped, we will simply call the corrcoef function (correlation coefficent) from the numpy library and pass the array.

# Calcualte correlation
d_corr = np.corrcoef(image_data_arr_2_dim)

# Create data frame and rename headers
corMat_Df = pd.DataFrame(data = d_corr,
                         index = ['Band 1','Band 2','Band 3','Band 4','Band 6', 'Band 6'],
                         columns = ['Band 1', 'Band 2', 'Band 3','Band 4','Band 5','Band 6'])
corMat_Df.round(3)

The correlation matrix indicates that band 1, 2, 3 are highly correlated. On the other hand, Band 4 turns out to be the least correlated bands. Hence, we can reduce the file size of the image by simply focusing on the least correlated bands there by reducing the time and computational power it takes to conduct further image analysis.

Summary¶

On this part, we covererd basic techniques of reading, visualizing and examining a satellite image. We visualized the image by creating different composites as well as by plotted individual bands. We also attempted to better understand the image using univariate metrics.

Next, we will cover image enhancement and component reduction techniques.

	Band 1	Band 2	Band 3	Band 4	Band 5	Band 6
Minimum	47.000	24.000	3.000	6.000	0.000	0.000
Maximum	255.000	255.000	255.000	255.000	255.000	255.000
Mean	79.109	62.513	53.665	91.040	83.146	45.061
25th Quantile	70.000	51.000	37.000	81.000	70.000	31.000
Median	84.000	68.000	62.000	104.000	94.000	54.000
75th Quantile	75.000	59.000	47.000	93.000	79.000	41.000
Variance	195.243	254.870	536.298	398.553	546.602	385.093
Standard Deviation	13.973	15.965	23.158	19.964	23.380	19.624

	Band 1	Band 2	Band 3	Band 4	Band 5	Band 6
Band 1	195.243	216.222	308.813	-97.523	175.986	222.842
Band 2	216.222	254.870	363.416	-82.310	246.768	276.996
Band 3	308.813	363.416	536.298	-149.087	350.425	407.831
Band 4	-97.523	-82.310	-149.087	398.553	154.324	-34.451
Band 6	175.986	246.768	350.425	154.324	546.602	399.643
Band 6	222.842	276.996	407.831	-34.451	399.643	385.093

	Band 1	Band 2	Band 3	Band 4	Band 5	Band 6
Band 1	1.000	0.969	0.954	-0.350	0.539	0.813
Band 2	0.969	1.000	0.983	-0.258	0.661	0.884
Band 3	0.954	0.983	1.000	-0.322	0.647	0.897
Band 4	-0.350	-0.258	-0.322	1.000	0.331	-0.088
Band 6	0.539	0.661	0.647	0.331	1.000	0.871
Band 6	0.813	0.884	0.897	-0.088	0.871	1.000