Spatial Prediction using ML in Python
Contents
Learning Objectives
Fit and predict machine learning models to make spatial predictions
Use sklearn pipelines, cross-validation and hyper parameter tuning for spatial data
Predict landcover or continuous models
Make predictions using timeseries data
Spatial Prediction using ML in Python#
Create Land Use Classification using Geowombat & Sklearn#
The most common task for remotely sensed data is creating land cover classification. In this tutorial we will walk you through how to train a ML model using raster data. These methods are heavily dependent on the great package sklearn_xarray. To understand the pipeline commands please see their documentation and examples.
Supervised Classification in Python#
In the following example we will use Landsat data, some training data to train a supervised sklearn model. In order to do this we first need to have land classifications for a set of points of polygons. In this case we have three polygons with the classes [‘water’,’crop’,’tree’,’developed’]. The first step is to use LabelEncoder
to convert these to integer based categories, which we store in a new column called ‘lc’.
import geowombat as gw
from geowombat.data import l8_224078_20200518, l8_224078_20200518_polygons
from geowombat.ml import fit, predict, fit_predict
import geopandas as gpd
from sklearn_xarray.preprocessing import Featurizer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.decomposition import PCA
from sklearn.naive_bayes import GaussianNB
le = LabelEncoder()
# The labels are string names, so here we convert them to integers
labels = gpd.read_file(l8_224078_20200518_polygons)
labels['lc'] = le.fit(labels.name).transform(labels.name)
print(labels)
name geometry lc
0 water POLYGON ((737544.502 -2795232.772, 737544.502 ... 3
1 crop POLYGON ((742517.658 -2798160.232, 743046.717 ... 0
2 tree POLYGON ((742435.360 -2801875.403, 742458.874 ... 2
3 developed POLYGON ((738903.667 -2811573.845, 738926.586 ... 1
We are then going to generate our sklearn pipeline (see simple tutorial here). A pipeline simply allows us to pass a numpy array through a defined set of operations. In this case the data is passed through the following operations:
StandardScaler
: Normalizes all variables by removing the mean and scaling to unit variancePCA
: Calculates Principal Components to reduce dimensionality.GaussianNB
: Fits a Gaussian Naive Bayes model for a quick classification.
In this example we will fit and predict the model in two steps. The fit
method returns three objects, a transformed version of the original dataset X
that can be used by sklearn, Xy
a tuple containing the data used for training (X,y)
where any data outside the polygons is removed, and the trained pipeline clf
object.
import matplotlib.pyplot as plt
# Use a data pipeline
pl = Pipeline([ ('scaler', StandardScaler()),
('pca', PCA()),
('clf', GaussianNB())])
fig, ax = plt.subplots(dpi=200,figsize=(5,5))
# Fit the classifier
with gw.config.update(ref_res=150):
with gw.open(l8_224078_20200518, nodata=0) as src:
X, Xy, clf = fit(src, pl, labels, col="lc")
y = predict(src, X, clf)
y.plot(robust=True, ax=ax)
plt.tight_layout(pad=1)
In order to fit and predict to our original data in one step, we simply use fit_predict
:
from geowombat.ml import fit_predict
fig, ax = plt.subplots(dpi=200,figsize=(5,5))
with gw.config.update(ref_res=150):
with gw.open(l8_224078_20200518, nodata=0) as src:
y = fit_predict(src, pl, labels, col='lc')
y.plot(robust=True, ax=ax)
plt.tight_layout(pad=1)
Unsupervised Classification in Python#
Unsupervised classification takes a different approach. Here we don’t have to provide examples of different land cover types. Instead we rely on the algorithm to identify distinct clusters of similar data, and apply a unique label to each cluster. For instance, if we are talking about land cover water and trees are going to look very different. Water reflects more blue and absorbs all the near infrared, while trees reflect little blue and reflect lots of near infrared. Therefore water and trees should ‘cluster’ together when plotted out according to their different blue and near infrared reflectances. These clusters will be assigned a unique value to each pixel, e.g. water will be assigned 1 and trees 2. Later, the end user will need to go back and assign the label to each numbered cluster, e.g. water=1, trees=2.
In this example we will use kmeans to do our clustering. To run we need to decide apriori how many clusters we want to identify. Typically you want to roughly double the number of expected classes and then recombine them later into the desired labels. This helps to better understand and categorize the variation in your image.
from sklearn.cluster import KMeans
cl = Pipeline([ ('clf', KMeans(n_clusters=6, random_state=0))])
fig, ax = plt.subplots(dpi=200,figsize=(5,5))
# Fit_predict unsupervised classifier
with gw.config.update(ref_res=150):
with gw.open(l8_224078_20200518, nodata=0) as src:
y= fit_predict(src, cl)
y.plot(robust=True, ax=ax)
plt.tight_layout(pad=1)
In this case we can see that it effective labels different clusters of data, and now it is up to us to determine which clusters should be categorized as water, trees, and fields etc.
Spatial prediction with time series stack using Geowombat & Sklearn#
If you have a stack of time series data it is simple to apply the same method as we described previously, except we need to open multiple images, set stack_dim
to ‘time’ and set the time_names
. Note we are just pretending we have two dates of LandSat imagery here.
fig, ax = plt.subplots(dpi=200,figsize=(5,5))
with gw.config.update(ref_res=150):
with gw.open([l8_224078_20200518, l8_224078_20200518],
time_names=['t1', 't2'],
stack_dim='time',
nodata=0) as src:
y = fit_predict(src, pl, labels, col='lc')
print(y)
# plot one time period prediction
y.sel(time='t1').plot(robust=True, ax=ax)
<xarray.DataArray (time: 2, band: 1, y: 372, x: 408)>
dask.array<xarray-<this-array>, shape=(2, 1, 372, 408), dtype=float64, chunksize=(2, 1, 256, 256), chunktype=numpy.ndarray>
Coordinates:
* x (x) float64 7.174e+05 7.176e+05 7.177e+05 ... 7.783e+05 7.785e+05
* y (y) float64 -2.777e+06 -2.777e+06 ... -2.833e+06 -2.833e+06
* time (time) object 't1' 't2'
targ (time, y, x) uint8 dask.array<chunksize=(2, 256, 256), meta=np.ndarray>
* band (band) <U4 'targ'
Attributes: (12/13)
transform: (150.0, 0.0, 717345.0, 0.0, -150.0, -2776995.0)
crs: 32621
res: (150.0, 150.0)
is_tiled: 0
nodatavals: (0, 0, 0)
_FillValue: 0
... ...
offsets: (0.0, 0.0, 0.0)
filename: ['LC08_L1TP_224078_20200518_20200518_01_RT.TIF', 'LC...
resampling: nearest
AREA_OR_POINT: Area
_data_are_separate: 1
_data_are_stacked: 1
If you want to do more sophisticated model tuning using sklearn it is also possible to break up your fit and predict steps as follows:
fig, ax = plt.subplots(dpi=200,figsize=(5,5))
with gw.config.update(ref_res=150):
with gw.open(l8_224078_20200518, nodata=0) as src:
X, Xy, clf = fit(src, pl, labels, col="lc")
y = predict(src, X, clf)
y.plot(robust=True, ax=ax)
Cross-validation and Hyperparameter Tuning with Spatial Prediction#
One of the most important parts of successfully building a model is a careful assessment of model performance. To do this we will leverage some of sklearn
built in tools. One of the most common cross-validation methods is called k-fold, where you data is broken in to independent sets of training and testing data multiple times. The ability of the model - trained on the ‘training’ data - to predict the outcome of the ‘testing’ data multiple times. We can then have a measure of how well our model will work on data it has never seen before.
In this case we are going to use our supervised classification pipeline pl
from earlier. And we will use kfold to do cross-validation. To use kfold
with geowombat
we need to use CrossValidatorWrapper
as seen in the example below to allow it to work with xarray
objects.
We often also need to hyper-parameter tune our model. In this case we will see if we need to keep 1, 2, or 3 pca components. We might also want to experiment with whether scaling the data range impacts our perforamnce with StandardScaler by changing whether or not variables are divided by their standard deviation.
To do hyper-parameter tuning with GridSearchCV in a pipeline we need to set up the ‘parameter-grid’. This part can be a little confusing. To help us let’s isolate the Pipeline
and param_grid
from the example below:
pl = Pipeline([('scaler', StandardScaler()),
('pca', PCA()),
('clf', GaussianNB())])
param_grid={"scaler__with_std":[True,False],
"pca__n_components": [1, 2, 3]
}
Notice that each step in the pipeline is labeled (e.g. ‘scaler’, ‘pca’, ‘clf’). To try out different parameters for each step we are going to need to reference them by name in our param_grid
dictionary. The dictionary follows this convention:
(step_name)__(parameter_name):[value_1, value2]
So "pca__n_components": [1, 2, 3]
says that for the pca
step of the pipeline, we will try out tree different values for the parameter n_components
, allowing us to choose the one that performs best at predicting our ‘testing’ data.
from sklearn.model_selection import GridSearchCV, KFold
from sklearn_xarray.model_selection import CrossValidatorWrapper
pl = Pipeline([('scaler', StandardScaler()),
('pca', PCA()),
('clf', GaussianNB())])
cv = CrossValidatorWrapper(KFold())
gridsearch = GridSearchCV(pl, cv=cv, scoring='balanced_accuracy',
param_grid={
"scaler__with_std":[True,False],
"pca__n_components": [1, 2, 3]
})
fig, ax = plt.subplots(dpi=200,figsize=(5,5))
with gw.config.update(ref_res=150):
with gw.open(l8_224078_20200518, nodata=0) as src:
# fit a model to get Xy used to train model
X, Xy, pipe = fit(src, pl, labels, col="lc")
# fit cross valiation and parameter tuning
# NOTE: must unpack * object Xy
gridsearch.fit(*Xy)
print(gridsearch.cv_results_)
print(gridsearch.best_score_)
print(gridsearch.best_params_)
# get set tuned parameters and make the prediction
# Note: predict(gridsearch.best_model_) not currently supported
pipe.set_params(**gridsearch.best_params_)
y = predict(src, X, pipe)
y.plot(robust=True, ax=ax)
plt.tight_layout(pad=1)
{'mean_fit_time': array([0.04351978, 0.05786901, 0.03855677, 0.03912234, 0.03584881,
0.04793367]), 'std_fit_time': array([0.01413411, 0.00869281, 0.0079048 , 0.01420253, 0.01127494,
0.01434204]), 'mean_score_time': array([0.02882466, 0.03413048, 0.02843852, 0.02275014, 0.02360048,
0.03220682]), 'std_score_time': array([0.00803809, 0.00528087, 0.00888135, 0.00699913, 0.01426182,
0.01118042]), 'param_pca__n_components': masked_array(data=[1, 1, 2, 2, 3, 3],
mask=[False, False, False, False, False, False],
fill_value='?',
dtype=object), 'param_scaler__with_std': masked_array(data=[True, False, True, False, True, False],
mask=[False, False, False, False, False, False],
fill_value='?',
dtype=object), 'params': [{'pca__n_components': 1, 'scaler__with_std': True}, {'pca__n_components': 1, 'scaler__with_std': False}, {'pca__n_components': 2, 'scaler__with_std': True}, {'pca__n_components': 2, 'scaler__with_std': False}, {'pca__n_components': 3, 'scaler__with_std': True}, {'pca__n_components': 3, 'scaler__with_std': False}], 'split0_test_score': array([0.23076923, 0.38461538, 0.69230769, 0.46153846, 0.69230769,
0.38461538]), 'split1_test_score': array([1. , 0.85714286, 1. , 1. , 1. ,
1. ]), 'split2_test_score': array([1. , 0.83333333, 1. , 1. , 1. ,
1. ]), 'split3_test_score': array([1., 1., 1., 1., 1., 1.]), 'split4_test_score': array([1. , 0.88888889, 1. , 1. , 1. ,
1. ]), 'mean_test_score': array([0.84615385, 0.79279609, 0.93846154, 0.89230769, 0.93846154,
0.87692308]), 'std_test_score': array([0.30769231, 0.21192572, 0.12307692, 0.21538462, 0.12307692,
0.24615385]), 'rank_test_score': array([5, 6, 1, 3, 1, 4], dtype=int32)}
0.9384615384615385
{'pca__n_components': 2, 'scaler__with_std': True}
In order to create a model with the optimal parameters we need to use gridsearch.best_params_
, which holds a dictionary of each parameter and its optimal value. To ‘use’ these values we need to update the parameters held in our returned pipeline, pipe
, by using the .set_params
method. We use **
to unpack the dictionary values, tutorial on unpacking here.
Notice that the gridsearch
has a few attributes of interest. This includes all the results of the kfold rounds .cv_results_
, the best score obtained .best_score_
, and the ideal set of parameters to use in the pipeline .best_params_
. This lase one .best_params_
will be use to update our pipe
pipeline for prediction.