Raster Data Extraction
Contents
Learning Objectives
Subset bands by index or name
Extract raster data by row and column number
Extract data by bounding window
Extract raster data by coordinates
Extract raster data by geometry (point, polygon)
Raster Data Extraction#
Raster data is often of little use unless we can extract and summarize the data. For instance, extracting raster to points by coordinates allows us to pass data to machine learning models for land cover classification or cloud masking.
Subsetting rasters#
We can subset sections of the data array in a number of ways. In this case we will create a slice based on row and column location to subset LandSat data using a rasterio.window.Window
.
Either a rasterio.window.Window
object or tuple can be used with geowombat.open
.
import geowombat as gw
from geowombat.data import rgbn
from rasterio.windows import Window
w = Window(row_off=0, col_off=0, height=100, width=100)
with gw.open(rgbn,
band_names=['blue', 'green', 'red'],
num_workers=8,
indexes=[1, 2, 3],
window=w,
out_dtype='float32') as src:
print(src)
<xarray.DataArray 'array-444557cdb992603699462903d82d3616' (band: 3, y: 100,
x: 100)>
dask.array<array, shape=(3, 100, 100), dtype=float32, chunksize=(3, 64, 64), chunktype=numpy.ndarray>
Coordinates:
* band (band) <U5 'blue' 'green' 'red'
* y (y) float64 2.05e+06 2.05e+06 2.05e+06 ... 2.05e+06 2.05e+06
* x (x) float64 7.93e+05 7.93e+05 7.93e+05 ... 7.935e+05 7.935e+05
Attributes:
transform: | 5.00, 0.00, 792988.00|\n| 0.00,-5.00, 2050382.00|\...
crs: +init=epsg:32618
res: (5.0, 5.0)
is_tiled: 1
nodatavals: (nan, nan, nan, nan)
offsets: (0.0, 0.0, 0.0, 0.0)
_data_are_separate: 0
_data_are_stacked: 0
We can also slice a subset of data using a tuple of bounded coordinates.
bounds = (793475.76, 2049033.03, 794222.03, 2049527.24)
with gw.open(rgbn,
band_names=['green', 'red', 'nir'],
num_workers=8,
indexes=[2, 3, 4],
bounds=bounds,
out_dtype='float32') as src:
print(src)
The configuration manager provides an alternative method to subset rasters. See tutorial-config
for more details.
with gw.config.update(ref_bounds=bounds):
with gw.open(rgbn) as src:
print(src)
By default, the subset will be returned by the upper left coordinates of the bounds, potentially shifting cell alignment with the reference raster. To subset a raster and align it to the same grid, use the ref_tar keyword. This is equivalent to a “snap raster” in ArcGIS.
with gw.config.update(ref_bounds=bounds, ref_tar=rgbn):
with gw.open(rgbn) as src:
print(src)
Extracting data by coordinates#
To extract values at a coordinate pair, translate the coordinates into array indices. For extraction by geometry, for instance with a shapefile, see extract by point geometry.
import geowombat as gw
from geowombat.data import l8_224078_20200518
# Coordinates in map projection units
y, x = -2823031.15, 761592.60
with gw.open(l8_224078_20200518) as src:
# Transform the map coordinates to data indices
j, i = gw.coords_to_indices(x, y, src)
# Subset by index
data = src[:, i, j].data.compute()
print(data.flatten())
[7448 6882 6090]
A latitude/longitude pair can be extracted after converting to the map projection.
import geowombat as gw
from geowombat.data import l8_224078_20200518
# Coordinates in latitude/longitude
lat, lon = -25.50142964, -54.39756038
with gw.open(l8_224078_20200518) as src:
# Transform the coordinates to map units
x, y = gw.lonlat_to_xy(lon, lat, src)
# Transform the map coordinates to data indices
j, i = gw.coords_to_indices(x, y, src)
data = src[:, i, j].data.compute()
print(data.flatten())
[7448 6882 6090]
Extracting data with point geometry#
In the example below, ‘l8_224078_20200518_points’ is a GeoPackage of point locations, and the output df
is a GeoPandas GeoDataFrame. To extract the raster values at the point locations, use geowombat.extract
.
import geowombat as gw
from geowombat.data import l8_224078_20200518, l8_224078_20200518_points
with gw.open(l8_224078_20200518) as src:
df = src.gw.extract(l8_224078_20200518_points)
print(df)
name geometry id 1 2 3
0 water POINT (741522.314 -2811204.698) 0 7966 7326 6254
1 crop POINT (736140.845 -2806478.364) 1 8030 7490 8080
2 tree POINT (745919.508 -2805168.579) 2 7561 6874 6106
3 developed POINT (739056.735 -2811710.662) 3 8302 8202 8111
4 water POINT (737802.183 -2818016.412) 4 8277 7982 7341
5 tree POINT (759209.443 -2828566.230) 5 7398 6711 6007
Note
The line df = src.gw.extract(l8_224078_20200518_points)
could also have been written as df = gw.extract(src, l8_224078_20200518_points)
.
In the previous example, the point vector had a CRS that matched the raster (i.e., EPSG=32621, or UTM zone 21N). If the CRS had not matched, the geowombat.extract
function transforms the CRS on-the-fly.
import geowombat as gw
from geowombat.data import l8_224078_20200518, l8_224078_20200518_points
import geopandas as gpd
point_df = gpd.read_file(l8_224078_20200518_points)
print(point_df.crs)
# Transform the CRS to WGS84 lat/lon
point_df = point_df.to_crs('epsg:4326')
print(point_df.crs)
with gw.open(l8_224078_20200518) as src:
df = src.gw.extract(point_df)
print(df)
EPSG:32621
epsg:4326
name geometry id 1 2 3
0 water POINT (741522.314 -2811204.698) 0 7966 7326 6254
1 crop POINT (736140.845 -2806478.364) 1 8030 7490 8080
2 tree POINT (745919.508 -2805168.579) 2 7561 6874 6106
3 developed POINT (739056.735 -2811710.662) 3 8302 8202 8111
4 water POINT (737802.183 -2818016.412) 4 8277 7982 7341
5 tree POINT (759209.443 -2828566.230) 5 7398 6711 6007
Set the data band names using sensor = 'bgr'
, which assigns the band names blue, green, red.
import geowombat as gw
from geowombat.data import l8_224078_20200518, l8_224078_20200518_points
with gw.config.update(sensor='bgr'):
with gw.open(l8_224078_20200518) as src:
df = src.gw.extract(l8_224078_20200518_points,
band_names=src.band.values.tolist())
print(df)
name geometry id blue green red
0 water POINT (741522.314 -2811204.698) 0 7966 7326 6254
1 crop POINT (736140.845 -2806478.364) 1 8030 7490 8080
2 tree POINT (745919.508 -2805168.579) 2 7561 6874 6106
3 developed POINT (739056.735 -2811710.662) 3 8302 8202 8111
4 water POINT (737802.183 -2818016.412) 4 8277 7982 7341
5 tree POINT (759209.443 -2828566.230) 5 7398 6711 6007
Extracting time series images by point geometry#
We can also easily extract a time series of raster images. Extracted pixel values are provided in ‘wide’ format with appropriate labels, for instance the column ‘t2_blue’ would be the blue band for the second time period
from geowombat.data import l8_224078_20200518, l8_224078_20200518_points
with gw.config.update(sensor='bgr'):
with gw.open([l8_224078_20200518, l8_224078_20200518],
time_names=['t1', 't2'],
stack_dim='time') as src:
# Extract and by point geometry
df = src.gw.extract(l8_224078_20200518_points)
print(df)
name geometry id t1_blue t1_green t1_red \
0 water POINT (741522.314 -2811204.698) 0 7966 7326 6254
1 crop POINT (736140.845 -2806478.364) 1 8030 7490 8080
2 tree POINT (745919.508 -2805168.579) 2 7561 6874 6106
3 developed POINT (739056.735 -2811710.662) 3 8302 8202 8111
4 water POINT (737802.183 -2818016.412) 4 8277 7982 7341
5 tree POINT (759209.443 -2828566.230) 5 7398 6711 6007
t2_blue t2_green t2_red
0 7966 7326 6254
1 8030 7490 8080
2 7561 6874 6106
3 8302 8202 8111
4 8277 7982 7341
5 7398 6711 6007
Extracting data by polygon geometry#
To extract values within polygons, use the same geowombat.extract
function.
from geowombat.data import l8_224078_20200518, l8_224078_20200518_polygons
with gw.config.update(sensor='bgr'):
with gw.open(l8_224078_20200518) as src:
df = src.gw.extract(l8_224078_20200518_polygons,
band_names=src.band.values.tolist())
print(df)
id point geometry name blue green \
0 0 0 POINT (737559.502 -2795247.772) water 7994 7423
1 0 1 POINT (737589.502 -2795247.772) water 8017 7428
2 0 2 POINT (737619.502 -2795247.772) water 8008 7446
3 0 3 POINT (737649.502 -2795247.772) water 8008 7412
4 0 4 POINT (737679.502 -2795247.772) water 8018 7398
.. .. ... ... ... ... ...
667 3 667 POINT (739038.667 -2811819.677) developed 8567 8564
668 3 668 POINT (739068.667 -2811819.677) developed 8099 7676
669 3 669 POINT (739098.667 -2811819.677) developed 10151 9651
670 3 670 POINT (739128.667 -2811819.677) developed 8065 7735
671 3 671 POINT (739158.667 -2811819.677) developed 9343 8987
red
0 6272
1 6292
2 6292
3 6291
4 6250
.. ...
667 8447
668 7332
669 10153
670 7501
671 9247
[672 rows x 7 columns]
Calculate mean pixel value by polygon#
It is simple then to calculate the mean value of pixels within each polygon by using the polygon id
column and pandas groupby function. You can easily calculate other statistics like min, max, median etc.
from geowombat.data import l8_224078_20200518, l8_224078_20200518_polygons
with gw.config.update(sensor='bgr'):
with gw.open(l8_224078_20200518) as src:
df = src.gw.extract(l8_224078_20200518_polygons,
band_names=src.band.values.tolist())
# use pandas groupby to calc pixel mean
df.drop(columns=['geometry'], inplace=True)
df = df.groupby(['id', 'name']).mean()
print(df)
point blue green red
id name
0 water 103.5 7990.038462 7387.918269 6264.846154
1 crop 304.0 7692.481865 7037.419689 7571.207254
2 tree 497.0 7506.901554 6838.704663 6091.932642
3 developed 632.5 8698.397436 8328.294872 8365.487179