Learning Objectives

  • Subset bands by index or name

  • Extract raster data by row and column number

  • Extract data by bounding window

  • Extract raster data by coordinates

  • Extract raster data by geometry (point, polygon)


Raster Data Extraction#

Raster data is often of little use unless we can extract and summarize the data. For instance, extracting raster to points by coordinates allows us to pass data to machine learning models for land cover classification or cloud masking.

Subsetting rasters#

We can subset sections of the data array in a number of ways. In this case we will create a slice based on row and column location to subset LandSat data using a rasterio.window.Window.

Either a rasterio.window.Window object or tuple can be used with geowombat.open.

import geowombat as gw
from geowombat.data import rgbn

from rasterio.windows import Window
w = Window(row_off=0, col_off=0, height=100, width=100)

with gw.open(rgbn,
                band_names=['blue', 'green', 'red'],
                num_workers=8,
                indexes=[1, 2, 3],
                window=w,
                out_dtype='float32') as src:
    print(src)
<xarray.DataArray 'array-444557cdb992603699462903d82d3616' (band: 3, y: 100,
                                                            x: 100)>
dask.array<array, shape=(3, 100, 100), dtype=float32, chunksize=(3, 64, 64), chunktype=numpy.ndarray>
Coordinates:
  * band     (band) <U5 'blue' 'green' 'red'
  * y        (y) float64 2.05e+06 2.05e+06 2.05e+06 ... 2.05e+06 2.05e+06
  * x        (x) float64 7.93e+05 7.93e+05 7.93e+05 ... 7.935e+05 7.935e+05
Attributes:
    transform:           | 5.00, 0.00, 792988.00|\n| 0.00,-5.00, 2050382.00|\...
    crs:                 +init=epsg:32618
    res:                 (5.0, 5.0)
    is_tiled:            1
    nodatavals:          (nan, nan, nan, nan)
    offsets:             (0.0, 0.0, 0.0, 0.0)
    _data_are_separate:  0
    _data_are_stacked:   0

We can also slice a subset of data using a tuple of bounded coordinates.

bounds = (793475.76, 2049033.03, 794222.03, 2049527.24)

with gw.open(rgbn,
                band_names=['green', 'red', 'nir'],
                num_workers=8,
                indexes=[2, 3, 4],
                bounds=bounds,
                out_dtype='float32') as src:
    print(src)

The configuration manager provides an alternative method to subset rasters. See tutorial-config for more details.

with gw.config.update(ref_bounds=bounds):

    with gw.open(rgbn) as src:
        print(src)

By default, the subset will be returned by the upper left coordinates of the bounds, potentially shifting cell alignment with the reference raster. To subset a raster and align it to the same grid, use the ref_tar keyword. This is equivalent to a “snap raster” in ArcGIS.

with gw.config.update(ref_bounds=bounds, ref_tar=rgbn):

    with gw.open(rgbn) as src:
        print(src)

Extracting data by coordinates#

To extract values at a coordinate pair, translate the coordinates into array indices. For extraction by geometry, for instance with a shapefile, see extract by point geometry.

import geowombat as gw
from geowombat.data import l8_224078_20200518

# Coordinates in map projection units
y, x = -2823031.15, 761592.60

with gw.open(l8_224078_20200518) as src:
    # Transform the map coordinates to data indices
    j, i = gw.coords_to_indices(x, y, src)
    # Subset by index
    data = src[:, i, j].data.compute()

print(data.flatten())
[7448 6882 6090]

A latitude/longitude pair can be extracted after converting to the map projection.

import geowombat as gw
from geowombat.data import l8_224078_20200518

# Coordinates in latitude/longitude
lat, lon = -25.50142964, -54.39756038

with gw.open(l8_224078_20200518) as src:
    # Transform the coordinates to map units
    x, y = gw.lonlat_to_xy(lon, lat, src)
    # Transform the map coordinates to data indices
    j, i = gw.coords_to_indices(x, y, src)
    data = src[:, i, j].data.compute()

print(data.flatten())
[7448 6882 6090]

Extracting data with point geometry#

In the example below, ‘l8_224078_20200518_points’ is a GeoPackage of point locations, and the output df is a GeoPandas GeoDataFrame. To extract the raster values at the point locations, use geowombat.extract.

import geowombat as gw
from geowombat.data import l8_224078_20200518, l8_224078_20200518_points

with gw.open(l8_224078_20200518) as src:
    df = src.gw.extract(l8_224078_20200518_points)

print(df)
        name                         geometry  id     1     2     3
0      water  POINT (741522.314 -2811204.698)   0  7966  7326  6254
1       crop  POINT (736140.845 -2806478.364)   1  8030  7490  8080
2       tree  POINT (745919.508 -2805168.579)   2  7561  6874  6106
3  developed  POINT (739056.735 -2811710.662)   3  8302  8202  8111
4      water  POINT (737802.183 -2818016.412)   4  8277  7982  7341
5       tree  POINT (759209.443 -2828566.230)   5  7398  6711  6007

Note

The line df = src.gw.extract(l8_224078_20200518_points) could also have been written as df = gw.extract(src, l8_224078_20200518_points).

In the previous example, the point vector had a CRS that matched the raster (i.e., EPSG=32621, or UTM zone 21N). If the CRS had not matched, the geowombat.extract function transforms the CRS on-the-fly.

import geowombat as gw
from geowombat.data import l8_224078_20200518, l8_224078_20200518_points
import geopandas as gpd

point_df = gpd.read_file(l8_224078_20200518_points)
print(point_df.crs)

# Transform the CRS to WGS84 lat/lon
point_df = point_df.to_crs('epsg:4326')
print(point_df.crs)

with gw.open(l8_224078_20200518) as src:
    df = src.gw.extract(point_df)

print(df)
EPSG:32621
epsg:4326
        name                         geometry  id     1     2     3
0      water  POINT (741522.314 -2811204.698)   0  7966  7326  6254
1       crop  POINT (736140.845 -2806478.364)   1  8030  7490  8080
2       tree  POINT (745919.508 -2805168.579)   2  7561  6874  6106
3  developed  POINT (739056.735 -2811710.662)   3  8302  8202  8111
4      water  POINT (737802.183 -2818016.412)   4  8277  7982  7341
5       tree  POINT (759209.443 -2828566.230)   5  7398  6711  6007

Set the data band names using sensor = 'bgr', which assigns the band names blue, green, red.

import geowombat as gw
from geowombat.data import l8_224078_20200518, l8_224078_20200518_points

with gw.config.update(sensor='bgr'):
    with gw.open(l8_224078_20200518) as src:
        df = src.gw.extract(l8_224078_20200518_points,
                            band_names=src.band.values.tolist())

print(df)
        name                         geometry  id  blue  green   red
0      water  POINT (741522.314 -2811204.698)   0  7966   7326  6254
1       crop  POINT (736140.845 -2806478.364)   1  8030   7490  8080
2       tree  POINT (745919.508 -2805168.579)   2  7561   6874  6106
3  developed  POINT (739056.735 -2811710.662)   3  8302   8202  8111
4      water  POINT (737802.183 -2818016.412)   4  8277   7982  7341
5       tree  POINT (759209.443 -2828566.230)   5  7398   6711  6007

Extracting time series images by point geometry#

We can also easily extract a time series of raster images. Extracted pixel values are provided in ‘wide’ format with appropriate labels, for instance the column ‘t2_blue’ would be the blue band for the second time period

from geowombat.data import l8_224078_20200518, l8_224078_20200518_points

with gw.config.update(sensor='bgr'):
    with gw.open([l8_224078_20200518, l8_224078_20200518],
            time_names=['t1', 't2'],
            stack_dim='time') as src:

        # Extract and by point geometry
        df = src.gw.extract(l8_224078_20200518_points)

print(df)
        name                         geometry  id  t1_blue  t1_green  t1_red  \
0      water  POINT (741522.314 -2811204.698)   0     7966      7326    6254   
1       crop  POINT (736140.845 -2806478.364)   1     8030      7490    8080   
2       tree  POINT (745919.508 -2805168.579)   2     7561      6874    6106   
3  developed  POINT (739056.735 -2811710.662)   3     8302      8202    8111   
4      water  POINT (737802.183 -2818016.412)   4     8277      7982    7341   
5       tree  POINT (759209.443 -2828566.230)   5     7398      6711    6007   

   t2_blue  t2_green  t2_red  
0     7966      7326    6254  
1     8030      7490    8080  
2     7561      6874    6106  
3     8302      8202    8111  
4     8277      7982    7341  
5     7398      6711    6007  

Extracting data by polygon geometry#

To extract values within polygons, use the same geowombat.extract function.

from geowombat.data import l8_224078_20200518, l8_224078_20200518_polygons

with gw.config.update(sensor='bgr'):
    with gw.open(l8_224078_20200518) as src:
        df = src.gw.extract(l8_224078_20200518_polygons,
                            band_names=src.band.values.tolist())

    print(df)
     id  point                         geometry       name   blue  green  \
0     0      0  POINT (737559.502 -2795247.772)      water   7994   7423   
1     0      1  POINT (737589.502 -2795247.772)      water   8017   7428   
2     0      2  POINT (737619.502 -2795247.772)      water   8008   7446   
3     0      3  POINT (737649.502 -2795247.772)      water   8008   7412   
4     0      4  POINT (737679.502 -2795247.772)      water   8018   7398   
..   ..    ...                              ...        ...    ...    ...   
667   3    667  POINT (739038.667 -2811819.677)  developed   8567   8564   
668   3    668  POINT (739068.667 -2811819.677)  developed   8099   7676   
669   3    669  POINT (739098.667 -2811819.677)  developed  10151   9651   
670   3    670  POINT (739128.667 -2811819.677)  developed   8065   7735   
671   3    671  POINT (739158.667 -2811819.677)  developed   9343   8987   

       red  
0     6272  
1     6292  
2     6292  
3     6291  
4     6250  
..     ...  
667   8447  
668   7332  
669  10153  
670   7501  
671   9247  

[672 rows x 7 columns]

Calculate mean pixel value by polygon#

It is simple then to calculate the mean value of pixels within each polygon by using the polygon id column and pandas groupby function. You can easily calculate other statistics like min, max, median etc.

from geowombat.data import l8_224078_20200518, l8_224078_20200518_polygons

with gw.config.update(sensor='bgr'):
    with gw.open(l8_224078_20200518) as src:
        df = src.gw.extract(l8_224078_20200518_polygons,
                            band_names=src.band.values.tolist())
        # use pandas groupby to calc pixel mean  
        df.drop(columns=['geometry'], inplace=True)
        df = df.groupby(['id', 'name']).mean()
    print(df)
              point         blue        green          red
id name                                                   
0  water      103.5  7990.038462  7387.918269  6264.846154
1  crop       304.0  7692.481865  7037.419689  7571.207254
2  tree       497.0  7506.901554  6838.704663  6091.932642
3  developed  632.5  8698.397436  8328.294872  8365.487179