Data preprocessing

July 20, 2024 · View on GitHub

All data preprocessing utilities are self-contained in the folder preprocessingand can be used independently of the other utilities of the package.

LiDAR

To preprocess LiDAR, use the prepare_lidar.py script. Do not forget to complete the associated config prepare_lidar_config in the config folder with your own paths. The script downloads LiDAR HD tiles from the IGN server and process them to get canopy height models (CHM) and classification rasters. The script create_vrt.pycan be used to get a virtual file from all processed tiles.

The canopy height maps are calculated by taking the maximum difference between the height of each point and the one of its nearest point classified as ground within its pixel, interpolating values in areas without data.

⚠️ Warning : Processing one $1km \times 1km$ tile can take a few minutes and requires between 10 to 20GB of RAM. It takes approximately one week to process one third of France on a cluster with 70 jobs running in parallel.

NB: as of July 12th 2024, IGN has started to release their own CHMs, and we recommend to use them wherever available.

SPOT 6-7

To preprocess SPOT 6-7 tiles, you must first download them from DINAMIS Data-Terra server (Etalab licence). Then use the pansharpen.py script to pansharpen the pairs of panchromatic and spectral images at 1.5m resolution with the weighted Brovey algorithm. Do not forget to complete the associated config pansharpen_config in the config folder with your own paths.

Splits

Use the create_splits.py with the associated config create_splits_configif you need to create new train/val/test splits, although we advise to use the same test splits as in the Open-Canopy paper in order to be able to compare new models with the benchmark of the paper. If you want to add train tiles, you can just append their geometries to the geometries.geojsonfile, with value "train" for column "split", and their acquisition year in column "lidar_year".