A Unified Python Interface for Water Resource Dataset Acquisition and Harmonization

October 17, 2025 · View on GitHub

Documentation Status PyPI version PyPI - Python Version status Zenodo

A Unified Python Interface for Water Resource Dataset Acquisition and Harmonization

AquaFetch is a Python package designed for the automated downloading, parsing, cleaning, and harmonization of freely available water resource datasets related to rainfall-runoff processes, surface water quality, and wastewater treatment. The package currently supports approximately 70 datasets, each containing between 1 to hundreds of parameters. It facilitates the downloading and transformation of raw data into consistent, easy-to-use, analysis-ready formats. This allows users to directly access and utilize the data without labor-intensive and time-consuming preprocessing.

The package comprises three submodules, each representing a different type of water resource data: rr for rainfall-runoff processes, wq for surface water quality, and wwt for wastewater treatment. The rr submodule offers data for 47,291 catchments worldwide, encompassing both dynamic and static features for each catchment. The dynamic features consist of observed streamflow and meteorological time series, averaged over the catchment area, available at daily and/or hourly time steps. Static features include constant parameters such as land use, soil, topography, and other physiographical characteristics, along with catchment boundaries. This submodule not only provides access to established rainfall-runoff datasets such as CAMELS and LamaH but also introduces new datasets compiled for the first time from publicly accessible online data sources. The wq submodule offers access to 17 surface water quality datasets, each containing various water quality parameters measured across different spaces and times. The wwt submodule provides access to over 20,000 experimental measurements related to wastewater treatment techniques such as adsorption, photocatalysis, membrane filtration, and sonolysis.

The development of AquaFetch was inspired by the growing availability of diverse water resource datasets in recent years. As a community-driven project, the codebase is structured to allow contributors to easily add new datasets, ensuring the package continues to expand and evolve to meet future needs.

Installation

You can install AquaFetch using pip

pip install aqua-fetch

The package can be installed using GitHub link from the master branch

python -m pip install git+https://github.com/hyex-research/AquaFetch.git

To install from a specific branch such as dev branch which contains more recent code

python -m pip install git+https://github.com/hyex-research/AquaFetch.git@dev

The above code will install minimal depencies required to use the library which include numpy, pandas and requests. To install the library with full list of dependencies use the all option during installation.

python -m pip install "aqua-fetch[all] @ git+https://github.com/hyex-research/AquaFetch.git"

This will install addtional optional depencdies which include xarray, fiona, netCDF4 and easy_mpl.

Usage

The following sections describe brief usage of datasets from each of the three submodules i.e. rr, wq and wwt. For detailed usage examples see docs

The core of rr sub-module is the RainfallRunoff class. This class fetches dynamic features (catchment averaged hydrometeorological data at daily or sub-daily timesteps), static features (catchment characteristics related to topography, soil, land use-land cover, or hydrological indices that have constant values over time) and the catchment boundary. The following example demonstrates how to fetch data for CAMELS_SE. However, the method is the same for all available rainfall-runoff datasets.

from aqua_fetch import RainfallRunoff
dataset = RainfallRunoff('CAMELS_SE')  # instead of CAMELS_SE, you can provide any other dataset name

# get data by station id
_, dynamic = dataset.fetch(stations='5', as_dataframe=True)
df = dynamic['5'] # dynamic is a dictionary of with keys as station names and values as DataFrames
df.shape   # ->    (21915, 4)

# get name of all stations as list
stns = dataset.stations()
len(stns)  # -> 50

# get data of 10 % of stations as dataframe
_, dynamic = dataset.fetch(0.1, as_dataframe=True)
len(dynamic)  # 5

# dynamic is a dictionary whose values are dataframes of dynamic features
[df.shape for df in dynamic.values()]   # [(21915, 4), (21915, 4), (21915, 4), (21915, 4), (21915, 4)]

# get the data of a single (randomly selected) station
_, dynamic = dataset.fetch(stations=1, as_dataframe=True)
len(dynamic)  # 1

# get names of available dynamic features
dataset.dynamic_features

# get only selected dynamic features
_, dynamic = dataset.fetch('5', as_dataframe=True,
...  dynamic_features=['pcp_mm', 'airtemp_C_mean', 'q_cms_obs'])
dynamic['5'].shape  # (21915, 3)

# get names of available static features
dataset.static_features

# get data of 10 random stations
_, dynamic = dataset.fetch(10, as_dataframe=True)
len(dynamic)  # 10

# If we want to get both static and dynamic data
static, dynamic = dataset.fetch(stations='5', static_features="all", as_dataframe=True)
static.shape, len(dynamic), dynamic['5'].shape   # ((1, 76), 1, (21915, 4))

# If we don't set as_dataframe=True and have xarray installed then the returned data will be a xarray Dataset
_, dynamic = dataset.fetch(10)
type(dynamic)   # -> xarray.core.dataset.Dataset
	
dynamic.dims   # -> FrozenMappingWarningOnValuesAccess({'time': 21915, 'dynamic_features': 4})

len(dynamic.data_vars)   # -> 10

# get coordinates of all stations
coords = dataset.stn_coords()
coords.shape  #     (50, 2)
# get coordinates of station whose id is 5
dataset.stn_coords('5')       # 68.035599	21.9758
# get coordinates of two stations
dataset.stn_coords(['5', '736'])

# get area of a single station
dataset.area('5')
# get coordinates of two stations
dataset.area(['5', '736'])

# if fiona library is installed we can get the boundary as fiona Geometry
dataset.get_boundary('5')

The datasets related to surface water quality are available using functional or objected-oriented API depending upon the complexity of the dataset. The following example shows usage of two surface water quality related datasets. For complete name of Python functions and classes see documentation

from aqua_fetch import busan_beach
dataframe = busan_beach()
dataframe.shape  # (1446, 14)

dataframe = busan_beach(target=['tetx_coppml', 'sul1_coppml'])
dataframe.shape  # (1446, 15)

from aqua_fetch import GRQA
ds = GRQA(path="/path/to/data")
print(ds.parameters)

len(ds.parameters)    # 42
country = "Pakistan"
len(ds.fetch_parameter('TEMP', country=country))

The datasets for wastewater treatment are all available in function API design. These datasets consist of experimental conducted to remove certain pollutants from wastewater. For complete list of functions, see documentation

from aqua_fetch import ec_removal_biochar
data, *_ = ec_removal_biochar()
data.shape  # -> (3757, 27)

data, encoders = ec_removal_biochar(encoding="le")
data.shape  # -> (3757, 27)


from aqua_fetch import mg_degradation
mg_data, encoders = mg_degradation()
mg_data.shape  # -> (1200, 12)

# the default encoding is None, but if we want to use one hot encoder
mg_data_ohe, encoders = mg_degradation(encoding="ohe")
mg_data_ohe.shape  # -> (1200, 31)

Summary of rainfall runoff Datasets

NameNum. of daily stationsNum. of hourly stationsNum. of dynamic featuresNum. of static featuresTemporal CoverageSpatial CoverageRef.
Arcticnet10627351979 - 2003Arctic (Russia)R-Arcticnet
Bull484552141990 - 2020SpainAparicio et al., 2024
CABra73513871980 - 2010BrazilAlmagro et al., 2021
CAMELSH5767137791900 - 2024United States of AmericaTran et al., (2025)
CAMELS_AUS222, 56128166, 1871900 - 2018AustraliaFlower et al., 2021
CAMELS_BR89710671920 - 2019BrazilChagas et al., 2020
CAMELS_COL34762551981 - 2022ColumbiaJimenez et al., 2025
CAMELS_CH33192091981 - 2020Switzerland, Austria, France, Germany ItalyHoege et al., 2023
CAMELS_CL516121041913 - 2018ChileAlvarez-Garreton et al., 2018
CAMELS_DK304131191989 - 2023DenmarkLiu et al., 2024
CAMELS_DE1582211111951 - 2020GermanyLoritz et al., 2024
CAMELS_FI3201111963 - 2023FinlandSeppä, I et al., 2025
CAMELS_FR654223441970 - 2021FranceDelaigue et al., 2024
CAMELS_GB671101451970 - 2015BritainCoxon et al., 2020
CAMELS_IND472202101980 - 2020IndiaMangukiya et al., 2024
CAMELS_LUX565625612004 - 2021LuxumbourgNijzink et al., 2025
CAMELS_SE504761961 - 2020SwedenTeutschbein et al., 2024
CAMELS_SK178172152000 - 2019South KoreaKim et al., 2025
CAMELS_NZ3693695401972 - 2024New ZealandBushra, et al., 2025
CAMELS_US6718591980 - 2014USANewman et al., 2014
Caravan_DK308382111981 - 2020DenmarkKoch, J. (2022)
CCAM102161241990 - 2020ChinaHao et al., 2021
Finland669102142012 - 2023FinlandNascimento et al., 2024 & ymparisto.fi
GRDCCaravan5357392111950 - 2023GlobalFaerber et al., 2023
HYSETS1442520301950 - 2018North AmericaArsenault et al., 2020
HYPE561931985 - 2019Costa RicaArciniega-Esparza and Birkel, 2020
Ireland464102141992 - 2020IrelandNascimento et al., 2024 & EPA Ireland
Italy294102141992 - 2020ItalyNascimento et al., 2024 & hiscentral.isprambiente.gov.it
Japan75169627351979 - 2022JapanPeirong et al., 2023 & river.go.jp
LamaHCE85985922801981 - 2019Central EuropeKlingler et al., 2021
LamaHIce111111361541950 - 2021IcelandHelgason and Nijssen 2024
NPCTRCatchments-714142013 - 2019CanadaKorver et al., 2022
Poland1287102141992 - 2020PolandNascimento et al., 2024 & danepubliczne.imgw.pl
Portugal280102141992 - 2020PortugalNascimento et al., 2024 & SNIRH Portugal
RRLuleaSweden1202016 - 2019Lulea (Sweden)Broekhuizen et al., 2020
Simbi7032321920 - 1940HaitiBathelemy et al., 2024
Slovenia11732141950 - 2023SloveniaNascimento et al., 2024 & vode.arso.gov.si
Spain88927351979 - 2020SpainPeirong et al., 2023 & ceh-flumen64
Thailand7327351980 - 1999ThailandPeirong et al., 2023 & RID project
USGS1200415415271950 - 2018USAUSGS nwis
WaterBenchIowa125372011 - 2018Iowa (USA)Demir et al., 2022

Summary of Water Quality Datasets

NameVariables CoveredNumber of StationsTemporal CoverageSpatial CoverageRef.
Busan Beach1412018 - 2019Busan, South KoreaJang et al., 2021
Buzzards Bay641992 - 2018Buzzards Bay (USA)Jakuba et al., 2021
CamelsChem286711980 - 2018Conterminous USASterle et al., 2024
CamelsCHChem401151980 - 2020SwtizerlandNascimento et al., 2025
Ecoli Mekong River102011 - 2021Mekong river (Houay Pano)Boithias et al., 2022
Ecoli Mekong River (Laos)102011 - 2021Mekong River (Laos)Boithias et al., 2022
Ecoli Houay Pano (Laos)102011 - 2021Houay Pano (Laos)Boithias et al., 2022
GRQA421898 - 2020GlobalVirro et al., 2021
GRiMeDB150291973 - 2021GlobalStanley et al., 2023
Oligotrend1718461986 - 2022GlobalMinaudo et al., 2025
Quadica1013861950 - 2018GermanyEbeling et al., 2022
RC4USCoast211401850 - 2020USAGomez et al., 2022
SanFrancisco Bay181969 - 2015Sans Francisco Bay (USA)Cloern et al., 2017
Selune River52021 - 2022Selune River (France)Moustapha Ba et al., 2023
Sylt Roads1531973 - 2019North Sea (Arctic)Rick et al., 2023
SWatCh24263221960 - 2022GlobalLobke et al., 2022
White Clay Creek21977 - 2017White Clay Creek (USA)Newbold and Damiano 2013
Treatment ProcessParametersTarget PollutantData PointsReference
Adsorption26Emerg. Contaminants3,757Jaffari et al., 2023
Adsorption15Cr219Ishtiaq et al., 2024
Adsorption30(Cr(VI), Co(II), Sr(II), Ba(II), I, and Fe )1,518Jaffari et al., 2023
Adsorption30po45,014Iftikhar et al., 2024
Adsorption12Industrial Dye1,514Iftikhar et al., 2023
Adsorption17Cu, Zn, Pb, Cd, Ni, and As689Shen et al., 2023
Adsorption8P504Leng et al., 2024
Adsorption8N211Leng et al., 2024
Adsorption13As1,605Huang et al., 2024
Photocatalysis11Melachite Green1,200Jaffari et a., 2023
Photocatalysis23Dyes1,527Kim et al., 2024
Photocatalysis152,4,Dichlorophenoxyacetic acid1,044Kim et al., 2024
Photocatalysis--2,078submitted et al., 2024
Photocatalysis8Tetracycline374Abdi et al., 2022
Photocatalysis7TiO2446Jiang et al., 2020
Photocatalysis8multiple457Jiang et al., 2020
membrane18micropollutants1,906Jeong et al., 2021
membrane18salts1,586Jeong et al., 2023
sonolysis6Cyanobacteria314Jaffari et al., 2024