Anofox Statistics - DuckDB Extension

June 22, 2026 · View on GitHub

A statistical analysis extension for DuckDB, providing regression analysis, diagnostics, and inference capabilities directly within your database.

License: BSL 1.1 DuckDB Version

Important

This extension is in early development, so bugs and breaking changes are expected. Please use the issues page to report bugs or request features.

Features

Regression Methods

MethodFunctionDescription
OLSols_fit, ols_fit_aggOrdinary Least Squares
Huberhuber_fit, huber_fit_aggRobust M-estimator (outlier-resistant; reports MAD scale + outlier mask)
RANSACransac_fit, ransac_fit_aggRobust consensus regression (outlier-resistant; reports inlier mask + trial count)
Theil-Sentheilsen_fit, theilsen_fit_aggRobust nonparametric regression via spatial-median over OLS subsamples
Ridgeridge_fit, ridge_fit_aggL2 regularization
Elastic Netelasticnet_fit, elasticnet_fit_aggCombined L1+L2 regularization
WLSwls_fit, wls_fit_aggWeighted Least Squares
RLSrls_fit, rls_fit_aggRecursive Least Squares (online)
Poissonpoisson_fit_aggGLM for count data
Binomialbinomial_fit_aggGLM for success-rate data (logit / probit / cloglog links)
Negative Binomialnegbinom_fit_aggGLM for overdispersed counts (dispersion α estimated jointly)
Tweedietweedie_fit_aggGLM for positive-skew continuous outcomes (power parameter; default 1.5 for compound Poisson-Gamma)
Gammagamma_fit_aggGLM for strictly-positive continuous outcomes (claims, durations)
Logisticlogistic_fit_aggBinary classification (binomial GLM with logit link); reports accuracy + threshold echo, optional L2 penalty
ALMalm_fit_agg24 error distributions
BLS/NNLSbls_fit_agg, nnls_fit_aggBounded/Non-negative LS
PLSpls_fit, pls_fit_aggPartial Least Squares
Isotonicisotonic_fit, isotonic_fit_aggMonotonic regression
Quantilequantile_fit, quantile_fit_aggQuantile/median regression

Statistical Hypothesis Tests

CategoryFunctionDescription
Normalityshapiro_wilk_agg, jarque_bera_agg, dagostino_k2_aggNormality tests
Parametrict_test_agg, one_way_anova_agg, yuen_agg, brown_forsythe_aggParametric tests
Nonparametricmann_whitney_u_agg, kruskal_wallis_agg, wilcoxon_signed_rank_agg, brunner_munzel_agg, permutation_t_test_aggNonparametric tests
Correlationpearson_agg, spearman_agg, kendall_agg, distance_cor_agg, icc_aggCorrelation tests
Categoricalchisq_test_agg, chisq_gof_agg, g_test_agg, fisher_exact_agg, mcnemar_aggContingency table tests
Effect Sizecramers_v_agg, phi_coefficient_agg, contingency_coef_agg, cohen_kappa_aggAssociation measures
Proportionprop_test_one_agg, prop_test_two_agg, binom_test_aggProportion tests
Equivalencetost_t_test_agg, tost_paired_agg, tost_correlation_aggTOST equivalence tests
Distributionenergy_distance_agg, mmd_aggDistribution comparison
Forecastdiebold_mariano_agg, clark_west_aggForecast evaluation

Diagnostics & Utilities

FunctionDescription
vif, vif_aggVariance Inflation Factor
aic, bicModel selection criteria
residuals_diagnostics_aggResidual analysis
aid_agg, aid_anomaly_aggDemand pattern classification

Fit-Predict Table Macros (*_fit_predict_by)

Table macros for easy per-group model fitting and prediction with a single function call:

MacroDescription
ols_fit_predict_byOLS per-group fit + predict
huber_fit_predict_byHuber robust per-group fit + predict
ransac_fit_predict_byRANSAC robust per-group fit + predict
theilsen_fit_predict_byTheil-Sen robust per-group fit + predict
ridge_fit_predict_byRidge per-group fit + predict
elasticnet_fit_predict_byElasticNet per-group fit + predict
wls_fit_predict_byWLS per-group fit + predict
rls_fit_predict_byRLS per-group fit + predict
bls_fit_predict_byBounded LS per-group fit + predict
alm_fit_predict_byALM per-group fit + predict
poisson_fit_predict_byPoisson GLM per-group fit + predict
pls_fit_predict_byPLS per-group fit + predict
isotonic_fit_predict_byIsotonic per-group fit + predict
quantile_fit_predict_byQuantile per-group fit + predict

Key Capabilities

  • Aggregate Functions: All methods support GROUP BY for per-group analysis
  • Window Functions: Regression methods support OVER clause for rolling/expanding windows
  • Prediction Intervals: Confidence and prediction intervals for forecasts
  • Full Inference: t-statistics, p-values, confidence intervals for coefficients
  • Table Macros: *_fit_predict_by macros simplify per-group workflows

Quick Start

Important: All functions in this extension use positional parameters, not named parameters (:= syntax). Parameters must be provided in the order shown in the examples below.

Naming Convention

All functions follow the anofox_stats_* naming convention, with convenient aliases without the prefix:

  • Primary: anofox_stats_ols_fit(...)
  • Alias: ols_fit(...) - Shorter and more convenient!

Both work identically - use whichever you prefer!

-- Load the extension
LOAD 'anofox_statistics';

-- Simple OLS regression (using alias - recommended)
SELECT * FROM ols_fit(
    [1.0, 2.0, 3.0, 4.0, 5.0]::DOUBLE[],  -- y: response variable
    [[1.1], [2.1], [2.9], [4.2], [4.8]]::DOUBLE[][],  -- x: feature matrix
    MAP{'intercept': true}                 -- options
);

-- Or use the full name if you prefer
SELECT * FROM anofox_stats_ols_fit(
    [1.0, 2.0, 3.0, 4.0, 5.0]::DOUBLE[],  -- y: response variable
    [[1.1], [2.1], [2.9], [4.2], [4.8]]::DOUBLE[][],  -- x: feature matrix
    MAP{'intercept': true}                 -- options
);

-- Multiple regression with 3 predictors
SELECT * FROM ols_fit(
    [1.0, 2.0, 3.0, 4.0, 5.0]::DOUBLE[],  -- y: response variable
    [[1.1, 2.0, 3.0], [2.1, 3.0, 4.0], [2.9, 4.0, 5.0],
     [4.2, 5.0, 6.0], [4.8, 6.0, 7.0]]::DOUBLE[][],  -- x: feature matrix
    MAP{'intercept': true}                 -- options
);

-- Get coefficient inference using fit with full_output
SELECT
    coefficients,
    intercept,
    coefficient_p_values,
    intercept_p_value,
    f_statistic,
    f_statistic_pvalue
FROM ols_fit(
    [1.0, 2.0, 3.0, 4.0, 5.0]::DOUBLE[],                  -- y
    [[1.0], [2.0], [3.0], [4.0], [5.0]]::DOUBLE[][],      -- x (matrix format)
    {'intercept': true, 'full_output': true, 'confidence_level': 0.95}  -- options
);

-- Per-group regression with aggregate functions
SELECT
    category,
    result.coefficients[1] as price_effect,
    result.intercept,
    result.r2
FROM (
    SELECT
        category,
        ols_fit_agg(
            sales,
            [price],
            MAP{'intercept': true}
        ) as result
    FROM sales_data
    GROUP BY category
) sub;

-- EASIER: Per-group fit + predict using table macros
-- Returns one row per observation with predictions
SELECT * FROM ols_fit_predict_by(
    'sales_data',      -- table name (as string)
    category,          -- group column
    sales,             -- y column
    [price, quantity]  -- x columns as list
);
-- Returns: group_id, y, x, yhat, yhat_lower, yhat_upper, is_training

-- With options (e.g., confidence level)
SELECT * FROM ridge_fit_predict_by(
    'sales_data',
    category,
    sales,
    [price, quantity],
    {'alpha': 0.5, 'confidence_level': 0.99}  -- options
);

Installation

Install directly from our public distribution bucket. DuckDB must be started with the -unsigned flag, since the extension binary is not signed by the DuckDB Foundation:

INSTALL 'anofox_statistics' FROM 'http://get.erpl.io';
LOAD 'anofox_statistics';

This pulls the right binary for your platform (Linux amd64/arm64, macOS amd64/arm64, Windows amd64). No build toolchain, no vcpkg, no submodules required.

Community Extension

INSTALL anofox_statistics FROM community;
LOAD anofox_statistics;

Documentation

Guides are available in the guides/ directory:

Dependencies

  • DuckDB: v1.4.5 LTS or v1.5.4 (latest)
  • Rust: Stable toolchain (for building from source)
  • anofox-regression: Rust regression library (via Cargo)
  • faer: Linear algebra library (via Cargo, no default features)
  • ICU Extension (Optional): Required for date/time operations in documentation examples
    INSTALL icu;
    LOAD icu;
    

Telemetry

This extension collects anonymous usage telemetry to help improve the product. Telemetry is enabled by default and includes:

  • Extension load events (extension name, version, platform)
  • Function execution events (which functions are used)
  • No personal data or query contents are collected

Disabling Telemetry

Environment Variable:

export DATAZOO_DISABLE_TELEMETRY=1

DuckDB SQL Setting:

SET anofox_telemetry_enabled = false;

For more information, see the posthog-telemetry repository.

License

This project is licensed under the Business Source License 1.1 (BSL 1.1).

Key Terms

  • Usage Grant: Free to use, modify, and distribute for non-production purposes
  • Production Use: Permitted after 4 years from release date, or under a commercial license
  • Change Date: [Release Date + 4 years]
  • Change License: Apache License 2.0

See LICENSE for full terms.

Why BSL?

The BSL allows:

  • Free use for development, testing, and research
  • Open source collaboration and contributions
  • Academic and educational use
  • Small-scale production use

While ensuring:

  • Sustainable development funding
  • Protection for the project's long-term viability
  • Future conversion to fully open source (Apache 2.0)

For commercial production use before the Change Date, please contact: [contact@datazoo.de]

Contributing

Contributions are welcome. To get started:

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes with tests
  4. Submit a pull request

For questions, open an issue on GitHub.

Areas for Contribution

  • Additional statistical tests (Durbin-Watson, Breusch-Pagan, etc.)
  • Visualization helpers for diagnostics
  • Documentation and examples
  • Bug reports and fixes
  • Performance optimizations
  • Internationalization

Support

Citation

If you use this extension in research, please cite:

@software{anofox_statistics,
  title = {Anofox Statistics: Statistical Analysis Extension for DuckDB},
  author = {DataZoo DE},
  year = {2025},
  url = {https://github.com/DataZooDE/anofox-statistics},
  version = {1.0.0}
}

Validation

The regression algorithms in this extension are powered by the anofox-regression Rust crate, which is validated against R's statistical functions. The test suite compares results with R's lm(), glmnet, and other standard statistical packages to ensure numerical accuracy.

Acknowledgments

  • DuckDB Team: For the database and extension framework
  • anofox-regression: Rust regression library with R-validated implementations
  • Open Source Community: For contributions and feedback