EEGauge

June 21, 2026 · View on GitHub

Audit public EEG/BCI datasets and reproduce simple baselines honestly. EEGauge is a small command-line tool that turns a public dataset into a plain-language "dataset card" - stating the evaluation split, chance level, per-class metrics, and leakage risks - so a reported accuracy can't be read out of context.

It works with motor-imagery datasets from MOABB (BNCI2014_001, BNCI2014_004, Zhou2016, Weibo2014) and reads metadata from EEGDash's 700+ OpenNeuro/NEMAR datasets. The goal is not a medical device - it is benchmark hygiene: making dataset assumptions, split strategy, and limitations easy to see.

Why it exists: on BNCI2014_001, the same CSP+LDA baseline scores 0.751 within-subject but only 0.429 across subjects (chance is 0.25). Quoting the optimistic number as if a model "works" is the exact mistake this tool helps you catch. See the worked example.

What This Project Does

Generates Markdown dataset cards for public BCI datasets, including license, DOI, and citation.
Saves benchmark results as JSON.
Runs a simple classical baseline for motor imagery: CSP + LDA.
Reports the chance level and an (approximate) test of whether the score beats chance.
Reports per-class precision/recall/F1 and a confusion matrix, not just overall accuracy.
Reads metadata for 700+ EEGDash (OpenNeuro/NEMAR) datasets and records an evaluation-provenance JSON (the exact cohort used).
States leakage warnings and limitations explicitly.
Avoids claims about diagnosis, treatment, emotion detection, or reliable assistive use.

What This Project Does Not Do

It does not read thoughts.
It does not diagnose or treat medical conditions.
It does not stream from EEG hardware.
It does not replace clinical BCI systems.
It does not prove a model will work on real users outside the tested dataset.

Install

Use Python 3.11 or newer. A virtual environment is strongly recommended.

git clone https://github.com/YG-paaleee/eegauge.git
cd eegauge
py -m venv .venv
.\.venv\Scripts\Activate.ps1
py -m pip install --upgrade pip
py -m pip install -e .[bci,test]

If you only want to run the mocked unit tests without installing MOABB/MNE:

py -m pip install -e .[test]
py -m pytest

Usage

Generate a dataset card:

eegauge scan --dataset BNCI2014_001

Run a tiny benchmark and update the card:

eegauge benchmark --dataset BNCI2014_001 --subjects 1 2 3

Outputs:

cards/BNCI2014_001.md
results/BNCI2014_001.json
results/BNCI2014_001.png

The first real MOABB run may download EEG data. That can take time.

Scanning EEGDash datasets (metadata only)

EEGDash catalogs hundreds of OpenNeuro/NEMAR datasets. The eegdash backend reads a dataset's metadata and records without downloading any signals, and additionally writes an evaluation-provenance record:

pip install -e .[eegdash]
eegauge scan --backend eegdash --dataset ds002718

Outputs:

cards/ds002718.md
results/ds002718.provenance.json

The card surfaces modalities, record/subject counts, license, DOI, BIDS validation status (when EEGDash provides it), and honest leakage risk factors (e.g. "multiple sessions per subject - use subject-wise splits"). The provenance.json (schema bci-evaluation-card/0.1) records the exact cohort so an evaluation can be reproduced. No baseline is run for EEGDash datasets yet (they span many modalities/tasks); this step is metadata only.

See examples/ds002718.md and examples/ds002718.provenance.json for real output from a live scan of the Wakeman face-processing dataset (18 subjects).

Dataset Downloads On Windows

MOABB/MNE may download public EEG files the first time you run a real benchmark. Keep those files outside the repo:

$env:MNE_DATA = "$HOME\mne_data"
$env:MNE_DATASETS_BNCI_PATH = "$HOME\mne_data"

The .gitignore blocks common local data folders and .mat files so dataset downloads do not get committed by accident.

Current Scope

Supported:

Motor-imagery datasets: BNCI2014_001, BNCI2014_004, Zhou2016, Weibo2014
CSP + LDA baseline
Subject-aware splitting when multiple subjects are provided
Stratified holdout when only one subject is provided
Chance level + binomial significance check
Per-class metrics and a confusion matrix
License / DOI / citation surfaced in the card
EEGDash backend: metadata-only scan + evaluation-provenance records (no signal download)

Not supported yet:

P300, SSVEP, c-VEP, or resting-state datasets
Deep learning baselines
Real-time hardware streams
Web dashboards

Example

examples/BNCI2014_001.md is real output from an actual run on the public BNCI2014_001 motor-imagery dataset (MOABB 1.5.0, MNE 1.12.1, scikit-learn 1.9.0, seed 42):

eegauge benchmark --dataset BNCI2014_001 --subjects 1 2 3

Why this tool exists: within-subject vs cross-subject

The same CSP + LDA baseline gives very different numbers depending on how you split:

Evaluation	Split method	Accuracy
Subject 1 only	stratified holdout (within-subject)	0.751
Subjects 1, 2, 3	leave-one-subject-out (cross-subject)	0.429

Chance is 0.25 for these four classes, so both results contain real signal. But the within-subject score is far more optimistic than the cross-subject score. Quoting the 0.751 number as if a model "works" is exactly the evaluation mistake this tool helps beginners notice and avoid. The generated card always states the split method so the number can't be read out of context.

Within-subject (optimistic)	Cross-subject (honest)

Beyond accuracy

The card also reports the chance level and whether the score beats it (an approximate binomial test), plus per-class precision/recall/F1 and a confusion matrix. On this run, cross-subject accuracy is 0.429 vs a 0.25 chance level (binomial p < 0.001), but the per-class breakdown shows the baseline barely detects the "feet" class - detail a single accuracy number hides.

Confusion matrix

Development

See CONTRIBUTING.md for setup and contribution guidance, and CHANGELOG.md for release notes.

Run tests:

py -m pytest

Run CLI smoke checks:

eegauge --help
eegauge scan --help
eegauge benchmark --help

Responsible Claims

This project is research tooling for public datasets. EEG signals are noisy, dataset-specific, and easy to overfit. Reported metrics are only meaningful with the stated split method, subject count, preprocessing, and dataset limitations.

Do not describe this project as medical software, mind reading, diagnosis, treatment, or a validated assistive communication system.