qsv Stats Definitions

July 3, 2026 · View on GitHub

Table of Contents


stats

Here are all the statistics produced by the qsv stats command, sourced from src/cmd/stats.rs.

Each statistic is categorized by its relevant section, with its identifier (column name), summary, computation method, and level (File or Variable).

Note: "Streaming" statistics are computed in constant memory. "Non-Streaming" statistics require loading the column data into memory (or multiple passes) and may use approximation or exact calculation depending on configuration.

Important: Unlike the sniff command, stats data type inferences are GUARANTEED, as the entire file is scanned, not just sampled. This makes stats a central command in qsv that underpins other "smart" commands (describegpt, frequency, joinp, pivotp, schema, sqlp, tojsonl) which use cached statistical information to work smarter & faster.

The command supports various caching options to improve performance on subsequent runs. See --stats-jsonl and --cache-threshold options for details.

Streaming vs Non-Streaming Statistics

Streaming Statistics (computed in constant memory, always emitted alongside the field and type identifier columns; those two identifier columns are always present in the output but are not counted as statistics here) — 27 stats:

#IdentifierGroup
1is_asciiMetadata
2sumDescriptive
3minDescriptive
4maxDescriptive
5rangeDescriptive
6sort_orderDescriptive
7sortinessDescriptive
8min_lengthString length
9max_lengthString length
10sum_lengthString length
11avg_lengthString length
12stddev_lengthString length
13variance_lengthString length
14cv_lengthString length
15meanCentral tendency
16semCentral tendency
17geometric_meanCentral tendency
18harmonic_meanCentral tendency
19stddevCentral tendency
20varianceCentral tendency
21cvCentral tendency
22nullcountQuality
23n_negativeQuality
24n_zeroQuality
25n_positiveQuality
26max_precisionQuality
27sparsityQuality

Non-Streaming Statistics (opt-in via flag or --everything; all but zero_padded_numeric load/retain per-column data in memory) — 21 stats:

zero_padded_numeric is the exception in this group: it is computed during the normal streaming scan in constant memory (two bookkeeping flags per column) and is listed here only because it is opt-in, not because it requires in-memory processing.

#IdentifierFlag
28median--median (suppressed when --quartiles is also on; supplied by q2_median instead)
29mad--mad
30lower_outer_fence--quartiles
31lower_inner_fence--quartiles
32q1--quartiles
33q2_median--quartiles
34q3--quartiles
35iqr--quartiles
36upper_inner_fence--quartiles
37upper_outer_fence--quartiles
38skewness--quartiles
39cardinality--cardinality
40uniqueness_ratio--cardinality
41mode--mode
42mode_count--mode
43mode_occurrences--mode
44antimode--mode
45antimode_count--mode
46antimode_occurrences--mode
47percentiles--percentiles (single column containing the comma-separated values listed in --percentile-list)
48zero_padded_numeric--zero-padded-numeric (true when leading/padding zeros would be lost if the column were cast to a number: the inferred type is String — a leading zero is exactly what forces an otherwise-numeric column to infer as String — and every non-null value is numeric-shaped: an all-digit integer (zip codes, barcodes, padded IDs), a zero-padded decimal code such as 007.1/05.10 (ICD-9, Dewey Decimal, Harmonized System codes), or a plain number mixed in; empty otherwise. Only String-typed columns are ever flagged)

Total: 48 statistics (27 streaming + 21 non-streaming, beyond the field/type identifiers). 48 is the maximum — it's the size of the union across all flag combinations. The actual emitted column count for any particular run depends on which flags are set: any single run emits at most 48 stats columns because median (#28) and q2_median (#33) are mutually exclusive — median is only emitted under --median alone, while q2_median replaces it whenever --quartiles or --everything is set. Runs without --everything emit fewer columns (only the streaming 27 plus whichever opt-in groups are enabled). The non-streaming statistics that retain per-column samples (every stat in this group except zero_padded_numeric, which is streaming/constant-memory) use memory-aware chunking for large files, dynamically calculating chunk size based on available memory and record sampling. The enumeration above is the source-of-truth for the "48 summary statistics" count quoted in README.md and docs/help/stats.md; it is sourced from the Stats::stat_headers builder in src/cmd/stats.rs.

Weighted Statistics

When the --weight <column> option is specified, all statistics are computed using weighted algorithms. The weight column must be numeric and is automatically excluded from statistics computation. Missing or non-numeric weights default to 1.0. Zero and negative weights are ignored and do not contribute to the statistics.

Weighted statistics use weighted versions of the standard algorithms:

  • Weighted mean/variance/stddev: Weighted Welford's algorithm (West, 1979)
  • Weighted geometric mean: exp(Σ(w_i * ln(x_i)) / Σ(w_i)) for positive values
  • Weighted harmonic mean: Σ(w_i) / Σ(w_i / x_i) for non-zero values
  • Weighted median/quartiles/percentiles: Weighted nearest-rank method
  • Weighted MAD: Weighted median of absolute deviations
  • Weighted modes/antimodes: Based on weight values rather than frequency counts

The output filename will be <FILESTEM>.stats.weighted.csv to distinguish from unweighted statistics.

Date/DateTime Statistics

Date and DateTime statistics are only computed when --infer-dates is enabled. Date inference is an expensive operation that matches date candidates against 19 possible date formats with multiple variants.

Formatting:

  • DateTime results are in RFC3339 format (e.g., "2023-01-15T10:30:00Z")
  • Date results are in "yyyy-mm-dd" format (UTC timezone)
  • If timezone is not specified in the data, it is set to UTC

Units:

  • Date range, stddev, variance, MAD, and IQR are returned in days (not milliseconds)
  • These values are rounded to a precision of 1e-5 days (sub-second), with trailing zeros trimmed in the displayed output
  • Mean, geometric mean, and harmonic mean for dates/datetimes are returned in RFC3339 format

Date Column Selection:

  • By default, --dates-whitelist is set to sniff, which uses two-stage date inferencing: first runs qsv sniff on the input file, then only infers dates for the columns that sniff identifies as date/datetime candidates. This is much faster than all, and more convenient than manually specifying patterns in the whitelist
  • Alternatively, set --dates-whitelist to a comma-separated, case-insensitive list of patterns to match against column names (e.g., date,time,due,open,close,created). Only columns whose names contain one of the patterns will be checked
  • Examples of column names that trigger date inference (with a manual whitelist like date,time,due,open,close,created): "start_date", "Observation Time", "timestamp", "Date Closed"
  • Examples that do NOT trigger: "start_dt", "create_dt", "tmstmp", "close_dt" (unless added to whitelist)
  • Use --dates-whitelist all to inspect all fields (may cause false positives with numeric data like Unix epoch timestamps)
  • Use --prefer-dmy to parse dates in day/month/year format instead of month/day/year

Metadata & Type Inference

IdentifierLevelSummaryComputation
fieldVariableThe name of the column/header (or its index if --no-headers is used).Extracted from the CSV header row.
typeVariableInferred data type of the column.Inferred by checking values against: NULL, Integer, Float, Date, DateTime, Boolean (optional), and fallback to String. Data type inferences are GUARANTEED as stats scans the entire file.
is_asciiVariableIndicates if all characters in the string column are ASCII.Checked during UTF-8 validation; true if bytes are valid ASCII.
zero_padded_numericVariabletrue when leading/padding zeros would be lost if the column were cast to a number: a String-typed column whose every non-null value is numeric-shaped — zero-padded integers/codes (zip codes, barcodes, padded IDs) or zero-padded decimal codes (e.g. 007.1, 05.10), possibly mixed with plain numbers; empty otherwise. Only String-typed columns are ever flagged (a leading zero is exactly what forces an otherwise-numeric column to infer as String). Opt-in via --zero-padded-numeric or --everything.While scanning, a column stays "qualified" only if every non-null value is numeric-shaped; the first non-numeric value disqualifies it (the disqualification is sticky). A value counts as numeric-shaped if it is (a) an all-ASCII-digit integer; (b) a zero-padded decimal code — a leading 0 in the integer part followed by another digit, with a single decimal point (007.1, 05.10); or (c) a plain number — either freshly inferred as Float, or (once the column has already widened to String) re-parsed as a float, so a 3.5 arriving after a 007.1 does not disqualify the column. At output, true is emitted when the column qualified, saw at least one value, AND its final type is String — which, for an all-numeric-shaped column, implies at least one value carried a leading zero (a plain all-digit column infers as Integer, a plain decimal column as Float; both are never flagged).

Zero-Padded Numeric Detection:

  • qsv deliberately keeps zero-padded numerics — integers with leading zeros (e.g. 07306) AND zero-padded decimal codes (e.g. 007.1) — as String rather than Integer/Float, to avoid silently dropping the leading zeros. The flag is therefore only ever emitted for String-typed columns; what it adds over the String type alone is the guarantee that every value is numeric-shaped, i.e. the column is a numeric code, not free text.
  • zero_padded_numeric surfaces both so such columns are not mistakenly re-typed/cast as numeric when loaded into SQL, SPSS, SAS, Stata, or other tools.
  • Detection is strict: a column with any non-numeric value (e.g. "N/A", "Main St") is not flagged, matching the "only if all values are numeric" intent.
  • The zero-padded decimal-code rule mirrors the integer one: a 0 in the integer part immediately followed by another digit (007.1, 05.10, 0601.10). This deliberately excludes ordinary fractions like 0.5/0.25 (a single 0 before the decimal point) — those infer as Float and are never flagged. Pure trailing-zero codes (7.107.1) are likewise not flagged: with no leading zero the column infers as Float, and trailing padding is indistinguishable from a rounded measurement without the original string anyway. Multi-dot codes like 0601.10.00 are neither all-digit nor parseable floats, so they disqualify the column (they count as non-numeric values) and it is never flagged — even though qsv does keep such columns as String.

Date and DateTime Type Inference:

  • See the Date/DateTime Statistics section for full details on how date columns are selected, the default --dates-whitelist, example column names that do and do not trigger inference, and the list of supported date formats.
  • In summary, Date and DateTime types are only inferred when --infer-dates is enabled, and inference relies on matching candidate values against the supported date formats.

Boolean Type Inference:

  • Boolean type is inferred when --infer-boolean is enabled
  • A column is inferred as Boolean when its cardinality is 2 and the two values match the boolean patterns specified by --boolean-patterns (default: 1:0,t*:f*,y*:n*)
  • Boolean inference automatically enables --cardinality computation
  • Patterns are case-insensitive and support prefix matching with * wildcards
  • Example: With the default patterns "t*:f*,y*:n*", a column is inferred as Boolean only when it contains exactly two distinct values—one matching a "true" pattern (for example, "true", "truthy", "Truth") and one matching a "false" pattern (for example, "false", "f", "no"); any additional distinct values (such as "falsified" or "falseness") would increase the cardinality above 2 and therefore prevent Boolean inference.

Descriptive Statistics (Numerical & General)

IdentifierLevelSummaryComputation
sumVariableSum of all values in the column.Rolling sum. Integers sum to Integer until a Float is encountered, then switches to Float. Integer sums that overflow/underflow show *OVERFLOW* or *UNDERFLOW*. For Floats, returns NaN as the string "NaN", positive infinity as "inf", and negative infinity as "-inf".
minVariableMinimum value found.Tracks minimum value during the scan.
maxVariableMaximum value found.Tracks maximum value during the scan.
rangeVariableDifference between Max and Min.max - min.
sort_orderVariableSorting status of the column.Checked during scan. Returns "Ascending", "Descending", or "Unsorted".
sortinessVariableMeasure of how sorted the column is.Returns a score between -1.0 and 1.0: 1.0 indicates perfectly ascending order, -1.0 indicates perfectly descending order, values in between indicate the general tendency towards ascending or descending order, and 0.0 indicates either no clear ordering or empty/single-element collections.

Central Tendency & Dispersion (Streaming)

Computed using Welford's online algorithm for single-pass accuracy. When --weight <column> is specified, weighted versions are computed using weighted Welford's algorithm (West, 1979).

IdentifierLevelSummaryComputation
meanVariableArithmetic mean (average).Welford's algorithm mean. Weighted: Σ(w_i * x_i) / Σ(w_i).
semVariableStandard Error of the Mean.stddev / sqrt(count).
geometric_meanVariableGeometric mean.Online calculation using logarithms. Weighted: exp(Σ(w_i * ln(x_i)) / Σ(w_i)) for positive values.
harmonic_meanVariableHarmonic mean.Online calculation using reciprocals. Weighted: Σ(w_i) / Σ(w_i / x_i) for non-zero values.
stddevVariableStandard deviation (sample).Welford's algorithm standard deviation. Weighted: uses frequency weight definition.
varianceVariableVariance (sample).Square of standard deviation. Weighted: S_n / (W_n - 1) where S_n is sum of squared differences.
cvVariableCoefficient of Variation.(stddev / mean) * 100. Returns NaN when mean is 0.

String Statistics

NOTE: Length statistics are only computed for columns with a String data type. Lengths are byte lengths, not character lengths, as some UTF-8 characters take more than one byte.

IdentifierLevelSummaryComputation
min_lengthVariableLength of the shortest string.Tracks minimum length in bytes.
max_lengthVariableLength of the longest string.Tracks maximum length in bytes.
sum_lengthVariableSum of lengths of all strings.Accumulates length of every value. Shows *OVERFLOW* when sum exceeds u64::MAX.
avg_lengthVariableAverage string length.sum_length / count. Shows *OVERFLOW* when sum_length overflowed.
stddev_lengthVariableStandard deviation of string lengths.Welford's algorithm on lengths. Shows *OVERFLOW* when sum_length overflowed.
variance_lengthVariableVariance of string lengths.Square of stddev_length. Shows *OVERFLOW* when sum_length overflowed.
cv_lengthVariableCoefficient of Variation of lengths.stddev_length / avg_length (unitless ratio, not multiplied by 100 unlike the numeric cv above). Shows *OVERFLOW* when sum_length overflowed.

Quality & Distribution

IdentifierLevelSummaryComputation
nullcountVariableCount of NULL (empty) values.Incremented when a field is empty (or matches custom NULL).
n_negativeVariableCount of negative values.Computed for Integer and Float types only.
n_zeroVariableCount of zero values.Computed for Integer and Float types only.
n_positiveVariableCount of positive values.Computed for Integer and Float types only.
max_precisionVariableMaximum decimal precision found (Floats).Tracks the maximum number of digits after the decimal point.
sparsityVariableFraction of missing (NULL) values.nullcount / record_count.

Median & Quartiles (Non-Streaming)

Requires loading data into memory and sorting. When --weight <column> is specified, weighted versions are computed using weighted nearest-rank method.

Note on Date/DateTime types: For Date and DateTime types, range, stddev, variance, MAD, and IQR are returned in days (not milliseconds). These values are rounded to a precision of 1e-5 days (sub-second precision); trailing zeros may be omitted in the output.

Requirements:

  • median requires --median or --everything (unless --quartiles is specified, in which case median is not returned separately as it's the same as q2_median)
  • mad requires --mad or --everything
  • Quartile statistics require --quartiles or --everything
  • When --quantile-method approx is set, median, q1, q2_median, q3, iqr, the four fences, skewness, and percentiles are computed from a t-digest sketch (Apache DataSketches port of Dunning's MergingDigest, ~200 centroids, ~1% rank error — more accurate at the tails). See Approximate Algorithms (Opt-In). Under that mode, --mad is auto-disabled with a warning, --weight is rejected, and results may differ slightly across runs with different --jobs values (pin --jobs 1 for run-to-run determinism).
IdentifierLevelSummaryComputation
medianVariableMedian value (50th percentile).Middle value of sorted data (or average of two middle values). Weighted: uses weighted nearest-rank method. When --quartiles is specified, this is not returned separately as it's the same as q2_median.
madVariableMedian Absolute Deviation, a robust measure of variability.Median of the absolute deviations from the data's median. Weighted: weighted median of absolute deviations. For dates/datetimes, returned in days.
q1VariableFirst Quartile (25th percentile).Value at 25% rank using Method 3. Weighted: value at which cumulative weight first reaches 25% of total weight.
q2_medianVariableSecond Quartile (Median).Same as median (50th percentile).
q3VariableThird Quartile (75th percentile).Value at 75% rank using Method 3. Weighted: value at which cumulative weight first reaches 75% of total weight.
iqrVariableInterquartile Range.q3 - q1. For dates/datetimes, returned in days with at least 5 decimal places.
lower_outer_fenceVariableLower bound for extreme outliers.q1 - (3.0 * iqr), used to identify extreme outliers. For dates/datetimes, returned in RFC3339 format.
lower_inner_fenceVariableLower bound for outliers.q1 - (1.5 * iqr), used to identify mild outliers. For dates/datetimes, returned in RFC3339 format.
upper_inner_fenceVariableUpper bound for outliers.q3 + (1.5 * iqr), used to identify mild outliers. For dates/datetimes, returned in RFC3339 format.
upper_outer_fenceVariableUpper bound for extreme outliers.q3 + (3.0 * iqr), used to identify extreme outliers. For dates/datetimes, returned in RFC3339 format.
skewnessVariableMeasure of asymmetry of the probability distribution.Quantile-based skewness: (q3 - (2.0 * q2) + q1) / iqr.

Cardinality & Modes (Non-Streaming)

Requirements:

  • cardinality and uniqueness_ratio require --cardinality or --everything
  • mode, mode_count, mode_occurrences, antimode, antimode_count, antimode_occurrences require --mode or --everything
  • By default, cardinality is computed exactly (via the same HashMap that backs mode tracking). Pass --cardinality-method approx to swap in a HyperLogLog sketch (Apache DataSketches port, lg_k=12, ~5KB/column, ~1.5% relative standard error) — useful on very-high-cardinality columns where exact counting is wasted work. See Approximate Algorithms (Opt-In). --infer-boolean forces exact (boolean inference needs cardinality == 2 exactness); a one-time warning is emitted.
  • The --mode-cardinality-cap <n> option (default 0 = unbounded) bounds the per-column memory used to track modes/antimodes on high-cardinality columns. When the cap fires, mode/antimode columns emit the sentinel *HIGH_CARDINALITY, and (under --cardinality-method exact only) the cardinality column emits >=<n>. The >= prefix DOES break downstream parsers expecting a plain integer, so the cap is opt-in. Under --cardinality-method approx, the cap does not affect the cardinality column (HLL emits its own 1.5%-RSE estimate at fixed memory); only mode/antimode tracking is gated. The cap measures total samples added under unweighted mode ( row count) and number of unique values under --weight (HashMap len(), == true cardinality).

When --weight <column> is specified, weighted versions are computed. For weighted modes, mode_occurrences is the maximum weight (rounded). For weighted antimodes, antimode_occurrences is the minimum weight (rounded).

Multiple modes/antimodes are separated by the QSV_STATS_SEPARATOR environment variable (default: |).

IdentifierLevelSummaryComputation
cardinalityVariableCount of unique values.Count of distinct entries in the column. Weighted: count of unique values (weights are not considered for uniqueness). Use --cardinality-method approx for a HyperLogLog estimate (~1.5% RSE, fixed ~5KB memory) — see Approximate Algorithms (Opt-In). When --mode-cardinality-cap <n> fires under exact mode, this column emits the sentinel >=<n>.
uniqueness_ratioVariableRatio of unique values to total records.cardinality / record_count. Interpretation: 1.0 = All unique values (e.g., primary keys). Close to 1.0 = Mostly unique values (e.g., user IDs, timestamps). Close to 0.0 = Many repeated values (e.g., categorical labels like "Male/Female" or "Yes/No").
modeVariableThe most frequent value(s) in the column.Value(s) with the highest frequency count. Weighted: value(s) with the highest weight. Multimodal-aware. If there are multiple modes, they are separated by QSV_STATS_SEPARATOR.
mode_countVariableNumber of modes found.Count of values tied for highest frequency.
mode_occurrencesVariableFrequency count of the mode.Number of times the mode(s) appear. Weighted: maximum weight (rounded).
antimodeVariableThe least frequent non-zero/non-null value(s) in the column.Value(s) with the lowest frequency count (non-zero). Returns *ALL if all values are unique. Limited to first 10 values, truncating after 100 characters (configurable with QSV_ANTIMODES_LEN). If truncated, includes *PREVIEW: prefix. Weighted: value(s) with the lowest weight. If there are multiple antimodes, they are separated by QSV_STATS_SEPARATOR.
antimode_countVariableNumber of antimodes found.Count of values tied for lowest frequency.
antimode_occurrencesVariableFrequency count of the antimode.Number of times the antimode(s) appear. Weighted: minimum weight (rounded).

Percentiles (Non-Streaming)

Requires loading data into memory and sorting. When --weight <column> is specified, weighted percentiles are computed using weighted nearest-rank method.

Requirements: --percentiles or --everything

Computed using the nearest-rank method.

IdentifierLevelSummaryComputation
percentilesVariableCustom percentiles of sorted values.Nearest rank method for user-defined list. Weighted: weighted nearest-rank method. Multiple percentiles separated by QSV_STATS_SEPARATOR (default: |). Special values: "deciles" → "10,20,30,40,50,60,70,80,90", "quintiles" → "20,40,60,80". Default: "5,10,40,60,90,95". For dates/datetimes, values in RFC3339 format.

File-Level Metadata (JSON Cache)

When stats are cached, the .stats.csv.json file includes file-level metadata that enables data fingerprinting and cache validation:

FieldDescriptionComputation
canonical_input_pathCanonical path to the input file.Filesystem canonical (absolute) path.
canonical_stats_pathCanonical path to the stats output file.Filesystem canonical (absolute) path.
record_countTotal number of rows (records).Count of records processed (excluding header).
field_countTotal number of columns.Count of fields in the header/first record.
filesize_bytesTotal file size in bytes.Filesystem metadata size.
date_generatedWhen the stats were generated.RFC3339 timestamp (UTC).
compute_duration_msTime taken to compute stats.Elapsed wall-clock time in milliseconds.
qsv_versionVersion of qsv used to generate stats.CARGO_PKG_VERSION at compile time. Used for cache invalidation when qsv is upgraded.
hash.blake3BLAKE3 fingerprint hash of the dataset's stats.BLAKE3 hash of the cached stats record's streaming-stats portion up to the FINGERPRINT_HASH_COLUMNS limit (29 columns in the default/non---typesonly output; effectively min(FINGERPRINT_HASH_COLUMNS, record.len()) columns in reduced-column modes such as --typesonly), plus dataset metadata (record_count, field_count, filesize_bytes). The limit is controlled by the FINGERPRINT_HASH_COLUMNS constant in src/cmd/stats.rs, which is kept in sync with the streaming-column count in stats_headers(). This allows users to quickly detect duplicate files without having to load the entire file to compute the hash. Especially useful for detecting duplicates of very large files with pre-existing stats cache metadata.

Whitespace Visualization

The --vis-whitespace option visualizes whitespace characters in the output to make them visible. Note that spaces will only be visualized (using 《_》) if the entire value is composed of spaces.

The following whitespace markers are used (as defined in the Rust reference):

CharacterVisualizationDescription
\t《→》Tab
\n《¶》Newline
\r《⏎》Carriage return
\u{000B}《⋮》Vertical tab
\u{000C}《␌》Form feed
\u{0085}《␤》Next line
\u{200E}《␎》Left-to-right mark
\u{200F}《␏》Right-to-left mark
\u{2028}《␊》Line separator
\u{2029}《␍》Paragraph separator
\u{00A0}《⍽》Non-breaking space
\u{2003}《emsp》Em space
\u{2007}《figsp》Figure space
\u{200B}《zwsp》Zero width space

Performance & Caching

The stats command is central to qsv and underpins other "smart" commands (describegpt, frequency, joinp, pivotp, schema, sqlp, tojsonl) that use cached statistical information to work smarter & faster.

Caching Behavior:

  • Statistics are cached in <FILESTEM>.stats.csv and optionally <FILESTEM>.stats.csv.data.jsonl (with --stats-jsonl)
  • The arguments and file-level metadata used to generate cached stats are saved in <FILESTEM>.stats.csv.json
  • If stats have already been computed with similar arguments and the file hasn't changed, stats are loaded from cache instead of recomputing
  • Use --force to force recomputing stats even if valid cache exists
  • Use --cache-threshold to control caching behavior (default: 5000ms)

Memory-Aware Chunking:

  • For non-streaming statistics, dynamically calculate chunk size based on available memory and record sampling
  • Override with QSV_STATS_CHUNK_MEMORY_MB environment variable (0 for dynamic sizing, positive for fixed limit, -1 for CPU-based chunking)
  • Enables processing of arbitrarily large "real-world" files

Approximate Algorithms (Opt-In)

By default, stats produces exact, deterministic results. Three opt-in flags swap exact accumulators for Apache DataSketches ports — Rust ports of streaming sketches — that trade a small, bounded error for constant (or near-constant) memory and faster compute on very-large columns.

FlagDefaultSketchMemoryErrorRestrictions / Notes
--quantile-method approxexactt-digest (Apache DataSketches port of Dunning's MergingDigest, ~200 centroids)O(K) per numeric column~1% rank error (more accurate at the tails)Replaces the sort-based median/q1/q2_median/q3/iqr/fences/skewness/percentiles pipeline. --mad is auto-disabled with a warning (MAD requires a second pass that t-digest does not support). --weight is rejected (the upstream datasketches crate does not expose a weighted-update API). Results may differ ~1% across runs with different --jobs values (TDigestMut::merge is associative but not chunk-count-invariant); pin --jobs 1 for run-to-run determinism.
--cardinality-method approxexactHyperLogLog (Apache DataSketches port, lg_k=12)~5KB per column~1.5% relative standard errorReplaces exact cardinality/uniqueness_ratio. Reproducible across --jobs values (the HLL union used at merge time is associative and order-invariant, so chunk completion order does not affect the final estimate). --infer-boolean forces exact (boolean inference needs cardinality == 2 exactness); a one-time warning is emitted. The --mode-cardinality-cap >=<n> sentinel is never emitted under approx — only mode/antimode columns remain gated by the cap.
--mode-cardinality-cap <n>0 (unbounded)bounds the mode/antimode trackerper-column cap on tracker entriesexact when ≤ cap; sentinels otherwiseWhen the tracker grows past <n>, qsv drops it and emits *HIGH_CARDINALITY for mode/antimode columns. Under --cardinality-method exact, the cardinality column emits >=<n> (the >= prefix breaks downstream integer parsers — that's why the cap is opt-in). Under --cardinality-method approx, the cap does not affect the cardinality column (HLL emits its estimate at fixed memory). The cap measures total samples added under unweighted mode (~ row count) and number of unique values under --weight (HashMap len(), == true cardinality).

Output validation: stats uses simdutf8 for SIMD-accelerated UTF-8 validation on the output path — a perf detail with no behavioral change.

OOM auto-fallback: Whenever stats takes the non-parallel path with non-streaming columns, it runs an in-memory load check via util::mem_file_check. By default the check is NORMAL mode (file size vs. total memory − headroom). Passing --memcheck (or setting the QSV_MEMORY_CHECK env var) switches to CONSERVATIVE mode (file size vs. available + free_swap × platform_factor − headroom), which is stricter and trips OOM far more readily. If the check fails in either mode, stats layers two fallbacks before propagating the OOM error:

  1. Auto-create an index (when no index exists and input is not stdin) to switch to parallel/indexed processing.
  2. Auto-enable approx DataSketches estimators — flips --quantile-method and --cardinality-method from exact to approx where the explicit-validation guards would have accepted them. Specifically:
    • --quantile-method auto-enables unless --weight is set; if --mad or --everything is also set, MAD is auto-disabled (mirroring the existing --quantile-method approx guard).
    • --cardinality-method auto-enables unless --infer-boolean is set.

A wwarn! is emitted listing each auto-enabled estimator. The original OOM error is only propagated when neither fallback engages. The sketch fallback can fire even when an index is already present and the OOM check still trips (e.g., with --jobs 1 on a pre-indexed file) — that is a behavior change from the previous "error out" path in this narrow case. Users can disable the auto-enable by passing --quantile-method exact or --cardinality-method exact explicitly; the OOM arm scans argv for these flag names (since docopt fills in the default exact value either way) and skips the auto-enable when either flag was explicitly provided.

See also: t-digest paper (Dunning, 2019), HyperLogLog (Flajolet et al., 2007), Apache DataSketches.

moarstats

Here are all the additional statistics produced by the qsv moarstats command, sourced from src/cmd/moarstats.rs.

The moarstats command extends an existing stats CSV file (created by the stats command) by computing additional statistics that can be derived from existing stats columns and/or by scanning the original CSV file.

How it works:

  • Looks for <FILESTEM>.stats.csv for a given CSV input
  • If the stats CSV file doesn't exist, it will first run the stats command with configurable options (via --stats-options, default: --infer-dates --infer-boolean --cardinality --mode --mad --quartiles --percentiles --force --stats-jsonl) to establish baseline stats
  • If the .stats.csv file is found, it skips running stats and just appends the additional stats columns
  • Statistics are rounded using Bankers Rounding (Midpoint Nearest Even) to the specified number of decimal places (default: 4, configurable with --round)
  • Uses parallel processing when an index is available for large files

Requirements:

  • All statistics are computed only for numeric and date/datetime columns (except Shannon Entropy which works for all field types)
  • Derived statistics require specific base statistics to be present in the stats CSV
  • Advanced statistics require --advanced flag and reading the entire CSV file
  • Outlier statistics require quartiles (and thus fences) to be computed in the baseline stats
  • Winsorized/trimmed means require either Q1/Q3 or percentiles to be available

Count Reference

moarstats documentation cites "up to an additional 55 statistical measures." That figure is the union of the three groups below; each is enumerated explicitly in this document so the total can be audited against the source-of-truth in src/cmd/moarstats.rs.

Counting convention. Each conceptual statistical measure counts once even when it emits multiple companion columns; the count is therefore over distinct concepts, not over output column names. Three grouping rules are applied below:

  • Jarque-Bera (#17) counts as one measure even though it emits two columns (jarque_bera plus its jarque_bera_pvalue); the p-value is a derived companion of the test statistic, not an independent measure.
  • Winsorized Mean + Trimmed Mean (#25) count together as a single robust-mean pair (one measure entry) even though they emit 12 columns combined (each mean plus 5 companion stddev/variance/cv/range/stddev_ratio columns). They share a single robust-mean pipeline driven by --use-percentiles/--pct-thresholds and are conceptually the same measure under two different boundary policies.
  • Covariance (bivariate #4) counts as one measure even though it emits two columns (covariance_sample and covariance_population); they differ only in the divisor.

A reader regenerating the count by tallying named output columns in src/cmd/moarstats.rs will arrive at a higher number; arrive at 55 by collapsing the three groups above.

Univariate measures (25) — see Derived Statistics, Advanced Statistics and Robust Statistics (Winsorized & Trimmed Means):

#MeasureSection / Flag
1Pearson's Second Skewness Coefficient (pearson_skewness)Derived
2Range to StdDev Ratio (range_stddev_ratio)Derived
3Quartile Coefficient of Dispersion (quartile_coefficient_dispersion)Derived
4Z-Score of Mode (mode_zscore)Derived
5Relative Standard Error (relative_standard_error)Derived
6Z-Score of Min (min_zscore)Derived
7Z-Score of Max (max_zscore)Derived
8Median-to-Mean Ratio (median_mean_ratio)Derived
9IQR-to-Range Ratio (iqr_range_ratio)Derived
10MAD-to-StdDev Ratio (mad_stddev_ratio)Derived
11Trimean (trimean)Derived
12Midhinge (midhinge)Derived
13Robust CV (robust_cv)Derived
14XSD type (xsd_type)Derived
15Kurtosis (kurtosis)--advanced
16Bimodality Coefficient (bimodality_coefficient)--advanced
17Jarque-Bera test (jarque_bera + jarque_bera_pvalue)--advanced (emits 2 columns)
18Gini Coefficient (gini_coefficient)--advanced
19Atkinson Index (atkinson_index_(<ε>), e.g. atkinson_index_(1) with the default --epsilon 1.0)--advanced --epsilon
20Theil Index (theil_index)--advanced
21Mean Absolute Deviation from mean (mean_ad)--advanced
22Shannon Entropy (shannon_entropy)--advanced
23Normalized Entropy (normalized_entropy)--advanced (when cardinality is present)
24Simpson's Diversity Index (simpsons_diversity_index)--advanced
25Winsorized Mean (winsorized_mean + 5 companion columns) and Trimmed Mean (trimmed_mean + 5 companion columns)Robust (counted as one measure pair, emits 12 columns)

Outlier measures (24) — see Outlier Statistics:

#GroupIdentifiers
1–7Outlier countsoutliers_extreme_lower_cnt, outliers_mild_lower_cnt, outliers_normal_cnt, outliers_mild_upper_cnt, outliers_extreme_upper_cnt, outliers_total_cnt, outliers_percentage
8–13Outlier descriptiveoutliers_mean, non_outliers_mean, outliers_to_normal_mean_ratio, outliers_min, outliers_max, outliers_range
14–20Outlier variance / spreadoutliers_stddev, outliers_variance, non_outliers_stddev, non_outliers_variance, outliers_cv, non_outliers_cv, outliers_normal_stddev_ratio
21–22Outlier impactoutlier_impact, outlier_impact_ratio
23–24Outlier boundarylower_outer_fence_zscore, upper_outer_fence_zscore

Bivariate measures (6, written to <FILESTEM>.stats.bivariate.csv under --bivariate) — see Bivariate Statistics:

#Measure
1Pearson's correlation (pearson_correlation)
2Spearman's rank correlation (spearman_correlation)
3Kendall's tau (kendall_tau)
4Covariance (covariance_sample + covariance_population — counted as one measure, emits 2 columns)
5Mutual Information (mutual_information)
6Normalized Mutual Information (normalized_mutual_information)

Total: 25 + 24 + 6 = 55 statistical measures. Note that several measures expand into more than one output column (e.g. Jarque-Bera → 2 columns, Winsorized/Trimmed Means → 12 columns combined, Covariance → 2 columns), so the actual column count in a <FILESTEM>.stats.csv extended by moarstats --advanced plus its bivariate sidecar is higher than 55.

Derived Statistics

These statistics are computed directly from existing stats columns without scanning the original CSV file. They require specific base statistics to be present in the stats CSV.

IdentifierLevelSummaryComputation
pearson_skewnessVariablePearson's Second Skewness Coefficient. Measures asymmetry of the distribution. Positive values indicate right skew, negative values indicate left skew.3 * (mean - median) / stddev. Requires: mean, median (or q2_median), stddev. Returns None if stddev is zero. See: Skewness
range_stddev_ratioVariableRange to Standard Deviation Ratio. Normalizes the spread of data. Higher values indicate more extreme outliers relative to the variability.range / stddev. Requires: range, stddev. Returns None if stddev is zero.
quartile_coefficient_dispersionVariableQuartile Coefficient of Dispersion. Measures relative variability using quartiles. Useful for comparing dispersion across different scales.(Q3 - Q1) / (Q3 + Q1). Requires: q1, q3. Returns None if Q1 >= Q3 or if denominator is zero. See: Quartile Coefficient of Dispersion
mode_zscoreVariableZ-Score of Mode. Indicates how typical the mode is relative to the distribution. Values near 0 suggest the mode is near the mean.(mode - mean) / stddev. Requires: mode, mean, stddev. If multiple modes exist, uses the first mode. Returns None if stddev is zero.
relative_standard_errorVariableRelative Standard Error. Measures precision of the mean estimate relative to its magnitude. Lower values indicate more reliable estimates.sem / mean. Requires: sem, mean. Returns None if mean is zero.
min_zscoreVariableZ-Score of Min. Shows how extreme the minimum value is. Large negative values indicate outliers or heavy left tail.(min - mean) / stddev. Requires: min, mean, stddev. Returns None if stddev is zero.
max_zscoreVariableZ-Score of Max. Shows how extreme the maximum value is. Large positive values indicate outliers or heavy right tail.(max - mean) / stddev. Requires: max, mean, stddev. Returns None if stddev is zero.
median_mean_ratioVariableMedian-to-Mean Ratio. Indicates skewness direction. Ratio < 1 suggests right skew, > 1 suggests left skew, = 1 suggests symmetry.median / mean. Requires: median (or q2_median), mean. Returns None if mean is zero.
iqr_range_ratioVariableIQR-to-Range Ratio. Measures concentration of data. Higher values (closer to 1) indicate more data concentrated in the middle 50%.iqr / range. Requires: iqr, range. Returns None if range is zero.
mad_stddev_ratioVariableMAD-to-StdDev Ratio. Compares robust vs non-robust spread measures. Higher values suggest presence of outliers affecting stddev.mad / stddev. Requires: mad, stddev. Returns None if stddev is zero.
trimeanVariableTukey's Trimean. Robust estimator of central tendency combining median with the midhinge. More robust than mean, more efficient than median alone.(Q1 + 2*median + Q3) / 4. Requires: q1, median (or q2_median), q3. See: Trimean
midhingeVariableMidhinge. Midpoint of the middle 50% of data. A robust central tendency measure that complements the mean and median.(Q1 + Q3) / 2. Requires: q1, q3. See: Midhinge
robust_cvVariableRobust Coefficient of Variation. Non-negative, outlier-resistant alternative to CV using MAD and the magnitude of the median instead of stddev and mean.MAD / abs(median). Requires: mad, median (or q2_median). Returns None if median is zero. See: Robust measures of scale
xsd_typeVariableInferred W3C XML Schema datatype. Infers the most specific XSD type based on field type and min/max values. Works for all field types.Computed from type, min, and max columns. For Integer types, refines to most specific type (e.g., byte, short, int, long, unsignedByte, unsignedShort, unsignedInt, unsignedLong, positiveInteger, nonNegativeInteger, negativeInteger, nonPositiveInteger, or integer) based on min/max ranges. Also detects Gregorian date types (gYear, gYearMonth, gMonthDay, gDay, gMonth) with confidence markers (? = more confident from thorough scan, ?? = less confident from quick scan). For other types: Float → decimal, String → string, Date → date, DateTime → dateTime, Boolean → boolean, NULL → empty string. If min/max are not available for Integer types, defaults to integer. See: XML Schema Part 2: Datatypes

Advanced Statistics

These statistics require the --advanced flag and reading the entire CSV file to collect all values for computation. They are computationally expensive.

IdentifierLevelSummaryComputation
kurtosisVariableExcess kurtosis. Measures the "tailedness" of the distribution. Positive values indicate heavy tails, negative values indicate light tails. Values near 0 indicate a normal distribution.Computed from all values in the column. Uses precalculated mean and variance from baseline stats for efficiency. Requires: mean, variance (or stddev). See: Kurtosis
bimodality_coefficientVariableBimodality Coefficient. Measures whether a distribution has two modes (peaks) or is unimodal. BC < 0.555 indicates unimodal, BC >= 0.555 indicates bimodal/multimodal.Computed as (skewness² + 1) / (kurtosis + 3). Requires: skewness (from base stats) and kurtosis (from --advanced flag). See: Bimodality
jarque_beraVariableJarque-Bera test statistic. Standard test for normality using skewness and kurtosis. Higher values indicate greater departure from normality.Computed as (n/6) * (S² + K²/4) where S is skewness and K is excess kurtosis. Requires: skewness (from base stats), kurtosis (from --advanced flag), and sample size n (from n_positive + n_negative + n_zero). See: Jarque-Bera test
jarque_bera_pvalueVariableP-value for the Jarque-Bera test. Low values (< 0.05) indicate the data is NOT normally distributed.Computed from the chi-squared distribution with 2 degrees of freedom: p = e^(-JB/2). Requires: jarque_bera.
gini_coefficientVariableGini Coefficient. Measures inequality/dispersion in the distribution. Values range from 0 (perfect equality) to 1 (maximum inequality).Computed from all values in the column. Uses precalculated sum from baseline stats for efficiency. Requires: sum. See: Gini Coefficient
atkinson_index_(<ε>)VariableAtkinson Index. Measures inequality in the distribution with a sensitivity parameter. The column name interpolates the epsilon value (e.g. atkinson_index_(1) with the default --epsilon 1.0). Values range from 0 (perfect equality) to 1 (maximum inequality). The Atkinson Index is a more general form of the Gini coefficient that allows for different sensitivity to inequality.Computed from all values in the column. Uses precalculated mean from baseline stats for efficiency. The epsilon (ε) parameter controls sensitivity to inequality (configurable via --epsilon, default: 1.0). Higher epsilon values indicate greater sensitivity to inequality. Requires: mean. See: Atkinson Index
theil_indexVariableTheil Index (Generalized Entropy GE(1)). Measures inequality/concentration in the distribution. Unlike Gini, it is decomposable into within-group and between-group components. Only computed for positive values.Computed as (1/n) * Σ((x_i / mean) * ln(x_i / mean)) for positive values. Computes mean from positive values only (not the overall precalculated mean). Requires positive values in the column. See: Theil Index
mean_adVariableMean Absolute Deviation from mean. Average absolute distance of values from the arithmetic mean. Less robust than MAD (which uses median) but more statistically efficient.Computed as `(1/n) * Σ
shannon_entropyVariableShannon Entropy. Measures the information content/uncertainty in the distribution. Higher values indicate more diversity, lower values indicate more concentration. Values range from 0 (all values identical) to log2(n) where n is the number of unique values.Computed using the frequency command with --limit 0 to collect all frequencies, then calculates: H(X) = -Σ p_i * log2(p_i) where p_i is the probability of value i. Works for all field types (not just numeric). For all-unique fields, returns log2(n). See: Entropy (Information Theory)
normalized_entropyVariableNormalized Entropy. Normalized version of Shannon Entropy scaled to [0, 1]. Values range from 0 (all values identical) to 1 (all values equally distributed).Computed as shannon_entropy / log2(cardinality). Requires: shannon_entropy (from --advanced flag) and cardinality (from base stats). If cardinality is 0 or 1, returns 0.
simpsons_diversity_indexVariableSimpson's Diversity Index. Probability that two randomly chosen values are different. More intuitive than entropy for many users. Ranges from 0 (all identical) to 1 (all unique).Computed as 1 - Σ(p_i²) where p_i are value proportions from frequency data. Computed alongside Shannon Entropy. Works for all field types. For all-unique fields, returns 1 - 1/n. See: Simpson's Diversity Index

Bivariate Statistics

These statistics examine relationships between pairs of columns in a dataset. They are computed when the --bivariate flag is used and require an indexed CSV file (index will be auto-created if missing). Bivariate statistics are output to a separate file: <FILESTEM>.stats.bivariate.csv.

Note: Bivariate statistics require reading the entire CSV file and are computationally expensive. For large files (>= 10k records), parallel chunked processing is used when an index is available. For smaller files or when no index exists, sequential processing is used.

Performance Optimizations:

  • Date parsing cache to avoid re-parsing same date strings
  • String interning to reduce allocations for repeated values
  • Batch string conversions to process multiple field pairs efficiently
  • Early termination for zero-variance fields (skip all correlation computations)
  • Streaming algorithms (Welford's online) for Pearson correlation and covariance
  • Lazy value collection (only store values if needed for Spearman/Kendall)

Multi-Dataset Bivariate Statistics: When using --join-inputs, multiple datasets can be joined internally before computing relationships. This allows analyzing relationships across datasets that share common join keys. The joined dataset is automatically indexed before bivariate statistics computation. Output file: <FILESTEM>.stats.bivariate.joined.csv.

IdentifierLevelSummaryComputation
pearson_correlationPairwisePearson product-moment correlation coefficient. Measures linear correlation between two numeric/date fields. Values range from -1 (perfect negative correlation) to +1 (perfect positive correlation). 0 indicates no linear correlation.Computed using Welford's online algorithm for efficient streaming computation across chunks. Requires both fields to be numeric or date types. Formula: covariance / (stddev_x * stddev_y). See: Pearson Correlation
spearman_correlationPairwiseSpearman's rank correlation coefficient. Measures monotonic relationship between two numeric/date fields (not just linear). Values range from -1 to +1. More robust to outliers than Pearson correlation.Computed by ranking both fields and then computing Pearson correlation on the ranks. Handles ties by averaging ranks. Requires both fields to be numeric or date types. See: Spearman's Rank Correlation
kendall_tauPairwiseKendall's tau rank correlation coefficient. Measures ordinal association between two numeric/date fields. Values range from -1 to +1. More robust to outliers and handles ties better than Spearman.Computed by counting concordant and discordant pairs using efficient O(n log n) merge sort algorithm. Formula accounts for ties in both variables. Requires both fields to be numeric or date types. See: Kendall's Tau
covariance_samplePairwiseSample covariance. Measures how two numeric/date fields vary together. Positive values indicate positive relationship, negative values indicate inverse relationship.Computed using Welford's online algorithm. Formula: sum((x - mean_x) * (y - mean_y)) / (n - 1). Requires both fields to be numeric or date types.
covariance_populationPairwisePopulation covariance. Same as sample covariance but uses population formula (divides by n instead of n-1).Computed using Welford's online algorithm. Formula: sum((x - mean_x) * (y - mean_y)) / n. Requires both fields to be numeric or date types.
mutual_informationPairwiseMutual Information. Measures the amount of information obtained about one field by observing another. Values range from 0 (independent) to positive infinity. Works for all field types (numeric, date, string).Computed from joint and marginal probability distributions. Formula: MI(X,Y) = sum(p(x,y) * log2(p(x,y) / (p(x) * p(y)))). Higher values indicate stronger relationship. Can be expensive for high-cardinality fields (use --cardinality-threshold to skip). See: Mutual Information
normalized_mutual_informationPairwiseNormalized Mutual Information. Normalized version of mutual information, scaled by the geometric mean of individual entropies. Values range from 0 (independent) to 1 (perfectly dependent).Computed as MI(X,Y) / sqrt(H(X) * H(Y)) where H(X) and H(Y) are Shannon entropies of individual fields. Requires mutual information computation. See: Normalized Mutual Information
n_pairsPairwiseNumber of valid pairs used in computation. Indicates how many non-null value pairs were available for computing the relationship statistics.Count of records where both fields have non-empty values.

Configuration Options:

  • --bivariate-stats: Select specific statistics (pearson, spearman, kendall, covariance, mi, nmi) or use "all" or "fast" (pearson + covariance). Default: fast.
  • --cardinality-threshold: Skip mutual information for field pairs where either field exceeds cardinality threshold (default: 1,000,000)
  • --join-inputs: Join multiple datasets before computing bivariate statistics
  • --join-keys: Specify join keys for each dataset
  • --join-type: Specify join type (inner, left, right, full; default: inner)

Robust Statistics (Winsorized & Trimmed Means)

These statistics require scanning the original CSV file. They provide robust alternatives to the standard mean by handling extreme values differently.

Winsorized Mean: Replaces values below/above thresholds with threshold values, then computes mean. All values are included in the calculation, but extreme values are capped at thresholds.

Trimmed Mean: Excludes values outside thresholds, then computes mean. Only values within thresholds are included in the calculation.

Threshold Options:

  • Default: Uses Q1 and Q3 as thresholds (25% winsorization/trimming)
  • With --use-percentiles: Uses configurable percentiles (default: 5th/95th) as thresholds via --pct-thresholds

Requirements:

  • Default mode: Requires q1 and q3 in baseline stats
  • Percentile mode: Requires percentiles in baseline stats and --use-percentiles flag
IdentifierLevelSummaryComputation
winsorized_mean_25pctVariableWinsorized mean using Q1/Q3 thresholds (25% winsorization).All values are included, but values below Q1 are set to Q1 and values above Q3 are set to Q3, then mean is computed. For dates/datetimes, returned in RFC3339 format.
winsorized_mean_5pctVariableWinsorized mean using percentile thresholds (5th/95th percentiles).Only computed when --use-percentiles is set. Column name varies based on --pct-thresholds (e.g., winsorized_mean_10pct for 10th/90th percentiles).
winsorized_stddevVariableStandard deviation of winsorized values.Sample standard deviation computed from winsorized values.
winsorized_varianceVariableVariance of winsorized values.Sample variance computed from winsorized values.
winsorized_cvVariableCoefficient of variation for winsorized values.winsorized_stddev / winsorized_mean. Returns None if mean is zero.
winsorized_rangeVariableRange of winsorized values.max_winsorized - min_winsorized.
winsorized_stddev_ratioVariableRatio of winsorized stddev to overall stddev.winsorized_stddev / stddev. Compares robust vs non-robust spread. Returns None if overall stddev is zero.
trimmed_mean_25pctVariableTrimmed mean using Q1/Q3 thresholds (25% trimming).Only values within Q1 and Q3 are included in the mean calculation. For dates/datetimes, returned in RFC3339 format.
trimmed_mean_5pctVariableTrimmed mean using percentile thresholds (5th/95th percentiles).Only computed when --use-percentiles is set. Column name varies based on --pct-thresholds.
trimmed_stddevVariableStandard deviation of trimmed values.Sample standard deviation computed from trimmed values (only values within thresholds).
trimmed_varianceVariableVariance of trimmed values.Sample variance computed from trimmed values.
trimmed_cvVariableCoefficient of variation for trimmed values.trimmed_stddev / trimmed_mean. Returns None if mean is zero.
trimmed_rangeVariableRange of trimmed values.max_trimmed - min_trimmed.
trimmed_stddev_ratioVariableRatio of trimmed stddev to overall stddev.trimmed_stddev / stddev. Compares robust vs non-robust spread. Returns None if overall stddev is zero.

See: Winsorized Mean, Truncated Mean

Outlier Statistics

These statistics require scanning the original CSV file and comparing each value against fence thresholds. Fences are computed using the IQR method: inner fences at Q1/Q3 ± 1.5IQR, outer fences at Q1/Q3 ± 3.0IQR.

Requirements:

  • Requires --quartiles or --everything in baseline stats (to compute fences)
  • Requires: lower_outer_fence, lower_inner_fence, upper_inner_fence, upper_outer_fence in baseline stats

Outlier Classification:

  • Extreme Lower: Values below the lower outer fence
  • Mild Lower: Values between lower outer and inner fences
  • Normal: Values between inner fences (non-outliers)
  • Mild Upper: Values between upper inner and outer fences
  • Extreme Upper: Values above the upper outer fence

See: Outlier

Outlier Counts

IdentifierLevelSummaryComputation
outliers_extreme_lower_cntVariableCount of values below the lower outer fence.Count of extreme lower outliers.
outliers_mild_lower_cntVariableCount of values between lower outer and inner fences.Count of mild lower outliers.
outliers_normal_cntVariableCount of values between inner fences (non-outliers).Count of normal (non-outlier) values.
outliers_mild_upper_cntVariableCount of values between upper inner and outer fences.Count of mild upper outliers.
outliers_extreme_upper_cntVariableCount of values above the upper outer fence.Count of extreme upper outliers.
outliers_total_cntVariableTotal count of all outliers (sum of extreme and mild outliers).Sum of all outlier counts (extreme + mild, both lower and upper).
outliers_percentageVariablePercentage of values that are outliers.(outliers_total_cnt / total_count) * 100.

Outlier Descriptive Statistics

IdentifierLevelSummaryComputation
outliers_meanVariableMean value of outliers.Mean of all outlier values (extreme and mild, lower and upper). For dates/datetimes, returned in RFC3339 format.
non_outliers_meanVariableMean value of non-outliers.Mean of all normal (non-outlier) values. For dates/datetimes, returned in RFC3339 format.
outliers_to_normal_mean_ratioVariableRatio of outlier mean to non-outlier mean.outliers_mean / non_outliers_mean. Returns None if non_outliers_mean is zero.
outliers_minVariableMinimum value among outliers.Minimum value across all outliers. For dates/datetimes, returned in RFC3339 format.
outliers_maxVariableMaximum value among outliers.Maximum value across all outliers. For dates/datetimes, returned in RFC3339 format.
outliers_rangeVariableRange of outlier values.outliers_max - outliers_min.

Outlier Variance/Spread Statistics

IdentifierLevelSummaryComputation
outliers_stddevVariableStandard deviation of outlier values.Sample standard deviation computed from outlier values. Requires at least 2 outliers.
outliers_varianceVariableVariance of outlier values.Sample variance computed from outlier values. Requires at least 2 outliers.
non_outliers_stddevVariableStandard deviation of non-outlier values.Sample standard deviation computed from normal (non-outlier) values. Requires at least 2 non-outliers.
non_outliers_varianceVariableVariance of non-outlier values.Sample variance computed from normal (non-outlier) values. Requires at least 2 non-outliers.
outliers_cvVariableCoefficient of variation for outliers.outliers_stddev / outliers_mean. Returns None if outliers_mean is zero or if stddev cannot be computed.
non_outliers_cvVariableCoefficient of variation for non-outliers.non_outliers_stddev / non_outliers_mean. Returns None if non_outliers_mean is zero or if stddev cannot be computed.
outliers_normal_stddev_ratioVariableRatio of outlier stddev to non-outlier stddev.outliers_stddev / non_outliers_stddev. Compares spread of outliers vs non-outliers. Returns None if non_outliers_stddev is zero or if either stddev cannot be computed.

Outlier Impact Statistics

IdentifierLevelSummaryComputation
outlier_impactVariableDifference between overall mean and non-outlier mean.overall_mean - non_outliers_mean. Measures how much outliers affect the overall mean.
outlier_impact_ratioVariableRelative impact of outliers.outlier_impact / non_outliers_mean. Normalized measure of outlier impact. Returns None if non_outliers_mean is zero.

Outlier Boundary Statistics

IdentifierLevelSummaryComputation
lower_outer_fence_zscoreVariableZ-score of the lower outer fence boundary.(lower_outer_fence - mean) / stddev. Shows how extreme the lower outlier boundary is relative to the distribution. Returns None if stddev is zero.
upper_outer_fence_zscoreVariableZ-score of the upper outer fence boundary.(upper_outer_fence - mean) / stddev. Shows how extreme the upper outlier boundary is relative to the distribution. Returns None if stddev is zero.

pragmastat

The pragmastat command computes robust, median-of-pairwise statistics using the Pragmastat library (v12.1.0). Designed for messy, heavy-tailed, or outlier-prone data where mean/stddev can mislead.

Sourced from src/cmd/pragmastat.rs.

Key Features:

  • Only finite numeric values are used; non-numeric/NaN/Inf values are ignored
  • Date/DateTime columns are supported when a stats cache is available (run qsv stats -E --infer-dates --stats-jsonl first); dates are converted to epoch milliseconds for analysis, then center/bounds are formatted as dates and spread/shift as days
  • Each column is treated as its own sample (two-sample compares columns, not rows)
  • Non-numeric columns appear with n=0 and empty estimator cells
  • Loads all numeric values into memory

Modes

pragmastat has four mutually exclusive output modes. The default (no mode flag) extends the existing stats cache the way moarstats does; the other modes always produce a standalone CSV.

Mode flagBehaviorOutput
(none)Default. Appends 7 ps_* columns to the existing .stats.csv cache file. If no cache exists, runs stats first using --stats-options.Extended stats CSV
--standaloneOne-sample point/bound estimates as a fresh CSV, without touching the stats cache.Standalone CSV
-t / --twosampleTwo-sample estimators for every unordered column pair.Standalone CSV
--compare1 <spec>One-sample confirmatory analysis — tests center / spread against user-defined thresholds.Standalone CSV
--compare2 <spec>Two-sample confirmatory analysis — tests shift / ratio / disparity against user-defined thresholds.Standalone CSV

Default Mode (Stats Cache Append)

Adds 7 ps_* columns to each row of the existing stats CSV (the same row-per-column layout stats and moarstats use). If no stats cache is present, one is generated first using --stats-options (default: --infer-dates --infer-boolean --mad --quartiles --force --stats-jsonl — note: no --percentiles, unlike moarstats).

ps_* columns that already exist in the cache are left untouched unless --force is set.

IdentifierLevelSummaryComputation
ps_nVariableCount of values used by pragmastat estimators.Count after filtering: finite numerics for numeric columns, or parsed epoch-ms values for Date/DateTime columns (when supported via the stats cache). Non-numeric / NaN / Inf / unparsable values are excluded.
ps_centerVariableHodges-Lehmann estimator — robust location.Median of pairwise averages. Tolerates up to 29% corrupted data.
ps_spreadVariableShamos estimator — robust dispersion.Median of pairwise absolute differences. Same units as data; also tolerates up to 29% corrupted data.
ps_center_lowerVariableLower confidence bound for ps_center.Exact under weak symmetry, with error rate = misrate.
ps_center_upperVariableUpper confidence bound for ps_center.Exact under weak symmetry, with error rate = misrate.
ps_spread_lowerVariableLower confidence bound for ps_spread.Randomized (bootstrap); error rate = misrate.
ps_spread_upperVariableUpper confidence bound for ps_spread.Randomized (bootstrap); error rate = misrate.

Standalone Mode (--standalone)

One-sample mode that produces a fresh standalone CSV instead of extending the stats cache. (This is the legacy default behavior preserved behind a flag.)

Output columns: field, n, center, spread, center_lower, center_upper, spread_lower, spread_upper

IdentifierLevelSummaryComputation
fieldVariableColumn name (or 1-based index if --no-headers).From CSV header.
nVariableCount of finite numeric values.Count after filtering non-numeric, NaN, Inf.
centerVariableHodges-Lehmann estimator — robust location.Median of pairwise averages. Tolerates up to 29% corrupted data. Like the mean but stable with outliers.
spreadVariableShamos estimator — robust dispersion.Median of pairwise absolute differences. Same units as data. Also tolerates up to 29% corrupted data.
center_lowerVariableLower confidence bound for center.Exact under weak symmetry, with error rate = misrate.
center_upperVariableUpper confidence bound for center.Exact under weak symmetry, with error rate = misrate.
spread_lowerVariableLower confidence bound for spread.Randomized (bootstrap); error rate = misrate.
spread_upperVariableUpper confidence bound for spread.Randomized (bootstrap); error rate = misrate.

Two-Sample Mode (-t / --twosample)

Computes statistics for all unordered column pairs. Always produces a standalone CSV.

Output columns: field_x, field_y, n_x, n_y, shift, ratio, disparity, shift_lower, shift_upper, ratio_lower, ratio_upper, disparity_lower, disparity_upper

IdentifierLevelSummaryComputation
field_x, field_yPairwiseColumn names being compared.From CSV header.
n_x, n_yPairwiseCounts of finite numeric values.Per-column counts after filtering non-numeric/NaN/Inf.
shiftPairwiseHodges-Lehmann difference — robust location difference.Median of pairwise differences between columns. Negative means first column tends to be lower.
ratioPairwiseRobust multiplicative ratio.exp(shift(log x, log y)). Use for positive-valued quantities (latency, price, concentration). Requires all values > 0. Suppressed for Date/DateTime pairs (depends on the arbitrary 1970 epoch origin).
disparityPairwiseRobust effect size.shift / (average spread of x and y).
shift_lower, shift_upperPairwiseConfidence bounds for shift.Exact; error rate = misrate. If bounds exclude 0, the difference is reliable. Ties may be conservative.
ratio_lower, ratio_upperPairwiseConfidence bounds for ratio.Exact; error rate = misrate. If bounds exclude 1, the difference is reliable. Requires all values > 0.
disparity_lower, disparity_upperPairwiseConfidence bounds for disparity.Randomized (Bonferroni combination); error rate = misrate. If bounds exclude 0, the disparity is reliable.

Compare1 Mode (--compare1 <spec>)

One-sample confirmatory analysis. Tests one-sample estimates (center / spread) against user-supplied thresholds and renders a verdict per (column, threshold) pair. Always produces a standalone CSV.

Threshold format: comma-separated metric:value pairs, e.g. center:42.0 or center:42.0,spread:0.5. Valid metrics: center, spread.

Output columns: field, n, metric, threshold, estimate, lower, upper, verdict

IdentifierLevelSummary
fieldVariableColumn name (or 1-based index if --no-headers).
nVariableCount of finite numeric values.
metricVariableThe metric being tested (center or spread).
thresholdVariableThe user-supplied threshold from --compare1 metric:value.
estimateVariablePoint estimate of the chosen metric for this column.
lower, upperVariableConfidence bounds for the estimate (error rate = misrate).
verdictVariableOne of less (estimate statistically below threshold), greater (statistically above), or inconclusive (interval contains threshold).

Incompatible with --no-bounds (the verdict requires bounds).

Compare2 Mode (--compare2 <spec>)

Two-sample confirmatory analysis. Tests two-sample estimates (shift / ratio / disparity) against user-supplied thresholds and renders a verdict per (column pair, threshold). Always produces a standalone CSV.

Threshold format: comma-separated metric:value pairs, e.g. shift:0 or shift:0,disparity:0.8. Valid metrics: shift, ratio, disparity.

Output columns: field_x, field_y, n_x, n_y, metric, threshold, estimate, lower, upper, verdict

IdentifierLevelSummary
field_x, field_yPairwiseColumn names being compared.
n_x, n_yPairwisePer-column counts of finite numeric values.
metricPairwiseThe metric being tested (shift, ratio, or disparity).
thresholdPairwiseThe user-supplied threshold from --compare2 metric:value.
estimatePairwisePoint estimate of the chosen metric for this column pair.
lower, upperPairwiseConfidence bounds for the estimate (error rate = misrate).
verdictPairwiseOne of less, greater, or inconclusive (same semantics as compare1).

ratio rows are suppressed for Date/DateTime pairs (see Two-Sample Mode above). Incompatible with --no-bounds.

Options

OptionDefaultDescription
--twosample / -toffCompute two-sample estimators for all column pairs. Mutually exclusive with --compare1 / --compare2.
--compare1 <spec>One-sample confirmatory analysis. Format: metric:value[,metric:value,...]. Valid metrics: center, spread. Mutually exclusive with --twosample / --compare2.
--compare2 <spec>Two-sample confirmatory analysis. Format: metric:value[,metric:value,...]. Valid metrics: shift, ratio, disparity. Mutually exclusive with --twosample / --compare1.
--select <cols> / -sall numeric columns (when stats cache fresh)Column selection using qsv's column-selection syntax. Non-numeric columns appear with n=0. In two-sample mode, all pairs of selected columns are computed.
--misrate <n> / -m0.001Probability that bounds fail to contain the true parameter. Lower values produce wider bounds. Must be achievable for the given sample size. Use 1e-3 for everyday analysis or 1e-6 for critical decisions.
--standaloneoffForce one-sample mode to emit a standalone CSV instead of extending the stats cache. No effect with --twosample / --compare1 / --compare2 (which are always standalone).
--stats-options <arg>--infer-dates --infer-boolean --mad --quartiles --force --stats-jsonlOptions passed to the stats command when baseline stats need to be generated. Note: this default differs from moarstats by omitting --percentiles.
--round <n>4Round statistics to <n> decimal places. Uses Midpoint Nearest Even (Bankers Rounding).
--forceoffForce recomputing ps_* columns in the stats cache even if they already exist.
--subsample <N>offPartial Fisher-Yates shuffle keeping only N values per column before computing. ~100× speedup on large datasets while preserving statistical robustness. Recommended: 10,000–50,000 for exploratory analysis. Incompatible with the default cache-append mode (approximate results would be silently reused as if computed from the full dataset) — must be combined with --standalone or one of the other non-cache modes.
--seed <N>42 (when --subsample is set)Seed for reproducible subsampling.
--no-boundsoffSkip confidence bound computation (~2× speedup) when only point estimates are needed. Incompatible with --compare1 / --compare2, and with the default cache-append mode (the cache would store empty bounds that a subsequent run would silently reuse) — must be combined with --standalone.
--output <file> / -ostdoutWrite output to file instead of stdout.
--delimiter <c> / -d,Field delimiter for reading/writing CSV data.
--no-headers / -noffWhen set, the first row will not be treated as headers.
--jobs <arg> / -jnumber of CPUsThe number of jobs to run in parallel. When not set, defaults to the number of CPUs detected.
--memcheckoffCheck if there is enough memory to load the entire CSV into memory using CONSERVATIVE heuristics. Not valid for stdin.

Performance Characteristics

Algorithmic Complexity

All Pragmastat estimators avoid naïve O(n²) pairwise enumeration by using implicit‑matrix selection and binary‑search techniques (see pragmastat.dev/synopsis):

Estimator(s)ComplexityTechnique
center, center_boundsO(n log n)Monahan's implicit‑matrix selection + SignedRankMargin
spread, spread_boundsO(n log n)Monahan's selection for differences + disjoint‑pair sign‑test inversion
shift, shift_bounds, ratio, ratio_boundsO((n+m) log L)Value‑space binary search over pairwise differences; L = value range
disparity, disparity_boundsO((n+m) log L + n log n + m log m)Bonferroni split combining shift bounds + average spread bounds

These are per‑column (one‑sample) or per‑pair (two‑sample) complexities. Randomization primitives (xoshiro256++) are O(1) per draw.

qsv Implementation Optimizations

  • Parallel computation — Columns (one‑sample) and column pairs (two‑sample) are processed in parallel via Rayon; controlled by --jobs.
  • Parallel indexed CSV reading — Files with ≥ 10,000 rows and a .csv.idx index are read in parallel chunks using a ThreadPool with crossbeam channels and deterministic seeking.
  • --subsample N — Partial Fisher‑Yates shuffle keeps only N values per column before computing, with deterministic per‑column seeding (--seed defaults to 42). Provides ~100× speedup on large datasets while preserving statistical robustness. Recommended: 10,000–50,000 for exploratory analysis.
  • --no-bounds — Skips confidence bound computation for ~2× speedup when only point estimates are needed.
  • Combined --subsample + --no-bounds — ~200× speedup for quick exploratory analysis on large datasets.
  • Stats cache integration — When a fresh .stats.csv.data.jsonl cache exists, non‑numeric columns are automatically filtered out before computation, and Date/DateTime type detection is read from the cache rather than re‑inferred.
  • Pre‑computed log arrays — In two‑sample mode, ln() transformations are computed once per column in parallel and shared across all pairs, avoiding redundant O(n) passes.
  • SIMD‑accelerated parsing — Uses simdutf8 for UTF‑8 validation, fast_float2 for float parsing, and simd_json (little‑endian) for cache deserialization.
  • Pre‑allocated buffers — Column vectors are sized to estimated row counts, and chunk buffers are pre‑allocated before parallel reads to minimize reallocations.

When Values Are Blank

Cells are empty (blank) when:

  • No numeric data (n=0): The column contains no finite numeric values
  • Positivity required: ratio, ratio_lower, and ratio_upper require all values > 0
  • Date/DateTime pairs: ratio is suppressed for --twosample and --compare2 because it depends on the arbitrary 1970 epoch origin and isn't meaningful for dates; shift, disparity, and their bounds remain populated
  • Sparsity required: spread, spread_lower, spread_upper, disparity, disparity_lower, and disparity_upper need real variability (not tie-dominant data)
  • Insufficient data for bounds: All bounds columns need enough data for the requested misrate; try a higher misrate or more data

See: Pragmastat manual (PDF), pragmastat.dev

frequency

The frequency command computes frequency distribution tables for CSV columns (exact by default, with an opt-in Frequent Items sketch for bounded-memory top-K), with support for multiple output formats, ranking strategies, and weighted frequencies.

Key Features:

  • Computes exact frequency counts by default; opt into a Misra-Gries heavy-hitters sketch via --sketch-method frequent_items for bounded-memory top-K on very-high-cardinality streams (see Frequent Items Sketch (Approximate Top-K))
  • Multiple ranking strategies for handling tied values
  • Weighted frequency support using a specified weight column
  • CSV and JSON/TOON output modes
  • Integration with stats cache for memory optimization
  • Memory-aware chunking for large datasets
  • Parallel processing support with indexing

Stats Cache Integration: When the stats cache exists (created by qsv stats --stats-jsonl), the frequency command uses it to:

  • Detect ID columns (cardinality == rowcount) and short-circuit frequency compilation
  • Pre-allocate appropriate hashmap capacity based on cardinality
  • Avoid building hashmaps for all-unique columns

Without stats cache, frequency will compute frequencies for ALL columns, even ID columns, which can use significant memory.

For more examples, see https://github.com/dathere/qsv/blob/master/tests/test_frequency.rs.

Frequency Table Output

In CSV output mode (default), the table is formatted as CSV data with the following columns:

ColumnDescription
fieldColumn name (or 1-based index if --no-headers is used)
valueThe value from the column
countFrequency count (or weighted sum if --weight is used)
percentagePercentage of total (count/total * 100)
rankRank based on count (1 = most frequent, or least frequent if --asc is used)

Formatting Options:

  • --pct-dec-places <arg> — decimal places for percentage rounding (default: -5). When negative, the number of decimal places is automatically determined to the minimum needed to represent the percentage accurately, up to the absolute value of the negative number.
  • --no-trim — don't trim whitespace from values when computing frequencies. By default, leading and trailing whitespace is trimmed.
  • --vis-whitespace — visualize whitespace characters in the output using the same markers as stats (see Whitespace Visualization).

Special Values:

  • <ALL_UNIQUE> (configurable via --all-unique-text): For ID columns detected via stats cache, indicates all values are unique. Count equals row count, percentage is 100%, rank is 0.
  • Other (N) (configurable via --other-text, default: Other): When --limit is used, remaining values are grouped into this category. N indicates the count of unique values grouped. Rank is 0. Use --no-other (alias for --other-text "<NONE>") to exclude the "Other" category entirely.
  • (NULL) (configurable via --null-text, default: (NULL)): Represents empty/missing values. Can be excluded with --no-nulls (alias for --null-text "<NONE>").

Limit Behavior:

  • --limit N (positive): Keep only top N most frequent values
  • --limit -N (negative): Keep only values with count >= N
  • --limit 0: No limit, return all values
  • --unq-limit N: For all-unique columns, limit to N sample values (default: 10)
  • --lmt-threshold N: Only apply limits when unique count >= N (default: 0 = always apply)

Sorting:

  • Default: Descending order by count (most frequent first)
  • --asc: Ascending order by count (least frequent first). Note: This also reverses ranking - least frequent values get rank 1.
  • --other-sorted: Include "Other" category in sorted order instead of at the end

Ranking Strategies

The --rank-strategy option controls how ranks are assigned when multiple values have the same count. See https://en.wikipedia.org/wiki/Ranking for more info.

StrategyDescriptionExample (counts: 4, 3, 3, 2)
denseConsecutive integers regardless of ties (1223 ranking)1, 2, 2, 3
minTied items receive minimum rank position (1224 ranking)1, 2, 2, 4
maxTied items receive maximum rank position (1334 ranking)1, 3, 3, 4
ordinalNext rank is current rank plus 1 (1234 ranking)1, 2, 3, 4
averageTied items receive average of positions (1 2.5 2.5 4 ranking)1, 2.5, 2.5, 4

Note: Tied values with the same rank are sorted alphabetically within their rank group.

NULL Handling

The frequency command provides several options for controlling how NULL (empty) values are handled:

OptionDefaultDescription
--null-text <arg>(NULL)Customize the display text for NULL values. Set to <NONE> to exclude NULLs entirely.
--no-nullsoffDon't include NULLs in the frequency table. Alias for --null-text "<NONE>".
--null-sortedoffSort NULL entries with other values by count instead of placing them at the end of the frequency table (after "Other" if present).
--pct-nullsoffInclude NULL values in percentage and rank calculations. When disabled (default), percentages are "valid percentages" with NULLs excluded from the denominator, and NULL entries display empty percentage and rank values. When enabled, NULLs are included in the denominator (original behavior). Has no effect when --no-nulls is set.

Column Filtering

The frequency command supports filtering columns from the frequency analysis:

OptionDescription
--no-float <cols>Exclude Float columns from frequency analysis. Floats typically contain continuous values where frequency tables are not meaningful. Use --no-float "*" to exclude ALL Float columns, or specify a comma-separated list of Float columns to INCLUDE as exceptions (e.g., --no-float price,rate excludes all Floats except "price" and "rate"). Requires stats cache for type detection.
--stats-filter <expr>Filter columns based on their statistics using a Luau expression. Columns where the expression evaluates to true are EXCLUDED. Available fields include: field, type, is_ascii, cardinality, nullcount, sum, min, max, range, sort_order, min_length, max_length, mean, stddev, variance, cv, sparsity, q1, q2_median, q3, iqr, mad, skewness, mode, antimode, n_negative, n_zero, n_positive, etc. Examples: "nullcount > 1000", "type == 'Float'", "cardinality > 500 and nullcount > 0". Requires stats cache and the luau feature.

Weighted Frequencies

When the --weight <column> option is specified, frequency counts are multiplied by the weight value for each row.

Weight Handling:

  • Weight column must be numeric
  • Weight column is automatically excluded from frequency computation
  • Missing or unparsable weights default to 1.0
  • Zero, negative, NaN, and infinite weights are ignored and do not contribute to frequencies
  • Weight tolerance calculation uses stats cache (stddev/range/mean) for scale-aware tolerance when available

Output:

  • Count column shows weighted sum (displayed as rounded integer)
  • Percentage calculated as: weight / total_weight * 100
  • Ranking based on weighted sums

Stats Cache Integration

The frequency command leverages the stats cache (created by qsv stats --stats-jsonl) to optimize memory usage and performance:

ID Column Detection: When stats cache exists, columns where cardinality == rowcount are detected as ID columns. For these columns:

  • Frequency compilation is short-circuited (no hashmap built)
  • Output shows single <ALL_UNIQUE> entry with count = rowcount, percentage = 100%, rank = 0
  • Saves significant memory for large datasets with ID columns

Memory Optimization:

  • Hashmap capacity pre-allocated based on cardinality from stats cache
  • For parallel processing, capacity divided by number of chunks
  • Reduces allocations and improves performance

Disabling Stats Cache: Set QSV_STATSCACHE_MODE=none to force computing frequencies for ALL columns including ID columns. Useful when you need a "complete" frequency table even for ID columns. In this case, use --unq-limit to avoid memory issues with large cardinality columns.

Creating Stats Cache:

# Create stats cache with cardinality for frequency optimization
qsv stats --cardinality --stats-jsonl data.csv

# Or create with all stats
qsv stats --everything --stats-jsonl data.csv

Frequency Cache

The --frequency-jsonl flag writes a complete frequency distribution as a JSONL cache file (FILESTEM.freq.csv.data.jsonl). When a valid (fresh) cache exists, subsequent frequency runs automatically reuse it instead of recomputing from the CSV.

Cache Options:

OptionDefaultDescription
--frequency-jsonloffWrite the frequency distribution as a JSONL cache. Requires a file input (not stdin).
--high-card-threshold <arg>100Absolute cardinality threshold for <HIGH_CARDINALITY> classification. Can also be set with QSV_FREQ_HIGH_CARD_THRESHOLD env var (env var takes precedence when CLI value equals the default).
--high-card-pct <arg>90Percentage of rowcount threshold for <HIGH_CARDINALITY> classification. Must be between 1 and 100. Can also be set with QSV_FREQ_HIGH_CARD_PCT env var (env var takes precedence when CLI value equals the default).
--forceoffForce recomputation and cache regeneration even when a valid frequency cache exists.

HIGH_CARDINALITY Sentinel: Columns whose cardinality exceeds the smaller of --high-card-threshold and --high-card-pct percent of rowcount are classified as HIGH_CARDINALITY. These get a single <HIGH_CARDINALITY> sentinel entry (count = rowcount, percentage = 100%, rank = 0), analogous to the <ALL_UNIQUE> sentinel for ID columns.

Cache Validation:

  • The cache is considered valid when the CSV file's mtime is older than the cache file's mtime
  • Metadata compatibility is checked: --no-nulls, --no-headers, and --delimiter must match the cached settings

Incompatibilities: The --frequency-jsonl flag produces an error when combined with:

  • --ignore-case — case folding changes computed values
  • --no-trim — whitespace handling changes computed values
  • --weight — weighted frequencies change computed values

Partial Cache Hits: When the cache is valid, columns with full cached data are served directly from the cache. HIGH_CARDINALITY columns (which store only a sentinel) are recomputed via parallel processing against the original CSV.

Frequent Items Sketch (Approximate Top-K)

By default, frequency computes exact counts by tracking every distinct value in a HashMap. For columns with very high cardinality, this can be memory-prohibitive. The --sketch-method frequent_items flag swaps the HashMap for the Misra-Gries heavy-hitters sketch (Apache DataSketches port), which tracks the top-K most frequent values in constant memory with bounded additive error.

OptionDefaultDescription
--sketch-method <m>exactAlgorithm for the frequency table. Choices: exact (HashMap, exact counts) or frequent_items (Misra-Gries sketch, approximate top-K).
--sketch-map-size <n>4096Maximum map size for the Frequent Items sketch. Must be a power of two and ≥ 8. Larger values tighten the error bound at the cost of more memory. Only used when --sketch-method frequent_items.

Counts are estimates. The sketch reports each item's upper-bound frequency estimate; tail items not retained in the sketch are aggregated into a single "Other" row (no unique-count suffix, since the sketch cannot recover the true number of distinct tail items). The sketch's natural ordering is top-K by estimate descending; tied counts use the sketch's hash-table iteration order. The frequency cache is bypassed under this mode.

Rejected flags (the command errors out if any of these are combined with --sketch-method frequent_items):

FlagReason
--ascThe sketch tracks heavy hitters only — least-frequent items are not recoverable.
--weightThe Apache DataSketches Frequent Items sketch operates on unit-weight streams.
--ignore-caseCase folding changes computed values; not supported in streaming sketch mode.
--no-trimWhitespace handling changes computed values; not supported in streaming sketch mode.
--other-sortedThe sketch always emits the "Other" row at the end.
--null-sortedThe sketch ranks NULL alongside other values by estimate; no reordering support.
--frequency-jsonlThe frequency cache is bypassed under sketch mode.
--stats-filterIncompatible with sketch-mode dispatch.
--json / --pretty-json / --toonOnly CSV output is supported under sketch mode.

Silently ignored flags under frequent_items (no error, no effect):

  • --rank-strategy — the sketch's natural top-K-by-estimate ordering is used.
  • --lmt-threshold — the sketch always tracks at most --sketch-map-size candidates.
  • --unq-limit — the sketch's bounded map is itself the unique-limit.

"Other" row divergence: Under frequent_items, the "Other" row label is the bare --other-text (no (N) unique-count suffix, since the sketch cannot recover the true count of items not in the top-K), and rank is 0 to match the existing convention for the exact mode's "Other" row.

When to use: prefer frequent_items for streaming over wide tables with many high-cardinality string columns where you only care about the heavy hitters and want predictable, fixed memory. Prefer exact (default) for small/medium cardinality, weighted streams, or whenever you need a complete frequency distribution.

See also: Misra-Gries summary (Wikipedia), Apache DataSketches Frequent Items.

JSON/TOON Output

The --json or --pretty-json flags output frequency tables as nested JSON. The --toon flag outputs in TOON format (compact, human-readable encoding for LLM prompts).

JSON Structure:

{
  "input": "filename.csv",
  "description": "command arguments",
  "rowcount": 1000,
  "fieldcount": 5,
  "rank_strategy": "dense",
  "fields": [
    {
      "field": "column_name",
      "type": "String",
      "cardinality": 10,
      "nullcount": 0,
      "sparsity": 0.0,
      "uniqueness_ratio": 0.01,
      "stats": [
        {"name": "sum", "value": 1000},
        {"name": "min", "value": "A"},
        {"name": "max", "value": "Z"},
        {"name": "range", "value": null},
        {"name": "sort_order", "value": "UNSORTED"},
        {"name": "min_length", "value": 1},
        {"name": "max_length", "value": 1},
        {"name": "avg_length", "value": 1},
        {"name": "mean", "value": null},
        {"name": "stddev", "value": null},
        ...
      ],
      "frequencies": [
        {"value": "A", "count": 500, "percentage": 50.0, "rank": 1},
        {"value": "B", "count": 300, "percentage": 30.0, "rank": 2},
        ...
      ]
    }
  ]
}

Additional Stats in JSON Output:

When --no-stats is NOT set and the column type is not empty, NULL, or Boolean, the per-field stats array contains up to 17 statistics (only those present in the underlying stats record are emitted):

  1. sum — Sum of numeric values
  2. min — Minimum value
  3. max — Maximum value
  4. range — Range (max - min)
  5. sort_order — ASCENDING, DESCENDING, or UNSORTED
  6. min_length — Shortest string length (String types)
  7. max_length — Longest string length (String types)
  8. sum_length — Total of all string lengths (String types)
  9. avg_length — Average string length (String types)
  10. stddev_length — Standard deviation of string lengths (String types)
  11. variance_length — Variance of string lengths (String types)
  12. cv_length — Coefficient of variation of string lengths (String types)
  13. mean — Arithmetic mean
  14. sem — Standard error of the mean
  15. stddev — Standard deviation
  16. variance — Variance
  17. cv — Coefficient of variation

Note: cardinality, nullcount, sparsity, and uniqueness_ratio are emitted as top-level properties of each FrequencyField (see the JSON Structure example above), not inside the per-field stats array. geometric_mean, harmonic_mean, and max_precision are not included in frequency JSON output even when present in the stats cache — use qsv stats directly if you need them.

Memory-Aware Processing

The frequency command defaults to dynamic, memory-aware chunking for large datasets to avoid out-of-memory errors. Both stats and frequency default to memory-aware sizing; frequency is documented separately here because it builds hash tables and therefore benefits from predictable per-chunk memory budgeting. CPU-based chunking can be requested explicitly via QSV_FREQ_CHUNK_MEMORY_MB = -1 (the same convention stats uses with QSV_STATS_CHUNK_MEMORY_MB).

Chunking Behavior:

  • Automatically enabled for indexed files
  • Chunk size calculated based on:
    • Available memory
    • Record sampling (samples first 1000 records)
    • Estimated hashmap overhead for frequency tables
  • Controlled by QSV_FREQ_CHUNK_MEMORY_MB environment variable:
    • Not set or 0: Dynamic sizing based on available memory and sampling
    • Positive N: Fixed memory limit of N MB per chunk
    • -1: CPU-based chunking (num_records / num_CPUs)

Parallel Processing:

  • Requires an index file (qsv index data.csv)
  • Automatically enabled when index exists (disable with --jobs 1)
  • Each chunk processed independently, then merged
  • For unindexed files, falls back to sequential processing

Memory Estimation: The command estimates memory per record as:

  • Base record size (sum of field lengths)
  • Hashmap overhead (~24 bytes per entry + value size)
  • Additional overhead for Vec capacity (~25%)

Auto-Index Creation: If memory check fails and file is not indexed:

  • Attempts to auto-create index
  • Switches to parallel processing if successful
  • Falls back to sequential if index creation fails

For configuration details, see https://github.com/dathere/qsv/blob/master/docs/ENVIRONMENT_VARIABLES.md


Processing Very Large Files

This section consolidates guidance for running stats and frequency on files that are large relative to available RAM (rule of thumb: any file whose CSV size approaches or exceeds 50% of free memory, or any file with columns whose true cardinality could approach the row count). For configuration knobs referenced below, see ENVIRONMENT_VARIABLES.md.

When to Worry

Most stats/frequency invocations on multi-GB files do not need special handling, because the defaults already cover the common cases:

  • Streaming stats are constant-memory. All 27 streaming statistics (list above) — sum, min, max, min_length/max_length/avg_length, mean/stddev/variance/cv, sem, geometric_mean/harmonic_mean, nullcount/sparsity, type-counts, sort_order/sortiness, is_ascii, max_precision, and range — run in O(1) memory regardless of row count. A plain qsv stats huge.csv on a 1 TB file with no non-streaming flags will not OOM.
  • frequency uses memory-aware chunking by default. With an index, it samples the first 1000 records, estimates per-record + HashMap overhead, and sizes chunks to fit available memory.
  • OOM auto-fallback is on by default. When non-streaming stats or frequency would exceed the memory-check budget, qsv first tries to auto-create an index (for non-stdin inputs), then auto-enables DataSketches estimators where the flags allow. A wwarn! line is printed listing each auto-enabled estimator.

You only need the recipes in this section when:

  1. You are requesting non-streaming stats (any of --mode, --cardinality, --quartiles, --median, --mad, --percentiles, or --everything) on a file too large to materialize the per-column state in memory, or
  2. You are running frequency on columns whose distinct-value count could blow up the HashMap (UUIDs, free-text, timestamps with sub-second precision), or
  3. You are reading from stdin (no index can be auto-created — the OOM fallback can still engage the sketch path, but cannot use indexed parallelism), or
  4. You are on a big-endian target (see Platform Note).

Memory Pressure Points

CommandPressure pointScales withMitigation
stats streamingnoneO(1)
stats --cardinality (exact)distinct-value HashMap per columnper-column cardinality--cardinality-method approx (HLL, ~5 KB/col) or --mode-cardinality-cap <n>
stats --mode / --everything (unweighted)mode tracker Vecrow count (every cell pushed)--mode-cardinality-cap <n> (this is the only knob; mode is not sketched even under --cardinality-method approx)
stats --mode --weight=…mode tracker HashMapper-column cardinality--mode-cardinality-cap <n>
stats --median / --quartiles / --mad / --percentilessort buffer per columncolumn row count--quantile-method approx (t-digest, ~200 centroids/col). Note: --mad is auto-disabled under approx (needs a second pass that t-digest cannot serve).
frequency (exact)per-column HashMap of distinct valuesper-column cardinality--sketch-method frequent_items (Misra-Gries, fixed --sketch-map-size slots)

Two things are easy to miss:

  • mode/antimode is the most common surprise. Under unweighted mode tracking, every cell in the column is pushed onto an underlying Vec, so the tracker grows with row count, not cardinality. The --cardinality-method approx HLL replaces only the cardinality column — mode/antimode is not sketched. The fix is --mode-cardinality-cap <n>: when the tracker grows past n, qsv drops it and emits *HIGH_CARDINALITY for the mode/antimode columns.
  • --mode-cardinality-cap 0 is the default (no cap). It is opt-in because, under --cardinality-method exact, an exceeded cap emits >=<n> in the cardinality column — and the >= prefix breaks downstream integer parsers. Under --cardinality-method approx, the cap does not affect the cardinality column (HLL emits its estimate regardless), so combining the two is safe.

Recipe: stats on Very Large Files

Maximum-safety invocation on a multi-GB CSV when you need the full non-streaming stat set:

# one-time: index enables parallel chunking
qsv index huge.csv

qsv stats huge.csv \
  --everything \
  --quantile-method approx \
  --cardinality-method approx \
  --mode-cardinality-cap 1000000 \
  --stats-jsonl \
  -o huge.stats.csv

What each non-default flag contributes:

  • --quantile-method approx — t-digest for median/quartiles/percentiles/skewness.
  • --cardinality-method approx — HyperLogLog for cardinality/uniqueness_ratio.
  • --mode-cardinality-cap 1000000 — bound the mode/antimode trackers.
  • --stats-jsonl — also write the stats cache for downstream "smart" commands.

What this gives you, in order of memory savings:

  1. Indexed parallel processing. Without an index, stats runs sequentially; with an index, work is split into memory-aware chunks (sized by QSV_STATS_CHUNK_MEMORY_MB) processed in parallel and merged.
  2. t-digest for quantiles. Median, q1/q2/q3, IQR, fences, skewness, and --percentiles all read from a ~200-centroid t-digest per column instead of sorting the full column. Error is ~1% rank error, more accurate at the tails. Caveat: TDigestMut::merge is associative but not chunk-count-invariant, so different --jobs values can yield ~1% differences across runs. Pin --jobs 1 for run-to-run determinism. --mad is auto-disabled with a warning.
  3. HLL for cardinality. cardinality and uniqueness_ratio come from a ~5 KB HyperLogLog per column. ~1.5% RSE. The HLL union is associative and order-invariant, so the estimate is reproducible across --jobs values.
  4. Cap on mode/antimode tracker. Without this, the unweighted mode tracker grows linearly with row count. With it set to, e.g., 1000000, columns whose tracker exceeds that drop to *HIGH_CARDINALITY for mode/antimode while every other statistic remains valid.

Flags that block this recipe (you must drop them or fall back to exact mode):

  • --weight <col> — t-digest has no weighted-update API upstream, so --quantile-method approx is rejected with --weight.
  • --infer-boolean — needs cardinality == 2 exactness, so --cardinality-method approx is rejected (or, under OOM auto-enable, suppressed) with --infer-boolean.

If neither of those applies, you can omit the explicit method flags and rely on the OOM auto-fallback: qsv will flip them on automatically when util::mem_file_check trips. You can disable the auto-enable by passing --quantile-method exact or --cardinality-method exact explicitly (the OOM arm scans argv for these flag names, so docopt's default-fill does not count as an explicit opt-out).

Recipe: frequency on Very Large Files

If you only care about the top-K most frequent values (a common analyst case), use the Misra-Gries sketch:

qsv index huge.csv
qsv frequency huge.csv \
  --sketch-method frequent_items \
  --sketch-map-size 4096 \
  --limit 100 \
  -o huge.freq.csv

--sketch-map-size must be a power of two and ≥ 8; larger values tighten the error bound at the cost of more memory. --limit 100 emits the top 100 values per column.

--sketch-map-size sets the upper bound on map slots; the sketch's worst-case additive error is bounded by the stream length minus the active map total, so doubling the map size roughly halves the error bound at the cost of doubling memory. 4096 is a reasonable starting point; bump to 16384 or 65536 for tighter bounds.

Flags that are rejected under --sketch-method frequent_items (the full list is in the Frequent Items Sketch section): --asc, --weight, --ignore-case, --no-trim, --other-sorted, --null-sorted, --frequency-jsonl, --stats-filter, --json/--pretty-json/--toon. If you need any of these, you must run in exact mode and rely on memory-aware chunking (and possibly the OOM auto-enable, which is itself blocked by the same flag set).

Silently ignored under FI mode: --rank-strategy, --lmt-threshold, --unq-limit (the sketch's bounded map is itself the unique-limit, and ordering is fixed at top-K by estimate descending).

"Other" row divergence: the Other label has no (N) unique-count suffix and rank is 0, since the sketch cannot recover the true tail count.

Indexing for Parallelism

For both stats and frequency, an index is the single highest-leverage prerequisite for large-file processing:

qsv index huge.csv      # creates huge.csv.idx; updated automatically when stale

What an index unlocks:

  • Parallel chunking. Work is split across cores (-j N or auto-detected). Each chunk is processed independently and merged.
  • Memory-aware chunk sizing. With QSV_STATS_CHUNK_MEMORY_MB / QSV_FREQ_CHUNK_MEMORY_MB unset (the default), qsv samples the first 1000 records, estimates per-record memory, and picks a chunk size that fits available memory.
  • OOM fallback for stats. When util::mem_file_check trips and no index exists, qsv attempts to auto-create one before falling back to sketches. Auto-creation is skipped for stdin (not seekable), so cat huge.csv | qsv stats … cannot benefit from indexed parallelism — pipe to a file first if you can.

You can also auto-build the index by setting QSV_AUTOINDEX_SIZE=<bytes> — any CSV larger than that threshold gets an index created on first use.

Environment Variable Cheat Sheet

Most relevant for large-file work (see ENVIRONMENT_VARIABLES.md for the full list):

VariableEffect
QSV_AUTOINDEX_SIZEMinimum file size (bytes) for automatic index creation. Set this so big inputs always get indexed.
QSV_MEMORY_CHECKSwitches util::mem_file_check from NORMAL (file size vs. total memory − headroom) to CONSERVATIVE (file size vs. available + free_swap × platform_factor − headroom). Trips OOM far more readily, so the auto-fallback engages sooner.
QSV_FREEMEMORY_HEADROOM_PCTFree-memory headroom for the memory check (default 20%). Set to 0 to skip the check entirely (use at your own risk).
QSV_STATS_CHUNK_MEMORY_MBPer-chunk memory cap for stats (positive integer in MB). 0 = dynamic sizing. -1 = CPU-based chunking (chunks = rows/cores; ignores memory).
QSV_FREQ_CHUNK_MEMORY_MBSame semantics as above, for frequency.
QSV_ANTIMODES_LENTruncation length for the antimodes preview (default 100 chars). 0 disables truncation.
QSV_STATS_STRING_MAX_LENGTHTruncate min/max for String columns at this length (useful when a column contains GeoJSON / Shapefile geometry blobs that would otherwise blow up downstream parsers).
QSV_MAX_JOBSCap on parallel workers across all multithreaded qsv commands. Useful when each chunk's in-memory state is large (lower QSV_MAX_JOBS to leave headroom).
QSV_FREQ_HIGH_CARD_THRESHOLD / QSV_FREQ_HIGH_CARD_PCTCardinality cutoffs for the --frequency-jsonl cache to emit a HIGH_CARDINALITY sentinel instead of a full frequency entry. Useful for keeping the cache compact on wide tables with ID-like columns.

Worked Example: a Multi-GB CSV

For a 30 GB CSV with ~200 columns on a 32 GB host, where some columns are UUIDs:

# 1. Index up front so all subsequent passes are parallel + chunked.
qsv index big.csv

# 2. Stats: full non-streaming set, but bound mode tracking and use sketches.
QSV_STATS_CHUNK_MEMORY_MB=512 \
qsv stats big.csv \
  --everything \
  --quantile-method approx \
  --cardinality-method approx \
  --mode-cardinality-cap 1000000 \
  --stats-jsonl \
  -o big.stats.csv

# 3. Frequency: top-100 per column, sketch-mode for fixed memory.
qsv frequency big.csv \
  --sketch-method frequent_items \
  --sketch-map-size 16384 \
  --limit 100 \
  -o big.freq.csv

# 4. (Optional) Tighten the memory check if you're sharing the host:
QSV_MEMORY_CHECK=1 QSV_FREEMEMORY_HEADROOM_PCT=40 qsv stats big.csv

If you forgot any of the sketch flags and stats hits the memory check, the OOM auto-fallback will print a wwarn! line such as:

OOM during memory check: auto-enabling DataSketches estimators
(--quantile-method approx, --cardinality-method approx).
Re-run with explicit --quantile-method exact / --cardinality-method exact
to disable the auto-enable.

The exact estimators auto-enabled depend on which incompatible flags are set (--weight blocks t-digest; --infer-boolean blocks HLL; --mad/--everything causes MAD to be auto-disabled under approx). The corresponding line for frequency mentions --sketch-method frequent_items and reports the map size.

Hard Limits (What Will Still OOM)

The DataSketches integration is a major step toward unbounded inputs, but it does not make stats/frequency truly unconditional. Cases where you can still hit memory exhaustion:

  1. Big-endian targets (s390x, PowerPC BE). DataSketches is unavailable — all --quantile-method approx, --cardinality-method approx, and --sketch-method frequent_items paths are rejected, and the OOM auto-enable compiles to a no-op stub. On these targets, fall back to --mode-cardinality-cap, smaller QSV_STATS_CHUNK_MEMORY_MB, and --limit / --unq-limit on frequency.
  2. Unweighted mode/antimode without a cap. The tracker grows with row count regardless of --cardinality-method. Solution: set --mode-cardinality-cap to a value you can afford, or drop --mode/--everything.
  3. Frequency in exact mode with unbounded distinct values. If the column is truly unique-per-row (a UUID column on a 1 B-row CSV), exact mode needs ~1 B HashMap entries. Solution: switch to --sketch-method frequent_items, or pre-bucket the column.
  4. --weight blocks t-digest and --infer-boolean blocks HLL for stats. If both flags are set, neither auto-enable engages, and the memory check will simply fail. Solution: drop the blocking flag, run a separate boolean-inference pass with stats alone (no --weight), or accept exact mode with adequate RAM.
  5. frequency flag combinations that reject Frequent Items. If you need --asc, --ignore-case, --no-trim, --weight, --other-sorted, --null-sorted, --frequency-jsonl, --stats-filter, or --json/--pretty-json/--toon, the sketch path is unavailable. Solution: exact mode with sufficient RAM, or do without that flag.
  6. Stdin input for stats. Stdin is not seekable, so the auto-index path is skipped. The sketch auto-enable still runs, but you lose parallelism. Solution: tee to a file first (tee /tmp/in.csv | qsv stats … or qsv stats /tmp/in.csv).
  7. Explicit --*-method exact opt-out. The OOM auto-enable scans argv for --quantile-method / --cardinality-method / --sketch-method; if you passed any of those (even exact), auto-enable is suppressed for that method. Drop the explicit opt-out to re-enable the fallback.

Platform Note: Big-Endian Targets

Apache DataSketches' Rust port is gated to little-endian targets (verified upfront in stats::run, frequency::run, and the OOM fallback paths). On big-endian targets:

  • --quantile-method approx, --cardinality-method approx, and --sketch-method frequent_items are all rejected with a clear error.
  • try_enable_approx_sketches (stats) and can_enable_frequent_items (frequency) compile to no-op stubs, so the OOM path falls through to error rather than silently degrading.

If you maintain qsv on a big-endian platform, the practical large-file toolkit is:

  • --mode-cardinality-cap for stats (bounds mode/antimode tracking only).
  • Smaller QSV_STATS_CHUNK_MEMORY_MB / QSV_FREQ_CHUNK_MEMORY_MB to keep per-chunk state small.
  • Pre-bucketing high-cardinality columns (e.g., truncate timestamps to the hour) before running frequency.
  • For sort/dedup adjacencies, use extsort/extdedup (external on-disk variants) instead of sort/dedup.

Notes for moarstats and pragmastat

The DataSketches fallback applies to stats and frequency only. Two adjacent commands have their own characteristics:

  • moarstats computes the Advanced, Bivariate, Robust, and Outlier statistic families. Most require either two passes or a full in-memory column (e.g., outlier detection needs the IQR + every value; correlation needs paired columns held together). There is no sketch fallback — for very large inputs, sample first with qsv sample and run moarstats on the sample, or pre-filter columns to the ones you actually need.
  • pragmastat (one-sample mode and two-sample mode) computes deterministic robust estimators that require full-sample residuals. It is designed for inputs that fit comfortably in memory; for very large inputs, sample down first.