qsv Stats Definitions
July 3, 2026 · View on GitHub
Table of Contents
- stats
- Streaming vs Non-Streaming Statistics
- Weighted Statistics
- Date/DateTime Statistics
- Metadata & Type Inference
- Descriptive Statistics (Numerical & General)
- Central Tendency & Dispersion
- String Statistics
- Quality & Distribution
- Median & Quartiles (Non-Streaming)
- Cardinality & Modes (Non-Streaming)
- Percentiles (Non-Streaming)
- File-Level Metadata (JSON Cache)
- Whitespace Visualization
- Performance & Caching
- Approximate Algorithms (Opt-In)
- moarstats
- pragmastat
- frequency
- Processing Very Large Files
- When to Worry
- Memory Pressure Points
- Recipe: stats on Very Large Files
- Recipe: frequency on Very Large Files
- Indexing for Parallelism
- Environment Variable Cheat Sheet
- Worked Example: a Multi-GB CSV
- Hard Limits (What Will Still OOM)
- Platform Note: Big-Endian Targets
- Notes for moarstats and pragmastat
stats
Here are all the statistics produced by the qsv stats command, sourced from src/cmd/stats.rs.
Each statistic is categorized by its relevant section, with its identifier (column name), summary, computation method, and level (File or Variable).
Note: "Streaming" statistics are computed in constant memory. "Non-Streaming" statistics require loading the column data into memory (or multiple passes) and may use approximation or exact calculation depending on configuration.
Important: Unlike the sniff command, stats data type inferences are GUARANTEED, as the entire file is scanned, not just sampled. This makes stats a central command in qsv that underpins other "smart" commands (describegpt, frequency, joinp, pivotp, schema, sqlp, tojsonl) which use cached statistical information to work smarter & faster.
The command supports various caching options to improve performance on subsequent runs. See --stats-jsonl and --cache-threshold options for details.
Streaming vs Non-Streaming Statistics
Streaming Statistics (computed in constant memory, always emitted alongside the field and type identifier columns; those two identifier columns are always present in the output but are not counted as statistics here) — 27 stats:
| # | Identifier | Group |
|---|---|---|
| 1 | is_ascii | Metadata |
| 2 | sum | Descriptive |
| 3 | min | Descriptive |
| 4 | max | Descriptive |
| 5 | range | Descriptive |
| 6 | sort_order | Descriptive |
| 7 | sortiness | Descriptive |
| 8 | min_length | String length |
| 9 | max_length | String length |
| 10 | sum_length | String length |
| 11 | avg_length | String length |
| 12 | stddev_length | String length |
| 13 | variance_length | String length |
| 14 | cv_length | String length |
| 15 | mean | Central tendency |
| 16 | sem | Central tendency |
| 17 | geometric_mean | Central tendency |
| 18 | harmonic_mean | Central tendency |
| 19 | stddev | Central tendency |
| 20 | variance | Central tendency |
| 21 | cv | Central tendency |
| 22 | nullcount | Quality |
| 23 | n_negative | Quality |
| 24 | n_zero | Quality |
| 25 | n_positive | Quality |
| 26 | max_precision | Quality |
| 27 | sparsity | Quality |
Non-Streaming Statistics (opt-in via flag or --everything; all but zero_padded_numeric load/retain per-column data in memory) — 21 stats:
zero_padded_numericis the exception in this group: it is computed during the normal streaming scan in constant memory (two bookkeeping flags per column) and is listed here only because it is opt-in, not because it requires in-memory processing.
| # | Identifier | Flag |
|---|---|---|
| 28 | median | --median (suppressed when --quartiles is also on; supplied by q2_median instead) |
| 29 | mad | --mad |
| 30 | lower_outer_fence | --quartiles |
| 31 | lower_inner_fence | --quartiles |
| 32 | q1 | --quartiles |
| 33 | q2_median | --quartiles |
| 34 | q3 | --quartiles |
| 35 | iqr | --quartiles |
| 36 | upper_inner_fence | --quartiles |
| 37 | upper_outer_fence | --quartiles |
| 38 | skewness | --quartiles |
| 39 | cardinality | --cardinality |
| 40 | uniqueness_ratio | --cardinality |
| 41 | mode | --mode |
| 42 | mode_count | --mode |
| 43 | mode_occurrences | --mode |
| 44 | antimode | --mode |
| 45 | antimode_count | --mode |
| 46 | antimode_occurrences | --mode |
| 47 | percentiles | --percentiles (single column containing the comma-separated values listed in --percentile-list) |
| 48 | zero_padded_numeric | --zero-padded-numeric (true when leading/padding zeros would be lost if the column were cast to a number: the inferred type is String — a leading zero is exactly what forces an otherwise-numeric column to infer as String — and every non-null value is numeric-shaped: an all-digit integer (zip codes, barcodes, padded IDs), a zero-padded decimal code such as 007.1/05.10 (ICD-9, Dewey Decimal, Harmonized System codes), or a plain number mixed in; empty otherwise. Only String-typed columns are ever flagged) |
Total: 48 statistics (27 streaming + 21 non-streaming, beyond the field/type identifiers). 48 is the maximum — it's the size of the union across all flag combinations. The actual emitted column count for any particular run depends on which flags are set: any single run emits at most 48 stats columns because median (#28) and q2_median (#33) are mutually exclusive — median is only emitted under --median alone, while q2_median replaces it whenever --quartiles or --everything is set. Runs without --everything emit fewer columns (only the streaming 27 plus whichever opt-in groups are enabled). The non-streaming statistics that retain per-column samples (every stat in this group except zero_padded_numeric, which is streaming/constant-memory) use memory-aware chunking for large files, dynamically calculating chunk size based on available memory and record sampling. The enumeration above is the source-of-truth for the "48 summary statistics" count quoted in README.md and docs/help/stats.md; it is sourced from the Stats::stat_headers builder in src/cmd/stats.rs.
Weighted Statistics
When the --weight <column> option is specified, all statistics are computed using weighted algorithms. The weight column must be numeric and is automatically excluded from statistics computation. Missing or non-numeric weights default to 1.0. Zero and negative weights are ignored and do not contribute to the statistics.
Weighted statistics use weighted versions of the standard algorithms:
- Weighted mean/variance/stddev: Weighted Welford's algorithm (West, 1979)
- Weighted geometric mean:
exp(Σ(w_i * ln(x_i)) / Σ(w_i))for positive values - Weighted harmonic mean:
Σ(w_i) / Σ(w_i / x_i)for non-zero values - Weighted median/quartiles/percentiles: Weighted nearest-rank method
- Weighted MAD: Weighted median of absolute deviations
- Weighted modes/antimodes: Based on weight values rather than frequency counts
The output filename will be <FILESTEM>.stats.weighted.csv to distinguish from unweighted statistics.
Date/DateTime Statistics
Date and DateTime statistics are only computed when --infer-dates is enabled. Date inference is an expensive operation that matches date candidates against 19 possible date formats with multiple variants.
Formatting:
- DateTime results are in RFC3339 format (e.g., "2023-01-15T10:30:00Z")
- Date results are in "yyyy-mm-dd" format (UTC timezone)
- If timezone is not specified in the data, it is set to UTC
Units:
- Date range, stddev, variance, MAD, and IQR are returned in days (not milliseconds)
- These values are rounded to a precision of 1e-5 days (sub-second), with trailing zeros trimmed in the displayed output
- Mean, geometric mean, and harmonic mean for dates/datetimes are returned in RFC3339 format
Date Column Selection:
- By default,
--dates-whitelistis set tosniff, which uses two-stage date inferencing: first runsqsv sniffon the input file, then only infers dates for the columns that sniff identifies as date/datetime candidates. This is much faster thanall, and more convenient than manually specifying patterns in the whitelist - Alternatively, set
--dates-whitelistto a comma-separated, case-insensitive list of patterns to match against column names (e.g.,date,time,due,open,close,created). Only columns whose names contain one of the patterns will be checked - Examples of column names that trigger date inference (with a manual whitelist like
date,time,due,open,close,created): "start_date", "Observation Time", "timestamp", "Date Closed" - Examples that do NOT trigger: "start_dt", "create_dt", "tmstmp", "close_dt" (unless added to whitelist)
- Use
--dates-whitelist allto inspect all fields (may cause false positives with numeric data like Unix epoch timestamps) - Use
--prefer-dmyto parse dates in day/month/year format instead of month/day/year
Metadata & Type Inference
| Identifier | Level | Summary | Computation |
|---|---|---|---|
field | Variable | The name of the column/header (or its index if --no-headers is used). | Extracted from the CSV header row. |
type | Variable | Inferred data type of the column. | Inferred by checking values against: NULL, Integer, Float, Date, DateTime, Boolean (optional), and fallback to String. Data type inferences are GUARANTEED as stats scans the entire file. |
is_ascii | Variable | Indicates if all characters in the string column are ASCII. | Checked during UTF-8 validation; true if bytes are valid ASCII. |
zero_padded_numeric | Variable | true when leading/padding zeros would be lost if the column were cast to a number: a String-typed column whose every non-null value is numeric-shaped — zero-padded integers/codes (zip codes, barcodes, padded IDs) or zero-padded decimal codes (e.g. 007.1, 05.10), possibly mixed with plain numbers; empty otherwise. Only String-typed columns are ever flagged (a leading zero is exactly what forces an otherwise-numeric column to infer as String). Opt-in via --zero-padded-numeric or --everything. | While scanning, a column stays "qualified" only if every non-null value is numeric-shaped; the first non-numeric value disqualifies it (the disqualification is sticky). A value counts as numeric-shaped if it is (a) an all-ASCII-digit integer; (b) a zero-padded decimal code — a leading 0 in the integer part followed by another digit, with a single decimal point (007.1, 05.10); or (c) a plain number — either freshly inferred as Float, or (once the column has already widened to String) re-parsed as a float, so a 3.5 arriving after a 007.1 does not disqualify the column. At output, true is emitted when the column qualified, saw at least one value, AND its final type is String — which, for an all-numeric-shaped column, implies at least one value carried a leading zero (a plain all-digit column infers as Integer, a plain decimal column as Float; both are never flagged). |
Zero-Padded Numeric Detection:
- qsv deliberately keeps zero-padded numerics — integers with leading zeros (e.g.
07306) AND zero-padded decimal codes (e.g.007.1) — asStringrather thanInteger/Float, to avoid silently dropping the leading zeros. The flag is therefore only ever emitted forString-typed columns; what it adds over theStringtype alone is the guarantee that every value is numeric-shaped, i.e. the column is a numeric code, not free text. zero_padded_numericsurfaces both so such columns are not mistakenly re-typed/cast as numeric when loaded into SQL, SPSS, SAS, Stata, or other tools.- Detection is strict: a column with any non-numeric value (e.g.
"N/A","Main St") is not flagged, matching the "only if all values are numeric" intent. - The zero-padded decimal-code rule mirrors the integer one: a
0in the integer part immediately followed by another digit (007.1,05.10,0601.10). This deliberately excludes ordinary fractions like0.5/0.25(a single0before the decimal point) — those infer asFloatand are never flagged. Pure trailing-zero codes (7.10→7.1) are likewise not flagged: with no leading zero the column infers asFloat, and trailing padding is indistinguishable from a rounded measurement without the original string anyway. Multi-dot codes like0601.10.00are neither all-digit nor parseable floats, so they disqualify the column (they count as non-numeric values) and it is never flagged — even though qsv does keep such columns asString.
Date and DateTime Type Inference:
- See the Date/DateTime Statistics section for full details on how date columns are selected, the default
--dates-whitelist, example column names that do and do not trigger inference, and the list of supported date formats. - In summary, Date and DateTime types are only inferred when
--infer-datesis enabled, and inference relies on matching candidate values against the supported date formats.
Boolean Type Inference:
- Boolean type is inferred when
--infer-booleanis enabled - A column is inferred as Boolean when its cardinality is 2 and the two values match the boolean patterns specified by
--boolean-patterns(default:1:0,t*:f*,y*:n*) - Boolean inference automatically enables
--cardinalitycomputation - Patterns are case-insensitive and support prefix matching with
*wildcards - Example: With the default patterns
"t*:f*,y*:n*", a column is inferred as Boolean only when it contains exactly two distinct values—one matching a "true" pattern (for example, "true", "truthy", "Truth") and one matching a "false" pattern (for example, "false", "f", "no"); any additional distinct values (such as "falsified" or "falseness") would increase the cardinality above 2 and therefore prevent Boolean inference.
Descriptive Statistics (Numerical & General)
| Identifier | Level | Summary | Computation |
|---|---|---|---|
sum | Variable | Sum of all values in the column. | Rolling sum. Integers sum to Integer until a Float is encountered, then switches to Float. Integer sums that overflow/underflow show *OVERFLOW* or *UNDERFLOW*. For Floats, returns NaN as the string "NaN", positive infinity as "inf", and negative infinity as "-inf". |
min | Variable | Minimum value found. | Tracks minimum value during the scan. |
max | Variable | Maximum value found. | Tracks maximum value during the scan. |
range | Variable | Difference between Max and Min. | max - min. |
sort_order | Variable | Sorting status of the column. | Checked during scan. Returns "Ascending", "Descending", or "Unsorted". |
sortiness | Variable | Measure of how sorted the column is. | Returns a score between -1.0 and 1.0: 1.0 indicates perfectly ascending order, -1.0 indicates perfectly descending order, values in between indicate the general tendency towards ascending or descending order, and 0.0 indicates either no clear ordering or empty/single-element collections. |
Central Tendency & Dispersion (Streaming)
Computed using Welford's online algorithm for single-pass accuracy. When --weight <column> is specified, weighted versions are computed using weighted Welford's algorithm (West, 1979).
| Identifier | Level | Summary | Computation |
|---|---|---|---|
mean | Variable | Arithmetic mean (average). | Welford's algorithm mean. Weighted: Σ(w_i * x_i) / Σ(w_i). |
sem | Variable | Standard Error of the Mean. | stddev / sqrt(count). |
geometric_mean | Variable | Geometric mean. | Online calculation using logarithms. Weighted: exp(Σ(w_i * ln(x_i)) / Σ(w_i)) for positive values. |
harmonic_mean | Variable | Harmonic mean. | Online calculation using reciprocals. Weighted: Σ(w_i) / Σ(w_i / x_i) for non-zero values. |
stddev | Variable | Standard deviation (sample). | Welford's algorithm standard deviation. Weighted: uses frequency weight definition. |
variance | Variable | Variance (sample). | Square of standard deviation. Weighted: S_n / (W_n - 1) where S_n is sum of squared differences. |
cv | Variable | Coefficient of Variation. | (stddev / mean) * 100. Returns NaN when mean is 0. |
String Statistics
NOTE: Length statistics are only computed for columns with a String data type. Lengths are byte lengths, not character lengths, as some UTF-8 characters take more than one byte.
| Identifier | Level | Summary | Computation |
|---|---|---|---|
min_length | Variable | Length of the shortest string. | Tracks minimum length in bytes. |
max_length | Variable | Length of the longest string. | Tracks maximum length in bytes. |
sum_length | Variable | Sum of lengths of all strings. | Accumulates length of every value. Shows *OVERFLOW* when sum exceeds u64::MAX. |
avg_length | Variable | Average string length. | sum_length / count. Shows *OVERFLOW* when sum_length overflowed. |
stddev_length | Variable | Standard deviation of string lengths. | Welford's algorithm on lengths. Shows *OVERFLOW* when sum_length overflowed. |
variance_length | Variable | Variance of string lengths. | Square of stddev_length. Shows *OVERFLOW* when sum_length overflowed. |
cv_length | Variable | Coefficient of Variation of lengths. | stddev_length / avg_length (unitless ratio, not multiplied by 100 unlike the numeric cv above). Shows *OVERFLOW* when sum_length overflowed. |
Quality & Distribution
| Identifier | Level | Summary | Computation |
|---|---|---|---|
nullcount | Variable | Count of NULL (empty) values. | Incremented when a field is empty (or matches custom NULL). |
n_negative | Variable | Count of negative values. | Computed for Integer and Float types only. |
n_zero | Variable | Count of zero values. | Computed for Integer and Float types only. |
n_positive | Variable | Count of positive values. | Computed for Integer and Float types only. |
max_precision | Variable | Maximum decimal precision found (Floats). | Tracks the maximum number of digits after the decimal point. |
sparsity | Variable | Fraction of missing (NULL) values. | nullcount / record_count. |
Median & Quartiles (Non-Streaming)
Requires loading data into memory and sorting. When --weight <column> is specified, weighted versions are computed using weighted nearest-rank method.
Note on Date/DateTime types: For Date and DateTime types, range, stddev, variance, MAD, and IQR are returned in days (not milliseconds). These values are rounded to a precision of 1e-5 days (sub-second precision); trailing zeros may be omitted in the output.
Requirements:
medianrequires--medianor--everything(unless--quartilesis specified, in which casemedianis not returned separately as it's the same asq2_median)madrequires--mador--everything- Quartile statistics require
--quartilesor--everything - When
--quantile-method approxis set,median,q1,q2_median,q3,iqr, the four fences,skewness, andpercentilesare computed from a t-digest sketch (Apache DataSketches port of Dunning's MergingDigest, ~200 centroids, ~1% rank error — more accurate at the tails). See Approximate Algorithms (Opt-In). Under that mode,--madis auto-disabled with a warning,--weightis rejected, and results may differ slightly across runs with different--jobsvalues (pin--jobs 1for run-to-run determinism).
| Identifier | Level | Summary | Computation |
|---|---|---|---|
median | Variable | Median value (50th percentile). | Middle value of sorted data (or average of two middle values). Weighted: uses weighted nearest-rank method. When --quartiles is specified, this is not returned separately as it's the same as q2_median. |
mad | Variable | Median Absolute Deviation, a robust measure of variability. | Median of the absolute deviations from the data's median. Weighted: weighted median of absolute deviations. For dates/datetimes, returned in days. |
q1 | Variable | First Quartile (25th percentile). | Value at 25% rank using Method 3. Weighted: value at which cumulative weight first reaches 25% of total weight. |
q2_median | Variable | Second Quartile (Median). | Same as median (50th percentile). |
q3 | Variable | Third Quartile (75th percentile). | Value at 75% rank using Method 3. Weighted: value at which cumulative weight first reaches 75% of total weight. |
iqr | Variable | Interquartile Range. | q3 - q1. For dates/datetimes, returned in days with at least 5 decimal places. |
lower_outer_fence | Variable | Lower bound for extreme outliers. | q1 - (3.0 * iqr), used to identify extreme outliers. For dates/datetimes, returned in RFC3339 format. |
lower_inner_fence | Variable | Lower bound for outliers. | q1 - (1.5 * iqr), used to identify mild outliers. For dates/datetimes, returned in RFC3339 format. |
upper_inner_fence | Variable | Upper bound for outliers. | q3 + (1.5 * iqr), used to identify mild outliers. For dates/datetimes, returned in RFC3339 format. |
upper_outer_fence | Variable | Upper bound for extreme outliers. | q3 + (3.0 * iqr), used to identify extreme outliers. For dates/datetimes, returned in RFC3339 format. |
skewness | Variable | Measure of asymmetry of the probability distribution. | Quantile-based skewness: (q3 - (2.0 * q2) + q1) / iqr. |
Cardinality & Modes (Non-Streaming)
Requirements:
cardinalityanduniqueness_ratiorequire--cardinalityor--everythingmode,mode_count,mode_occurrences,antimode,antimode_count,antimode_occurrencesrequire--modeor--everything- By default,
cardinalityis computed exactly (via the same HashMap that backs mode tracking). Pass--cardinality-method approxto swap in a HyperLogLog sketch (Apache DataSketches port, lg_k=12, ~5KB/column, ~1.5% relative standard error) — useful on very-high-cardinality columns where exact counting is wasted work. See Approximate Algorithms (Opt-In).--infer-booleanforces exact (boolean inference needscardinality == 2exactness); a one-time warning is emitted. - The
--mode-cardinality-cap <n>option (default0= unbounded) bounds the per-column memory used to track modes/antimodes on high-cardinality columns. When the cap fires,mode/antimodecolumns emit the sentinel*HIGH_CARDINALITY, and (under--cardinality-method exactonly) thecardinalitycolumn emits>=<n>. The>=prefix DOES break downstream parsers expecting a plain integer, so the cap is opt-in. Under--cardinality-method approx, the cap does not affect thecardinalitycolumn (HLL emits its own1.5%-RSE estimate at fixed memory); only mode/antimode tracking is gated. The cap measures total samples added under unweighted mode (row count) and number of unique values under--weight(HashMaplen(), == true cardinality).
When --weight <column> is specified, weighted versions are computed. For weighted modes, mode_occurrences is the maximum weight (rounded). For weighted antimodes, antimode_occurrences is the minimum weight (rounded).
Multiple modes/antimodes are separated by the QSV_STATS_SEPARATOR environment variable (default: |).
| Identifier | Level | Summary | Computation |
|---|---|---|---|
cardinality | Variable | Count of unique values. | Count of distinct entries in the column. Weighted: count of unique values (weights are not considered for uniqueness). Use --cardinality-method approx for a HyperLogLog estimate (~1.5% RSE, fixed ~5KB memory) — see Approximate Algorithms (Opt-In). When --mode-cardinality-cap <n> fires under exact mode, this column emits the sentinel >=<n>. |
uniqueness_ratio | Variable | Ratio of unique values to total records. | cardinality / record_count. Interpretation: 1.0 = All unique values (e.g., primary keys). Close to 1.0 = Mostly unique values (e.g., user IDs, timestamps). Close to 0.0 = Many repeated values (e.g., categorical labels like "Male/Female" or "Yes/No"). |
mode | Variable | The most frequent value(s) in the column. | Value(s) with the highest frequency count. Weighted: value(s) with the highest weight. Multimodal-aware. If there are multiple modes, they are separated by QSV_STATS_SEPARATOR. |
mode_count | Variable | Number of modes found. | Count of values tied for highest frequency. |
mode_occurrences | Variable | Frequency count of the mode. | Number of times the mode(s) appear. Weighted: maximum weight (rounded). |
antimode | Variable | The least frequent non-zero/non-null value(s) in the column. | Value(s) with the lowest frequency count (non-zero). Returns *ALL if all values are unique. Limited to first 10 values, truncating after 100 characters (configurable with QSV_ANTIMODES_LEN). If truncated, includes *PREVIEW: prefix. Weighted: value(s) with the lowest weight. If there are multiple antimodes, they are separated by QSV_STATS_SEPARATOR. |
antimode_count | Variable | Number of antimodes found. | Count of values tied for lowest frequency. |
antimode_occurrences | Variable | Frequency count of the antimode. | Number of times the antimode(s) appear. Weighted: minimum weight (rounded). |
Percentiles (Non-Streaming)
Requires loading data into memory and sorting. When --weight <column> is specified, weighted percentiles are computed using weighted nearest-rank method.
Requirements: --percentiles or --everything
Computed using the nearest-rank method.
| Identifier | Level | Summary | Computation |
|---|---|---|---|
percentiles | Variable | Custom percentiles of sorted values. | Nearest rank method for user-defined list. Weighted: weighted nearest-rank method. Multiple percentiles separated by QSV_STATS_SEPARATOR (default: |). Special values: "deciles" → "10,20,30,40,50,60,70,80,90", "quintiles" → "20,40,60,80". Default: "5,10,40,60,90,95". For dates/datetimes, values in RFC3339 format. |
File-Level Metadata (JSON Cache)
When stats are cached, the .stats.csv.json file includes file-level metadata that enables data fingerprinting and cache validation:
| Field | Description | Computation |
|---|---|---|
canonical_input_path | Canonical path to the input file. | Filesystem canonical (absolute) path. |
canonical_stats_path | Canonical path to the stats output file. | Filesystem canonical (absolute) path. |
record_count | Total number of rows (records). | Count of records processed (excluding header). |
field_count | Total number of columns. | Count of fields in the header/first record. |
filesize_bytes | Total file size in bytes. | Filesystem metadata size. |
date_generated | When the stats were generated. | RFC3339 timestamp (UTC). |
compute_duration_ms | Time taken to compute stats. | Elapsed wall-clock time in milliseconds. |
qsv_version | Version of qsv used to generate stats. | CARGO_PKG_VERSION at compile time. Used for cache invalidation when qsv is upgraded. |
hash.blake3 | BLAKE3 fingerprint hash of the dataset's stats. | BLAKE3 hash of the cached stats record's streaming-stats portion up to the FINGERPRINT_HASH_COLUMNS limit (29 columns in the default/non---typesonly output; effectively min(FINGERPRINT_HASH_COLUMNS, record.len()) columns in reduced-column modes such as --typesonly), plus dataset metadata (record_count, field_count, filesize_bytes). The limit is controlled by the FINGERPRINT_HASH_COLUMNS constant in src/cmd/stats.rs, which is kept in sync with the streaming-column count in stats_headers(). This allows users to quickly detect duplicate files without having to load the entire file to compute the hash. Especially useful for detecting duplicates of very large files with pre-existing stats cache metadata. |
Whitespace Visualization
The --vis-whitespace option visualizes whitespace characters in the output to make them visible. Note that spaces will only be visualized (using 《_》) if the entire value is composed of spaces.
The following whitespace markers are used (as defined in the Rust reference):
| Character | Visualization | Description |
|---|---|---|
\t | 《→》 | Tab |
\n | 《¶》 | Newline |
\r | 《⏎》 | Carriage return |
\u{000B} | 《⋮》 | Vertical tab |
\u{000C} | 《␌》 | Form feed |
\u{0085} | 《》 | Next line |
\u{200E} | 《␎》 | Left-to-right mark |
\u{200F} | 《␏》 | Right-to-left mark |
\u{2028} | 《␊》 | Line separator |
\u{2029} | 《␍》 | Paragraph separator |
\u{00A0} | 《⍽》 | Non-breaking space |
\u{2003} | 《emsp》 | Em space |
\u{2007} | 《figsp》 | Figure space |
\u{200B} | 《zwsp》 | Zero width space |
Performance & Caching
The stats command is central to qsv and underpins other "smart" commands (describegpt, frequency, joinp, pivotp, schema, sqlp, tojsonl) that use cached statistical information to work smarter & faster.
Caching Behavior:
- Statistics are cached in
<FILESTEM>.stats.csvand optionally<FILESTEM>.stats.csv.data.jsonl(with--stats-jsonl) - The arguments and file-level metadata used to generate cached stats are saved in
<FILESTEM>.stats.csv.json - If stats have already been computed with similar arguments and the file hasn't changed, stats are loaded from cache instead of recomputing
- Use
--forceto force recomputing stats even if valid cache exists - Use
--cache-thresholdto control caching behavior (default: 5000ms)
Memory-Aware Chunking:
- For non-streaming statistics, dynamically calculate chunk size based on available memory and record sampling
- Override with
QSV_STATS_CHUNK_MEMORY_MBenvironment variable (0 for dynamic sizing, positive for fixed limit, -1 for CPU-based chunking) - Enables processing of arbitrarily large "real-world" files
Approximate Algorithms (Opt-In)
By default, stats produces exact, deterministic results. Three opt-in flags swap exact accumulators for Apache DataSketches ports — Rust ports of streaming sketches — that trade a small, bounded error for constant (or near-constant) memory and faster compute on very-large columns.
| Flag | Default | Sketch | Memory | Error | Restrictions / Notes |
|---|---|---|---|---|---|
--quantile-method approx | exact | t-digest (Apache DataSketches port of Dunning's MergingDigest, ~200 centroids) | O(K) per numeric column | ~1% rank error (more accurate at the tails) | Replaces the sort-based median/q1/q2_median/q3/iqr/fences/skewness/percentiles pipeline. --mad is auto-disabled with a warning (MAD requires a second pass that t-digest does not support). --weight is rejected (the upstream datasketches crate does not expose a weighted-update API). Results may differ ~1% across runs with different --jobs values (TDigestMut::merge is associative but not chunk-count-invariant); pin --jobs 1 for run-to-run determinism. |
--cardinality-method approx | exact | HyperLogLog (Apache DataSketches port, lg_k=12) | ~5KB per column | ~1.5% relative standard error | Replaces exact cardinality/uniqueness_ratio. Reproducible across --jobs values (the HLL union used at merge time is associative and order-invariant, so chunk completion order does not affect the final estimate). --infer-boolean forces exact (boolean inference needs cardinality == 2 exactness); a one-time warning is emitted. The --mode-cardinality-cap >=<n> sentinel is never emitted under approx — only mode/antimode columns remain gated by the cap. |
--mode-cardinality-cap <n> | 0 (unbounded) | bounds the mode/antimode tracker | per-column cap on tracker entries | exact when ≤ cap; sentinels otherwise | When the tracker grows past <n>, qsv drops it and emits *HIGH_CARDINALITY for mode/antimode columns. Under --cardinality-method exact, the cardinality column emits >=<n> (the >= prefix breaks downstream integer parsers — that's why the cap is opt-in). Under --cardinality-method approx, the cap does not affect the cardinality column (HLL emits its estimate at fixed memory). The cap measures total samples added under unweighted mode (~ row count) and number of unique values under --weight (HashMap len(), == true cardinality). |
Output validation: stats uses simdutf8 for SIMD-accelerated UTF-8 validation on the output path — a perf detail with no behavioral change.
OOM auto-fallback: Whenever stats takes the non-parallel path with non-streaming columns, it runs an in-memory load check via util::mem_file_check. By default the check is NORMAL mode (file size vs. total memory − headroom). Passing --memcheck (or setting the QSV_MEMORY_CHECK env var) switches to CONSERVATIVE mode (file size vs. available + free_swap × platform_factor − headroom), which is stricter and trips OOM far more readily. If the check fails in either mode, stats layers two fallbacks before propagating the OOM error:
- Auto-create an index (when no index exists and input is not stdin) to switch to parallel/indexed processing.
- Auto-enable approx DataSketches estimators — flips
--quantile-methodand--cardinality-methodfromexacttoapproxwhere the explicit-validation guards would have accepted them. Specifically:--quantile-methodauto-enables unless--weightis set; if--mador--everythingis also set, MAD is auto-disabled (mirroring the existing--quantile-method approxguard).--cardinality-methodauto-enables unless--infer-booleanis set.
A wwarn! is emitted listing each auto-enabled estimator. The original OOM error is only propagated when neither fallback engages. The sketch fallback can fire even when an index is already present and the OOM check still trips (e.g., with --jobs 1 on a pre-indexed file) — that is a behavior change from the previous "error out" path in this narrow case. Users can disable the auto-enable by passing --quantile-method exact or --cardinality-method exact explicitly; the OOM arm scans argv for these flag names (since docopt fills in the default exact value either way) and skips the auto-enable when either flag was explicitly provided.
See also: t-digest paper (Dunning, 2019), HyperLogLog (Flajolet et al., 2007), Apache DataSketches.
moarstats
Here are all the additional statistics produced by the qsv moarstats command, sourced from src/cmd/moarstats.rs.
The moarstats command extends an existing stats CSV file (created by the stats command) by computing additional statistics that can be derived from existing stats columns and/or by scanning the original CSV file.
How it works:
- Looks for
<FILESTEM>.stats.csvfor a given CSV input - If the stats CSV file doesn't exist, it will first run the
statscommand with configurable options (via--stats-options, default:--infer-dates --infer-boolean --cardinality --mode --mad --quartiles --percentiles --force --stats-jsonl) to establish baseline stats - If the
.stats.csvfile is found, it skips running stats and just appends the additional stats columns - Statistics are rounded using Bankers Rounding (Midpoint Nearest Even) to the specified number of decimal places (default: 4, configurable with
--round) - Uses parallel processing when an index is available for large files
Requirements:
- All statistics are computed only for numeric and date/datetime columns (except Shannon Entropy which works for all field types)
- Derived statistics require specific base statistics to be present in the stats CSV
- Advanced statistics require
--advancedflag and reading the entire CSV file - Outlier statistics require quartiles (and thus fences) to be computed in the baseline stats
- Winsorized/trimmed means require either Q1/Q3 or percentiles to be available
Count Reference
moarstats documentation cites "up to an additional 55 statistical measures." That figure is
the union of the three groups below; each is enumerated explicitly in this document so the
total can be audited against the source-of-truth in src/cmd/moarstats.rs.
Counting convention. Each conceptual statistical measure counts once even when it emits multiple companion columns; the count is therefore over distinct concepts, not over output column names. Three grouping rules are applied below:
- Jarque-Bera (#17) counts as one measure even though it emits two columns
(
jarque_beraplus itsjarque_bera_pvalue); the p-value is a derived companion of the test statistic, not an independent measure. - Winsorized Mean + Trimmed Mean (#25) count together as a single robust-mean pair
(one measure entry) even though they emit 12 columns combined (each mean plus 5
companion stddev/variance/cv/range/stddev_ratio columns). They share a single robust-mean
pipeline driven by
--use-percentiles/--pct-thresholdsand are conceptually the same measure under two different boundary policies. - Covariance (bivariate #4) counts as one measure even though it emits two columns
(
covariance_sampleandcovariance_population); they differ only in the divisor.
A reader regenerating the count by tallying named output columns in src/cmd/moarstats.rs
will arrive at a higher number; arrive at 55 by collapsing the three groups above.
Univariate measures (25) — see Derived Statistics, Advanced Statistics and Robust Statistics (Winsorized & Trimmed Means):
| # | Measure | Section / Flag |
|---|---|---|
| 1 | Pearson's Second Skewness Coefficient (pearson_skewness) | Derived |
| 2 | Range to StdDev Ratio (range_stddev_ratio) | Derived |
| 3 | Quartile Coefficient of Dispersion (quartile_coefficient_dispersion) | Derived |
| 4 | Z-Score of Mode (mode_zscore) | Derived |
| 5 | Relative Standard Error (relative_standard_error) | Derived |
| 6 | Z-Score of Min (min_zscore) | Derived |
| 7 | Z-Score of Max (max_zscore) | Derived |
| 8 | Median-to-Mean Ratio (median_mean_ratio) | Derived |
| 9 | IQR-to-Range Ratio (iqr_range_ratio) | Derived |
| 10 | MAD-to-StdDev Ratio (mad_stddev_ratio) | Derived |
| 11 | Trimean (trimean) | Derived |
| 12 | Midhinge (midhinge) | Derived |
| 13 | Robust CV (robust_cv) | Derived |
| 14 | XSD type (xsd_type) | Derived |
| 15 | Kurtosis (kurtosis) | --advanced |
| 16 | Bimodality Coefficient (bimodality_coefficient) | --advanced |
| 17 | Jarque-Bera test (jarque_bera + jarque_bera_pvalue) | --advanced (emits 2 columns) |
| 18 | Gini Coefficient (gini_coefficient) | --advanced |
| 19 | Atkinson Index (atkinson_index_(<ε>), e.g. atkinson_index_(1) with the default --epsilon 1.0) | --advanced --epsilon |
| 20 | Theil Index (theil_index) | --advanced |
| 21 | Mean Absolute Deviation from mean (mean_ad) | --advanced |
| 22 | Shannon Entropy (shannon_entropy) | --advanced |
| 23 | Normalized Entropy (normalized_entropy) | --advanced (when cardinality is present) |
| 24 | Simpson's Diversity Index (simpsons_diversity_index) | --advanced |
| 25 | Winsorized Mean (winsorized_mean + 5 companion columns) and Trimmed Mean (trimmed_mean + 5 companion columns) | Robust (counted as one measure pair, emits 12 columns) |
Outlier measures (24) — see Outlier Statistics:
| # | Group | Identifiers |
|---|---|---|
| 1–7 | Outlier counts | outliers_extreme_lower_cnt, outliers_mild_lower_cnt, outliers_normal_cnt, outliers_mild_upper_cnt, outliers_extreme_upper_cnt, outliers_total_cnt, outliers_percentage |
| 8–13 | Outlier descriptive | outliers_mean, non_outliers_mean, outliers_to_normal_mean_ratio, outliers_min, outliers_max, outliers_range |
| 14–20 | Outlier variance / spread | outliers_stddev, outliers_variance, non_outliers_stddev, non_outliers_variance, outliers_cv, non_outliers_cv, outliers_normal_stddev_ratio |
| 21–22 | Outlier impact | outlier_impact, outlier_impact_ratio |
| 23–24 | Outlier boundary | lower_outer_fence_zscore, upper_outer_fence_zscore |
Bivariate measures (6, written to <FILESTEM>.stats.bivariate.csv under --bivariate) — see Bivariate Statistics:
| # | Measure |
|---|---|
| 1 | Pearson's correlation (pearson_correlation) |
| 2 | Spearman's rank correlation (spearman_correlation) |
| 3 | Kendall's tau (kendall_tau) |
| 4 | Covariance (covariance_sample + covariance_population — counted as one measure, emits 2 columns) |
| 5 | Mutual Information (mutual_information) |
| 6 | Normalized Mutual Information (normalized_mutual_information) |
Total: 25 + 24 + 6 = 55 statistical measures. Note that several measures expand into more than one output column (e.g. Jarque-Bera → 2 columns, Winsorized/Trimmed Means → 12 columns combined, Covariance → 2 columns), so the actual column count in a <FILESTEM>.stats.csv extended by moarstats --advanced plus its bivariate sidecar is higher than 55.
Derived Statistics
These statistics are computed directly from existing stats columns without scanning the original CSV file. They require specific base statistics to be present in the stats CSV.
| Identifier | Level | Summary | Computation |
|---|---|---|---|
pearson_skewness | Variable | Pearson's Second Skewness Coefficient. Measures asymmetry of the distribution. Positive values indicate right skew, negative values indicate left skew. | 3 * (mean - median) / stddev. Requires: mean, median (or q2_median), stddev. Returns None if stddev is zero. See: Skewness |
range_stddev_ratio | Variable | Range to Standard Deviation Ratio. Normalizes the spread of data. Higher values indicate more extreme outliers relative to the variability. | range / stddev. Requires: range, stddev. Returns None if stddev is zero. |
quartile_coefficient_dispersion | Variable | Quartile Coefficient of Dispersion. Measures relative variability using quartiles. Useful for comparing dispersion across different scales. | (Q3 - Q1) / (Q3 + Q1). Requires: q1, q3. Returns None if Q1 >= Q3 or if denominator is zero. See: Quartile Coefficient of Dispersion |
mode_zscore | Variable | Z-Score of Mode. Indicates how typical the mode is relative to the distribution. Values near 0 suggest the mode is near the mean. | (mode - mean) / stddev. Requires: mode, mean, stddev. If multiple modes exist, uses the first mode. Returns None if stddev is zero. |
relative_standard_error | Variable | Relative Standard Error. Measures precision of the mean estimate relative to its magnitude. Lower values indicate more reliable estimates. | sem / mean. Requires: sem, mean. Returns None if mean is zero. |
min_zscore | Variable | Z-Score of Min. Shows how extreme the minimum value is. Large negative values indicate outliers or heavy left tail. | (min - mean) / stddev. Requires: min, mean, stddev. Returns None if stddev is zero. |
max_zscore | Variable | Z-Score of Max. Shows how extreme the maximum value is. Large positive values indicate outliers or heavy right tail. | (max - mean) / stddev. Requires: max, mean, stddev. Returns None if stddev is zero. |
median_mean_ratio | Variable | Median-to-Mean Ratio. Indicates skewness direction. Ratio < 1 suggests right skew, > 1 suggests left skew, = 1 suggests symmetry. | median / mean. Requires: median (or q2_median), mean. Returns None if mean is zero. |
iqr_range_ratio | Variable | IQR-to-Range Ratio. Measures concentration of data. Higher values (closer to 1) indicate more data concentrated in the middle 50%. | iqr / range. Requires: iqr, range. Returns None if range is zero. |
mad_stddev_ratio | Variable | MAD-to-StdDev Ratio. Compares robust vs non-robust spread measures. Higher values suggest presence of outliers affecting stddev. | mad / stddev. Requires: mad, stddev. Returns None if stddev is zero. |
trimean | Variable | Tukey's Trimean. Robust estimator of central tendency combining median with the midhinge. More robust than mean, more efficient than median alone. | (Q1 + 2*median + Q3) / 4. Requires: q1, median (or q2_median), q3. See: Trimean |
midhinge | Variable | Midhinge. Midpoint of the middle 50% of data. A robust central tendency measure that complements the mean and median. | (Q1 + Q3) / 2. Requires: q1, q3. See: Midhinge |
robust_cv | Variable | Robust Coefficient of Variation. Non-negative, outlier-resistant alternative to CV using MAD and the magnitude of the median instead of stddev and mean. | MAD / abs(median). Requires: mad, median (or q2_median). Returns None if median is zero. See: Robust measures of scale |
xsd_type | Variable | Inferred W3C XML Schema datatype. Infers the most specific XSD type based on field type and min/max values. Works for all field types. | Computed from type, min, and max columns. For Integer types, refines to most specific type (e.g., byte, short, int, long, unsignedByte, unsignedShort, unsignedInt, unsignedLong, positiveInteger, nonNegativeInteger, negativeInteger, nonPositiveInteger, or integer) based on min/max ranges. Also detects Gregorian date types (gYear, gYearMonth, gMonthDay, gDay, gMonth) with confidence markers (? = more confident from thorough scan, ?? = less confident from quick scan). For other types: Float → decimal, String → string, Date → date, DateTime → dateTime, Boolean → boolean, NULL → empty string. If min/max are not available for Integer types, defaults to integer. See: XML Schema Part 2: Datatypes |
Advanced Statistics
These statistics require the --advanced flag and reading the entire CSV file to collect all values for computation. They are computationally expensive.
| Identifier | Level | Summary | Computation |
|---|---|---|---|
kurtosis | Variable | Excess kurtosis. Measures the "tailedness" of the distribution. Positive values indicate heavy tails, negative values indicate light tails. Values near 0 indicate a normal distribution. | Computed from all values in the column. Uses precalculated mean and variance from baseline stats for efficiency. Requires: mean, variance (or stddev). See: Kurtosis |
bimodality_coefficient | Variable | Bimodality Coefficient. Measures whether a distribution has two modes (peaks) or is unimodal. BC < 0.555 indicates unimodal, BC >= 0.555 indicates bimodal/multimodal. | Computed as (skewness² + 1) / (kurtosis + 3). Requires: skewness (from base stats) and kurtosis (from --advanced flag). See: Bimodality |
jarque_bera | Variable | Jarque-Bera test statistic. Standard test for normality using skewness and kurtosis. Higher values indicate greater departure from normality. | Computed as (n/6) * (S² + K²/4) where S is skewness and K is excess kurtosis. Requires: skewness (from base stats), kurtosis (from --advanced flag), and sample size n (from n_positive + n_negative + n_zero). See: Jarque-Bera test |
jarque_bera_pvalue | Variable | P-value for the Jarque-Bera test. Low values (< 0.05) indicate the data is NOT normally distributed. | Computed from the chi-squared distribution with 2 degrees of freedom: p = e^(-JB/2). Requires: jarque_bera. |
gini_coefficient | Variable | Gini Coefficient. Measures inequality/dispersion in the distribution. Values range from 0 (perfect equality) to 1 (maximum inequality). | Computed from all values in the column. Uses precalculated sum from baseline stats for efficiency. Requires: sum. See: Gini Coefficient |
atkinson_index_(<ε>) | Variable | Atkinson Index. Measures inequality in the distribution with a sensitivity parameter. The column name interpolates the epsilon value (e.g. atkinson_index_(1) with the default --epsilon 1.0). Values range from 0 (perfect equality) to 1 (maximum inequality). The Atkinson Index is a more general form of the Gini coefficient that allows for different sensitivity to inequality. | Computed from all values in the column. Uses precalculated mean from baseline stats for efficiency. The epsilon (ε) parameter controls sensitivity to inequality (configurable via --epsilon, default: 1.0). Higher epsilon values indicate greater sensitivity to inequality. Requires: mean. See: Atkinson Index |
theil_index | Variable | Theil Index (Generalized Entropy GE(1)). Measures inequality/concentration in the distribution. Unlike Gini, it is decomposable into within-group and between-group components. Only computed for positive values. | Computed as (1/n) * Σ((x_i / mean) * ln(x_i / mean)) for positive values. Computes mean from positive values only (not the overall precalculated mean). Requires positive values in the column. See: Theil Index |
mean_ad | Variable | Mean Absolute Deviation from mean. Average absolute distance of values from the arithmetic mean. Less robust than MAD (which uses median) but more statistically efficient. | Computed as `(1/n) * Σ |
shannon_entropy | Variable | Shannon Entropy. Measures the information content/uncertainty in the distribution. Higher values indicate more diversity, lower values indicate more concentration. Values range from 0 (all values identical) to log2(n) where n is the number of unique values. | Computed using the frequency command with --limit 0 to collect all frequencies, then calculates: H(X) = -Σ p_i * log2(p_i) where p_i is the probability of value i. Works for all field types (not just numeric). For all-unique fields, returns log2(n). See: Entropy (Information Theory) |
normalized_entropy | Variable | Normalized Entropy. Normalized version of Shannon Entropy scaled to [0, 1]. Values range from 0 (all values identical) to 1 (all values equally distributed). | Computed as shannon_entropy / log2(cardinality). Requires: shannon_entropy (from --advanced flag) and cardinality (from base stats). If cardinality is 0 or 1, returns 0. |
simpsons_diversity_index | Variable | Simpson's Diversity Index. Probability that two randomly chosen values are different. More intuitive than entropy for many users. Ranges from 0 (all identical) to 1 (all unique). | Computed as 1 - Σ(p_i²) where p_i are value proportions from frequency data. Computed alongside Shannon Entropy. Works for all field types. For all-unique fields, returns 1 - 1/n. See: Simpson's Diversity Index |
Bivariate Statistics
These statistics examine relationships between pairs of columns in a dataset. They are computed when the --bivariate flag is used and require an indexed CSV file (index will be auto-created if missing). Bivariate statistics are output to a separate file: <FILESTEM>.stats.bivariate.csv.
Note: Bivariate statistics require reading the entire CSV file and are computationally expensive. For large files (>= 10k records), parallel chunked processing is used when an index is available. For smaller files or when no index exists, sequential processing is used.
Performance Optimizations:
- Date parsing cache to avoid re-parsing same date strings
- String interning to reduce allocations for repeated values
- Batch string conversions to process multiple field pairs efficiently
- Early termination for zero-variance fields (skip all correlation computations)
- Streaming algorithms (Welford's online) for Pearson correlation and covariance
- Lazy value collection (only store values if needed for Spearman/Kendall)
Multi-Dataset Bivariate Statistics: When using --join-inputs, multiple datasets can be joined internally before computing relationships. This allows analyzing relationships across datasets that share common join keys. The joined dataset is automatically indexed before bivariate statistics computation. Output file: <FILESTEM>.stats.bivariate.joined.csv.
| Identifier | Level | Summary | Computation |
|---|---|---|---|
pearson_correlation | Pairwise | Pearson product-moment correlation coefficient. Measures linear correlation between two numeric/date fields. Values range from -1 (perfect negative correlation) to +1 (perfect positive correlation). 0 indicates no linear correlation. | Computed using Welford's online algorithm for efficient streaming computation across chunks. Requires both fields to be numeric or date types. Formula: covariance / (stddev_x * stddev_y). See: Pearson Correlation |
spearman_correlation | Pairwise | Spearman's rank correlation coefficient. Measures monotonic relationship between two numeric/date fields (not just linear). Values range from -1 to +1. More robust to outliers than Pearson correlation. | Computed by ranking both fields and then computing Pearson correlation on the ranks. Handles ties by averaging ranks. Requires both fields to be numeric or date types. See: Spearman's Rank Correlation |
kendall_tau | Pairwise | Kendall's tau rank correlation coefficient. Measures ordinal association between two numeric/date fields. Values range from -1 to +1. More robust to outliers and handles ties better than Spearman. | Computed by counting concordant and discordant pairs using efficient O(n log n) merge sort algorithm. Formula accounts for ties in both variables. Requires both fields to be numeric or date types. See: Kendall's Tau |
covariance_sample | Pairwise | Sample covariance. Measures how two numeric/date fields vary together. Positive values indicate positive relationship, negative values indicate inverse relationship. | Computed using Welford's online algorithm. Formula: sum((x - mean_x) * (y - mean_y)) / (n - 1). Requires both fields to be numeric or date types. |
covariance_population | Pairwise | Population covariance. Same as sample covariance but uses population formula (divides by n instead of n-1). | Computed using Welford's online algorithm. Formula: sum((x - mean_x) * (y - mean_y)) / n. Requires both fields to be numeric or date types. |
mutual_information | Pairwise | Mutual Information. Measures the amount of information obtained about one field by observing another. Values range from 0 (independent) to positive infinity. Works for all field types (numeric, date, string). | Computed from joint and marginal probability distributions. Formula: MI(X,Y) = sum(p(x,y) * log2(p(x,y) / (p(x) * p(y)))). Higher values indicate stronger relationship. Can be expensive for high-cardinality fields (use --cardinality-threshold to skip). See: Mutual Information |
normalized_mutual_information | Pairwise | Normalized Mutual Information. Normalized version of mutual information, scaled by the geometric mean of individual entropies. Values range from 0 (independent) to 1 (perfectly dependent). | Computed as MI(X,Y) / sqrt(H(X) * H(Y)) where H(X) and H(Y) are Shannon entropies of individual fields. Requires mutual information computation. See: Normalized Mutual Information |
n_pairs | Pairwise | Number of valid pairs used in computation. Indicates how many non-null value pairs were available for computing the relationship statistics. | Count of records where both fields have non-empty values. |
Configuration Options:
--bivariate-stats: Select specific statistics (pearson, spearman, kendall, covariance, mi, nmi) or use "all" or "fast" (pearson + covariance). Default:fast.--cardinality-threshold: Skip mutual information for field pairs where either field exceeds cardinality threshold (default: 1,000,000)--join-inputs: Join multiple datasets before computing bivariate statistics--join-keys: Specify join keys for each dataset--join-type: Specify join type (inner, left, right, full; default: inner)
Robust Statistics (Winsorized & Trimmed Means)
These statistics require scanning the original CSV file. They provide robust alternatives to the standard mean by handling extreme values differently.
Winsorized Mean: Replaces values below/above thresholds with threshold values, then computes mean. All values are included in the calculation, but extreme values are capped at thresholds.
Trimmed Mean: Excludes values outside thresholds, then computes mean. Only values within thresholds are included in the calculation.
Threshold Options:
- Default: Uses Q1 and Q3 as thresholds (25% winsorization/trimming)
- With
--use-percentiles: Uses configurable percentiles (default: 5th/95th) as thresholds via--pct-thresholds
Requirements:
- Default mode: Requires
q1andq3in baseline stats - Percentile mode: Requires
percentilesin baseline stats and--use-percentilesflag
| Identifier | Level | Summary | Computation |
|---|---|---|---|
winsorized_mean_25pct | Variable | Winsorized mean using Q1/Q3 thresholds (25% winsorization). | All values are included, but values below Q1 are set to Q1 and values above Q3 are set to Q3, then mean is computed. For dates/datetimes, returned in RFC3339 format. |
winsorized_mean_5pct | Variable | Winsorized mean using percentile thresholds (5th/95th percentiles). | Only computed when --use-percentiles is set. Column name varies based on --pct-thresholds (e.g., winsorized_mean_10pct for 10th/90th percentiles). |
winsorized_stddev | Variable | Standard deviation of winsorized values. | Sample standard deviation computed from winsorized values. |
winsorized_variance | Variable | Variance of winsorized values. | Sample variance computed from winsorized values. |
winsorized_cv | Variable | Coefficient of variation for winsorized values. | winsorized_stddev / winsorized_mean. Returns None if mean is zero. |
winsorized_range | Variable | Range of winsorized values. | max_winsorized - min_winsorized. |
winsorized_stddev_ratio | Variable | Ratio of winsorized stddev to overall stddev. | winsorized_stddev / stddev. Compares robust vs non-robust spread. Returns None if overall stddev is zero. |
trimmed_mean_25pct | Variable | Trimmed mean using Q1/Q3 thresholds (25% trimming). | Only values within Q1 and Q3 are included in the mean calculation. For dates/datetimes, returned in RFC3339 format. |
trimmed_mean_5pct | Variable | Trimmed mean using percentile thresholds (5th/95th percentiles). | Only computed when --use-percentiles is set. Column name varies based on --pct-thresholds. |
trimmed_stddev | Variable | Standard deviation of trimmed values. | Sample standard deviation computed from trimmed values (only values within thresholds). |
trimmed_variance | Variable | Variance of trimmed values. | Sample variance computed from trimmed values. |
trimmed_cv | Variable | Coefficient of variation for trimmed values. | trimmed_stddev / trimmed_mean. Returns None if mean is zero. |
trimmed_range | Variable | Range of trimmed values. | max_trimmed - min_trimmed. |
trimmed_stddev_ratio | Variable | Ratio of trimmed stddev to overall stddev. | trimmed_stddev / stddev. Compares robust vs non-robust spread. Returns None if overall stddev is zero. |
See: Winsorized Mean, Truncated Mean
Outlier Statistics
These statistics require scanning the original CSV file and comparing each value against fence thresholds. Fences are computed using the IQR method: inner fences at Q1/Q3 ± 1.5IQR, outer fences at Q1/Q3 ± 3.0IQR.
Requirements:
- Requires
--quartilesor--everythingin baseline stats (to compute fences) - Requires:
lower_outer_fence,lower_inner_fence,upper_inner_fence,upper_outer_fencein baseline stats
Outlier Classification:
- Extreme Lower: Values below the lower outer fence
- Mild Lower: Values between lower outer and inner fences
- Normal: Values between inner fences (non-outliers)
- Mild Upper: Values between upper inner and outer fences
- Extreme Upper: Values above the upper outer fence
See: Outlier
Outlier Counts
| Identifier | Level | Summary | Computation |
|---|---|---|---|
outliers_extreme_lower_cnt | Variable | Count of values below the lower outer fence. | Count of extreme lower outliers. |
outliers_mild_lower_cnt | Variable | Count of values between lower outer and inner fences. | Count of mild lower outliers. |
outliers_normal_cnt | Variable | Count of values between inner fences (non-outliers). | Count of normal (non-outlier) values. |
outliers_mild_upper_cnt | Variable | Count of values between upper inner and outer fences. | Count of mild upper outliers. |
outliers_extreme_upper_cnt | Variable | Count of values above the upper outer fence. | Count of extreme upper outliers. |
outliers_total_cnt | Variable | Total count of all outliers (sum of extreme and mild outliers). | Sum of all outlier counts (extreme + mild, both lower and upper). |
outliers_percentage | Variable | Percentage of values that are outliers. | (outliers_total_cnt / total_count) * 100. |
Outlier Descriptive Statistics
| Identifier | Level | Summary | Computation |
|---|---|---|---|
outliers_mean | Variable | Mean value of outliers. | Mean of all outlier values (extreme and mild, lower and upper). For dates/datetimes, returned in RFC3339 format. |
non_outliers_mean | Variable | Mean value of non-outliers. | Mean of all normal (non-outlier) values. For dates/datetimes, returned in RFC3339 format. |
outliers_to_normal_mean_ratio | Variable | Ratio of outlier mean to non-outlier mean. | outliers_mean / non_outliers_mean. Returns None if non_outliers_mean is zero. |
outliers_min | Variable | Minimum value among outliers. | Minimum value across all outliers. For dates/datetimes, returned in RFC3339 format. |
outliers_max | Variable | Maximum value among outliers. | Maximum value across all outliers. For dates/datetimes, returned in RFC3339 format. |
outliers_range | Variable | Range of outlier values. | outliers_max - outliers_min. |
Outlier Variance/Spread Statistics
| Identifier | Level | Summary | Computation |
|---|---|---|---|
outliers_stddev | Variable | Standard deviation of outlier values. | Sample standard deviation computed from outlier values. Requires at least 2 outliers. |
outliers_variance | Variable | Variance of outlier values. | Sample variance computed from outlier values. Requires at least 2 outliers. |
non_outliers_stddev | Variable | Standard deviation of non-outlier values. | Sample standard deviation computed from normal (non-outlier) values. Requires at least 2 non-outliers. |
non_outliers_variance | Variable | Variance of non-outlier values. | Sample variance computed from normal (non-outlier) values. Requires at least 2 non-outliers. |
outliers_cv | Variable | Coefficient of variation for outliers. | outliers_stddev / outliers_mean. Returns None if outliers_mean is zero or if stddev cannot be computed. |
non_outliers_cv | Variable | Coefficient of variation for non-outliers. | non_outliers_stddev / non_outliers_mean. Returns None if non_outliers_mean is zero or if stddev cannot be computed. |
outliers_normal_stddev_ratio | Variable | Ratio of outlier stddev to non-outlier stddev. | outliers_stddev / non_outliers_stddev. Compares spread of outliers vs non-outliers. Returns None if non_outliers_stddev is zero or if either stddev cannot be computed. |
Outlier Impact Statistics
| Identifier | Level | Summary | Computation |
|---|---|---|---|
outlier_impact | Variable | Difference between overall mean and non-outlier mean. | overall_mean - non_outliers_mean. Measures how much outliers affect the overall mean. |
outlier_impact_ratio | Variable | Relative impact of outliers. | outlier_impact / non_outliers_mean. Normalized measure of outlier impact. Returns None if non_outliers_mean is zero. |
Outlier Boundary Statistics
| Identifier | Level | Summary | Computation |
|---|---|---|---|
lower_outer_fence_zscore | Variable | Z-score of the lower outer fence boundary. | (lower_outer_fence - mean) / stddev. Shows how extreme the lower outlier boundary is relative to the distribution. Returns None if stddev is zero. |
upper_outer_fence_zscore | Variable | Z-score of the upper outer fence boundary. | (upper_outer_fence - mean) / stddev. Shows how extreme the upper outlier boundary is relative to the distribution. Returns None if stddev is zero. |
pragmastat
The pragmastat command computes robust, median-of-pairwise statistics using the Pragmastat library (v12.1.0). Designed for messy, heavy-tailed, or outlier-prone data where mean/stddev can mislead.
Sourced from src/cmd/pragmastat.rs.
Key Features:
- Only finite numeric values are used; non-numeric/NaN/Inf values are ignored
- Date/DateTime columns are supported when a stats cache is available (run
qsv stats -E --infer-dates --stats-jsonlfirst); dates are converted to epoch milliseconds for analysis, thencenter/bounds are formatted as dates andspread/shiftas days - Each column is treated as its own sample (two-sample compares columns, not rows)
- Non-numeric columns appear with n=0 and empty estimator cells
- Loads all numeric values into memory
Modes
pragmastat has four mutually exclusive output modes. The default (no mode flag) extends the existing stats cache the way moarstats does; the other modes always produce a standalone CSV.
| Mode flag | Behavior | Output |
|---|---|---|
| (none) | Default. Appends 7 ps_* columns to the existing .stats.csv cache file. If no cache exists, runs stats first using --stats-options. | Extended stats CSV |
--standalone | One-sample point/bound estimates as a fresh CSV, without touching the stats cache. | Standalone CSV |
-t / --twosample | Two-sample estimators for every unordered column pair. | Standalone CSV |
--compare1 <spec> | One-sample confirmatory analysis — tests center / spread against user-defined thresholds. | Standalone CSV |
--compare2 <spec> | Two-sample confirmatory analysis — tests shift / ratio / disparity against user-defined thresholds. | Standalone CSV |
Default Mode (Stats Cache Append)
Adds 7 ps_* columns to each row of the existing stats CSV (the same row-per-column layout stats and moarstats use). If no stats cache is present, one is generated first using --stats-options (default: --infer-dates --infer-boolean --mad --quartiles --force --stats-jsonl — note: no --percentiles, unlike moarstats).
ps_* columns that already exist in the cache are left untouched unless --force is set.
| Identifier | Level | Summary | Computation |
|---|---|---|---|
ps_n | Variable | Count of values used by pragmastat estimators. | Count after filtering: finite numerics for numeric columns, or parsed epoch-ms values for Date/DateTime columns (when supported via the stats cache). Non-numeric / NaN / Inf / unparsable values are excluded. |
ps_center | Variable | Hodges-Lehmann estimator — robust location. | Median of pairwise averages. Tolerates up to 29% corrupted data. |
ps_spread | Variable | Shamos estimator — robust dispersion. | Median of pairwise absolute differences. Same units as data; also tolerates up to 29% corrupted data. |
ps_center_lower | Variable | Lower confidence bound for ps_center. | Exact under weak symmetry, with error rate = misrate. |
ps_center_upper | Variable | Upper confidence bound for ps_center. | Exact under weak symmetry, with error rate = misrate. |
ps_spread_lower | Variable | Lower confidence bound for ps_spread. | Randomized (bootstrap); error rate = misrate. |
ps_spread_upper | Variable | Upper confidence bound for ps_spread. | Randomized (bootstrap); error rate = misrate. |
Standalone Mode (--standalone)
One-sample mode that produces a fresh standalone CSV instead of extending the stats cache. (This is the legacy default behavior preserved behind a flag.)
Output columns: field, n, center, spread, center_lower, center_upper, spread_lower, spread_upper
| Identifier | Level | Summary | Computation |
|---|---|---|---|
field | Variable | Column name (or 1-based index if --no-headers). | From CSV header. |
n | Variable | Count of finite numeric values. | Count after filtering non-numeric, NaN, Inf. |
center | Variable | Hodges-Lehmann estimator — robust location. | Median of pairwise averages. Tolerates up to 29% corrupted data. Like the mean but stable with outliers. |
spread | Variable | Shamos estimator — robust dispersion. | Median of pairwise absolute differences. Same units as data. Also tolerates up to 29% corrupted data. |
center_lower | Variable | Lower confidence bound for center. | Exact under weak symmetry, with error rate = misrate. |
center_upper | Variable | Upper confidence bound for center. | Exact under weak symmetry, with error rate = misrate. |
spread_lower | Variable | Lower confidence bound for spread. | Randomized (bootstrap); error rate = misrate. |
spread_upper | Variable | Upper confidence bound for spread. | Randomized (bootstrap); error rate = misrate. |
Two-Sample Mode (-t / --twosample)
Computes statistics for all unordered column pairs. Always produces a standalone CSV.
Output columns: field_x, field_y, n_x, n_y, shift, ratio, disparity, shift_lower, shift_upper, ratio_lower, ratio_upper, disparity_lower, disparity_upper
| Identifier | Level | Summary | Computation |
|---|---|---|---|
field_x, field_y | Pairwise | Column names being compared. | From CSV header. |
n_x, n_y | Pairwise | Counts of finite numeric values. | Per-column counts after filtering non-numeric/NaN/Inf. |
shift | Pairwise | Hodges-Lehmann difference — robust location difference. | Median of pairwise differences between columns. Negative means first column tends to be lower. |
ratio | Pairwise | Robust multiplicative ratio. | exp(shift(log x, log y)). Use for positive-valued quantities (latency, price, concentration). Requires all values > 0. Suppressed for Date/DateTime pairs (depends on the arbitrary 1970 epoch origin). |
disparity | Pairwise | Robust effect size. | shift / (average spread of x and y). |
shift_lower, shift_upper | Pairwise | Confidence bounds for shift. | Exact; error rate = misrate. If bounds exclude 0, the difference is reliable. Ties may be conservative. |
ratio_lower, ratio_upper | Pairwise | Confidence bounds for ratio. | Exact; error rate = misrate. If bounds exclude 1, the difference is reliable. Requires all values > 0. |
disparity_lower, disparity_upper | Pairwise | Confidence bounds for disparity. | Randomized (Bonferroni combination); error rate = misrate. If bounds exclude 0, the disparity is reliable. |
Compare1 Mode (--compare1 <spec>)
One-sample confirmatory analysis. Tests one-sample estimates (center / spread) against user-supplied thresholds and renders a verdict per (column, threshold) pair. Always produces a standalone CSV.
Threshold format: comma-separated metric:value pairs, e.g. center:42.0 or center:42.0,spread:0.5. Valid metrics: center, spread.
Output columns: field, n, metric, threshold, estimate, lower, upper, verdict
| Identifier | Level | Summary |
|---|---|---|
field | Variable | Column name (or 1-based index if --no-headers). |
n | Variable | Count of finite numeric values. |
metric | Variable | The metric being tested (center or spread). |
threshold | Variable | The user-supplied threshold from --compare1 metric:value. |
estimate | Variable | Point estimate of the chosen metric for this column. |
lower, upper | Variable | Confidence bounds for the estimate (error rate = misrate). |
verdict | Variable | One of less (estimate statistically below threshold), greater (statistically above), or inconclusive (interval contains threshold). |
Incompatible with --no-bounds (the verdict requires bounds).
Compare2 Mode (--compare2 <spec>)
Two-sample confirmatory analysis. Tests two-sample estimates (shift / ratio / disparity) against user-supplied thresholds and renders a verdict per (column pair, threshold). Always produces a standalone CSV.
Threshold format: comma-separated metric:value pairs, e.g. shift:0 or shift:0,disparity:0.8. Valid metrics: shift, ratio, disparity.
Output columns: field_x, field_y, n_x, n_y, metric, threshold, estimate, lower, upper, verdict
| Identifier | Level | Summary |
|---|---|---|
field_x, field_y | Pairwise | Column names being compared. |
n_x, n_y | Pairwise | Per-column counts of finite numeric values. |
metric | Pairwise | The metric being tested (shift, ratio, or disparity). |
threshold | Pairwise | The user-supplied threshold from --compare2 metric:value. |
estimate | Pairwise | Point estimate of the chosen metric for this column pair. |
lower, upper | Pairwise | Confidence bounds for the estimate (error rate = misrate). |
verdict | Pairwise | One of less, greater, or inconclusive (same semantics as compare1). |
ratio rows are suppressed for Date/DateTime pairs (see Two-Sample Mode above). Incompatible with --no-bounds.
Options
| Option | Default | Description |
|---|---|---|
--twosample / -t | off | Compute two-sample estimators for all column pairs. Mutually exclusive with --compare1 / --compare2. |
--compare1 <spec> | — | One-sample confirmatory analysis. Format: metric:value[,metric:value,...]. Valid metrics: center, spread. Mutually exclusive with --twosample / --compare2. |
--compare2 <spec> | — | Two-sample confirmatory analysis. Format: metric:value[,metric:value,...]. Valid metrics: shift, ratio, disparity. Mutually exclusive with --twosample / --compare1. |
--select <cols> / -s | all numeric columns (when stats cache fresh) | Column selection using qsv's column-selection syntax. Non-numeric columns appear with n=0. In two-sample mode, all pairs of selected columns are computed. |
--misrate <n> / -m | 0.001 | Probability that bounds fail to contain the true parameter. Lower values produce wider bounds. Must be achievable for the given sample size. Use 1e-3 for everyday analysis or 1e-6 for critical decisions. |
--standalone | off | Force one-sample mode to emit a standalone CSV instead of extending the stats cache. No effect with --twosample / --compare1 / --compare2 (which are always standalone). |
--stats-options <arg> | --infer-dates --infer-boolean --mad --quartiles --force --stats-jsonl | Options passed to the stats command when baseline stats need to be generated. Note: this default differs from moarstats by omitting --percentiles. |
--round <n> | 4 | Round statistics to <n> decimal places. Uses Midpoint Nearest Even (Bankers Rounding). |
--force | off | Force recomputing ps_* columns in the stats cache even if they already exist. |
--subsample <N> | off | Partial Fisher-Yates shuffle keeping only N values per column before computing. ~100× speedup on large datasets while preserving statistical robustness. Recommended: 10,000–50,000 for exploratory analysis. Incompatible with the default cache-append mode (approximate results would be silently reused as if computed from the full dataset) — must be combined with --standalone or one of the other non-cache modes. |
--seed <N> | 42 (when --subsample is set) | Seed for reproducible subsampling. |
--no-bounds | off | Skip confidence bound computation (~2× speedup) when only point estimates are needed. Incompatible with --compare1 / --compare2, and with the default cache-append mode (the cache would store empty bounds that a subsequent run would silently reuse) — must be combined with --standalone. |
--output <file> / -o | stdout | Write output to file instead of stdout. |
--delimiter <c> / -d | , | Field delimiter for reading/writing CSV data. |
--no-headers / -n | off | When set, the first row will not be treated as headers. |
--jobs <arg> / -j | number of CPUs | The number of jobs to run in parallel. When not set, defaults to the number of CPUs detected. |
--memcheck | off | Check if there is enough memory to load the entire CSV into memory using CONSERVATIVE heuristics. Not valid for stdin. |
Performance Characteristics
Algorithmic Complexity
All Pragmastat estimators avoid naïve O(n²) pairwise enumeration by using implicit‑matrix selection and binary‑search techniques (see pragmastat.dev/synopsis):
| Estimator(s) | Complexity | Technique |
|---|---|---|
center, center_bounds | O(n log n) | Monahan's implicit‑matrix selection + SignedRankMargin |
spread, spread_bounds | O(n log n) | Monahan's selection for differences + disjoint‑pair sign‑test inversion |
shift, shift_bounds, ratio, ratio_bounds | O((n+m) log L) | Value‑space binary search over pairwise differences; L = value range |
disparity, disparity_bounds | O((n+m) log L + n log n + m log m) | Bonferroni split combining shift bounds + average spread bounds |
These are per‑column (one‑sample) or per‑pair (two‑sample) complexities. Randomization primitives (xoshiro256++) are O(1) per draw.
qsv Implementation Optimizations
- Parallel computation — Columns (one‑sample) and column pairs (two‑sample) are processed in parallel via Rayon; controlled by
--jobs. - Parallel indexed CSV reading — Files with ≥ 10,000 rows and a
.csv.idxindex are read in parallel chunks using a ThreadPool with crossbeam channels and deterministic seeking. --subsample N— Partial Fisher‑Yates shuffle keeps only N values per column before computing, with deterministic per‑column seeding (--seeddefaults to 42). Provides ~100× speedup on large datasets while preserving statistical robustness. Recommended: 10,000–50,000 for exploratory analysis.--no-bounds— Skips confidence bound computation for ~2× speedup when only point estimates are needed.- Combined
--subsample+--no-bounds— ~200× speedup for quick exploratory analysis on large datasets. - Stats cache integration — When a fresh
.stats.csv.data.jsonlcache exists, non‑numeric columns are automatically filtered out before computation, and Date/DateTime type detection is read from the cache rather than re‑inferred. - Pre‑computed log arrays — In two‑sample mode,
ln()transformations are computed once per column in parallel and shared across all pairs, avoiding redundant O(n) passes. - SIMD‑accelerated parsing — Uses
simdutf8for UTF‑8 validation,fast_float2for float parsing, andsimd_json(little‑endian) for cache deserialization. - Pre‑allocated buffers — Column vectors are sized to estimated row counts, and chunk buffers are pre‑allocated before parallel reads to minimize reallocations.
When Values Are Blank
Cells are empty (blank) when:
- No numeric data (n=0): The column contains no finite numeric values
- Positivity required:
ratio,ratio_lower, andratio_upperrequire all values > 0 - Date/DateTime pairs:
ratiois suppressed for--twosampleand--compare2because it depends on the arbitrary 1970 epoch origin and isn't meaningful for dates;shift,disparity, and their bounds remain populated - Sparsity required:
spread,spread_lower,spread_upper,disparity,disparity_lower, anddisparity_upperneed real variability (not tie-dominant data) - Insufficient data for bounds: All bounds columns need enough data for the requested misrate; try a higher misrate or more data
See: Pragmastat manual (PDF), pragmastat.dev
frequency
The frequency command computes frequency distribution tables for CSV columns (exact by default, with an opt-in Frequent Items sketch for bounded-memory top-K), with support for multiple output formats, ranking strategies, and weighted frequencies.
Key Features:
- Computes exact frequency counts by default; opt into a Misra-Gries heavy-hitters sketch via
--sketch-method frequent_itemsfor bounded-memory top-K on very-high-cardinality streams (see Frequent Items Sketch (Approximate Top-K)) - Multiple ranking strategies for handling tied values
- Weighted frequency support using a specified weight column
- CSV and JSON/TOON output modes
- Integration with stats cache for memory optimization
- Memory-aware chunking for large datasets
- Parallel processing support with indexing
Stats Cache Integration:
When the stats cache exists (created by qsv stats --stats-jsonl), the frequency command uses it to:
- Detect ID columns (cardinality == rowcount) and short-circuit frequency compilation
- Pre-allocate appropriate hashmap capacity based on cardinality
- Avoid building hashmaps for all-unique columns
Without stats cache, frequency will compute frequencies for ALL columns, even ID columns, which can use significant memory.
For more examples, see https://github.com/dathere/qsv/blob/master/tests/test_frequency.rs.
Frequency Table Output
In CSV output mode (default), the table is formatted as CSV data with the following columns:
| Column | Description |
|---|---|
field | Column name (or 1-based index if --no-headers is used) |
value | The value from the column |
count | Frequency count (or weighted sum if --weight is used) |
percentage | Percentage of total (count/total * 100) |
rank | Rank based on count (1 = most frequent, or least frequent if --asc is used) |
Formatting Options:
--pct-dec-places <arg>— decimal places for percentage rounding (default: -5). When negative, the number of decimal places is automatically determined to the minimum needed to represent the percentage accurately, up to the absolute value of the negative number.--no-trim— don't trim whitespace from values when computing frequencies. By default, leading and trailing whitespace is trimmed.--vis-whitespace— visualize whitespace characters in the output using the same markers asstats(see Whitespace Visualization).
Special Values:
<ALL_UNIQUE>(configurable via--all-unique-text): For ID columns detected via stats cache, indicates all values are unique. Count equals row count, percentage is 100%, rank is 0.Other (N)(configurable via--other-text, default:Other): When--limitis used, remaining values are grouped into this category. N indicates the count of unique values grouped. Rank is 0. Use--no-other(alias for--other-text "<NONE>") to exclude the "Other" category entirely.(NULL)(configurable via--null-text, default:(NULL)): Represents empty/missing values. Can be excluded with--no-nulls(alias for--null-text "<NONE>").
Limit Behavior:
--limit N(positive): Keep only top N most frequent values--limit -N(negative): Keep only values with count >= N--limit 0: No limit, return all values--unq-limit N: For all-unique columns, limit to N sample values (default: 10)--lmt-threshold N: Only apply limits when unique count >= N (default: 0 = always apply)
Sorting:
- Default: Descending order by count (most frequent first)
--asc: Ascending order by count (least frequent first). Note: This also reverses ranking - least frequent values get rank 1.--other-sorted: Include "Other" category in sorted order instead of at the end
Ranking Strategies
The --rank-strategy option controls how ranks are assigned when multiple values have the same count. See https://en.wikipedia.org/wiki/Ranking for more info.
| Strategy | Description | Example (counts: 4, 3, 3, 2) |
|---|---|---|
dense | Consecutive integers regardless of ties (1223 ranking) | 1, 2, 2, 3 |
min | Tied items receive minimum rank position (1224 ranking) | 1, 2, 2, 4 |
max | Tied items receive maximum rank position (1334 ranking) | 1, 3, 3, 4 |
ordinal | Next rank is current rank plus 1 (1234 ranking) | 1, 2, 3, 4 |
average | Tied items receive average of positions (1 2.5 2.5 4 ranking) | 1, 2.5, 2.5, 4 |
Note: Tied values with the same rank are sorted alphabetically within their rank group.
NULL Handling
The frequency command provides several options for controlling how NULL (empty) values are handled:
| Option | Default | Description |
|---|---|---|
--null-text <arg> | (NULL) | Customize the display text for NULL values. Set to <NONE> to exclude NULLs entirely. |
--no-nulls | off | Don't include NULLs in the frequency table. Alias for --null-text "<NONE>". |
--null-sorted | off | Sort NULL entries with other values by count instead of placing them at the end of the frequency table (after "Other" if present). |
--pct-nulls | off | Include NULL values in percentage and rank calculations. When disabled (default), percentages are "valid percentages" with NULLs excluded from the denominator, and NULL entries display empty percentage and rank values. When enabled, NULLs are included in the denominator (original behavior). Has no effect when --no-nulls is set. |
Column Filtering
The frequency command supports filtering columns from the frequency analysis:
| Option | Description |
|---|---|
--no-float <cols> | Exclude Float columns from frequency analysis. Floats typically contain continuous values where frequency tables are not meaningful. Use --no-float "*" to exclude ALL Float columns, or specify a comma-separated list of Float columns to INCLUDE as exceptions (e.g., --no-float price,rate excludes all Floats except "price" and "rate"). Requires stats cache for type detection. |
--stats-filter <expr> | Filter columns based on their statistics using a Luau expression. Columns where the expression evaluates to true are EXCLUDED. Available fields include: field, type, is_ascii, cardinality, nullcount, sum, min, max, range, sort_order, min_length, max_length, mean, stddev, variance, cv, sparsity, q1, q2_median, q3, iqr, mad, skewness, mode, antimode, n_negative, n_zero, n_positive, etc. Examples: "nullcount > 1000", "type == 'Float'", "cardinality > 500 and nullcount > 0". Requires stats cache and the luau feature. |
Weighted Frequencies
When the --weight <column> option is specified, frequency counts are multiplied by the weight value for each row.
Weight Handling:
- Weight column must be numeric
- Weight column is automatically excluded from frequency computation
- Missing or unparsable weights default to 1.0
- Zero, negative, NaN, and infinite weights are ignored and do not contribute to frequencies
- Weight tolerance calculation uses stats cache (stddev/range/mean) for scale-aware tolerance when available
Output:
- Count column shows weighted sum (displayed as rounded integer)
- Percentage calculated as:
weight / total_weight * 100 - Ranking based on weighted sums
Stats Cache Integration
The frequency command leverages the stats cache (created by qsv stats --stats-jsonl) to optimize memory usage and performance:
ID Column Detection:
When stats cache exists, columns where cardinality == rowcount are detected as ID columns. For these columns:
- Frequency compilation is short-circuited (no hashmap built)
- Output shows single
<ALL_UNIQUE>entry with count = rowcount, percentage = 100%, rank = 0 - Saves significant memory for large datasets with ID columns
Memory Optimization:
- Hashmap capacity pre-allocated based on cardinality from stats cache
- For parallel processing, capacity divided by number of chunks
- Reduces allocations and improves performance
Disabling Stats Cache:
Set QSV_STATSCACHE_MODE=none to force computing frequencies for ALL columns including ID columns. Useful when you need a "complete" frequency table even for ID columns. In this case, use --unq-limit to avoid memory issues with large cardinality columns.
Creating Stats Cache:
# Create stats cache with cardinality for frequency optimization
qsv stats --cardinality --stats-jsonl data.csv
# Or create with all stats
qsv stats --everything --stats-jsonl data.csv
Frequency Cache
The --frequency-jsonl flag writes a complete frequency distribution as a JSONL cache file (FILESTEM.freq.csv.data.jsonl). When a valid (fresh) cache exists, subsequent frequency runs automatically reuse it instead of recomputing from the CSV.
Cache Options:
| Option | Default | Description |
|---|---|---|
--frequency-jsonl | off | Write the frequency distribution as a JSONL cache. Requires a file input (not stdin). |
--high-card-threshold <arg> | 100 | Absolute cardinality threshold for <HIGH_CARDINALITY> classification. Can also be set with QSV_FREQ_HIGH_CARD_THRESHOLD env var (env var takes precedence when CLI value equals the default). |
--high-card-pct <arg> | 90 | Percentage of rowcount threshold for <HIGH_CARDINALITY> classification. Must be between 1 and 100. Can also be set with QSV_FREQ_HIGH_CARD_PCT env var (env var takes precedence when CLI value equals the default). |
--force | off | Force recomputation and cache regeneration even when a valid frequency cache exists. |
HIGH_CARDINALITY Sentinel:
Columns whose cardinality exceeds the smaller of --high-card-threshold and --high-card-pct percent of rowcount are classified as HIGH_CARDINALITY. These get a single <HIGH_CARDINALITY> sentinel entry (count = rowcount, percentage = 100%, rank = 0), analogous to the <ALL_UNIQUE> sentinel for ID columns.
Cache Validation:
- The cache is considered valid when the CSV file's mtime is older than the cache file's mtime
- Metadata compatibility is checked:
--no-nulls,--no-headers, and--delimitermust match the cached settings
Incompatibilities:
The --frequency-jsonl flag produces an error when combined with:
--ignore-case— case folding changes computed values--no-trim— whitespace handling changes computed values--weight— weighted frequencies change computed values
Partial Cache Hits: When the cache is valid, columns with full cached data are served directly from the cache. HIGH_CARDINALITY columns (which store only a sentinel) are recomputed via parallel processing against the original CSV.
Frequent Items Sketch (Approximate Top-K)
By default, frequency computes exact counts by tracking every distinct value in a HashMap. For columns with very high cardinality, this can be memory-prohibitive. The --sketch-method frequent_items flag swaps the HashMap for the Misra-Gries heavy-hitters sketch (Apache DataSketches port), which tracks the top-K most frequent values in constant memory with bounded additive error.
| Option | Default | Description |
|---|---|---|
--sketch-method <m> | exact | Algorithm for the frequency table. Choices: exact (HashMap, exact counts) or frequent_items (Misra-Gries sketch, approximate top-K). |
--sketch-map-size <n> | 4096 | Maximum map size for the Frequent Items sketch. Must be a power of two and ≥ 8. Larger values tighten the error bound at the cost of more memory. Only used when --sketch-method frequent_items. |
Counts are estimates. The sketch reports each item's upper-bound frequency estimate; tail items not retained in the sketch are aggregated into a single "Other" row (no unique-count suffix, since the sketch cannot recover the true number of distinct tail items). The sketch's natural ordering is top-K by estimate descending; tied counts use the sketch's hash-table iteration order. The frequency cache is bypassed under this mode.
Rejected flags (the command errors out if any of these are combined with --sketch-method frequent_items):
| Flag | Reason |
|---|---|
--asc | The sketch tracks heavy hitters only — least-frequent items are not recoverable. |
--weight | The Apache DataSketches Frequent Items sketch operates on unit-weight streams. |
--ignore-case | Case folding changes computed values; not supported in streaming sketch mode. |
--no-trim | Whitespace handling changes computed values; not supported in streaming sketch mode. |
--other-sorted | The sketch always emits the "Other" row at the end. |
--null-sorted | The sketch ranks NULL alongside other values by estimate; no reordering support. |
--frequency-jsonl | The frequency cache is bypassed under sketch mode. |
--stats-filter | Incompatible with sketch-mode dispatch. |
--json / --pretty-json / --toon | Only CSV output is supported under sketch mode. |
Silently ignored flags under frequent_items (no error, no effect):
--rank-strategy— the sketch's natural top-K-by-estimate ordering is used.--lmt-threshold— the sketch always tracks at most--sketch-map-sizecandidates.--unq-limit— the sketch's bounded map is itself the unique-limit.
"Other" row divergence: Under frequent_items, the "Other" row label is the bare --other-text (no (N) unique-count suffix, since the sketch cannot recover the true count of items not in the top-K), and rank is 0 to match the existing convention for the exact mode's "Other" row.
When to use: prefer frequent_items for streaming over wide tables with many high-cardinality string columns where you only care about the heavy hitters and want predictable, fixed memory. Prefer exact (default) for small/medium cardinality, weighted streams, or whenever you need a complete frequency distribution.
See also: Misra-Gries summary (Wikipedia), Apache DataSketches Frequent Items.
JSON/TOON Output
The --json or --pretty-json flags output frequency tables as nested JSON. The --toon flag outputs in TOON format (compact, human-readable encoding for LLM prompts).
JSON Structure:
{
"input": "filename.csv",
"description": "command arguments",
"rowcount": 1000,
"fieldcount": 5,
"rank_strategy": "dense",
"fields": [
{
"field": "column_name",
"type": "String",
"cardinality": 10,
"nullcount": 0,
"sparsity": 0.0,
"uniqueness_ratio": 0.01,
"stats": [
{"name": "sum", "value": 1000},
{"name": "min", "value": "A"},
{"name": "max", "value": "Z"},
{"name": "range", "value": null},
{"name": "sort_order", "value": "UNSORTED"},
{"name": "min_length", "value": 1},
{"name": "max_length", "value": 1},
{"name": "avg_length", "value": 1},
{"name": "mean", "value": null},
{"name": "stddev", "value": null},
...
],
"frequencies": [
{"value": "A", "count": 500, "percentage": 50.0, "rank": 1},
{"value": "B", "count": 300, "percentage": 30.0, "rank": 2},
...
]
}
]
}
Additional Stats in JSON Output:
When --no-stats is NOT set and the column type is not empty, NULL, or Boolean, the per-field stats array contains up to 17 statistics (only those present in the underlying stats record are emitted):
sum— Sum of numeric valuesmin— Minimum valuemax— Maximum valuerange— Range (max - min)sort_order— ASCENDING, DESCENDING, or UNSORTEDmin_length— Shortest string length (String types)max_length— Longest string length (String types)sum_length— Total of all string lengths (String types)avg_length— Average string length (String types)stddev_length— Standard deviation of string lengths (String types)variance_length— Variance of string lengths (String types)cv_length— Coefficient of variation of string lengths (String types)mean— Arithmetic meansem— Standard error of the meanstddev— Standard deviationvariance— Variancecv— Coefficient of variation
Note:
cardinality,nullcount,sparsity, anduniqueness_ratioare emitted as top-level properties of eachFrequencyField(see the JSON Structure example above), not inside the per-fieldstatsarray.geometric_mean,harmonic_mean, andmax_precisionare not included infrequencyJSON output even when present in the stats cache — useqsv statsdirectly if you need them.
Memory-Aware Processing
The frequency command defaults to dynamic, memory-aware chunking for large datasets to avoid out-of-memory errors. Both stats and frequency default to memory-aware sizing; frequency is documented separately here because it builds hash tables and therefore benefits from predictable per-chunk memory budgeting. CPU-based chunking can be requested explicitly via QSV_FREQ_CHUNK_MEMORY_MB = -1 (the same convention stats uses with QSV_STATS_CHUNK_MEMORY_MB).
Chunking Behavior:
- Automatically enabled for indexed files
- Chunk size calculated based on:
- Available memory
- Record sampling (samples first 1000 records)
- Estimated hashmap overhead for frequency tables
- Controlled by
QSV_FREQ_CHUNK_MEMORY_MBenvironment variable:- Not set or
0: Dynamic sizing based on available memory and sampling - Positive N: Fixed memory limit of N MB per chunk
-1: CPU-based chunking (num_records / num_CPUs)
- Not set or
Parallel Processing:
- Requires an index file (
qsv index data.csv) - Automatically enabled when index exists (disable with
--jobs 1) - Each chunk processed independently, then merged
- For unindexed files, falls back to sequential processing
Memory Estimation: The command estimates memory per record as:
- Base record size (sum of field lengths)
- Hashmap overhead (~24 bytes per entry + value size)
- Additional overhead for Vec capacity (~25%)
Auto-Index Creation: If memory check fails and file is not indexed:
- Attempts to auto-create index
- Switches to parallel processing if successful
- Falls back to sequential if index creation fails
For configuration details, see https://github.com/dathere/qsv/blob/master/docs/ENVIRONMENT_VARIABLES.md
Processing Very Large Files
This section consolidates guidance for running stats and frequency on files that are large relative to available RAM (rule of thumb: any file whose CSV size approaches or exceeds 50% of free memory, or any file with columns whose true cardinality could approach the row count). For configuration knobs referenced below, see ENVIRONMENT_VARIABLES.md.
When to Worry
Most stats/frequency invocations on multi-GB files do not need special handling, because the defaults already cover the common cases:
- Streaming stats are constant-memory. All 27 streaming statistics (list above) —
sum,min,max,min_length/max_length/avg_length,mean/stddev/variance/cv,sem,geometric_mean/harmonic_mean,nullcount/sparsity, type-counts,sort_order/sortiness,is_ascii,max_precision, andrange— run in O(1) memory regardless of row count. A plainqsv stats huge.csvon a 1 TB file with no non-streaming flags will not OOM. frequencyuses memory-aware chunking by default. With an index, it samples the first 1000 records, estimates per-record + HashMap overhead, and sizes chunks to fit available memory.- OOM auto-fallback is on by default. When non-streaming
statsorfrequencywould exceed the memory-check budget, qsv first tries to auto-create an index (for non-stdin inputs), then auto-enables DataSketches estimators where the flags allow. Awwarn!line is printed listing each auto-enabled estimator.
You only need the recipes in this section when:
- You are requesting non-streaming stats (any of
--mode,--cardinality,--quartiles,--median,--mad,--percentiles, or--everything) on a file too large to materialize the per-column state in memory, or - You are running
frequencyon columns whose distinct-value count could blow up the HashMap (UUIDs, free-text, timestamps with sub-second precision), or - You are reading from stdin (no index can be auto-created — the OOM fallback can still engage the sketch path, but cannot use indexed parallelism), or
- You are on a big-endian target (see Platform Note).
Memory Pressure Points
| Command | Pressure point | Scales with | Mitigation |
|---|---|---|---|
stats streaming | none | O(1) | — |
stats --cardinality (exact) | distinct-value HashMap per column | per-column cardinality | --cardinality-method approx (HLL, ~5 KB/col) or --mode-cardinality-cap <n> |
stats --mode / --everything (unweighted) | mode tracker Vec | row count (every cell pushed) | --mode-cardinality-cap <n> (this is the only knob; mode is not sketched even under --cardinality-method approx) |
stats --mode --weight=… | mode tracker HashMap | per-column cardinality | --mode-cardinality-cap <n> |
stats --median / --quartiles / --mad / --percentiles | sort buffer per column | column row count | --quantile-method approx (t-digest, ~200 centroids/col). Note: --mad is auto-disabled under approx (needs a second pass that t-digest cannot serve). |
frequency (exact) | per-column HashMap of distinct values | per-column cardinality | --sketch-method frequent_items (Misra-Gries, fixed --sketch-map-size slots) |
Two things are easy to miss:
mode/antimodeis the most common surprise. Under unweighted mode tracking, every cell in the column is pushed onto an underlying Vec, so the tracker grows with row count, not cardinality. The--cardinality-method approxHLL replaces only thecardinalitycolumn — mode/antimode is not sketched. The fix is--mode-cardinality-cap <n>: when the tracker grows pastn, qsv drops it and emits*HIGH_CARDINALITYfor the mode/antimode columns.--mode-cardinality-cap 0is the default (no cap). It is opt-in because, under--cardinality-method exact, an exceeded cap emits>=<n>in thecardinalitycolumn — and the>=prefix breaks downstream integer parsers. Under--cardinality-method approx, the cap does not affect the cardinality column (HLL emits its estimate regardless), so combining the two is safe.
Recipe: stats on Very Large Files
Maximum-safety invocation on a multi-GB CSV when you need the full non-streaming stat set:
# one-time: index enables parallel chunking
qsv index huge.csv
qsv stats huge.csv \
--everything \
--quantile-method approx \
--cardinality-method approx \
--mode-cardinality-cap 1000000 \
--stats-jsonl \
-o huge.stats.csv
What each non-default flag contributes:
--quantile-method approx— t-digest for median/quartiles/percentiles/skewness.--cardinality-method approx— HyperLogLog forcardinality/uniqueness_ratio.--mode-cardinality-cap 1000000— bound the mode/antimode trackers.--stats-jsonl— also write the stats cache for downstream "smart" commands.
What this gives you, in order of memory savings:
- Indexed parallel processing. Without an index,
statsruns sequentially; with an index, work is split into memory-aware chunks (sized byQSV_STATS_CHUNK_MEMORY_MB) processed in parallel and merged. - t-digest for quantiles. Median, q1/q2/q3, IQR, fences, skewness, and
--percentilesall read from a ~200-centroid t-digest per column instead of sorting the full column. Error is ~1% rank error, more accurate at the tails. Caveat:TDigestMut::mergeis associative but not chunk-count-invariant, so different--jobsvalues can yield ~1% differences across runs. Pin--jobs 1for run-to-run determinism.--madis auto-disabled with a warning. - HLL for cardinality.
cardinalityanduniqueness_ratiocome from a ~5 KB HyperLogLog per column. ~1.5% RSE. The HLL union is associative and order-invariant, so the estimate is reproducible across--jobsvalues. - Cap on mode/antimode tracker. Without this, the unweighted mode tracker grows linearly with row count. With it set to, e.g.,
1000000, columns whose tracker exceeds that drop to*HIGH_CARDINALITYfor mode/antimode while every other statistic remains valid.
Flags that block this recipe (you must drop them or fall back to exact mode):
--weight <col>— t-digest has no weighted-update API upstream, so--quantile-method approxis rejected with--weight.--infer-boolean— needscardinality == 2exactness, so--cardinality-method approxis rejected (or, under OOM auto-enable, suppressed) with--infer-boolean.
If neither of those applies, you can omit the explicit method flags and rely on the OOM auto-fallback: qsv will flip them on automatically when util::mem_file_check trips. You can disable the auto-enable by passing --quantile-method exact or --cardinality-method exact explicitly (the OOM arm scans argv for these flag names, so docopt's default-fill does not count as an explicit opt-out).
Recipe: frequency on Very Large Files
If you only care about the top-K most frequent values (a common analyst case), use the Misra-Gries sketch:
qsv index huge.csv
qsv frequency huge.csv \
--sketch-method frequent_items \
--sketch-map-size 4096 \
--limit 100 \
-o huge.freq.csv
--sketch-map-size must be a power of two and ≥ 8; larger values tighten the error bound at the cost of more memory. --limit 100 emits the top 100 values per column.
--sketch-map-size sets the upper bound on map slots; the sketch's worst-case additive error is bounded by the stream length minus the active map total, so doubling the map size roughly halves the error bound at the cost of doubling memory. 4096 is a reasonable starting point; bump to 16384 or 65536 for tighter bounds.
Flags that are rejected under --sketch-method frequent_items (the full list is in the Frequent Items Sketch section): --asc, --weight, --ignore-case, --no-trim, --other-sorted, --null-sorted, --frequency-jsonl, --stats-filter, --json/--pretty-json/--toon. If you need any of these, you must run in exact mode and rely on memory-aware chunking (and possibly the OOM auto-enable, which is itself blocked by the same flag set).
Silently ignored under FI mode: --rank-strategy, --lmt-threshold, --unq-limit (the sketch's bounded map is itself the unique-limit, and ordering is fixed at top-K by estimate descending).
"Other" row divergence: the Other label has no (N) unique-count suffix and rank is 0, since the sketch cannot recover the true tail count.
Indexing for Parallelism
For both stats and frequency, an index is the single highest-leverage prerequisite for large-file processing:
qsv index huge.csv # creates huge.csv.idx; updated automatically when stale
What an index unlocks:
- Parallel chunking. Work is split across cores (
-j Nor auto-detected). Each chunk is processed independently and merged. - Memory-aware chunk sizing. With
QSV_STATS_CHUNK_MEMORY_MB/QSV_FREQ_CHUNK_MEMORY_MBunset (the default), qsv samples the first 1000 records, estimates per-record memory, and picks a chunk size that fits available memory. - OOM fallback for
stats. Whenutil::mem_file_checktrips and no index exists, qsv attempts to auto-create one before falling back to sketches. Auto-creation is skipped for stdin (not seekable), socat huge.csv | qsv stats …cannot benefit from indexed parallelism — pipe to a file first if you can.
You can also auto-build the index by setting QSV_AUTOINDEX_SIZE=<bytes> — any CSV larger than that threshold gets an index created on first use.
Environment Variable Cheat Sheet
Most relevant for large-file work (see ENVIRONMENT_VARIABLES.md for the full list):
| Variable | Effect |
|---|---|
QSV_AUTOINDEX_SIZE | Minimum file size (bytes) for automatic index creation. Set this so big inputs always get indexed. |
QSV_MEMORY_CHECK | Switches util::mem_file_check from NORMAL (file size vs. total memory − headroom) to CONSERVATIVE (file size vs. available + free_swap × platform_factor − headroom). Trips OOM far more readily, so the auto-fallback engages sooner. |
QSV_FREEMEMORY_HEADROOM_PCT | Free-memory headroom for the memory check (default 20%). Set to 0 to skip the check entirely (use at your own risk). |
QSV_STATS_CHUNK_MEMORY_MB | Per-chunk memory cap for stats (positive integer in MB). 0 = dynamic sizing. -1 = CPU-based chunking (chunks = rows/cores; ignores memory). |
QSV_FREQ_CHUNK_MEMORY_MB | Same semantics as above, for frequency. |
QSV_ANTIMODES_LEN | Truncation length for the antimodes preview (default 100 chars). 0 disables truncation. |
QSV_STATS_STRING_MAX_LENGTH | Truncate min/max for String columns at this length (useful when a column contains GeoJSON / Shapefile geometry blobs that would otherwise blow up downstream parsers). |
QSV_MAX_JOBS | Cap on parallel workers across all multithreaded qsv commands. Useful when each chunk's in-memory state is large (lower QSV_MAX_JOBS to leave headroom). |
QSV_FREQ_HIGH_CARD_THRESHOLD / QSV_FREQ_HIGH_CARD_PCT | Cardinality cutoffs for the --frequency-jsonl cache to emit a HIGH_CARDINALITY sentinel instead of a full frequency entry. Useful for keeping the cache compact on wide tables with ID-like columns. |
Worked Example: a Multi-GB CSV
For a 30 GB CSV with ~200 columns on a 32 GB host, where some columns are UUIDs:
# 1. Index up front so all subsequent passes are parallel + chunked.
qsv index big.csv
# 2. Stats: full non-streaming set, but bound mode tracking and use sketches.
QSV_STATS_CHUNK_MEMORY_MB=512 \
qsv stats big.csv \
--everything \
--quantile-method approx \
--cardinality-method approx \
--mode-cardinality-cap 1000000 \
--stats-jsonl \
-o big.stats.csv
# 3. Frequency: top-100 per column, sketch-mode for fixed memory.
qsv frequency big.csv \
--sketch-method frequent_items \
--sketch-map-size 16384 \
--limit 100 \
-o big.freq.csv
# 4. (Optional) Tighten the memory check if you're sharing the host:
QSV_MEMORY_CHECK=1 QSV_FREEMEMORY_HEADROOM_PCT=40 qsv stats big.csv …
If you forgot any of the sketch flags and stats hits the memory check, the OOM auto-fallback will print a wwarn! line such as:
OOM during memory check: auto-enabling DataSketches estimators
(--quantile-method approx, --cardinality-method approx).
Re-run with explicit --quantile-method exact / --cardinality-method exact
to disable the auto-enable.
The exact estimators auto-enabled depend on which incompatible flags are set (--weight blocks t-digest; --infer-boolean blocks HLL; --mad/--everything causes MAD to be auto-disabled under approx). The corresponding line for frequency mentions --sketch-method frequent_items and reports the map size.
Hard Limits (What Will Still OOM)
The DataSketches integration is a major step toward unbounded inputs, but it does not make stats/frequency truly unconditional. Cases where you can still hit memory exhaustion:
- Big-endian targets (s390x, PowerPC BE). DataSketches is unavailable — all
--quantile-method approx,--cardinality-method approx, and--sketch-method frequent_itemspaths are rejected, and the OOM auto-enable compiles to a no-op stub. On these targets, fall back to--mode-cardinality-cap, smallerQSV_STATS_CHUNK_MEMORY_MB, and--limit/--unq-limitonfrequency. - Unweighted mode/antimode without a cap. The tracker grows with row count regardless of
--cardinality-method. Solution: set--mode-cardinality-capto a value you can afford, or drop--mode/--everything. - Frequency in exact mode with unbounded distinct values. If the column is truly unique-per-row (a UUID column on a 1 B-row CSV), exact mode needs ~1 B HashMap entries. Solution: switch to
--sketch-method frequent_items, or pre-bucket the column. --weightblocks t-digest and--infer-booleanblocks HLL forstats. If both flags are set, neither auto-enable engages, and the memory check will simply fail. Solution: drop the blocking flag, run a separate boolean-inference pass withstatsalone (no--weight), or accept exact mode with adequate RAM.frequencyflag combinations that reject Frequent Items. If you need--asc,--ignore-case,--no-trim,--weight,--other-sorted,--null-sorted,--frequency-jsonl,--stats-filter, or--json/--pretty-json/--toon, the sketch path is unavailable. Solution: exact mode with sufficient RAM, or do without that flag.- Stdin input for
stats. Stdin is not seekable, so the auto-index path is skipped. The sketch auto-enable still runs, but you lose parallelism. Solution: tee to a file first (tee /tmp/in.csv | qsv stats …orqsv stats /tmp/in.csv). - Explicit
--*-method exactopt-out. The OOM auto-enable scansargvfor--quantile-method/--cardinality-method/--sketch-method; if you passed any of those (evenexact), auto-enable is suppressed for that method. Drop the explicit opt-out to re-enable the fallback.
Platform Note: Big-Endian Targets
Apache DataSketches' Rust port is gated to little-endian targets (verified upfront in stats::run, frequency::run, and the OOM fallback paths). On big-endian targets:
--quantile-method approx,--cardinality-method approx, and--sketch-method frequent_itemsare all rejected with a clear error.try_enable_approx_sketches(stats) andcan_enable_frequent_items(frequency) compile to no-op stubs, so the OOM path falls through to error rather than silently degrading.
If you maintain qsv on a big-endian platform, the practical large-file toolkit is:
--mode-cardinality-capforstats(bounds mode/antimode tracking only).- Smaller
QSV_STATS_CHUNK_MEMORY_MB/QSV_FREQ_CHUNK_MEMORY_MBto keep per-chunk state small. - Pre-bucketing high-cardinality columns (e.g., truncate timestamps to the hour) before running
frequency. - For sort/dedup adjacencies, use
extsort/extdedup(external on-disk variants) instead ofsort/dedup.
Notes for moarstats and pragmastat
The DataSketches fallback applies to stats and frequency only. Two adjacent commands have their own characteristics:
moarstatscomputes the Advanced, Bivariate, Robust, and Outlier statistic families. Most require either two passes or a full in-memory column (e.g., outlier detection needs the IQR + every value; correlation needs paired columns held together). There is no sketch fallback — for very large inputs, sample first withqsv sampleand runmoarstatson the sample, or pre-filter columns to the ones you actually need.pragmastat(one-sample mode and two-sample mode) computes deterministic robust estimators that require full-sample residuals. It is designed for inputs that fit comfortably in memory; for very large inputs, sample down first.