classify_scalar
May 12, 2026 ยท View on GitHub

High-performance, header-only scalar classification for C++11 and newer based on the leading byte of a string.
Motivation
I was creating a general purpose ETL tool for CSV files and needed a fast and reliable way to determine whether a field was a number, bool, basic string, an ISO 8601 timestamp, etc.
I experimented with other methods, but I wanted something that was easily extensible to support arbitrary user-provided types.
Built-In Examples
The library supports parsing booleans (true or false of varying casings), ISO 8601 timestamps, floats, and ints out of the box.
#include <classify_scalar.hpp>
using namespace classify_scalar;
auto text = classify_scalar("hello"); // scalar_string
auto empty = classify_scalar(""); // scalar_null
auto whitespace = classify_scalar(" "); // scalar_null
auto yes = classify_scalar("true"); // scalar_bool
auto no = classify_scalar("FALSE"); // scalar_bool
auto integer_kind = classify_scalar("42"); // scalar_int8
auto negative_kind = classify_scalar("-17"); // scalar_int8
auto hex_kind = classify_scalar("0x2a"); // scalar_int8
auto too_large = classify_scalar("9223372036854775808"); // scalar_bigint
auto float_kind = classify_scalar("3.14159"); // scalar_float
auto integral_exp = classify_scalar("-1.25e2"); // scalar_int8
auto fractional_exp = classify_scalar("1e-3"); // scalar_float
auto date = classify_scalar("2024-01-31"); // scalar_timestamp
auto datetime = classify_scalar("2024-01-31T23:59:58Z"); // scalar_timestamp
auto offset = classify_scalar("2021-04-05T10:14:57-0600"); // scalar_timestamp
Timestamp offsets may use either Z, +HH:MM/-HH:MM, or compact
+HHMM/-HHMM forms.
ASCII boundary whitespace is trimmed by default:
auto trimmed = classify_scalar::classify_scalar(" 42 "); // scalar_int8
auto exact = classify_scalar::classify_scalar<ScalarKind, false>(" 42 "); // scalar_string
Calling classify_scalar(...) without an output policy means classify only.
Use output_refs(number, integer, boolean) when you want built-in parsed values
stored:
std::int64_t integer = 0;
long double number = 0;
bool boolean = false;
auto kind = classify_scalar::classify_scalar(
" -0x2a ",
classify_scalar::output_refs(number, integer, boolean));
// kind == scalar_int8
// integer == -42
// number is unchanged; floating-point values write to number
Integer classification writes the integer output only. It deliberately does
not also convert integers to long double; callers that want floating storage
can parse or cast explicitly.
Explicit Parsing
Use classify_scalar(...) when you want conservative inference. Use
parse_scalar<kind>(...) for built-in kinds when you already know which grammar
you want:
std::int64_t hex = 0;
bool ok = classify_scalar::parse_hex("DEADBEEF", hex);
// ok == true
// hex == 0xDEADBEEF
// classify_scalar("DEADBEEF") would still be scalar_string
double number = 0;
const char* float_first = "1e-3";
classify_scalar::parse_scalar<classify_scalar::scalar_float>(
float_first,
float_first + 4,
number);
std::uint64_t timestamp = 0;
const char* timestamp_first = "2024-01-31T23:59:58Z";
classify_scalar::parse_scalar<classify_scalar::scalar_timestamp>(
timestamp_first,
timestamp_first + 20,
timestamp);
parse_scalar<scalar_int64> reuses the normal numeric classifier and accepts
any signed integer width that fits in int64. parse_scalar<std::int32_t> and
other signed C++ integer types are also supported when you want boundary checks
for a concrete storage type. Use parse_hex when bare hexadecimal should be
accepted explicitly without making inference classify DEADBEEF as an integer.
The parser supports optional ASCII-boundary trimming through its second template
argument.
parse_scalar<scalar_timestamp> returns a JavaScript-style Unix timestamp in
milliseconds, normalized to UTC, and currently requires a non-negative result
because the natural home is std::uint64_t. scalar_bigint remains
classification-only because the built-in path deliberately avoids allocating or
storing the full integer.
Extending the Classifier/Parser
To add a custom scalar type, define:
- an enum value for the new kind
- a policy that recognizes and parses it
- an output object if you want to store parsed values
Policies are ordinary types. A policy provides:
matches_leading(unsigned char), used to build the compile-time dispatch table.on_dispatch(parse_state&, output&), which returns aScalarKind.
Custom policies can use an application enum for their own scalar kinds:
enum class app_scalar_kind : int {
CLASSIFY_SCALAR_BUILTINS,
telephone
};
struct telephone_policy {
static constexpr bool matches_leading(unsigned char c) noexcept {
return c == '+' || c == '(' || (c >= '0' && c <= '9');
}
template<typename Output>
app_scalar_kind on_dispatch(classify_scalar::parse_state& state, Output& out) const noexcept {
// Parse the field and call out.set_telephone(...) here.
return app_scalar_kind::telephone;
}
};
auto kind = classify_scalar::classify_scalar<app_scalar_kind>(
"(212) 555-1212",
outputs,
classify_scalar::policy_pack<
telephone_policy,
classify_scalar::builtin_numeric_policy<>,
classify_scalar::builtin_timestamp_policy,
classify_scalar::builtin_bool_policy>());
CLASSIFY_SCALAR_BUILTINS copies the library's built-in kind ids into your enum
and positions the next enum value at scalar_custom_begin. The typed
classify_scalar<app_scalar_kind>(...) overload returns your enum directly, so
built-ins return values such as app_scalar_kind::scalar_int8 and custom
policies can return values such as app_scalar_kind::telephone.
Custom policies can provide their own output object with matching set_* hooks.
The simple path is to derive from builtin_output_refs and add domain setters:
struct app_outputs : classify_scalar::builtin_output_refs {
app_outputs(long double& n, std::int64_t& i, bool& b, std::uint64_t& phone)
: builtin_output_refs(n, i, b), telephone(phone) {}
void set_telephone(std::uint64_t value) const noexcept { telephone = value; }
std::uint64_t& telephone;
};
See tests/test_telephone_policy.cpp for a complete custom NANP telephone
number recognizer that returns a user-defined scalar kind while falling through
to the built-in numeric, bool, and timestamp policies.
Design
The classifier uses compile-time ASCII tables for leading-byte dispatch. The
top-level classifier selects a parser family such as numeric, timestamp, or
bool, and the built-in numeric policy owns the decimal, exponent, and hex-prefix
scan.
Custom policies receive a mutable parse state with raw pointer context
(first, last, current) and scanner facts such as the first sign.
User policy packs are ordered by priority: the first policy whose
matches_leading(unsigned char) returns true receives the trimmed span through
on_dispatch(parse_state&, output&). If that policy returns scalar_string,
the pack falls through to the next policy that matches the same leading byte.
Integer conversion uses the bundled parser in all language modes. When compiled
as C++17 or newer, floating-point conversion uses std::from_chars when the
standard library provides floating-point overloads; older builds and libc++
builds without those overloads use the bundled fallback parser.
Decimal and 0x integer literals classify to the narrowest signed width
(scalar_int8, scalar_int16, scalar_int32, or scalar_int64). Decimal
integer literals outside int64 classify as scalar_bigint without allocating or
storing the full integer text. Unsigned scalar ids are reserved for a future
unsigned policy, but the default classifier currently reports signed kinds.
Hexadecimal inference is limited to 0x/0X prefixes by default; bare hex
strings such as FF remain strings.
Well-formed floating-point literals that cannot be represented by the built-in
finite double conversion path classify as scalar_bigfloat. That includes
range errors such as overflow or underflow; callers that need those exact values
should route the original span to an arbitrary-precision floating-point parser.
Policy Packs
Hot-path behavior is selected at compile time. TrimAsciiWhitespace is the
public classifier template knob; scalar families are selected by policy pack.
using no_bool_pack = classify_scalar::policy_pack<
classify_scalar::builtin_numeric_policy<>,
classify_scalar::builtin_timestamp_policy>;
auto no_bools = classify_scalar::classify_scalar(
"true",
classify_scalar::classify_only_output(),
no_bool_pack());
For numeric-only inference, pass numeric_policy_pack:
auto kind = classify_scalar::classify_scalar(
"2024-01-31",
classify_scalar::classify_only_output(),
classify_scalar::numeric_policy_pack()); // scalar_string
The built-in numeric policy recognizes hexadecimal integers by default. Its template argument selects the decimal separator:
using comma_decimal_pack = classify_scalar::policy_pack<
classify_scalar::builtin_numeric_policy<','>,
classify_scalar::builtin_bool_policy>;
auto value = classify_scalar::classify_scalar("3,14",
classify_scalar::classify_only_output(),
comma_decimal_pack()); // scalar_float
For explicit float parsing with a runtime decimal symbol, use parse_float.
It supports the common '.' and ',' decimal symbols:
double parsed = 0;
bool ok = classify_scalar::parse_float(first, last, parsed, decimal_symbol);
The default numeric policy returns the narrowest signed integer kind for
integral-valued floating syntax such as 1e3 and -1.25e2. If an integration
needs scientific or decimal syntax to stay scalar_float, use the second
numeric policy template argument:
using floating_syntax_pack = classify_scalar::policy_pack<
classify_scalar::builtin_numeric_policy<'.', false>>;
Benchmarks
The benchmark suite lives in benchmarks/ and compares the default classifier,
numeric-only policy packs, a copy of csv-parser's old data_type() logic, and
plain std::from_chars parse paths.
The practical takeaways:
- Numeric-first policy order keeps timestamp support cheap. If your data has numbers but no timestamps, incidental ISO timestamp handling adds little to the common numeric path.
numeric_policy_packis available when an integration wants only null/string/int/float/bigint behavior.- Float classification scans first, then calls
std::from_charsonce the input is known to be float-shaped. The classifier is only marginally more expensive than parsing directly, while also returning scalar kind information. - Default classification includes bool, ISO timestamp,
0xhex, bigint, and string fallback behavior; directfrom_charsbenchmarks are useful baselines, not equivalent feature sets.
Standards
- The public header remains C++11 compatible. The core API and implementation use
const char*pointer spans. - C++17 builds also provide thin
std::string_viewoverloads. - C++20 builds add concepts to improve diagnostics for malformed custom policy packs.
C++17 and newer builds use std::from_chars for floating point parsing when the
standard library provides those overloads. C++11, C++14, and standard libraries
without floating from_chars use the bundled fallback parser.
The header defines CLASSIFY_SCALAR_VERSION_MAJOR,
CLASSIFY_SCALAR_VERSION_MINOR, CLASSIFY_SCALAR_VERSION_PATCH, and numeric
CLASSIFY_SCALAR_VERSION macros. If multiple vendored copies are included in
one translation unit, an older copy included after a newer copy is skipped; a
newer copy included after an older copy fails loudly because the older
definitions have already been emitted.