README.md

June 22, 2026 · View on GitHub

Banner

build Hex.pm Hex.pm

glazer - the fastest Erlang NIF encoder/decoder for JSON, YAML, and CSV, built around hand-rolled recursive-descent decoders and direct term-to-text encoders that produce/consume native Erlang terms in a single pass. The JSON implementation was inspired by the glaze C++ library; glazer has since matured into a standalone implementation with no external C++ dependencies, and extended the same approach to YAML and CSV, with performance and features unmatched by other existing libraries for these formats.

Table of contents

Features

JSON

  • Decoding straight to Erlang terms: maps, lists, binaries, integers (including bignums), floats, booleans, and null
  • Encoding Erlang terms straight to JSON, including big integers
  • Incremental/streaming decoding of partial input (e.g. NDJSON over a socket) via stream_decoder/0,1, stream_feed/2, stream_eof/1
  • Configurable representation of JSON null and JSON object keys
  • minify/1 and prettify/1 helpers
  • Standalone big-integer encode/decode helpers (encode_integer/1, decode_integer/1, try_decode_integer/1)
  • query/2,3: run a jq filter over a JSON document, returning decoded Erlang terms (requires glazer to be built with libjq available — see jq filter support)
  • glazer:find/2 and glazer:compile_path/1: look up value(s) in a decoded term using a small subset of jq path syntax (.a.b[].c[0]), with no libjq dependency

YAML

  • Decoding YAML mappings/sequences/scalars to Erlang maps/lists/scalars, including big integers
  • Encoding Erlang terms to YAML in block style
  • Configurable representation of YAML null and mapping keys, with optional YAML 1.1 boolean compatibility (yes/no/on/off)

CSV

  • RFC 4180 CSV encoding/decoding via decode/1,2 and encode/1,2, with optional header-row support
  • Incremental/streaming CSV decoding via stream_decoder/0,1, stream_feed/2, stream_eof/1

Installation

Erlang (rebar.config):

{deps, [
  {glazer, "~> 0.5"}
]}.

Elixir (mix.exs):

def deps do
  [
    {:glazer, "~> 0.5"}
  ]
end

Building

Building the NIF requires a C++23 compiler (GCC 12+ or Clang 16+) and make. There are no external C++ library dependencies — all C++ code is self-contained in c_src/. A plain

make

builds priv/glazer.so and compiles the Erlang sources. For the fastest performance, run a Profile-Guided Optimisation (PGO) build instead:

make optimize

or

OPTIMIZE=1 make

This performs three steps automatically: compiles an instrumented binary, runs the test suite to collect real branch-frequency data, then recompiles with those profiles applied. The resulting .so typically outperforms a plain -O3 build by 5–15% on realistic JSON workloads.

glazer is an Erlang application with a Rebar-based C++ NIF build; mix invokes the same top-level Makefile/rebar3 compile path described above, so the same C++23 compiler requirement applies. Once compiled, call it via the :glazer module from Elixir:

Erlang:

1> glazer_json:decode(~"{\"a\":1,\"b\":[true,null,3.5]}")
#{<<"a">> => 1,<<"b">> => [true,null,3.5]}

Elixir:

iex> :glazer_json.encode(%{"a" => 1, "b" => [true, :null, 3.5]})
"{\"a\":1,\"b\":[true,null,3.5]}"

Use the use_nil/{null_term, nil} option (see Null term configuration below) to get idiomatic Elixir nil instead of the atom :null.

Testing

make test

runs the EUnit test suite via rebar3 eunit.

Benchmarking

Benchmarking:

Performance

  • JSON: faster than every other library benchmarked on both encoding and decoding — consistently ~25–40% ahead of torque (Rust sonic-rs NIF), and well ahead of simdjsone, jiffy, and the pure-Elixir libraries jason, thoas, euneus, and OTP's built-in json.
  • YAML: 2–7× faster than yaml_rustler and fast_yaml$, \text{and} ~25–75 \times \text{faster} \text{than} \text{the} \text{pure}-\text{Erlang} $yamerl/ymlr.
  • CSV: 4–12× faster than nimble_csv, and tens to hundreds of times faster than csv and erl_csv (which time out on large inputs).
Small file benchmarks (JSON/YAML/CSV) Medium file benchmarks (JSON/YAML/CSV) Large file benchmarks (JSON/YAML/CSV)

Each chart compares glazer against other libraries for JSON/YAML/CSV decode and encode on a representative small/medium/large file. Charts are generated from the tables below via scripts/gen_bench_charts.py.

Benchmarking data tables:

JSON

Usage

1> glazer_json:decode(<<"{\"a\":1,\"b\":[true,null,3.5]}">>).
#{<<"a">> => 1, <<"b">> => [true, null, 3.5]}

2> glazer_json:encode(#{<<"a">> => 1, <<"b">> => [true, null, 3.5]}).
<<"{\"a\":1,\"b\":[true,null,3.5]}">>

3> glazer_json:encode(#{a => 1}, [pretty]).
<<"{\n  \"a\": 1\n}">>

4> glazer_json:minify(<<" { \"a\" : 1 } ">>).
{ok, <<"{\"a\":1}">>}

5> glazer_json:prettify(<<"{\"a\":1}">>).
{ok, <<"{\n  \"a\": 1\n}">>}

Streaming

For input that arrives in chunks — e.g. reading a large document incrementally, or consuming newline-delimited JSON (NDJSON) from a socket or file — stream_decoder/0,1 provides a small stateful wrapper that buffers partial input and decodes each JSON value as soon as it's complete, without re-parsing bytes you've already seen:

1> D0 = glazer_json:stream_decoder(),
2> {Vals1, D1} = glazer_json:stream_feed(D0, <<"{\"a\":1} {\"b\":">>),
3> Vals1.
[#{<<"a">> => 1}]

4> {Vals2, D2} = glazer_json:stream_feed(D1, <<"2}">>),
5> Vals2.
[#{<<"b">> => 2}]

6> glazer_json:stream_eof(D2).
{ok, []}

stream_feed/2 returns the list of values completed by the chunk just fed (possibly empty, possibly more than one if the chunk completes several values) along with the updated decoder state to pass to the next call. Once the input is exhausted, call stream_eof/1 to flush any trailing bare scalar (numbers, strings, etc. have no closing delimiter of their own) and surface an error if the buffer holds an incomplete value:

1> D0 = glazer_json:stream_decoder(),
2> {[], D1} = glazer_json:stream_feed(D0, <<"   42">>),
3> glazer_json:stream_eof(D1).
{ok, [42]}

stream_decoder/1 accepts the same options as decode/2 (e.g. {keys, atom}, use_nil) and applies them to every decoded value.

A typical read loop calls stream_feed/2 for each chunk while more data may still arrive, and stream_eof/1 once the socket closes to flush any trailing value:

loop(Socket, D0) ->
  case gen_tcp:recv(Socket, 0) of
    {ok, Chunk} ->
      {Vals, D1} = glazer_json:stream_feed(D0, Chunk),
      handle_values(Vals),
      loop(Socket, D1);
    {error, closed} ->
      case glazer_json:stream_eof(D0) of
        {ok, Trailing}  -> handle_values(Trailing);
        {error, Reason} -> handle_truncated_stream(Reason)
      end
  end.

Efficiency

stream_feed/2 only scans for value boundaries incrementally — the scanner carries a small resumable cursor (scan_state()) that remembers how far it has already looked (nesting depth, whether it's inside a string, escape state, …), so each call to scan/2 resumes from where the previous one left off rather than re-walking the whole buffer from byte zero. Once a complete value's end offset is known, that slice is decoded exactly once via the same NIF-backed decoder used by decode/2 — there's no intermediate tokenization or tree representation, and no byte is ever scanned or decoded twice. The only buffering cost is concatenating newly-arrived chunks onto the not-yet-complete tail of the input.

This makes stream_feed/2 well suited to byte-at-a-time or small-chunk feeding (e.g. consuming a gen_tcp/gen_statem socket buffer as it fills) without the quadratic-rescan cost a naive "concatenate and retry full decode" loop would incur on large or slow-arriving documents.

Under the hood, stream_feed/2 is built on scan/1,2 — a low-level primitive that scans a buffer for the byte offset where the next JSON value ends (or reports that more input is needed) without doing a full decode. It's exposed directly for callers that want to implement their own framing/buffering strategy:

1> glazer_json:scan(<<"{\"a\":1} {\"b\":2}">>).
{complete, 7}

2> glazer_json:scan(<<"{\"a\":">>).
{incomplete, ScanState}

3> glazer_json:scan(<<"{\"a\":1}">>, ScanState).
{complete, 7}

stream_decoder/0,1, stream_feed/2, stream_eof/1 and scan/1,2 are JSON-only — see YAML streaming and CSV streaming below for the other formats.

Null term configuration

By default, JSON/YAML null decodes to (and null encodes from) the atom null, and this same atom is used as the default null term throughout the library (e.g. for the CSV on_failure => null field option). This can be overridden:

  • Application-wide, via the null environment key — set this once in the application's config and every call uses it as the default:

    Erlang (rebar.config):

    {glazer, [{null, nil}]}
    

    Elixir (config.exs):

    config :glazer, null: nil
    
  • Per call, with the use_nil shorthand or the {null_term, Atom} option (see Decode options below). Per-call options always take precedence over the application-wide default.

JSON decode options

OptionDescription
object_as_tupleDecode JSON objects as {[{Key, Value}]} proplist tuples (jiffy-style) instead of maps (default)
use_nilUse the atom nil for JSON null
{null_term, Atom}Use Atom for JSON null
{keys, atom}Decode object keys as atoms (via binary_to_atom/2-equivalent)
{keys, existing_atom}Decode object keys as existing atoms, falling back to binaries for unknown atoms
{keys, binary}Decode object keys as binaries (default)
dedupe_keysWith object_as_tuple, eliminate duplicate object keys, keeping the last occurrence's value (and position)
copy_stringsAlways allocate a fresh binary for each decoded string, instead of a zero-copy sub-binary of the input (see Performance Optimization Details)
return_trailerAllow trailing non-whitespace data after the decoded value instead of rejecting it; on a match, return {has_trailer, Term, Rest} with Rest as a zero-copy sub-binary of the unconsumed remainder
1> glazer_json:decode(<<"{\"a\":1}">>, [object_as_tuple]).
{[{<<"a">>, 1}]}

2> glazer_json:decode(<<"{\"a\":1}">>, [{keys, atom}]).
#{a => 1}

3> glazer_json:decode(<<"null">>, [use_nil]).
nil

4> glazer_json:decode(<<"null">>, [{null_term, undefined}]).
undefined

5> glazer_json:decode(<<"{\"a\":1,\"a\":2}">>).
#{<<"a">> => 2}

6> glazer_json:decode(<<"{\"a\":1,\"a\":2}">>, [object_as_tuple]).
{[{<<"a">>, 1}, {<<"a">>, 2}]}

7> glazer_json:decode(<<"{\"a\":1,\"a\":2}">>, [object_as_tuple, dedupe_keys]).
{[{<<"a">>, 2}]}

8> glazer_json:decode(<<"1 2">>, [return_trailer]).
{has_trailer, 1, <<"2">>}

Note

A JSON object with duplicate keys cannot be represented as an Erlang map, so decoding to maps (the default) and {keys, atom | existing_atom} always dedupe duplicate keys, last value wins, regardless of dedupe_keys. With object_as_tuple, duplicate keys are preserved as-is unless dedupe_keys is given.

JSON encode options

OptionDescription
prettyPretty-print the JSON output with two-space indentation
uescapeEscape non-ASCII characters as \uXXXX sequences
force_utf8Replace invalid UTF-8 byte sequences with U+FFFD before encoding
use_nilEncode the atom nil as JSON null
{null_term, Atom}Encode Atom as JSON null
1> glazer_json:encode(#{a => 1}, [pretty]).
<<"{\n  \"a\": 1\n}">>

2> glazer_json:encode(<<"héllo"/utf8>>, [uescape]).
<<"\"h\\u00e9llo\"">>

3> glazer_json:encode(nil, [use_nil]).
<<"null">>

Option force_utf8:

Note

force_utf8 is an encode-only option. decode/1,2 does not validate that JSON strings in the input are valid UTF-8 — bytes are copied through to the resulting binaries as-is, regardless of options.

Binaries may contain arbitrary bytes, including byte sequences that are not valid UTF-8. By default, such bytes are copied into the output verbatim, which can produce a result that is not valid UTF-8/JSON:

1> glazer_json:encode(<<"a", 128, "b">>).
<<"\"a", 128, "b\"">>

With force_utf8, each invalid byte (or byte sequence) is replaced with the Unicode replacement character U+FFFD (encoded as 0xEF 0xBF 0xBD):

2> glazer_json:encode(<<"a", 128, "b">>, [force_utf8]).
<<"\"a", 239, 191, 189, "b\"">>

A literal U+FFFD already present in the input is left untouched (it is not re-replaced). Combining force_utf8 with uescape further escapes the replacement character as \ufffd:

3> glazer_json:encode(<<"a", 128, "b">>, [force_utf8, uescape]).
<<"\"a\\ufffdb\"">>

jq filter support

If libjq and its headers (jq.h/jv.h) are available when glazer is built, query/2,3 runs a jq filter program against a JSON document and returns one Erlang term per value produced by the filter (decoded using the same options as decode/2):

1> glazer_json:query(<<"{\"a\":[1,2,3]}">>, <<".a[]">>).
{ok, [1, 2, 3]}

2> glazer_json:query(<<"{\"a\":1}">>, <<".b">>).
{ok, [null]}

3> glazer_json:query(<<"{\"a\":{\"b\":2}}">>, <<".">>, [{keys, atom}]).
{ok, [#{a => #{b => 2}}]}

4> glazer_json:query(<<"not json">>, <<".">>).
{error, invalid_input}

5> glazer_json:query(<<"{\"a\":1}">>, <<"bad syntax (((">>).
{error, jq_decode_error}

If libjq was not available at build time, query/2,3 returns {error, jq_not_available}. Build detection is automatic — make probes for jq.h/libjq and only enables this feature if found, so glazer still builds and works without libjq installed.

Elixir's Phoenix json_library() compliance

Phoenix supports a pluggable :json_library configuration (see phoenix) that lets applications swap in an alternative JSON implementation for Phoenix's JSON API module by configuring a module that exports:

  • decode!/1
  • encode!/1
  • encode_to_iodata!/1

glazer_json exports these under the equivalent (quoted) Erlang names — 'decode!'/1, 'encode!'/1, and 'encode_to_iodata!'/1 — as thin aliases for decode/1 and encode/1, so glazer_json can be configured directly as a json_library(). To match Elixir's JSON module, where null decodes to/from nil rather than the atom :null, these three functions automatically apply use_nil — no extra configuration is needed:

config :phoenix, :json_library, :glazer_json
1> glazer_json:'decode!'(<<"{\"a\":1,\"b\":null}">>).
#{<<"a">> => 1, <<"b">> => nil}

2> glazer_json:'encode!'(#{<<"a">> => 1, <<"b">> => nil}).
<<"{\"a\":1,\"b\":null}">>

3> glazer_json:'encode_to_iodata!'(#{<<"a">> => 1, <<"b">> => nil}).
<<"{\"a\":1,\"b\":null}">>
1> glazer_json:'decode!'(<<"{\"a\":null}">>).
#{<<"a">> => nil}

2> glazer_json:'encode!'(#{<<"a">> => nil}).
<<"{\"a\":null}">>

API

All functions below are in glazer_json.

FunctionDescription
decode/1, decode/2Decode a JSON binary or iolist to an Erlang term
try_decode/1, try_decode/2Decode a JSON binary or iolist, returning {ok, Term} or {error, {parse_error, Msg}} instead of raising
encode/1, encode/2Encode an Erlang term to a JSON binary; raises {encode_error, {Msg, Term}} on failure
'decode!'/1Decode a JSON binary or iolist to an Erlang term (alias for decode/1)
'encode!'/1Encode an Erlang term to a JSON binary (alias for encode/1)
'encode_to_iodata!'/1Encode an Erlang term to JSON as iodata (alias for encode/1)
minify/1Remove unnecessary whitespace from a JSON document
prettify/1Pretty-print a JSON document with two-space indentation
read_file/1, read_file/2Read a file and decode its contents as JSON
write_file/2, write_file/3Encode a term to JSON and write it to a file
scan/1, scan/2Scan a buffer for the end offset of the next complete JSON value
stream_decoder/0, stream_decoder/1Create an incremental-decode state for chunked input
stream_feed/2Feed a chunk to a stream decoder, returning completed values
stream_eof/1Flush a stream decoder at end-of-input
query/2, query/3Run a jq filter over a JSON document, returning {ok, [Term]} (requires libjq)

Benchmarking JSON

A comparison benchmark against other JSON libraries (simdjsone, jiffy, jason, thoas, euneus, OTP's built-in json, and torque) is available via:

$ PARALLEL=2 make bench-json
==> Running benchmarks with parallelism: 1 (optimization: O3 - PGO)

(numbers in µs)
JSON        twitter (616.7K)   twitter2 (758.0K)     openrtb (1.2K)       esad (1.3K)         small (0.1K)
            decode   encode     decode   encode     decode   encode     decode   encode     decode   encode
-------------------------------------------------------------------------------------------------------------
glazer      2369.8   1053.4     2563.1   1807.0        5.2      3.7        3.8      2.1        0.9      0.7
torque      3249.5   1191.1     2795.4   1865.7        5.9      4.6        3.5      3.4        1.2      1.0
simdjsone   3158.1   2658.1     5070.6   5300.0        9.8     12.9        6.5      8.5        1.1      1.7
jiffy       5648.8   1877.9     7186.2   3660.6       10.9      9.9        7.4      5.5        1.8      1.5
jason       7821.5   7818.7    15858.5  14425.2       22.1     20.4       13.8     14.7        3.0      2.3
json        8351.3   5291.7    11191.1   9945.7       17.9     13.5       10.3      7.8        2.1      1.9
thoas       8997.9   7852.7    15110.9  15454.8       22.8     20.9       14.8     17.0        2.7      2.1
euneus      8395.7   6192.9    11569.4  11719.4       21.2     15.8       10.9     11.2        2.6      2.1

(requires the bench/dev Mix dependencies — see mix.exs).

YAML

Usage

decode/1,2 decodes a YAML document to an Erlang term — mappings become maps, sequences become lists, and scalars become the matching Erlang type (binaries, numbers, booleans, or null):

1> glazer_yaml:decode(<<"a: 1\nb:\n  - true\n  - null\n  - 3.5\n">>).
#{<<"a">> => 1, <<"b">> => [true, null, 3.5]}

2> glazer_yaml:encode(#{<<"a">> => 1, <<"b">> => [true, null, 3.5]}).
<<"a: 1\nb:\n  - true\n  - null\n  - 3.5\n">>

encode/1,2 encodes an Erlang term to YAML in block style (2-space indentation, sequences at the same indentation as the mapping key that owns them). Raises {encode_error, {Msg, Term}} if the data contains a value that cannot be represented as YAML.

Streaming

There is no incremental YAML decoder. YAML's block styles have no closing delimiter — a mapping or sequence simply ends at a dedent or end-of-input — so there is no way to scan a partial buffer for "is this value complete yet?" the way scan/1,2 does for JSON's bracket-balanced syntax. Decode full YAML documents with decode/1,2 once they are fully buffered.

YAML decode options

OptionDescription
use_nilUse the atom nil for YAML null/~/empty values
{null_term, Atom}Use Atom for YAML null/~/empty values
{keys, atom}Decode mapping keys as atoms
{keys, existing_atom}Decode mapping keys as existing atoms, falling back to binaries for unknown atoms
{keys, binary}Decode mapping keys as binaries (default)
yaml_1_1_boolsAdditionally treat yes/no/on/off (and case variants) as booleans, per the YAML 1.1 core schema. By default (YAML 1.2 core schema) only true/false are recognized as booleans
copy_stringsAlways allocate a fresh binary for each decoded scalar, instead of a zero-copy sub-binary of the input for single-line plain scalars (see Performance Optimization Details)
1> glazer_yaml:decode(<<"a: ~\n">>, [use_nil]).
#{<<"a">> => nil}

2> glazer_yaml:decode(<<"a: 1\n">>, [{keys, atom}]).
#{a => 1}

3> glazer_yaml:decode(<<"a: yes\n">>, [yaml_1_1_bools]).
#{<<"a">> => true}

YAML encode options

OptionDescription
use_nilTreat the atom nil as YAML null
{null_term, Atom}Treat Atom as YAML null
1> glazer_yaml:encode(#{<<"a">> => nil}, [use_nil]).
<<"a: null\n">>

API

All functions below are in glazer_yaml.

FunctionDescription
decode/1, decode/2Decode a YAML binary or iolist to an Erlang term
try_decode/1, try_decode/2Decode YAML, returning {ok, Term} or {error, Msg} instead of raising
encode/1, encode/2Encode an Erlang term to a YAML binary in block style; raises {encode_error, {Msg, Term}} on failure
read_file/1, read_file/2Read a file and decode its contents as YAML
write_file/2, write_file/3Encode a term to YAML and write it to a file

Benchmarking YAML

$ PARALLEL=2 make bench-yaml
==> Running benchmarks with parallelism: 1 (optimization: O3 - PGO)

(numbers in µs)
YAML             openrtb (1.3K)       esad (1.3K)         small (0.1K)
                decode   encode     decode   encode     decode   encode
-------------------------------------------------------------------------
glazer            18.4      8.5        9.2      3.3        1.4      0.8
yaml_rustler     104.8      n/a       66.3      n/a       10.3      n/a
fast_yaml        130.7     51.1       79.0     31.6       15.4      5.8
yamerl          1108.5      n/a      859.0      n/a      422.5      n/a
ymlr               n/a     39.0        n/a     36.4        n/a      4.3

CSV

Usage

decode/1,2 decodes an RFC 4180 CSV document to #{headers => nil|[...], data => Rows}, where Rows is a list of rows, each row a list of binary fields by default:

1> glazer_csv:decode(<<"name,age\nAlice,30\nBob,25\n">>).
#{headers => nil,
  data    => [[<<"name">>,<<"age">>],[<<"Alice">>,<<"30">>],[<<"Bob">>,<<"25">>]]}

2> glazer_csv:encode([[<<"name">>, <<"age">>], [<<"Alice">>, 30]]).
<<"name,age\r\nAlice,30\r\n">>

With the headers option, the first row is captured as column names in headers and each subsequent row decodes to a map when combined with {return, map}; encode/2 with headers does the reverse, deriving the header row from the first map's keys:

1> glazer_csv:decode(<<"name,age\nAlice,30\n">>, [headers, {return, map}]).
#{headers => [<<"name">>,<<"age">>],
  data    => [#{<<"name">> => <<"Alice">>, <<"age">> => <<"30">>}]}

2> glazer_csv:encode([#{<<"name">> => <<"Alice">>, <<"age">> => 30}], [headers]).
<<"name,age\r\nAlice,30\r\n">>

Fields containing the delimiter, a double quote, or a line break are quoted automatically on encode (with embedded quotes doubled), and unquoted on decode. The delimiter defaults to , and can be changed via {delimiter, Char}; the encoded line ending defaults to \r\n per RFC 4180 and can be changed to \n via {line_ending, lf}.

Streaming

For input that arrives in chunks, stream_decoder/0,1 provides the same kind of stateful wrapper as JSON streaming: it buffers partial input and decodes each row as soon as its terminating line break is seen, via decode/2 on that single row. A small scanner tracks whether the cursor is inside a quoted field across chunks, so a \n/\r\n inside a quoted field doesn't end the row:

1> D0 = glazer_csv:stream_decoder(),
2> {Rows1, D1} = glazer_csv:stream_feed(D0, <<"a,b\n1,2\n3,">>),
3> Rows1.
[[<<"a">>,<<"b">>],[<<"1">>,<<"2">>]]

4> {Rows2, D2} = glazer_csv:stream_feed(D1, <<"4\n">>),
5> Rows2.
[[<<"3">>,<<"4">>]]

6> glazer_csv:stream_eof(D2).
{ok, []}

stream_feed/2 returns the rows completed by the chunk just fed (possibly empty, possibly more than one) along with the updated decoder state. Once the input is exhausted, call stream_eof/1 to flush a trailing row that has no terminating line break, or surface an error if the buffered bytes don't form a valid row:

1> D0 = glazer_csv:stream_decoder(),
2> {Rows1, D1} = glazer_csv:stream_feed(D0, <<"a,b\n1,2">>),
3> Rows1.
[[<<"a">>,<<"b">>]]

4> glazer_csv:stream_eof(D1).
{ok, [[<<"1">>,<<"2">>]]}

stream_decoder/1 accepts the same options as decode/2. With the headers option, the first complete row is captured as the header and used to decode every subsequent row (as a map when combined with {return, map}); no row is emitted for the header itself. Blank lines are skipped, matching decode/2.

CSV decode options

OptionDescription
{delimiter, Char}Field delimiter (default $,)
headersTreat the first row as column names (shorthand for {headers, binary})
{headers, [Name, ...]}Use the given list of atoms or binaries as column names; the first data row is not consumed as a header
{headers, binary}First row is binary column names (same as bare headers)
{headers, string}Alias for {headers, binary}
{headers, atom}First row → atom column names (via binary_to_atom/2-equivalent)
{headers, existing_atom}First row → existing-atom column names, falling back to binaries for unknown atoms
{headers, charlist}First row → column names as lists of Unicode codepoints
{return, list}Data rows are lists of field values (default)
{return, tuple}Data rows are tuples of field values
{return, map}Data rows are maps keyed by column names; requires headers or {headers, ...}. Raises duplicate_header on duplicate column names
{fields, Specs}Convert each column's field from a binary, positionally — see Field type conversion
{skip, N}Skip the first N data rows (after any header row)
{skip, {From, To}}Process only data rows From..To (1-based inclusive); equivalent to {skip, From-1} plus {limit, To-From+1}
{limit, N}Process at most N data rows (after skipping)
{null_term, Atom}Use Atom as the value produced by on_failure => null (default null)
copy_stringsAlways allocate a fresh binary for each decoded field, instead of a zero-copy sub-binary of the input (see Performance Optimization Details)

Field type conversion

The {fields, Specs} decode option converts each column's field from a binary to the given Erlang type. Specs is a list applied positionally — the Nth spec applies to the Nth column, regardless of whether headers is set. Columns beyond the end of Specs are left as binaries.

1> glazer_csv:decode(<<"name,age,active,joined\nAlice,30,true,2024-01-15T10:30:00Z\n">>,
..                    [headers, {fields, [binary, integer, boolean,
..                                         {datetime, <<"%Y-%m-%dT%H:%M:%SZ">>}]}]).
[#{<<"name">> => <<"Alice">>, <<"age">> => 30, <<"active">> => true,
   <<"joined">> => 1705314600}]

Each element of Specs is either a Type directly, or a map #{type => Type, default => Term, on_failure => OnFailure} for more control (see below). Type is one of:

TypeDescription
integerParse the field as an integer
{float, Precision}Parse the field as a float, rounded to Precision decimal digits
booleanParse "true"/"false" (any case) as true/false
{datetime, InputFormat}Parse with a strptime-like format string and convert to Unix epoch seconds (UTC)
binaryLeave the field as a binary (default)
charlistConvert the field to a list of Unicode code points
existing_atomConvert to an existing atom, falling back to a binary if no such atom exists
{atom, ExistingAtoms}Convert to an atom only if the field's text matches (and exists as) one of ExistingAtoms, falling back to a binary otherwise

InputFormat supports the directives %Y %y %m %d %H %M %S %f %z (and %% for a literal %); any other character must match the input literally, and a space matches a run of one-or-more whitespace characters. %z accepts Z, +HHMM, or +HH:MM-style offsets; fractional seconds (%f) are parsed but discarded. The result is always in UTC.

default and on_failure

Using the map form #{type => Type, default => Term, on_failure => OnFailure}:

  • default (when given) is used in place of the converted value whenever the raw CSV field is empty.

  • on_failure controls what happens when a non-empty field fails to convert to Type (default binary):

    on_failureBehavior
    binaryLeave the field as the original binary (default)
    raiseRaise {invalid_field_value, Row, Column} (1-based), or return {error, Reason} from try_decode/2
    defaultUse the spec's default value (falls back to binary if no default is given)
    nullUse the configured null term: {null_term, Atom} if given, otherwise the library-wide null term (see Null term configuration and {null_term, Atom} below)
1> glazer_csv:decode(<<"1\nbad\n">>,
..                    [{fields, [#{type => integer, on_failure => raise}]}]).
** exception error: {invalid_field_value,2,1}

2> glazer_csv:decode(<<"1\nbad\n">>,
..                    [{fields, [#{type => integer, default => 0, on_failure => default}]}]).
[[1],[0]]

3> glazer_csv:decode(<<"1\nbad\n">>,
..                    [{null_term, nil},
..                     {fields, [#{type => integer, on_failure => null}]}]).
[[1],[nil]]

{null_term, Atom} only affects on_failure => null for that call. Without it, on_failure => null falls back to the library-wide null term — null by default, or whatever atom is configured via the Null term configuration application env var ({glazer, [{null, Atom}]}).

CSV Encode options

OptionDescription
{delimiter, Char}Field delimiter (default $,)
headersInput is a list of maps; the first map's keys become the header row, and subsequent maps are encoded as rows in that column order (missing keys produce empty fields)
{headers, [Name, ...]}Input is a list of maps; uses the given list of atoms or binaries (matching the maps' key type) as the column order and header row, instead of deriving it from the first map's keys (missing keys produce empty fields)
{line_ending, lf | crlf}Line terminator (default crlf, per RFC 4180)

API

All functions below are in glazer_csv.

FunctionDescription
decode/1, decode/2Decode a CSV binary or iolist to a list of rows (or maps with headers)
try_decode/1, try_decode/2Decode CSV, returning {ok, Rows} or {error, Reason} instead of raising
encode/1, encode/2Encode a list of rows (or maps with headers) to a CSV binary; raises {encode_error, {Msg, Term}} on failure
read_file/1, read_file/2Read a file and decode its contents as CSV
write_file/2, write_file/3Encode rows to CSV and write them to a file
stream_decoder/0, stream_decoder/1Create an incremental CSV decode state for chunked input
stream_feed/2Feed a chunk to a CSV stream decoder, returning completed rows
stream_eof/1Flush a CSV stream decoder at end-of-input

Benchmarking CSV

$ PARALLEL=2 make bench-csv
==> Running benchmarks with parallelism: 1 (optimization: O3 - PGO)

(numbers in µs)
CSV               small (1.3K)          medium (130.9K)         large (3433.1K)
                decode     encode       decode     encode       decode     encode
-----------------------------------------------------------------------------------
glazer             9.3        3.8        676.9      239.1      20867.1     9657.3
nimble_csv        29.6       27.7       3469.0     2694.8     144525.9   100152.0
rusty_csv         27.8        n/a        740.6        n/a      22251.9        n/a
csv               65.3      156.3       5733.1    16888.9     298011.0   467137.4
erl_csv          440.8      296.6      37380.3    22897.5      TIMEOUT    TIMEOUT

glazer vs rusty_csv

Note: rusty_csv is a Rust NIF (via rustler) and the closest performance comparison for glazer's CSV decoder — both use SIMD (AVX2/SSE2) to scan for delimiters/quotes and return zero-copy sub-binaries for unescaped fields. It's excluded from the default make bench-csv table above because it can't be deps.get'd alongside yaml_rustler (incompatible rustler version constraints — see the BENCH_SET note in the Makefile); run it explicitly with make bench-csv BENCH_SET=csv:

$ PARALLEL=2 make bench-csv BENCH_SET=csv

The benchmarking table above has the merged results of running with BENCH_SET=csv and without.

(rusty_csv has no CSV encoder, so its encode column is n/a.)

Decode is within a few percent either way across file sizes — small-input overhead favors glazer (no per-call Rust/NIF marshalling layer beyond rustler's own), and medium/large decode is close to a tie, with the remainder being run-to-run noise rather than a structural gap. Profiling glazer's large-file decode (3.4 MB / 25K rows / 150K fields) by incrementally stubbing out parts of the pipeline shows where the time actually goes:

StageShare of decode time
SIMD scan (find delimiters/quotes)~7%
enif_make_sub_binary per field~31%
enif_make_list_from_array per row~26%
Remaining bookkeeping (field/row vectors, outer list)~34%

Scanning is a small fraction of the total; the dominant cost is the NIF term-construction calls inherent to the [[field, ...], ...] row-of-lists shape both libraries return — rusty_csv pays the same enif_make_sub_binary and list-construction costs per field/row, just batched at the end of a two-phase scan-then-extract design instead of interleaved during scanning like glazer. There's no scanning-strategy change available that would close the remaining gap without changing the output term shape itself (e.g. {return, tuple}, which avoids rebuilding a list per row).

Big integers

JSON/YAML/CSV numbers that don't fit into a 64-bit integer are decoded as Erlang big integers (and big integers are encoded back to their exact decimal representation).

API

FunctionDescription
encode_integer/1Encode an integer to its JSON decimal-string representation
decode_integer/1Decode a JSON number string to an Erlang integer, raising on invalid input
try_decode_integer/1Decode a JSON number string to an Erlang integer, returning {ok, Int} or {error, invalid_number_format}

encode_integer/1 and decode_integer/1/try_decode_integer/1 expose the same conversion routines directly, independent of JSON/YAML/CSV parsing/encoding:

1> glazer:encode_integer(123456789012345678901234567890).
<<"123456789012345678901234567890">>

2> glazer:decode_integer(<<"123456789012345678901234567890">>).
123456789012345678901234567890

3> glazer:try_decode_integer(<<"not a number">>).
{error, invalid_number_format}

See the module's documentation (src/glazer.erl) for full type specs and details.

Limitations

Scope

glazer targets formats that map naturally onto a tree of Erlang maps/lists/scalars — JSON and YAML both fit this model directly, so a single decode/encode pair can convert losslessly between the format and native terms. XML is intentionally not planned: its data model (tagged elements, attributes, mixed text/element content, namespaces, processing instructions, entities) has no single natural Erlang term representation, and any choice (xmerl-style tuples, JSON-like maps with @attr/#text keys, etc.) is a lossy or awkward fit compared to formats that are already trees of scalars and collections. Erlang's standard library already ships xmerl for XML; there's little value in duplicating it here with a different, opinionated term shape.

Nesting depth

The JSON and YAML decoders both cap recursion at 256 levels of nesting (arrays/objects for JSON; mappings/sequences for YAML). Inputs that exceed this limit are rejected with a decode error rather than crashing the VM by overflowing the C stack.

FormatLimitError returned
JSON256{error, <<"exceeded maximum nesting depth at offset N">>}
YAML256{error, <<"exceeded maximum nesting depth at offset N">>}

256 levels is sufficient for any reasonable real-world document; it is deliberately not configurable, because the limit exists to protect the Erlang VM process (the NIF runs on the scheduler thread) from runaway recursive descent on adversarial input.

Performance Optimization Details

glazer is faster than all competitors on both encoding and decoding in all data formats - JSON/YAML/CSV. On JSON decoding it has a slight edge over torque (Rust sonic-rs NIF) across every benchmarked workload, and on encoding the lead is by by ~10–30%. Both sit well ahead of the remaining contenders (simdjsone, jiffy, and the pure-Elixir libraries jason, thoas, euneus, and OTP's built-in json). On CSV it's close competitor is also Rust-backended rusty_csv project, though that project is missing encoding implementation. Here are some observations about glazer's design:

  • No tuple-of-binaries intermediate representation. glazer decodes straight to native Erlang terms (maps, lists, binaries, numbers) and encodes straight from them, in a single pass, with no generic JSON-tree staging step — minimizing allocation and copying on both the decode and encode paths.
  • Big integer support. numbers that overflow 64 bits decode to Erlang bignums (and encode back to their exact decimal form) — see Big integers.
  • No external C++ dependencies. The NIF is fully self-contained — no CMake, no vendored third-party library to pull at build time, so it's easier to use as a dependency since it doesn't have reliance on other toolchains such as sonic-rs by other libraries that use Rust.

A few implementation techniques in c_src/glazer_nif.cpp account for most of the gap over the slower contenders:

  • Single-pass, zero-copy decode/encode. As noted above, there's no intermediate generic JSON tree — the decoder builds Erlang terms directly from the input bytes and the encoder writes JSON bytes directly from Erlang terms. This removes a whole staging allocate-and-copy pass that tree-based decoders pay for.

  • Sub-binary string/field values (zero allocation on decode). Shared across the JSON, YAML, and CSV decoders: unescaped scalars are returned as enif_make_sub_binary terms — a slice of the original input binary — rather than newly allocated copies. No memcpy or heap allocation occurs for the common case (JSON strings with no \ escapes, CSV fields without embedded quotes, single-line YAML plain scalars). Only values that need unescaping or reassembly (escaped JSON strings, quoted/folded YAML scalars, CSV fields with doubled quotes) pay the copy cost. The copy_strings decode option opts back into copying for every value when decoded results are long-lived and the input is large (keeping one sub-binary alive would otherwise pin the entire input buffer in memory).

  • Inline, growable output buffer (OutBuf). Encoding writes into a 4 KB stack-allocated buffer first; only documents that exceed that spill to the heap, growing geometrically via malloc/realloc (the latter resizes in place when possible, avoiding a copy on every growth — a plain new[]/delete[] doubling strategy can't do this).

  • Pre-reserved worst-case output, raw-pointer inner loop. Before encoding any string, json_escape_string and emit_double_quoted (YAML) call out.ensure(len * 6 + 2) once — the absolute worst case of six output bytes per input byte (\uXXXX) plus two quote characters. After that single reservation the inner loop writes through a raw char* pointer with no further bounds checks or ensure() calls. This removes a branch and a potential realloc from every character in the hot path.

  • Dense escape table (ESCAPE_TAB). Instead of a per-character switch statement, a 256-entry constexpr table maps each byte to an {len, seq[7]} struct. Emitting an escape sequence is a single indexed table load followed by one memcpy(dst, e.seq, e.len) — branch-free and inlined by the compiler. The same table is shared by the JSON and YAML encoders via glazer_common.hpp.

  • Key cache for repeated object keys (KeyCache). Real-world JSON documents reuse the same small set of key strings heavily (e.g. a Twitter feed has ~13K key occurrences across only ~94 distinct keys). KeyCache is an open-addressed hash table (power-of-two size, linear probing, FNV-1a hash with a precomputed-hash fast-reject before the memcmp) that lets a repeated key reuse the same already-built ERL_NIF_TERM binary instead of paying enif_make_new_binary + memcpy again. It's only engaged for inputs above a size threshold (KEY_CACHE_MIN_SIZE), since small payloads (RPC-sized messages) rarely repeat keys enough to amortize the lookup cost.

  • Epoch-counter lazy clearing. Both KeyCache and the scratch buffers it touches need to start "empty" on every decode call, but zero-initializing a multi-KB table for every single call — including tiny documents that never populate it — would cost more than the cache saves. Instead each cache entry carries a generation/epoch tag; a slot is considered live only if its epoch matches the cache's current m_epoch (itself seeded from a process-wide monotonically-increasing counter, so leftover garbage from a prior stack frame can never coincidentally look live). This makes cache construction effectively free, regardless of table size.

  • SIMD string scanning (NEON / AVX2 / SSE2). A shared find_escape_pos function in glazer_common.hpp scans for ", \, and control characters (c < 0x20) using an architecture cascade: AArch64 NEON (16 bytes/iter), x86 AVX2 (32 bytes/iter), SSE2 (16 bytes/iter), then a byte-table scalar fallback. Control-character detection uses a bias trick — XOR with 0x80 shifts the unsigned < 0x20 range into a region where a single signed vclt/cmpgt instruction covers all 32 values at once, avoiding 32 separate equality checks. The same scanner is used by both the JSON and YAML string encoders. Separate SIMD scanners handle format-specific stop sets: find_break (YAML line-break scanner), find_dq_special (YAML " / \ / LF / CR), find_field_end (CSV delimiter | LF | CR), and find_csv_special (CSV quoting check) — all with NEON, AVX2, and SSE2 paths.

  • SWAR whitespace skipping. skip_ws checks the next byte before paying for any wider load, then — for runs of whitespace — scans 8 bytes at a time using branch-free bit-twiddling ("SIMD within a register") to find the first non-whitespace byte. Minified JSON (the overwhelmingly common case) has little or no structural whitespace, so the single-byte fast path dominates; the 8-byte path handles pretty-printed inputs.

  • Fast integer formatting. Integers are written to JSON using a lookup-table-based digit-pair algorithm (avoiding division for small values) with a vendored lltoa fallback for larger numbers — faster than routing every integer through snprintf.

License

Glazer uses MIT License. You can use the source code freely in any project, including commercial applications, as long as you give credit by publishing the contents of the LICENSE file somewhere in your documentation.