libfyaml Python Binding

March 16, 2026 · View on GitHub

The libfyaml Python binding exposes the high-performance libfyaml C library directly. Parsed documents are represented as FyGeneric objects — lazy wrappers that defer conversion to Python types until you ask for them. This keeps memory low and lets you navigate large documents without materialising every node.

Quick Start
Parsing
- Parse modes
- Parser options
The FyGeneric Type
Serialisation
- Scalar styles
Converting Python objects
Path navigation
Mutability
FyDocumentState
Memory management
Error handling
Comparison with PyYAML

Quick Start

import libfyaml as fy

# Parse a YAML string
doc = fy.loads("name: Alice\nage: 30")
print(doc["name"])   # FyGeneric wrapping "Alice"
print(str(doc["name"]))  # "Alice"
print(doc.to_python())   # {'name': 'Alice', 'age': 30}

# Parse a file
doc = fy.load("config.yaml")

# Serialise back to YAML
print(fy.dumps(doc))

# Parse JSON
data = fy.loads('{"x": 1}', mode='json')

Parsing

`loads(s, **options) → FyGeneric`

Parse a YAML or JSON string. Raises ValueError if the input contains more than one document — use loads_all for multi-document streams.

doc = fy.loads("key: value")
docs = fy.loads_all("---\na: 1\n---\nb: 2")  # list of FyGeneric

`load(file, **options) → FyGeneric`

Parse from a file path (string — uses mmap internally) or any file-like object with a .read() method.

doc = fy.load("data.yaml")

with open("data.yaml") as f:
    doc = fy.load(f)

`loads_all(s, **options) → list[FyGeneric]`

`load_all(file, **options) → list[FyGeneric]`

Return all documents in a multi-document stream as a list.

docs = fy.loads_all("---\n1\n---\n2\n---\n3")
# [FyGeneric(1), FyGeneric(2), FyGeneric(3)]

Parse modes

The mode parameter controls which YAML dialect is accepted:

Mode string	Meaning
`'yaml'`, `'yaml1.2'`, `'1.2'`	YAML 1.2 — the default
`'yaml1.1'`, `'1.1'`	YAML 1.1 (accepts merge keys `<<`, sexagesimal numbers, etc.)
`'yaml1.1-pyyaml'`, `'pyyaml'`	YAML 1.1 with PyYAML-compatible quirks (used by the compat layer)
`'json'`	Strict JSON

# Merge keys only work in YAML 1.1
doc = fy.loads("""
defaults: &defaults
  timeout: 30

server:
  <<: *defaults
  host: localhost
""", mode='yaml1.1')

Parser options

All four parse functions accept the same keyword options:

Option	Default	Description
`mode`	`'yaml'`	Dialect — see above
`dedup`	`True`	Use the deduplication allocator (saves memory for documents with repeated content)
`trim`	`True`	Release unused allocator memory after parsing
`mutable`	`False`	Produce mutable `FyGeneric` objects (required for `__setitem__` and `set_at_path`)
`collect_diag`	`False`	Attach parse diagnostics to the result instead of raising
`create_markers`	`False`	Record byte/line/column positions for every node
`keep_comments`	`False`	Preserve YAML comments in the document
`keep_style`	`False`	Preserve original scalar styles (literal, folded, quoted, …)

The FyGeneric Type

FyGeneric is the type returned by all parse functions. It wraps a C fy_generic value without copying data. Conversion to Python only happens when you explicitly ask for it.

doc = fy.loads("x: 42")
type(doc)          # <class 'libfyaml._libfyaml.FyGeneric'>
doc.__class__      # <class 'dict'>  — the Python equivalent class

Type checking

Eight predicate methods, all return bool:

v = fy.loads("42")
v.is_null()       # False
v.is_bool()       # False
v.is_int()        # True
v.is_float()      # False
v.is_string()     # False
v.is_sequence()   # False
v.is_mapping()    # False
v.is_indirect()   # True if the value carries a tag or anchor

Converting to Python

doc = fy.loads("items: [1, 2, 3]")

# Recursive — the whole document becomes plain Python
doc.to_python()   # {'items': [1, 2, 3]}

# Scalar coercions
n = fy.loads("99")
int(n)    # 99
float(n)  # 99.0
bool(n)   # True
str(n)    # "99"

to_python() raises TypeError if a mapping key is unhashable (e.g. a nested mapping used as a key).

Container access

Sequences and mappings support the standard Python container protocol:

doc = fy.loads("fruits: [apple, banana, cherry]")
fruits = doc["fruits"]

len(fruits)      # 3
fruits[0]        # FyGeneric("apple")
str(fruits[0])   # "apple"
"banana" in fruits  # True (linear scan)

for item in fruits:
    print(str(item))

# Mappings
doc["fruits"]           # FyGeneric sequence
doc.keys()              # ['fruits']
doc.values()            # [FyGeneric sequence]
doc.items()             # [('fruits', FyGeneric sequence)]

Attribute access on mappings delegates to the underlying dict:

doc = fy.loads("host: localhost\nport: 8080")
str(doc.host)   # "localhost"
int(doc.port)   # 8080

Numeric operations on integer and float values work directly:

v = fy.loads("10")
v + 5    # 15
v * 2    # 20
v > 5    # True

Tags and anchors

doc = fy.loads("value: !!int '42'")
v = doc["value"]
v.has_tag()    # True
v.get_tag()    # "tag:yaml.org,2002:int"

doc2 = fy.loads("x: &myanchor hello\ny: *myanchor")
doc2["x"].has_anchor()   # True
doc2["x"].get_anchor()   # "myanchor"

Source markers

Markers record the byte offset, line, and column of each node in the original source. Enable them at parse time with create_markers=True.

doc = fy.loads("host: localhost\nport: 8080", create_markers=True)

m = doc["host"].get_marker()
# (start_byte, start_line, start_col, end_byte, end_line, end_col)
# e.g. (6, 0, 6, 15, 0, 15)

doc["host"].has_marker()   # True
doc["port"].get_marker()   # (22, 1, 6, 31, 1, 15)

Lines and columns are zero-based. get_marker() returns None when markers were not enabled.

Comments

Preserve YAML comments by parsing with keep_comments=True.

yaml_text = """\
# Server settings
host: localhost  # primary
port: 8080
"""
doc = fy.loads(yaml_text, keep_comments=True)
doc["host"].get_comment()   # "# primary"
doc["host"].has_comment()   # True

Diagnostics

With collect_diag=True parse errors are attached to the document rather than raised immediately. This lets you process partially-valid input.

doc = fy.loads("good: ok\nbad: {unclosed", collect_diag=True)
doc.has_diag()   # True
doc.get_diag()   # FyGeneric describing the error(s)

Serialisation

`dumps(obj, *, compact=False, json=False, style=None, indent=0) → str`

Serialise a FyGeneric or plain Python object to a YAML (or JSON) string.

doc = fy.loads("name: Alice\nscores: [10, 20, 30]")
print(fy.dumps(doc))
# name: Alice
# scores:
#   - 10
#   - 20
#   - 30

print(fy.dumps(doc, compact=True))
# {name: Alice, scores: [10, 20, 30]}

print(fy.dumps(doc, json=True))
# {"name": "Alice", "scores": [10, 20, 30]}

indent sets the indentation width (2–8 spaces; 0 uses the library default).

`dump(file, obj, *, mode='yaml', compact=False)`

Write to a file path (string) or file-like object. mode accepts 'yaml' or 'json'.

fy.dump("output.yaml", doc)

with open("output.json", "w") as f:
    fy.dump(f, doc, mode='json')

`dumps_all(documents, *, compact=False, json=False, style=None) → str`

`dump_all(file, documents, *, compact=False, json=False)`

Serialise a list of documents with --- separators.

docs = fy.loads_all("---\na: 1\n---\nb: 2")
print(fy.dumps_all(docs))
# ---
# a: 1
# ---
# b: 2

Individual node serialisation

FyGeneric objects have their own .dump() method:

doc = fy.loads("x: 1\ny: 2")
doc["x"].dump()                          # returns "1\n"
doc["x"].dump(strip_newline=True)        # returns "1"
doc["x"].dump("node.yaml")               # writes to file
doc["x"].dump(sys.stdout, mode='json')   # writes to file object

Scalar styles

The style parameter controls how scalar values are written. Accepted values:

Style	Effect
`None` or `'default'`	Library default (usually plain)
`'original'`	Preserve the style from the parsed input (requires `keep_style=True` at parse time)
`'block'`	Block scalars (literal `\|` or folded `>`)
`'flow'`	Flow / inline style
`'pretty'`	Readable multi-line format
`'compact'`	Compact single-line
`'oneline'`	Force everything onto one line

doc = fy.loads("text: 'hello world'")
print(fy.dumps(doc, style='block'))
print(fy.dumps(doc, style='flow'))

Converting Python objects

`from_python(obj, *, tag=None, style=None, mutable=False, dedup=True) → FyGeneric`

Convert a plain Python object (dict, list, str, int, float, bool, None) to a FyGeneric. Useful for attaching tags or styles before serialisation.

# Attach a YAML tag
v = fy.from_python("hello", tag="!mytag")
print(fy.dumps(v))   # !mytag hello

# Control the scalar style
text = fy.from_python("line one\nline two\n", style='|')
print(fy.dumps(text))
# |
#   line one
#   line two

Scalar style values accepted by from_python:

Style	Meaning
`'	'`
`'>'`	Folded block scalar
`"'"`	Single-quoted
`'"'`	Double-quoted
`'plain'` or `''`	Plain (unquoted)

`get_at_path(path) → FyGeneric`

`get_at_unix_path(path_str) → FyGeneric`

Navigate into a nested document. A path is a list of keys (strings) and indices (integers).

doc = fy.loads("""
servers:
  - host: web01
    port: 80
  - host: web02
    port: 443
""")

doc.get_at_path(["servers", 0, "host"])      # FyGeneric("web01")
doc.get_at_unix_path("/servers/0/host")      # FyGeneric("web01")
doc.get_at_unix_path("/servers/1/port")      # FyGeneric(443)

get_at_path raises KeyError if the path does not exist.

`get_path() → tuple` / `get_unix_path() → str`

Return the path of a node within its document (useful when iterating):

doc = fy.loads("a:\n  b:\n    c: 42")
v = doc.get_at_unix_path("/a/b/c")
v.get_unix_path()    # "/a/b/c"
v.get_path()         # ('a', 'b', 'c')

Path utility functions

fy.path_list_to_unix_path(["servers", 0, "host"])   # "/servers/0/host"
fy.unix_path_to_path_list("/servers/0/host")         # ["servers", 0, "host"]

Mutability

By default FyGeneric objects are immutable. Pass mutable=True to the parse function (or from_python) to allow in-place modification.

doc = fy.loads("x: 1\ny: 2", mutable=True)

doc["x"] = 99
str(doc["x"])   # "99"

doc.set_at_path(["y"], "updated")
doc.set_at_unix_path("/x", 0)

print(fy.dumps(doc))
# x: 0
# y: updated

Attempting to modify an immutable object raises TypeError.

FyDocumentState

FyDocumentState carries the YAML directives that appeared before a document. Access it via FyGeneric.document_state.

doc = fy.loads("%YAML 1.2\n---\nkey: value")
ds = doc.document_state

ds.version           # (1, 2)
ds.version_explicit  # True
ds.json_mode         # False
ds.tags              # list of {'handle': ..., 'prefix': ...} dicts
ds.tags_explicit     # True if %TAG directives were present

document_state is None for values that are not document roots.

Memory management

Allocator strategy

The dedup=True default uses a deduplication allocator that stores only one copy of repeated strings or scalars. This is a significant win for large documents with repeated content (e.g. YAML files with many identical keys or values).

Set dedup=False to use the standard allocator, which may be faster for small documents or documents with little repetition.

Trim

trim=True (default) releases unused allocator pages after parsing is complete. Disable with trim=False if you will be building on the document after parsing and want to avoid reallocation.

Manual trim

doc = fy.loads(large_yaml, trim=False)
# ... do some work ...
doc.trim()   # release unused memory now

Clone

clone() creates an independent copy of a FyGeneric value, decoupled from the original document's allocator:

original = fy.load("big.yaml")
part = original.get_at_unix_path("/config/server").clone()
del original   # can now be collected

Error handling

Exception	Raised when
`ValueError`	Parse error; invalid mode string; invalid style; multiple documents where one was expected
`TypeError`	Wrong argument type; mutation on an immutable object; unhashable mapping key in `to_python()` or `items()`
`KeyError`	Path not found in `get_at_path` / `get_at_unix_path`
`RuntimeError`	Internal builder or emitter failure; file write error
`AttributeError`	Attribute access on a non-mapping `FyGeneric`
`NotImplementedError`	`del` on a `FyGeneric` item

try:
    doc = fy.loads("key: [unclosed")
except ValueError as e:
    print(f"Parse error: {e}")

# Or collect errors without raising:
doc = fy.loads("key: [unclosed", collect_diag=True)
if doc.has_diag():
    print(doc.get_diag().to_python())

Comparison with PyYAML

This section describes how the core libfyaml binding relates to PyYAML.

Where they are similar

Function names: load, loads, dump, dumps follow the same naming convention as PyYAML's yaml.safe_load / yaml.dump.
Python types out: both ultimately produce dict, list, str, int, float, bool, and None. Call .to_python() on a FyGeneric to get the plain Python value.
YAML tag handling: both support !!str, !!int, !!float, !!bool, !!null, !!seq, !!map, !!binary, and custom tags.
Multi-document streams: both support ----separated documents via load_all / loads_all.

Where they diverge

Return type

The most immediate difference: loads returns a FyGeneric, not a native Python object. You must call .to_python() (or use the object directly via the container/numeric protocols) to get a plain dict or list.

# PyYAML
import yaml
result = yaml.safe_load("x: 1")
type(result)          # dict

# libfyaml
import libfyaml as fy
result = fy.loads("x: 1")
type(result)          # FyGeneric
type(result.to_python())  # dict

API shape: mode instead of Loader

PyYAML selects behaviour through Loader classes (SafeLoader, FullLoader, BaseLoader). libfyaml uses a mode string:

# PyYAML
yaml.load(s, Loader=yaml.SafeLoader)
yaml.safe_load(s)

# libfyaml
fy.loads(s)                      # YAML 1.2 (roughly equivalent to SafeLoader)
fy.loads(s, mode='yaml1.1-pyyaml')  # closest to PyYAML's SafeLoader behaviour

There are no Loader or Dumper classes in the core binding.

Default YAML version: 1.2 not 1.1

libfyaml defaults to YAML 1.2. PyYAML implements YAML 1.1. This affects implicit type resolution:

Input	PyYAML (1.1)	libfyaml default (1.2)
`yes` / `no` / `on` / `off`	`True` / `False`	string
`0755`	`493` (octal int)	string
`1:30` (sexagesimal)	`90` (int)	string
`1.5e3`	`1500.0`	`1500.0`
`.inf` / `.nan`	`inf` / `nan`	`inf` / `nan`

Use mode='yaml1.1' or mode='yaml1.1-pyyaml' to get YAML 1.1 resolution.

Strictness differences in YAML 1.1 mode

Even in yaml1.1-pyyaml mode a few corner cases differ because libfyaml follows the YAML specification more strictly than PyYAML does:

Situation	PyYAML	libfyaml
Duplicate anchor (`&a 1 ... &a 2`)	`ComposerError`	accepted (spec §3.2.2.2 allows redefinition)
Unknown `%DIRECTIVE`	`ScannerError`	warning, continues (spec §6.8.1 says SHOULD warn)
`?` in anchor name (`&?foo`)	`ScannerError`	accepted (`?` is a valid `ns-anchor-char` per spec §6.9.2)
Sexagesimal integers (`190:20:30`)	`685230`	string (not resolved)
Sexagesimal floats (`190:20:30.15`)	`685230.15`	string (not resolved)
Single dot (`.`)	string	`0.0` (float — C library bug)
`---` as flow scalar	string	`null` (C library bug)

Error messages

libfyaml and PyYAML produce different human-readable error messages for the same parse errors. Code that pattern-matches exception strings will need adjustment; code that only catches the exception type will be fine.

Block scalar emission

libfyaml follows the YAML spec strictly when choosing scalar styles, which means it will refuse to use a block scalar (| or >) in contexts where the spec does not permit one — for example as a value inside a flow collection. PyYAML emits block scalars in those contexts anyway, producing output that is technically non-conformant. If you serialise a document that PyYAML would render with block scalars inside flow collections, libfyaml will choose a flow-compatible style (double-quoted) instead.

Unicode line separators (U+2028 / U+2029)

The YAML 1.2 spec (§6.5) classifies U+2028 (LINE SEPARATOR) and U+2029 (PARAGRAPH SEPARATOR) as line-break characters. libfyaml honours this in block scalars, treating them as line breaks during both parsing and emission. PyYAML predates this clarification and treats them as ordinary non-breaking characters throughout. If your data contains these code points, block-style round-trips will produce different results between the two libraries. Use double-quoted scalars to preserve them unambiguously in either library.

!!binary tag syntax

libfyaml accepts inline !!binary scalars (!!binary aGVsbG8=) in addition to the block form that PyYAML requires (!!binary |\n aGVsbG8=). Both forms decode to bytes.

Features not in PyYAML

The core binding provides capabilities that PyYAML has no equivalent for:

Source markers (create_markers=True) — byte/line/column positions for every node, without the overhead of PyYAML's Mark objects on events.
Comment preservation (keep_comments=True).
Style preservation (keep_style=True) — round-trip the original scalar style (literal, folded, single-quoted, etc.).
Path navigation — get_at_unix_path, set_at_unix_path for direct document surgery without tree traversal code.
Deduplication allocator — dramatically lower memory usage for documents with repeated content.
FyDocumentState — programmatic access to %YAML and %TAG directives.

Appendix: Parse performance

Methodology

Configurations were measured by running docs/benchmark-parse.py against two real-world YAML files. Each configuration runs in an isolated subprocess so that allocations from earlier runs cannot inflate later measurements.

All libraries are imported before the baseline RSS is measured so that library load cost (the .so footprint) is excluded from the delta. The RSS delta therefore reflects only the memory added by parsing that specific file — the data structures created, the source text mapped, the allocator pages used.

Five timed repetitions were taken per configuration; the tables report the median parse time and median peak RSS delta across those runs.

The benchmark can be reproduced on any YAML file:

python3 docs/benchmark-parse.py <file.yaml> [--runs N] [--multi]

Use --multi for files containing multiple ----separated documents.

Note on PyYAML compatibility. PyYAML's SafeLoader and CLoader do not recognise tag:yaml.org,2002:value, the tag YAML 1.1 assigns to a bare = scalar. YAML 1.2 treats = as a plain string, and it appears legitimately in both test files (e.g. as an enum value in Kubernetes CRD schemas). The benchmark registers a one-line constructor fix so PyYAML can parse these files; libfyaml handles them correctly without any patching.

Environment

Item	Version
CPU	AMD Ryzen 5 5600X
Python	3.12.3
PyYAML	6.0.1
libyaml (CLoader)	0.2.5
libfyaml	v0.9.3-278 (release build)

Results — 6.4 MB (`AtomicCards-2-cleaned-small.yaml`, single-doc)

Magic: The Gathering card database — highly varied text content with moderate key repetition.

xychart-beta horizontal
    title "Parse time — AtomicCards 6.4 MB (ms, lower is better)"
    x-axis ["PyYAML safe_load", "PyYAML CLoader", "libfyaml dedup=on", "libfyaml dedup=off"]
    y-axis "ms" 0 --> 7500
    bar [7155, 1228, 115, 102]

xychart-beta horizontal
    title "RSS delta — AtomicCards 6.4 MB (MB, lower is better)"
    x-axis ["PyYAML safe_load", "PyYAML CLoader", "libfyaml dedup=on", "libfyaml dedup=off"]
    y-axis "MB" 0 --> 175
    bar [164, 123, 28, 25]

Configuration	Median	Min	RSS delta
PyYAML `safe_load` (pure Python)	7155 ms	7033 ms	+164 MB
PyYAML `CLoader` (libyaml)	1228 ms	1172 ms	+123 MB
libfyaml `dedup=True` (default)	115 ms	114 ms	+28 MB
libfyaml `dedup=False`	102 ms	101 ms	+25 MB

Results — 4.3 MB (`bundle.yaml`, multi-doc, 24 documents)

Prometheus Operator CRD bundle (source) — structured Kubernetes schemas with heavy key repetition (name, type, description, properties, spec recurring throughout).

xychart-beta horizontal
    title "Parse time — bundle.yaml 4.3 MB (ms, lower is better)"
    x-axis ["PyYAML safe_load", "PyYAML CLoader", "libfyaml dedup=on", "libfyaml dedup=off"]
    y-axis "ms" 0 --> 3200
    bar [2964, 274, 53, 48]

xychart-beta horizontal
    title "RSS delta — bundle.yaml 4.3 MB (MB, lower is better)"
    x-axis ["PyYAML safe_load", "PyYAML CLoader", "libfyaml dedup=on", "libfyaml dedup=off"]
    y-axis "MB" 0 --> 20
    bar [16, 14, 3, 10]

Configuration	Median	Min	RSS delta
PyYAML `safe_load` (pure Python)	2964 ms	2919 ms	+16 MB
PyYAML `CLoader` (libyaml)	274 ms	267 ms	+14 MB
libfyaml `dedup=True` (default)	53 ms	52 ms	+3 MB
libfyaml `dedup=False`	48 ms	48 ms	+10 MB

Analysis

Speed. Across both files, libfyaml is 4–5× faster than CLoader and 55–60× faster than pure-Python PyYAML. The gap against the pure Python loader is expected — PyYAML constructs every node as a heap-allocated Python object while iterating the event stream in interpreted bytecode. The gap against CLoader is more meaningful: both parsers are written in C, but libfyaml uses mmap for file I/O, a purpose-built allocator, and avoids the two-phase parse/construct split that libyaml's event model requires.

Memory. libfyaml consistently uses far less RSS than PyYAML for the parsed data structure. PyYAML allocates a heap object (dict, list, str, int, …) for every node in the document; libfyaml stores values in its arena allocator with FyGeneric wrappers created lazily on access. On the card database, libfyaml uses ~78% less RSS than CLoader (+25–28 MB vs +123 MB); on the CRD bundle it uses ~80–98% less (+3–10 MB vs +14 MB).

Note that libfyaml's .so file itself has a significant up-front import cost (~50 MB RSS), which is a fixed one-time overhead amortised across all subsequent load() calls and not included in the delta figures above.

dedup vs no-dedup. On the card database, dedup=True adds ~13 ms but saves only ~3 MB — the text content is highly varied, so the dedup allocator finds little to share. On the CRD bundle, dedup=True saves 7 MB compared to dedup=False because Kubernetes schemas repeat the same field names (name, type, description, properties, …) thousands of times across 24 documents. The deduplication allocator is the right default for structured configuration and API-schema YAML; for documents with unique free-form text, dedup=False is marginally faster.