libfyaml Python Binding
March 16, 2026 · View on GitHub
The libfyaml Python binding exposes the high-performance libfyaml C library
directly. Parsed documents are represented as FyGeneric objects — lazy
wrappers that defer conversion to Python types until you ask for them. This
keeps memory low and lets you navigate large documents without materialising
every node.
Table of Contents
- Quick Start
- Parsing
- The FyGeneric Type
- Serialisation
- Converting Python objects
- Path navigation
- Mutability
- FyDocumentState
- Memory management
- Error handling
- Comparison with PyYAML
Quick Start
import libfyaml as fy
# Parse a YAML string
doc = fy.loads("name: Alice\nage: 30")
print(doc["name"]) # FyGeneric wrapping "Alice"
print(str(doc["name"])) # "Alice"
print(doc.to_python()) # {'name': 'Alice', 'age': 30}
# Parse a file
doc = fy.load("config.yaml")
# Serialise back to YAML
print(fy.dumps(doc))
# Parse JSON
data = fy.loads('{"x": 1}', mode='json')
Parsing
loads(s, **options) → FyGeneric
Parse a YAML or JSON string. Raises ValueError if the input contains
more than one document — use loads_all for multi-document streams.
doc = fy.loads("key: value")
docs = fy.loads_all("---\na: 1\n---\nb: 2") # list of FyGeneric
load(file, **options) → FyGeneric
Parse from a file path (string — uses mmap internally) or any file-like
object with a .read() method.
doc = fy.load("data.yaml")
with open("data.yaml") as f:
doc = fy.load(f)
loads_all(s, **options) → list[FyGeneric]
load_all(file, **options) → list[FyGeneric]
Return all documents in a multi-document stream as a list.
docs = fy.loads_all("---\n1\n---\n2\n---\n3")
# [FyGeneric(1), FyGeneric(2), FyGeneric(3)]
Parse modes
The mode parameter controls which YAML dialect is accepted:
| Mode string | Meaning |
|---|---|
'yaml', 'yaml1.2', '1.2' | YAML 1.2 — the default |
'yaml1.1', '1.1' | YAML 1.1 (accepts merge keys <<, sexagesimal numbers, etc.) |
'yaml1.1-pyyaml', 'pyyaml' | YAML 1.1 with PyYAML-compatible quirks (used by the compat layer) |
'json' | Strict JSON |
# Merge keys only work in YAML 1.1
doc = fy.loads("""
defaults: &defaults
timeout: 30
server:
<<: *defaults
host: localhost
""", mode='yaml1.1')
Parser options
All four parse functions accept the same keyword options:
| Option | Default | Description |
|---|---|---|
mode | 'yaml' | Dialect — see above |
dedup | True | Use the deduplication allocator (saves memory for documents with repeated content) |
trim | True | Release unused allocator memory after parsing |
mutable | False | Produce mutable FyGeneric objects (required for __setitem__ and set_at_path) |
collect_diag | False | Attach parse diagnostics to the result instead of raising |
create_markers | False | Record byte/line/column positions for every node |
keep_comments | False | Preserve YAML comments in the document |
keep_style | False | Preserve original scalar styles (literal, folded, quoted, …) |
The FyGeneric Type
FyGeneric is the type returned by all parse functions. It wraps a C
fy_generic value without copying data. Conversion to Python only happens
when you explicitly ask for it.
doc = fy.loads("x: 42")
type(doc) # <class 'libfyaml._libfyaml.FyGeneric'>
doc.__class__ # <class 'dict'> — the Python equivalent class
Type checking
Eight predicate methods, all return bool:
v = fy.loads("42")
v.is_null() # False
v.is_bool() # False
v.is_int() # True
v.is_float() # False
v.is_string() # False
v.is_sequence() # False
v.is_mapping() # False
v.is_indirect() # True if the value carries a tag or anchor
Converting to Python
doc = fy.loads("items: [1, 2, 3]")
# Recursive — the whole document becomes plain Python
doc.to_python() # {'items': [1, 2, 3]}
# Scalar coercions
n = fy.loads("99")
int(n) # 99
float(n) # 99.0
bool(n) # True
str(n) # "99"
to_python() raises TypeError if a mapping key is unhashable (e.g. a
nested mapping used as a key).
Container access
Sequences and mappings support the standard Python container protocol:
doc = fy.loads("fruits: [apple, banana, cherry]")
fruits = doc["fruits"]
len(fruits) # 3
fruits[0] # FyGeneric("apple")
str(fruits[0]) # "apple"
"banana" in fruits # True (linear scan)
for item in fruits:
print(str(item))
# Mappings
doc["fruits"] # FyGeneric sequence
doc.keys() # ['fruits']
doc.values() # [FyGeneric sequence]
doc.items() # [('fruits', FyGeneric sequence)]
Attribute access on mappings delegates to the underlying dict:
doc = fy.loads("host: localhost\nport: 8080")
str(doc.host) # "localhost"
int(doc.port) # 8080
Numeric operations on integer and float values work directly:
v = fy.loads("10")
v + 5 # 15
v * 2 # 20
v > 5 # True
Tags and anchors
doc = fy.loads("value: !!int '42'")
v = doc["value"]
v.has_tag() # True
v.get_tag() # "tag:yaml.org,2002:int"
doc2 = fy.loads("x: &myanchor hello\ny: *myanchor")
doc2["x"].has_anchor() # True
doc2["x"].get_anchor() # "myanchor"
Source markers
Markers record the byte offset, line, and column of each node in the original
source. Enable them at parse time with create_markers=True.
doc = fy.loads("host: localhost\nport: 8080", create_markers=True)
m = doc["host"].get_marker()
# (start_byte, start_line, start_col, end_byte, end_line, end_col)
# e.g. (6, 0, 6, 15, 0, 15)
doc["host"].has_marker() # True
doc["port"].get_marker() # (22, 1, 6, 31, 1, 15)
Lines and columns are zero-based. get_marker() returns None when markers
were not enabled.
Comments
Preserve YAML comments by parsing with keep_comments=True.
yaml_text = """\
# Server settings
host: localhost # primary
port: 8080
"""
doc = fy.loads(yaml_text, keep_comments=True)
doc["host"].get_comment() # "# primary"
doc["host"].has_comment() # True
Diagnostics
With collect_diag=True parse errors are attached to the document rather than
raised immediately. This lets you process partially-valid input.
doc = fy.loads("good: ok\nbad: {unclosed", collect_diag=True)
doc.has_diag() # True
doc.get_diag() # FyGeneric describing the error(s)
Serialisation
dumps(obj, *, compact=False, json=False, style=None, indent=0) → str
Serialise a FyGeneric or plain Python object to a YAML (or JSON) string.
doc = fy.loads("name: Alice\nscores: [10, 20, 30]")
print(fy.dumps(doc))
# name: Alice
# scores:
# - 10
# - 20
# - 30
print(fy.dumps(doc, compact=True))
# {name: Alice, scores: [10, 20, 30]}
print(fy.dumps(doc, json=True))
# {"name": "Alice", "scores": [10, 20, 30]}
indent sets the indentation width (2–8 spaces; 0 uses the library default).
dump(file, obj, *, mode='yaml', compact=False)
Write to a file path (string) or file-like object. mode accepts 'yaml' or
'json'.
fy.dump("output.yaml", doc)
with open("output.json", "w") as f:
fy.dump(f, doc, mode='json')
dumps_all(documents, *, compact=False, json=False, style=None) → str
dump_all(file, documents, *, compact=False, json=False)
Serialise a list of documents with --- separators.
docs = fy.loads_all("---\na: 1\n---\nb: 2")
print(fy.dumps_all(docs))
# ---
# a: 1
# ---
# b: 2
Individual node serialisation
FyGeneric objects have their own .dump() method:
doc = fy.loads("x: 1\ny: 2")
doc["x"].dump() # returns "1\n"
doc["x"].dump(strip_newline=True) # returns "1"
doc["x"].dump("node.yaml") # writes to file
doc["x"].dump(sys.stdout, mode='json') # writes to file object
Scalar styles
The style parameter controls how scalar values are written. Accepted values:
| Style | Effect |
|---|---|
None or 'default' | Library default (usually plain) |
'original' | Preserve the style from the parsed input (requires keep_style=True at parse time) |
'block' | Block scalars (literal | or folded >) |
'flow' | Flow / inline style |
'pretty' | Readable multi-line format |
'compact' | Compact single-line |
'oneline' | Force everything onto one line |
doc = fy.loads("text: 'hello world'")
print(fy.dumps(doc, style='block'))
print(fy.dumps(doc, style='flow'))
Converting Python objects
from_python(obj, *, tag=None, style=None, mutable=False, dedup=True) → FyGeneric
Convert a plain Python object (dict, list, str, int, float, bool,
None) to a FyGeneric. Useful for attaching tags or styles before
serialisation.
# Attach a YAML tag
v = fy.from_python("hello", tag="!mytag")
print(fy.dumps(v)) # !mytag hello
# Control the scalar style
text = fy.from_python("line one\nline two\n", style='|')
print(fy.dumps(text))
# |
# line one
# line two
Scalar style values accepted by from_python:
| Style | Meaning |
|---|---|
| `' | '` |
'>' | Folded block scalar |
"'" | Single-quoted |
'"' | Double-quoted |
'plain' or '' | Plain (unquoted) |
Path navigation
get_at_path(path) → FyGeneric
get_at_unix_path(path_str) → FyGeneric
Navigate into a nested document. A path is a list of keys (strings) and indices (integers).
doc = fy.loads("""
servers:
- host: web01
port: 80
- host: web02
port: 443
""")
doc.get_at_path(["servers", 0, "host"]) # FyGeneric("web01")
doc.get_at_unix_path("/servers/0/host") # FyGeneric("web01")
doc.get_at_unix_path("/servers/1/port") # FyGeneric(443)
get_at_path raises KeyError if the path does not exist.
get_path() → tuple / get_unix_path() → str
Return the path of a node within its document (useful when iterating):
doc = fy.loads("a:\n b:\n c: 42")
v = doc.get_at_unix_path("/a/b/c")
v.get_unix_path() # "/a/b/c"
v.get_path() # ('a', 'b', 'c')
Path utility functions
fy.path_list_to_unix_path(["servers", 0, "host"]) # "/servers/0/host"
fy.unix_path_to_path_list("/servers/0/host") # ["servers", 0, "host"]
Mutability
By default FyGeneric objects are immutable. Pass mutable=True to the parse
function (or from_python) to allow in-place modification.
doc = fy.loads("x: 1\ny: 2", mutable=True)
doc["x"] = 99
str(doc["x"]) # "99"
doc.set_at_path(["y"], "updated")
doc.set_at_unix_path("/x", 0)
print(fy.dumps(doc))
# x: 0
# y: updated
Attempting to modify an immutable object raises TypeError.
FyDocumentState
FyDocumentState carries the YAML directives that appeared before a document.
Access it via FyGeneric.document_state.
doc = fy.loads("%YAML 1.2\n---\nkey: value")
ds = doc.document_state
ds.version # (1, 2)
ds.version_explicit # True
ds.json_mode # False
ds.tags # list of {'handle': ..., 'prefix': ...} dicts
ds.tags_explicit # True if %TAG directives were present
document_state is None for values that are not document roots.
Memory management
Allocator strategy
The dedup=True default uses a deduplication allocator that stores only one
copy of repeated strings or scalars. This is a significant win for large
documents with repeated content (e.g. YAML files with many identical keys or
values).
Set dedup=False to use the standard allocator, which may be faster for
small documents or documents with little repetition.
Trim
trim=True (default) releases unused allocator pages after parsing is
complete. Disable with trim=False if you will be building on the document
after parsing and want to avoid reallocation.
Manual trim
doc = fy.loads(large_yaml, trim=False)
# ... do some work ...
doc.trim() # release unused memory now
Clone
clone() creates an independent copy of a FyGeneric value, decoupled from
the original document's allocator:
original = fy.load("big.yaml")
part = original.get_at_unix_path("/config/server").clone()
del original # can now be collected
Error handling
| Exception | Raised when |
|---|---|
ValueError | Parse error; invalid mode string; invalid style; multiple documents where one was expected |
TypeError | Wrong argument type; mutation on an immutable object; unhashable mapping key in to_python() or items() |
KeyError | Path not found in get_at_path / get_at_unix_path |
RuntimeError | Internal builder or emitter failure; file write error |
AttributeError | Attribute access on a non-mapping FyGeneric |
NotImplementedError | del on a FyGeneric item |
try:
doc = fy.loads("key: [unclosed")
except ValueError as e:
print(f"Parse error: {e}")
# Or collect errors without raising:
doc = fy.loads("key: [unclosed", collect_diag=True)
if doc.has_diag():
print(doc.get_diag().to_python())
Comparison with PyYAML
This section describes how the core libfyaml binding relates to PyYAML.
Where they are similar
- Function names:
load,loads,dump,dumpsfollow the same naming convention as PyYAML'syaml.safe_load/yaml.dump. - Python types out: both ultimately produce
dict,list,str,int,float,bool, andNone. Call.to_python()on aFyGenericto get the plain Python value. - YAML tag handling: both support
!!str,!!int,!!float,!!bool,!!null,!!seq,!!map,!!binary, and custom tags. - Multi-document streams: both support
----separated documents viaload_all/loads_all.
Where they diverge
Return type
The most immediate difference: loads returns a FyGeneric, not a native
Python object. You must call .to_python() (or use the object directly via
the container/numeric protocols) to get a plain dict or list.
# PyYAML
import yaml
result = yaml.safe_load("x: 1")
type(result) # dict
# libfyaml
import libfyaml as fy
result = fy.loads("x: 1")
type(result) # FyGeneric
type(result.to_python()) # dict
API shape: mode instead of Loader
PyYAML selects behaviour through Loader classes (SafeLoader,
FullLoader, BaseLoader). libfyaml uses a mode string:
# PyYAML
yaml.load(s, Loader=yaml.SafeLoader)
yaml.safe_load(s)
# libfyaml
fy.loads(s) # YAML 1.2 (roughly equivalent to SafeLoader)
fy.loads(s, mode='yaml1.1-pyyaml') # closest to PyYAML's SafeLoader behaviour
There are no Loader or Dumper classes in the core binding.
Default YAML version: 1.2 not 1.1
libfyaml defaults to YAML 1.2. PyYAML implements YAML 1.1. This affects implicit type resolution:
| Input | PyYAML (1.1) | libfyaml default (1.2) |
|---|---|---|
yes / no / on / off | True / False | string |
0755 | 493 (octal int) | string |
1:30 (sexagesimal) | 90 (int) | string |
1.5e3 | 1500.0 | 1500.0 |
.inf / .nan | inf / nan | inf / nan |
Use mode='yaml1.1' or mode='yaml1.1-pyyaml' to get YAML 1.1 resolution.
Strictness differences in YAML 1.1 mode
Even in yaml1.1-pyyaml mode a few corner cases differ because libfyaml
follows the YAML specification more strictly than PyYAML does:
| Situation | PyYAML | libfyaml |
|---|---|---|
Duplicate anchor (&a 1 ... &a 2) | ComposerError | accepted (spec §3.2.2.2 allows redefinition) |
Unknown %DIRECTIVE | ScannerError | warning, continues (spec §6.8.1 says SHOULD warn) |
? in anchor name (&?foo) | ScannerError | accepted (? is a valid ns-anchor-char per spec §6.9.2) |
Sexagesimal integers (190:20:30) | 685230 | string (not resolved) |
Sexagesimal floats (190:20:30.15) | 685230.15 | string (not resolved) |
Single dot (.) | string | 0.0 (float — C library bug) |
--- as flow scalar | string | null (C library bug) |
Error messages
libfyaml and PyYAML produce different human-readable error messages for the same parse errors. Code that pattern-matches exception strings will need adjustment; code that only catches the exception type will be fine.
Block scalar emission
libfyaml follows the YAML spec strictly when choosing scalar styles, which
means it will refuse to use a block scalar (| or >) in contexts where
the spec does not permit one — for example as a value inside a flow
collection. PyYAML emits block scalars in those contexts anyway, producing
output that is technically non-conformant. If you serialise a document that
PyYAML would render with block scalars inside flow collections, libfyaml will
choose a flow-compatible style (double-quoted) instead.
Unicode line separators (U+2028 / U+2029)
The YAML 1.2 spec (§6.5) classifies U+2028 (LINE SEPARATOR) and U+2029 (PARAGRAPH SEPARATOR) as line-break characters. libfyaml honours this in block scalars, treating them as line breaks during both parsing and emission. PyYAML predates this clarification and treats them as ordinary non-breaking characters throughout. If your data contains these code points, block-style round-trips will produce different results between the two libraries. Use double-quoted scalars to preserve them unambiguously in either library.
!!binary tag syntax
libfyaml accepts inline !!binary scalars (!!binary aGVsbG8=) in addition
to the block form that PyYAML requires (!!binary |\n aGVsbG8=). Both forms
decode to bytes.
Features not in PyYAML
The core binding provides capabilities that PyYAML has no equivalent for:
- Source markers (
create_markers=True) — byte/line/column positions for every node, without the overhead of PyYAML'sMarkobjects on events. - Comment preservation (
keep_comments=True). - Style preservation (
keep_style=True) — round-trip the original scalar style (literal, folded, single-quoted, etc.). - Path navigation —
get_at_unix_path,set_at_unix_pathfor direct document surgery without tree traversal code. - Deduplication allocator — dramatically lower memory usage for documents with repeated content.
FyDocumentState— programmatic access to%YAMLand%TAGdirectives.
Appendix: Parse performance
Methodology
Configurations were measured by running docs/benchmark-parse.py against
two real-world YAML files. Each configuration runs in an isolated
subprocess so that allocations from earlier runs cannot inflate later
measurements.
All libraries are imported before the baseline RSS is measured so that
library load cost (the .so footprint) is excluded from the delta. The RSS
delta therefore reflects only the memory added by parsing that specific file —
the data structures created, the source text mapped, the allocator pages used.
Five timed repetitions were taken per configuration; the tables report the median parse time and median peak RSS delta across those runs.
The benchmark can be reproduced on any YAML file:
python3 docs/benchmark-parse.py <file.yaml> [--runs N] [--multi]
Use --multi for files containing multiple ----separated documents.
Note on PyYAML compatibility. PyYAML's SafeLoader and CLoader do not
recognise tag:yaml.org,2002:value, the tag YAML 1.1 assigns to a bare =
scalar. YAML 1.2 treats = as a plain string, and it appears legitimately in
both test files (e.g. as an enum value in Kubernetes CRD schemas). The
benchmark registers a one-line constructor fix so PyYAML can parse these files;
libfyaml handles them correctly without any patching.
Environment
| Item | Version |
|---|---|
| CPU | AMD Ryzen 5 5600X |
| Python | 3.12.3 |
| PyYAML | 6.0.1 |
| libyaml (CLoader) | 0.2.5 |
| libfyaml | v0.9.3-278 (release build) |
Results — 6.4 MB (AtomicCards-2-cleaned-small.yaml, single-doc)
Magic: The Gathering card database — highly varied text content with moderate key repetition.
xychart-beta horizontal
title "Parse time — AtomicCards 6.4 MB (ms, lower is better)"
x-axis ["PyYAML safe_load", "PyYAML CLoader", "libfyaml dedup=on", "libfyaml dedup=off"]
y-axis "ms" 0 --> 7500
bar [7155, 1228, 115, 102]
xychart-beta horizontal
title "RSS delta — AtomicCards 6.4 MB (MB, lower is better)"
x-axis ["PyYAML safe_load", "PyYAML CLoader", "libfyaml dedup=on", "libfyaml dedup=off"]
y-axis "MB" 0 --> 175
bar [164, 123, 28, 25]
| Configuration | Median | Min | RSS delta |
|---|---|---|---|
PyYAML safe_load (pure Python) | 7155 ms | 7033 ms | +164 MB |
PyYAML CLoader (libyaml) | 1228 ms | 1172 ms | +123 MB |
libfyaml dedup=True (default) | 115 ms | 114 ms | +28 MB |
libfyaml dedup=False | 102 ms | 101 ms | +25 MB |
Results — 4.3 MB (bundle.yaml, multi-doc, 24 documents)
Prometheus Operator CRD bundle (source)
— structured Kubernetes schemas with heavy key repetition (name, type,
description, properties, spec recurring throughout).
xychart-beta horizontal
title "Parse time — bundle.yaml 4.3 MB (ms, lower is better)"
x-axis ["PyYAML safe_load", "PyYAML CLoader", "libfyaml dedup=on", "libfyaml dedup=off"]
y-axis "ms" 0 --> 3200
bar [2964, 274, 53, 48]
xychart-beta horizontal
title "RSS delta — bundle.yaml 4.3 MB (MB, lower is better)"
x-axis ["PyYAML safe_load", "PyYAML CLoader", "libfyaml dedup=on", "libfyaml dedup=off"]
y-axis "MB" 0 --> 20
bar [16, 14, 3, 10]
| Configuration | Median | Min | RSS delta |
|---|---|---|---|
PyYAML safe_load (pure Python) | 2964 ms | 2919 ms | +16 MB |
PyYAML CLoader (libyaml) | 274 ms | 267 ms | +14 MB |
libfyaml dedup=True (default) | 53 ms | 52 ms | +3 MB |
libfyaml dedup=False | 48 ms | 48 ms | +10 MB |
Analysis
Speed. Across both files, libfyaml is 4–5× faster than CLoader and 55–60× faster than pure-Python PyYAML. The gap against the pure Python loader is expected — PyYAML constructs every node as a heap-allocated Python object while iterating the event stream in interpreted bytecode. The gap against CLoader is more meaningful: both parsers are written in C, but libfyaml uses mmap for file I/O, a purpose-built allocator, and avoids the two-phase parse/construct split that libyaml's event model requires.
Memory. libfyaml consistently uses far less RSS than PyYAML for the
parsed data structure. PyYAML allocates a heap object (dict, list, str, int,
…) for every node in the document; libfyaml stores values in its arena
allocator with FyGeneric wrappers created lazily on access. On the card
database, libfyaml uses ~78% less RSS than CLoader (+25–28 MB vs +123 MB);
on the CRD bundle it uses ~80–98% less (+3–10 MB vs +14 MB).
Note that libfyaml's .so file itself has a significant up-front import cost
(~50 MB RSS), which is a fixed one-time overhead amortised across all subsequent
load() calls and not included in the delta figures above.
dedup vs no-dedup. On the card database, dedup=True adds ~13 ms but saves
only ~3 MB — the text content is highly varied, so the dedup allocator finds
little to share. On the CRD bundle, dedup=True saves 7 MB compared to
dedup=False because Kubernetes schemas repeat the same field names (name,
type, description, properties, …) thousands of times across 24 documents.
The deduplication allocator is the right default for structured configuration
and API-schema YAML; for documents with unique free-form text, dedup=False is
marginally faster.