Serialization

March 26, 2026 ยท View on GitHub

Prism ships with the ability to serialize a syntax tree to a single string. The string can then be deserialized back into a syntax tree using a language other than C. This is useful for using the parsing logic in other tools without having to write a parser in that language. The syntax tree still requires a copy of the original source, as for the most part it just contains byte offsets into the source string.

Types

Let us define some simple types for readability.

varuint

A variable-length unsigned integer with the value fitting in uint32_t using between 1 and 5 bytes, using the LEB128 encoding. This drastically cuts down on the size of the serialized string, especially when the source file is large.

varsint

A variable-length signed integer with the value fitting in int32_t using between 1 and 5 bytes, using ZigZag encoding into [LEB128].

string

# bytesfield
varuintthe length of the string in bytes
...the string bytes

location

# bytesfield
varuintbyte offset into the source string where this location begins
varuintlength of the location in bytes in the source string

comment

The comment type is one of:

  • 0=INLINE (# comment)
  • 1=EMBEDDED_DOCUMENT (=begin/=end)
# bytesfield
1comment type
locationthe location in the source of this comment

magic comment

# bytesfield
locationthe location of the key of the magic comment
locationthe location of the value of the magic comment

error

# bytesfield
varuinttype
stringerror message (ASCII-only characters)
locationthe location in the source this error applies to
1the level of the error: 0 for fatal, 1 for argument, 2 for load

warning

# bytesfield
varuinttype
stringwarning message (ASCII-only characters)
locationthe location in the source this warning applies to
1the level of the warning: 0 for default and 1 for verbose

integer

# bytesfield
11 if the integer is negative, 0 if the integer is positive
varuintthe number of words in this integer
varuint+the words of the integer, least-significant to most-significant

Structure

The serialized string representing the syntax tree is composed of three parts: the header, the body, and the constant pool. The header contains information like the version of prism that serialized the tree. The body contains the actual nodes in the tree. The constant pool contains constants that were interned while parsing.

The header is structured like the following table:

# bytesfield
5"PRISM"
1major version number
1minor version number
1patch version number
11 indicates only semantics fields were serialized, 0 indicates all fields were serialized (including location fields)
stringthe encoding name
varsintthe start line
varuintnumber of newline offsets
varuint*newline offsets
varuintnumber of comments
comment*comments
varuintnumber of magic comments
magic comment*magic comments
location?the optional location of the __END__ keyword and its contents
varuintnumber of errors
error*errors
varuintnumber of warnings
warning*warnings
11 if the source is continuable (incomplete but could become valid with more input), 0 otherwise
4content pool offset
varuintcontent pool size

After the header comes the body of the serialized string. The body consists of a sequence of nodes that is built using a prefix traversal order of the syntax tree. Each node is structured like the following table:

# bytesfield
1node type
varuintnode identifier
locationnode location
varuintnode flags

Every field on the node is then appended to the serialized string. The fields can be determined by referencing config.yml. Depending on the type of field, it could take a couple of different forms, described below:

  • double - A field that is a double. This is structured as a sequence of 8 bytes in native endian order.
  • node - A field that is a node. This is structured just as like parent node.
  • node? - A field that is a node that is optionally present. If the node is not present, then a single 0 byte will be written in its place. If it is present, then it will be structured just as like parent node.
  • node[] - A field that is an array of nodes. This is structured as a variable-length integer length, followed by the child nodes themselves.
  • string - A field that is a string. For example, this is used as the name of the method in a call node, since it cannot directly reference the source string (as in @- or foo=). This is structured as a variable-length integer byte length, followed by the string bytes (without a trailing null byte).
  • constant - A variable-length integer that represents an index in the constant pool.
  • constant? - An optional variable-length integer that represents an index in the constant pool. If it's not present, then a single 0 byte will be written in its place.
  • integer - A field that represents an arbitrary-sized integer. The structure is listed above.
  • location - A field that is a location. This is structured as a variable-length integer start followed by a variable-length integer length.
  • location? - A field that is a location that is optionally present. If the location is not present, then a single 0 byte will be written in its place. If it is present, then it will be structured just like the location child node.
  • uint8 - A field that is an 8-bit unsigned integer. This is structured as a single byte.
  • uint32 - A field that is a 32-bit unsigned integer. This is structured as a variable-length integer.

After the syntax tree, the content pool is serialized. This is a list of constants that were referenced from within the tree. The content pool begins at the offset specified in the header. Every constant is embedded in the serialization. Each constant is structured as follows:

# bytesfield
4the byte offset in the serialization for the contents of the constant
4the byte length in the serialization

After the constant pool, the contents of the constants are serialized. This is just a sequence of bytes that represent the contents of the constants. At the end of the serialization, the buffer is null terminated.

APIs

The relevant APIs and struct definitions are listed below:

// A pm_buffer_t is a simple memory buffer that stores data in a contiguous
// block of memory. It is used to store the serialized representation of a
// prism tree.

// Parse and serialize the AST represented by the given source to the given
// buffer.
void pm_serialize_parse(pm_buffer_t *buffer, const uint8_t *source, size_t length, const char *data);

Typically you would allocate a pm_buffer_t and call pm_serialize_parse, as in:

void
serialize(const uint8_t *source, size_t length) {
  pm_buffer_t *buffer = pm_buffer_new();
  pm_serialize_parse(buffer, source, length, NULL);

  // Do something with the serialized string.

  pm_buffer_free(buffer);
}

The final argument to pm_serialize_parse is an optional string that controls the options to the parse function. This includes all of the normal options that could be passed to pm_parser_init through a pm_options_t struct, but serialized as a string to make it easier for callers through FFI. Note that no varuint are used here to make it easier to produce the data for the caller, and also serialized size is less important here. The format of the data is structured as follows:

# bytesfield
4the length of the filepath
...the filepath bytes
4the line number
4the length the encoding
...the encoding bytes
1frozen string literal
1command line flags
1syntax version, see pm_options_version_t for valid values
1whether or not the encoding is locked (should almost always be false)
4the number of scopes
...the scopes

Command line flags are a bitset. By default every flag is 0. It includes the following values:

  • 0x1 - the -a option
  • 0x2 - the -e option
  • 0x4 - the -l option
  • 0x8 - the -n option
  • 0x10 - the -p option
  • 0x20 - the -x option

Scopes are ordered from the outermost scope to the innermost one.

Each scope is laid out as follows:

# bytesfield
4the number of locals
...the locals

Each local is laid out as follows:

# bytesfield
4the length of the local
...the local bytes

The data can be NULL (as seen in the example above).