Inputs and Parsing
May 30, 2026 ยท View on GitHub
Input data is a sequence of objects -- often of type char -- that is intended to be parsed.
An input is a class that adheres to an informal interface and represents some input data.
Contents
- Introduction
- Input Classes
- Ends of Lines
- Parse Function
- Nested Parsing
- Position Classes
- Input Interface
Introduction
Performing a parsing run requires (at least) the following steps.
- The grammar has to be defined.
- An input has to be constructed.
- The parse function has to be called with the grammar and the input.
The following steps are also frequently included to do something useful while parsing.
- Actions have to be implemented and passed to the parse function as template parameter.
- States have to be instantiated and passed to the parse function as additional arguments.
More advanced use cases might also pass a control class to the parsing run, or use other functions to drive the parsing run.
The following incomplete code shows the general outline of performaing a parsing run.
using namespace tao::pegtl;
// A state class to collect data while parsing:
class my_state;
// Implementation of required grammar rules:
struct some_rule : ... {};
struct my_grammar : ... {};
// Action class with default case that does nothing:
template< typename Rule >
struct my_actions
: tao::pegtl::nothing< Rule >{};
// Specializations of my_actions as required for rules:
template<>
struct my_actions< some_rule >
{
template< class ActionInput >
static void apply( const ActionInput& in, my_state& state )
{ ... }
};
// Putting everything together to start parsing:
[[nodsicard]] bool my_parse( const std::filesystem::path& file, my_state& state )
{
file_input in( file );
return parse< my_grammar, my_actions >( in, state );
}
Input Classes
The PEGTL includes several input classes for parsing memory, standard library containers, and files. Additionally there are dedicated stream inputs for stream parsing. The stream inputs are only documented on the page dedicated to stream parsing.
All inputs are class templates that can be customized.
Let us classify the inputs according to their names, explaining what e.g. a text_mmap_input is, and then document their template parameters.
Classification
The inputs for data in memory are either
viewinputs that keep a reference (pointer and size) without taking ownership, orcopyinputs that make a copy of the data in a private container likestd::string.
The argv and base inputs are also view inputs according to this classification.
The copy inputs are move aware when presented with an r-value reference during construction.
For all inputs the data must reside in a contiguous piece of memory, i.e. a std::vector or std::array can be used, but not a std::list or std::deque.
The inputs for data in a file are either
readinputs that usestd::fread()to read the file into memory upon construction, ormmapinputs that use::mmap(2)(or similar) to map the file into the address space, orfileinputs that are built with anmmapinput when available, and areadinput as fallback.
All of the above inputs are also either
- plain inputs whose position information is "simple", most often a count from the start of the input data, or
textinputs whose position information includes a line and column number based on theEolparameter explained below.
Note
The column number is just an offset from the start of the input data. It does not take into account that a single Unicode character can be composed of multiple code points, and a single code point can be encoded in multiple code units.
In addition there is the action_input which is related to the inputs listed here but falls into a slightly different category since it can not be used as an input for a parsing run.
Parameters
The typename Eol template parameter determines which (sequence of) object(s) constitute an end-of-line.
When this parameter is set to void the eol and eolf rules can not be used, and neither can the begin_of_line(), end_of_file_or_line() and line_view_at() input member functions.
All input classes except for argv_input use the default end-of-line rule as default for the Eol template parameter.
On Windows the default end-of-line rule is tao::pegtl::(ascii::)scan::lf_crlf which matches both Unix and MS-DOS line endings.
On Unix and Linux, including macOS and Android, the default end-of-line rule is tao::pegtl::(ascii::)scan::lf which matches Unix line endings.
For an explanation of the three tracking modes and tables of the included end-of-line rules see the ends of lines section below.
The typename Data template parameter determines the type of objects in the input data.
For inputs that have this template parameter it defaults to char and can be changed, for all others it is hard-wired to char.
The typename Source template parameter determines the data type of an optional fixed part of the position information.
The name "source" comes from the most frequent use case, the source filename.
For inputs that have this template parameter it defaults to void, no source information is stored.
For all filesystem inputs this is hard-wired to std::filesystem::path because error messages like "parse error in line 10" without the filename have made me want to throw my computer out the window on too many occasions.
Many inputs have two source types, InputSource and ErrorSource.
In this case the source is stored as type InputSource in the input, but as type ErrorSource in thrown parse_error_base-derived exceptions.
A common use case for the separation is to use std::string_view as InputSource and std::string as ErrorSource; cheap view in the input, more expensive copy in the exception to make it self-contained.
The typename Container template parameter determines the container in which copy inputs keep the input data.
It defaults to std::string but can be changed to match an already existing object to enable moving.
Ends of Lines
There are three ways text inputs can use the Eol rule to track or calculate the line and column numbers during a parsing run.
Rule and scan tracking are both eager while lazy tracking is lazy.
Rule Tracking
Rule tracking continuously updates the line and column numbers during a parsing run.
The column number is updated with every successful rule match.
The line number is only updated when the eol or eolf rules match.
- Low overhead while parsing, but
- care has to be taken to not accidentally match the character(s) (or object(s)) constituting an end-of-line with a rule that is not
eoloreolf. - There are no limitations or requirements for the
Eolrule.
Rules that enable rule tracking are just any normal rules in the ASCII and Unicode namespaces (tao::pegtl::ascii, tao::pegtl::utf8 etc.)
| Class | ASCII | Unicode |
|---|---|---|
...::cr | rule | rule |
...:lf | rule | rule |
...::crlf | rule | rule |
...::cr_lf | rule | rule |
...::cr_crlf | rule | rule |
...::lf_crlf | rule | rule |
...::cr_lf_crlf | rule | rule |
...::ls | - | rule |
...::nel | - | rule |
...::ps | - | rule |
...::eol1 | - | rule |
...::eolu | - | rule |
The table shows the most commonly used end-of-line rules, however anything (outside of the scan and lazy sub-namespaces) can be used.
This tracking mode was introduced in PEGTL 4.0.
Scan Tracking
Scan tracking also continuously updates the line and column numbers during a parsing run.
After every successful rule match the matched portion of the input is scanned for any occurrences of an end-of-line.
- Some overhead while parsing, and
- only works for end-of-line rules with a designated code point that signifies end-of-line, but
- not a problem if a rule distinct from
eolandeolfmatches an end-of-line.
Rules that enable scan tracking can be found in the scan sub-namespace of the ASCII and Unicode namespaces (tao::pegtl::ascii, tao::pegtl::utf8 etc.)
| Class | ASCII | Unicode |
|---|---|---|
...::scan::cr | rule | rule |
...::scan::lf | rule | rule |
...::scan::lf_crlf | rule | rule |
...::scan::ls | - | rule |
...::scan::nel | - | rule |
...::scan::ps | - | rule |
This tracking mode corresponds to the eager tracking in PEGTL versions prior to 4.0.
Note that the scan is skipped if it can be statically proven that the matched input does not contain an end-of-line, for example in the case of a tao::pegtl::(ascii::)string where all characters in the string are not the designated end-of-line character.
Lazy Tracking
Lazy tracking does not continuously update the line and column numbers during a parsing run.
The position information is calculated on demand, i.e. when current_position() or previous_position() are called on the input.
In that case an eol scan is performed on the input data from the start to the point for which position information was requested.
- Zero overhead while parsing, but
- linear complexity in size of input data to calculate line and column number on demand.
- Not a problem if a rule distinct from
eolandeolfmatches an end-of-line. - There are no limitations or requirements for the
Eolrule, except: - it needs to define an appropriate
eol_lazy_peektype alias.
This type alias is used by the eol scan to skip to the next place in the input data at which to attempt an Eol match.
Rules that enable lazy tracking can be found in the lazy sub-namespace of the ASCII and Unicode namespaces (tao::pegtl::ascii, tao::pegtl::utf8 etc.)
| Class | ASCII | Unicode |
|---|---|---|
...::lazy::cr | rule | rule |
...::lazy::lf | rule | rule |
...::lazy::crlf | rule | rule |
...::lazy::cr_lf | rule | rule |
...::lazy::cr_crlf | rule | rule |
...::lazy::lf_crlf | rule | rule |
...::lazy::cr_lf_crlf | rule | rule |
...::lazy::ls | - | rule |
...::lazy::nel | - | rule |
...::lazy::ps | - | rule |
...::lazy::eolu | - | rule |
This tracking mode was extended in PEGTL 4.0 -- previously lazy tracking had the same restrictions as scan tracking.
Note that the case of simple end-of-line rules, i.e. tao::pegtl::(ascii::)one< N > for a single N uses a slightly more optimized scan.
Parse Function
The parse() function is the single most important user-facing function, it starts a parsing run.
template< typename Rule,
template< typename... > class Action = nothing,
template< typename... > class Control = normal,
apply_mode A = apply_mode::enabled,
rewind_mode M = rewind_mode::dontcare,
typename ParseInput,
typename... States >
bool parse( ParseInput& in,
States&&... st );
- The
Ruleclass represents the top-level parsing rule of the grammar and is mandatory. - The
Actiondefaults to an action that does nothing. It is required to pass a user-defined action for a parsing run to do more, e.g. build some data structure, than validate an input against the grammar. - The
Controldefaults to the normal control class that implements the expected and documented behavior. It can be changed for debugging, e.g. printing all rule match attempts and their outcomes, and for some other advanced use cases, e.g. gathering rule invocation statistics. - The
Statesare the types of the additional state objectsstthat are passed to all rules'match()functions, all actions'apply()andapply0()functions, and all control functions. What is needed here depends on what the actions (and control functions) expect. - The
apply_modedefaults toapply_mode::enabledwhich enables actions. Can be changed torewind_mode::disabledor in the grammar with theenableanddisablerules. - The
rewind_modedefaults torewind_mode::dontcarein which case the input might not be rewound to its start whenparse()returnsfalse. Rewinding can be enabled by passingrewind_mode::required.
A parsing run can have the same three outcomes as the match function of a rule. Note that the distinction between "local" and "global" failure does not make too much sense at top-level, however for sake of consistency we will use these terms in all appropriate contexts.
TODO
- success, a return value of
true, - local failure, a return value of
false, - global failure, an exception of type
tao::pegtl::parse_error, or also - any other exception thrown during a parsing run.
Nested Parsing
Nested parsing refers to an (inner) parsing run that is performed during another (outer) parsing run, for example when a file being parsed includes another file.
When an exception is thrown within a nested parsing run it will be caught by tao::pegtl::parse_nested() and a new exception thrown via Control< Rule >::raise_nested().
The new exception contains a position from the argument of type OuterInput and the previous exception as nested exception.
The functions in the header tao/pegtl/contrib/nested_exceptions.hpp can be used to work with these nested exceptions.
The inner-most exception that was thrown first will be the "most nested" exception, i.e. the final one in the linked list of nested exceptions.
The position information contained in the nested exceptions allows for error messages like "error in file F1 line L1 included from file F2 line L2 etc."
Calling parse_nested() requires the input from the outer parsing run, or the position whithin the outer parsing run, as additional first argument ("additional" as compared to parse()).
template< typename Rule,
template< typename... > class Action = nothing,
template< typename... > class Control = normal,
apply_mode A = apply_mode::enabled,
rewind_mode M = rewind_mode::dontcare,
typename OuterInput,
typename ParseInput,
typename... States >
bool parse_nested( const OuterInput& oi,
ParseInput& in,
States&&... st );
The OuterInput will usually be the input from the outer parsing run; it can also be the position obtained from that input.
More precisely, if oi.current_position() is not callable then oi is assumed to be a position itself, otherwise it is called to obtain a position.
Position Classes
Positions occur as return type of the input functions current_position() and previous_position(), and as template parameter, and therefore position object, in their parse errors which are instances of tao::pegtl::parse_error<>.
All text inputs use tao::pegtl::text_position for their position reporting; when the source parameters are not void the type is tao::pegtl::position_with_source< SourceType, tao::pegtl::text_position >.
For all filesystem inputs the SourceType is std::filesystem::path, for all other inputs it defaults to void.
Most non-text inputs use tao::pegtl::count_position for their position reporting; when the source parameters are not void the type is tao::pegtl::position_with_source< SourceType, tao::pegtl::count_position >.
The exception are the base inputs which are so basic that they neither keep track nor can compute the number of objects from the start of the input data, they use tao::pegtl::pointer_position instead.
For all filesystem inputs the SourceType is std::filesystem::path, for all other inputs it defaults to void except for the argv input which defaults to std::string.
Input Interface
All input classes adhere to an informally defined interface of which some parts are optional. Some rules or other facilities will not function when the optional interface parts they rely on are not present.
Basic Interface
The basic interface implemented by all inputs.
using namespace tao::pegtl;
using data_t = char ... or something else;
using error_position_t = ...one of the position classes;
using offset_position_t = void;
using rewind_position_t = ...one of the position classes;
#if defined( __cpp_exceptions )
using parse_error_t = parse_error< error_position_t >;
#endif
[[nodiscard]] bool empty() const noexcept;
[[nodiscard]] std::size_t size() const noexcept; // Number of unconsumed input objects.
[[nodiscard]] const data_t* current( const std::size_t offset = 0 ) const noexcept
[[nodiscard]] const data_t* end() const noexcept;
[[nodiscard]] const data_t* previous( const rewind_position_t saved ) const noexcept;
[[nodiscard]] const data_t* previous( const error_position_t saved ) const noexcept;
template< typename Rule >
void consume( const std::size_t count ) noexcept;
[[nodiscard]] rewind_position_t rewind_position() const noexcept;
void rewind_to_position( const rewind_position_t saved ) noexcept;
[[nodiscard]] error_position_t current_position() const noexcept;
[[nodiscard]] error_position_t previous_position( const rewind_position_t saved ) const noexcept;
Inputs with Start
An input with start is an input that remembers the initial return value of current() and can be restarted from that position.
Most inputs are inputs with start except for base_input -- and all stream inputs.
[[nodiscard]] const data_t* start() const noexcept;
void restart() noexcept;
Inputs with Lines
An input with lines defines an eol_rule type alias which enables use of the eol and eolf rules.
Note that an input with lines does not necessarily include line and column numbers in its position tracking, that is only provided by text inputs.
Note that when one of the user-facing input classes is given void as Eol template parameter it disables the eol_rule type alias and is not considered an input with lines.
For an in-depth explanation of the choices regarding the Eol template parameter please see the ends of lines section above.
using eol_rule = Eol; // template parameter, usually defaults to default_eol
Inputs with lines also implement the following functions that rely on the presence of eol_rule.
[[nodiscard]] const data_t* begin_of_line( const error_position_t&, const std::size_t max = 135 ) const noexcept;
[[nodiscard]] const data_t* end_of_line_or_file( const error_position_t&, const std::size_t max = 135 ) const;
Inputs with Source
An input with source keeps an object that is part of the position but does not change over the parsing run.
For inputs that read from a file the source is the filename in a std::filesystem::path.
There are two source type aliases, input_source_t is the type of the source object embedded in the input, and error_source_t is the type of the source object embedded in the error_position_t which will be some position_with_source<> that is also used in the parse_error<> exceptions.
Note
If either of input_source_t or error_source_t is void then both must be void.
Note
It must be possible to construct an error_source_t from a const input_source_t&.
using input_source_t = ...std::string or user chosen;
using error_source_t = ...std::string or user chosen;
[[nodiscard]] const input_source_t& direct_source() const noexcept;
Input Convenience
All inputs, including action_input, implement the following set of convenience functions.
[[nodiscard]] const data_t& peek( const std::size_t offset = 0 ) const noexcept
{
return *current( offset );
}
template< typename T >
[[nodiscard]] T peek_as( const std::size_t offset = 0 ) const noexcept
{
static_assert( sizeof( T ) == sizeof( data_t ) );
return static_cast< T >( peek( offset ) );
}
[[nodiscard]] char peek_char( const std::size_t offset = 0 ) const noexcept
{
return peek_as< char >( offset );
}
[[nodiscard]] std::byte peek_byte( const std::size_t offset = 0 ) const noexcept
{
return peek_as< std::byte >( offset );
}
[[nodiscard]] std::int8_t peek_int8( const std::size_t offset = 0 ) const noexcept
{
return peek_as< std::int8_t >( offset );
}
[[nodiscard]] std::uint8_t peek_uint8( const std::size_t offset = 0 ) const noexcept
{
return peek_as< std::uint8_t >( offset );
}
[[nodiscard]] std::string string() const
{
static_assert( sizeof( data_t ) == 1 );
return std::string( static_cast< const char* >( this->current() ), this->size() );
}
[[nodiscard]] std::string_view string_view() const noexcept
{
static_assert( sizeof( data_t ) == 1 );
return std::string_view( static_cast< const char* >( this->current() ), this->size() );
}
[[nodiscard]] std::vector< data_t > vector() const
{
return std::vector< data_t >( this->current(), this->current() + this->size() );
}
template< typename Position >
[[nodiscard]] std::string_view line_view_at( const Position& pos )
{
static_assert( sizeof( data_t ) == 1 );
const char* const b = static_cast< const char* >( this->begin_of_line( pos ) );
const char* const e = static_cast< const char* >( this->end_of_line_or_file( pos ) );
return { b, std::size_t( e - b ) };
}
The line_view_at function returns a std::string_view of the line of the input containing pos.
It requires an input in where in.begin_of_line( pos ) and in.end_of_line_or_file( pos ) are valid function calls.
Stream Compatibility
The PEGTL is designed to minimize the impact of the existence of the stream parsing on the core library. This goal was mostly achieved with the exception of some input functions and how the rules use them. All non-stream input classes implement the following functions for compatibility with the stream inputs.
[[nodiscard]] decltype( auto ) end( const std::size_t /*unused*/ ) const noexcept( auto )
{
return end();
}
[[nodiscard]] std::size_t size( const std::size_t /*unused*/ ) const noexcept( auto )
{
return size();
}
void require( const std::size_t /*unused*/ ) const noexcept
{}
void discard() const noexcept
{}
All rules that need to be compatible with stream inputs need to use the end() and size() variants with argument.
The argument tells the stream input how much data it needs to prefetch or the rule to attempt its match.
That is why, for example, the implementation of consume< Num > uses if( in.size( Num ) >= Num ) instead of if( in.size() >= Num ) to test whether the Input in contains at least Num further objects.
Similarly the require() and discard() functions are implemented for compatibility so that grammars with require and discard can be used on all inputs.
This page is part of the PEGTL and its documentation.
Copyright (c) 2014-2026 Dr. Colin Hirsch and Daniel Frey
Distributed under the Boost Software License, Version 1.0
See accompanying file LICENSE_1_0.txt or copy at https://www.boost.org/LICENSE_1_0.txt