Writing a Pollen lexer in Pygments
After writing a few blog posts about Pollen I started getting annoyed that I didn’t have syntax highlighting for the code snippets. So I did a bit of fooling around with Pygments, and it turns out writing a custom lexer isn’t that unreasonable, so here’s how I did it.
Pollen markup
Pollen’s rules are pretty simple as it’s basically just some extra syntax for embedding Racket in a text file:
- Comments starts with
◊;
- You can insert variables with
◊|my-var|
- Run arbitrary Racket code with
◊( ... )
- There’s an extra construction that transforms
◊fun[arg1 arg2]{some text}
to◊(fun arg1 arg2 "some" text")
, which is useful when you want to send a bunch of interpolated text to a function. (I use it everywhere in my book.)
So the end goal is to be able to highlight this type of code:
◊; A link can just be a standard reference
◊(define dune-audible "https://www.audible.com/pd/Dune-Audiobook/B002V1OF70")
I'm ◊strong{really} looking forward to the upcoming Dune movie!
◊div[#:class "extra-sand"]{
I also recommend the Dune audiobook ◊link[#:ref dune-audible]{on Audible}.
}
Setup and parsing comments
The simpest Pygments lexer might look like this:
"""
Lexer for Pollen
"""
=
=
=
= :
],
}
Which if placed in pollen.py
can be run on a file pollen.html.pm
like this:
python3 -m pygments -l pollen.py:PollenLexer -x -f html pollen.html.pm
That produces html output. Right now it doesn’t do anything interesting, as it only returns a Text
token for everything, so let’s change that shall we?
I would like to support highlighting comments:
◊; Standard comment
Regular text ◊; Trailing comment
Which should be pretty straightforward. We just need to add a single clause to the root state that matches everything from ◊;
to the end of the line:
= : ,
And comments are highlighted!
python3 -m pygments -l pollen2.py:PollenLexer -x -f html pollen.html.pm
◊; Standard comment
Regular text ◊; Trailing comment
That’s hard to read so I’ll embed the output from now on:

If you’re reading this but don’t understand why it works you might want to lookup regular expressions in Python, in this post I’ll assume you’re familiar.
Embedding variables
Let’s move on to embedding variables via ◊|var|
.
A first attempt could be like this:
: ),
Which splits out highlighting into three groups; ◊|
, var
and |
and gives them different tokens using the special bygroups
construction. The tokens you choose are slightly arbitrary, but should generally follow the advice in Pygments tokens reference. I chose Name.Variable.Magic
instead of something like Punctuation
because I wanted them to stand out more.
This works, but there are two changes I’d like to make. The immediate problem is that we now only match against characters with \w+
, but a Racket variable can contain a bunch of different symbols. This is for example perfectly valid:
If we look at the existing Racket lexer they have defined a variable like this:
= r
= r %
Which we can steal and copy to our class and use when we build our regex:
),
To make this work out of the box we also need to add the regex flags:
= |
The other thing I want to do is introduce another state. It’s not strictly needed in this case, but as ◊
can be followed by different cases it makes the lexer easier to follow. Like this:
: ,
,
],
: ,
),
],
Now what happens when we parse the ◊
we mark it as Name.Variable.Magic
and the push the state post-magic
onto the stack. Important to note is that we don’t replace the existing state, so the stack will now have post-magic
on top and root
below.
Then when we try to parse the next character, post-magic
is responsible to match against it. The new thing there is the magic '#pop'
variable that pops from the stack, so after we’re done matching ◊|var|
we hand back control to root
.
This should now be able to highlight embedding variables:

I unfortunately didn’t figure out how to debug the state transitions in an easy manner. If we mess up Pygments will insert err
classes but you can also play around with different colored tokens during development, for example letting root
return a Keyword
token so you can see that we return to the right state.
Highlighting Racket code
Our next step is to try to highlight Racket code inside ◊( ... )
.
I thought this was going to be really hard, but Pygments supports this in various ways. The way I chose was to delegate the lexer of different parts to the existing RacketLexer
.
First the import:
And the case is simply:
: ...
,
),
)
The interesting line is which delegates the lexer to
RacketLexer
, starting in state unquoted-datum
. How did I figure out which initial state to start in? I tried to read the code and make an educated guess…
But we also need to ensure we use the regex flag of allowing the dot to match newlines as well, otherwise we won’t match multiline racket expressions:
= | |
And all embedded Racket code is highlighted:

Recursive brackets
Now lexing ◊var[arg1]{text args}
is a bit more involved, but builds on the concepts we’ve already seen.
First let’s support the simpler ◊var{text args}
case.
Matching ◊var
is straightforward:
: ...
,
],
We could do more here, but we’re preparing for the future where we can also match against an optional [...]
after the variable, so we’ll delegate to another state. essentially means we’ll replace the current state
post-magic
with the new curly-start
state.
:
],
Here again we could’ve done more, but we want to be able to do recursive matching inside { ... }
as well. This is what the curly-end
state does:
: ,
,
],
does what you might expect it to do: it copies all cases from our
root
state into the curly-end
sate. This to avoid code duplication.
And this can indeed highlight ◊var{ ... }
recursively!

To support an optional [ ... ]
we can add another state before curly-end
that either matches against square brackets or curly brackets:
: ...
,
],
: ,
),
),
,
],
We’ve already seen these things before. The only thorn in my side here is that we use a non-greedy match (.+?)
to match between brackets, but we used a greedy (.+)
earlier. I don’t have a good answer for this… This could probably be improved some way.
But hey! Think positive! It works for the cases I need. For example:

The result
Putting it all together we can now highlight the code we looked at in the start of this post:

And this is the complete lexer:
"""
Lexer for Pollen
"""
=
=
=
= | |
= r
= r %
= : ,
,
],
: ,
),
,
),
),
,
],
: ,
),
),
,
],
:
],
: ,
,
],
}