Site Update pt II

2022-12-21

In the previous site-update I’d mentioned that the ‘right’ fix for me to do would be to update the rendering pipeline for this site to use a different flavor of Markdown, preferrably CommonMark.

That has happened! And it wasn’t too bad of an adjustment.

Why CommonMark

I use Pandoc as the core of how the content on this site gets created and it’s default ‘Pandoc Flavored Markdown’ has served me well.

However CommonMark-flavored-Markdown seems to have taken over in the spaces I operate day-to-day - GitLab/GitHub both use a variation on CommonMark for their Markdown rendering, as-does the in-IDE Markdown-preview in VS Code (which is where I typically do editing).

I do have the ability to generate this site locally, so that I can preview what’s going to go out, but having the renderer for this site be the same as the other renderers I use day-to-day will make for a smoother content-editing experience

What Did I Do

At the most basic, I adjusted the ‘from’ parameter I pass to pandoc when building contet for this site:

- PANDOC_OPTS := -f markdown -t html5 -s --template $(template) --shift-heading-level-by=1
+ PANDOC_OPTS := -f commonmark_X -t html5 -s --template $(template) --shift-heading-level-by=1

The first thing that happened is I got loads of warnings from pandoc about a ‘title’ being required by the template I’m using, but the title not being available, and the ‘recent posts’ section of this site breaking. That’s no good!

Pandoc supports placing meta-data about a document in-line in the document itself, which is then available to the output-template. It has two ways the author can format things:

With a %-prefixed header-block at the top of the document:

% Post Title
% Post Author
% 2022-12-21

Post body with *Markdown* syntax

Or as a block of yaml, with arbitrary fields expected by your template:

---
title: Post title
date: 2022-12-21
...

Post body with *Markdown* syntax

(I’ve left out the ‘author’ field because my template doesn’t use it.)

All of my posts used the first method, however the commonmark_x parser in Pandoc only supports the second method (which makes sense - it’s more flexible).

The update was tedious but straightforward:

The new parser is brutal but effective:

    // r is an io.Reader holding the raw markdown content
    s := bufio.NewScanner(r)

    // Format:
    //  ---
    //  title: some-title
    //  date:  some-date
    //  ...

    if !s.Scan() {
        return
    }
    if !bytes.Equal(s.Bytes(), []byte("---")) {
        return
    }

    var done bool
    var lines [][]byte
    lines = append(lines, slices.Clone(s.Bytes()))
    for s.Scan() {
        lines = append(lines, slices.Clone(s.Bytes()))
        if bytes.Equal(s.Bytes(), []byte("...")) {
            done = true
            break
        }
    }
    if !done {
        return
    }

    var parsed struct {
        Title string               `yaml:"title"`
        Date  *localdate.LocalDate `yaml:"date"`
    }

    data := bytes.Join(lines, []byte("\n"))

    err = yaml.Unmarshal(data, &parsed)
    if err != nil {
        return
    }

    title = parsed.Title
    date = parsed.Date

    return

(This parser assumes that if a document starts with --- it will find ... quickly. If this parser were exposed to an adversary I would need to be more defensive and make sure the look-ahead/memory-consumption is bounded.)

I also lost some formatting - Pandoc Markdown allows placing a \ at the end of a line to insert a line-break but also keep the rendered lines together (I think this is called ‘keep with next’ in MS Word, for example):

Line one\
Line two

The Commonmark spec claims to support this syntax, but it doesn’t work in the contexts I’m using it so I stripped it out by hand. I might be able to do something similar with <br> tags, but it’s not crucial.