Link Rewriting

2019-1-6

This site is generated from a collection of markdown documents which are converted to HTML using Pandoc. The generated HTML is then stuffed in to a template, which has standard meta-data and styling. You can see more details here.

Problem

Originally this site had all styles directly in-line in <style> tags in the template:

<!DOCTYPE html>
<html>
<head>
    <style type="text/css">
      /* styles go here! */
    </style>
</head>

This worked great, but I wanted styles in a separate CSS file to ease maintenance.

Because posts can appear at any path I would need to use an absolute path in the template:

<link href="/main.css" rel="stylesheet">

This worked when published, however I was no longer able to use a local build of the site for testing - the stylesheet would get resolved to file:///main.css which doesn’t exist.

Solution!

I decided it’d be simple enough to take a pass over the generated HTML and re-write any links found to make them work the way I want them to.

I already have a general ‘utility’ executable which performs other steps in the build - I decided to add a sub-command to it to fix-up links. The process would be:

  1. Load the HTML file contents
  2. Parse the HTML into some sort of structure
  3. Edit the parsed-HTML
  4. Write the updated HTML back out

I’ve already been working in TypeScript/JavaScript for this project, so I dug around to see what was popular on NPM and settled on htmlparser2 to parse the HTML, and then domhandler and domutils to work with the processed HTML (they look like they’re from the same author and designed to work together).

TypeScript

The HTML wrangling in TypeScript is pretty straightforward.

DefinitelyTyped didn’t have any types for domutils or domhandler so I made some locally-defined interfaces to as the imports over to.1

import * as htmlparser from "htmlparser2";
import * as path from "path";
import * as fs from "./fs";

const domutils = require("domutils") as domutils;
const DomHandler = require("domhandler") as DomHandlerConstructor;

This program is supplied with a set of HTML files to edit and a base-folder to pre-pend in front of all domain-local absolute paths (that is, links beginning with “/”).

We’ll start wih some boiler-plate to loop over input files and do file-IO.2

export async function main(args: { files: string[]; baseFolder: string }) {
    const { files, baseFolder } = args;
    if (baseFolder === "/") {
        // nothing to do
        return;
    }
    await Promise.all(
        (files || []).map(file =>
            mapPathsInFile(file, path.resolve(baseFolder)),
        ),
    );
}

async function mapPathsInFile(file: string, baseFolder: string): Promise<void> {
    const htmlContents = await fs.readFile(file, "utf8");
    const updatedContents = updateFileContents(htmlContents, baseFolder);
    await fs.writeFile(file, updatedContents, { encoding: "utf8" });
}

And next is the fun bit. I think this is a nice, small example of how htmlparser2 and domhandler can be used to get things done.

It would be more efficient to use some sort of event-based or stream-based HTML processor, but this looked like the easiest thing to get started with, and I doubt the cost of constructing a document-object-model will end up mattering much for a site like mine.3

function updateFileContents(htmlContents: string, baseFolder: string) {
    let htmlResult: string = htmlContents;
    const handler = new DomHandler((err, dom) => {
        if (err) {
            throw err;
        }
        // Find all anchor and link tags, update href
        for (const link of domutils.findAll(isLink, dom)) {
            if (domutils.isTag(link) && domutils.hasAttrib(link, "href")) {
                const origHref = domutils.getAttributeValue(link, "href")!;
        
                if (!origHref.startsWith("/")) {
                    continue;
                }

                // Add absolute path to base
                link.attribs.href = path.join(baseFolder, origHref);
            }
        }

        htmlResult = dom.map(domutils.getOuterHTML).join("");
    });

    const parser = new htmlparser.Parser(handler);
    parser.write(htmlContents);
    parser.end();

    return htmlResult;
}

function isLink(elem: DomElement): boolean {
    return domutils.isTag(elem) && (elem.name === "a" || elem.name === "link");
}

The typings for domutils are only a little bit interesting so I’ve left them off to the side over here.

Makefile

I already had a step in the Makefile to call in to Pandoc to generate HTML - it was pretty easy to fix-up all of the links at the same time:

site_base_folder ?= $(site_output)

$(site_output)/%.html: $(site_source)/%.md $(template) $(utilJs)
    mkdir -p $(@D)
    pandoc $(PANDOC_OPTS) $< -o $@
    $(util) fix-paths $@ --baseFolder $(site_base_folder)

This calls the above script to pre-pend our output-folder-path to the front of all absolute paths in links, which makes the build work when I preview it on my local machine.

This isn’t what I want when building the production site - so I override the site_base_folder variable in my CI configs:

variables:
  site_base_folder: /

(this is in the build stage of the site’s gitlab-ci.yml file.)

Et voilà!


  1. Presumably I could have used dts-gen to build these in a separate file but I was having trouble making that work.↩︎

  2. I never actually pass more than one file at a time, but I didn’t know it was going to end up that way …↩︎

  3. If I switched over to go as a base languages it looks like the html package has a tokenizer with a built-in String() method that does everything I want, but I didn’t want to add in a separate language tooling/ecosystem into this project.↩︎