Coda to the discussion on converting the HTML s6 documentation

From: Alexis <flexibeast_at_gmail.com>
Date: Wed, 02 Sep 2020 19:59:10 +1000

Hi all,

i've received an email offlist asking some clarifying questions
about automating the conversion of the current HTML s6
documentation, and i thought it might be useful to post some of
the things i noted in my reply.

The issue isn't that the HTML is unparseable (it's not). A tool
like `pandoc` can be used to convert the pages into other formats,
including roff. Over at Void, we recently tried to make use of
`pandoc` to create a man page for Érico's neat `void-docs` script,
which allows viewing the Void Handbook locally in a number of
formats. What i found is that the output of pandoc produced roff
that was fine visually, but which relied on presentational markup,
rather than semantic markup. i'll return to this issue below.

The issue is twofold:

* Things like bare "<em>" tags (i.e. without a 'class' attribute
  describing their contents) are used in the HTML to convey
  multiple types of information that mdoc/roff
  distinguishes. Sometimes an "<em>" is used for an argument (Ar
  in mdoc), sometimes it's simply used for emphasis (Em in
  mdoc). Similarly, bare "<tt>" tags are used for a path (Pa in
  mdoc), function types (Ft in mdoc),
functions (Fn in mdoc), libraries (which could have a man page
that should be cross-referenced with an Xr macro), and so on. A
human is needed to decide the semantics involved (e.g. for
Casper's putative IL), based on context.

* Many things /simply aren't marked up at all/. The example i gave
  in my earlier post was environment variables: again, a human is
  needed to decide whether something in ALLCAPS is an env var, a
  cpp macro, or something else altogether (like a reference to the
  'TAI64' concept.)

The question might be asked: "Well, who cares? Why care about
semantic markup? As long as the visual output is the same, what's
the issue?" Two things:

* Having the documentation source use semantic markup as much as
  possible facilitates conversion between formats. `mandoc(1)`
  doesn't only output man pages from mdoc source: it can also
  produce HTML (used on man.voidlinux.org, with some custom CSS
  for Void theming), PDF, PostScript, Markdown and plain ASCII. So
  if things like flags, arguments, paths, environment variables,
  variable types, variables, function types, functions etc. are
  marked up in the mdoc source, a PDF (for example) can be styled
  appropriately for each case.

* Additionally, extensive semantic markup has a direct benefit to
  end-users: the ability to use the functionality of `apropos` to
  find appropriate content. For example, say one wished to find
  all uses of the 'GID' env var in the s6 man pages. One could use
  `apropos 'Ev=GID' | grep s6-`. (This sort of use-case is part of
  why i've made sure all the names of all the man pages i'm
  creating are prefixed with "s6-".) Similarly, one could search
  for all mentions of the 'notification-fd' file with `apropos
  'Pa~.*notification-fd'`, with the '~' indicating an extended
  regular expression. However, this won't work without the
  relevant markup in the sources.

Fwiw, my suggestion, for those interested in converting the
documentation to One True Format as decided by Laurent, would be
to leverage my efforts to use semantic markup extensively in the
man pages. Once the s6-man-pages repo is ready, use `mandoc -T
html` to convert the pages to HTML, which will contain consistent
semantic markup (e.g. '<h1 class="Sh" id="DESCRIPTION">'). That
HTML can then be parsed and converted to the One True Format, an
authoritative source from which man pages and HTML can be
produced.


Alexis.
Received on Wed Sep 02 2020 - 09:59:10 UTC

This archive was generated by hypermail 2.3.0 : Sun May 09 2021 - 19:44:19 UTC