The Case for HTML Word Processors

If you like my thoughts on formats, publishing systems, development methods, and open source subscribe to my newsletter.

Making a case for HTML editors as stealth desktop word processors… the strategy has been so stealthy that not even the developers realised what they were building.

We use all over-complicated software to create desktop documents. Microsoft Word, LibreOffice, whatever you like – we know them. They are one of the core apps in any user’s operating system. We also know that they are slow, unwieldy and have lots of quirky ways of doing things. However, most of us just accept that this is the way it is and we try not to bother ourselves by noticing just how awful this software actually is.

So, I think it might be interesting to ask just this simple question – what if we used desktop HTML editors instead of word processors to do word processing? It might sound like an irrational proposition… word processors are, after all, created for word processing. HTML editors are for creating…well, …HTML. But let’s just forget that. What if we could allow ourselves to imagine we used an HTML editor for all our word processing needs and HTML replaced .docx and .odt and all those other over-burdened word processing formats. What would we win and what would we lose?

The first thing to recognise is that word processors and HTML editors actually look and work in kinda the same way. They have a big blank page to start with – the empty text canvas. They have similar toolbars with similar tools. They both essentially just allow you to write words on a page and place other stuff on it. You can also change font sizes, styles, colours, backgrounds etc and add images, tables, whatever you like.

There are two big differences that are apparent in their interfaces, however:

  • in a Word Processor there are nice margins to write within
  • in an HTML editor there is a ‘view source’ allowing you to see the markup behind the display version

Seeing the source is quite nice for those that know how to edit raw HTML. That is certainly an advantage, but for those that do not know how to edit HTML then this difference means relatively little. However, having margins in the Word Processor feels pretty necessary. It is the legacy from the age of print that we just can’t seem to shake. We still need electronic word processors to create interfaces that conform to the standard paper sizes of our region. In the US, it’s default US Letter, and in Europe, it is A4. As crazy as it sounds, margin-less word processing is going to take a long time to take off because of our legacy attachment to paper. That is why Google Docs looks like a page. It doesn’t make sense but it does make a difference, especially in adoption.

The good news is… adding margins to an HTML editor is easy because we can just add CSS to the document and there you go… in fact I think you can add quite nice margins, much nicer (and easier to change) than you do for the typical word processor document. If you define the print region CSS to the page size you want to have printed, then HTML docs can look, feel, and work pretty much the same way word processor docs do. With some CSS trickery, it is even possible to include pagination.

But what about storage? Word processors store nice single files on your computer. HTML files, however, have all these messy attachments. CSS files, JS, images… scattered all over the show.

But but but!…. a .docx file, the format created by the latest MS Word, is just a compressed archive containing many files. It is actually a zip file. You can try this for yourself. Grab a .docx file, change the suffix from ‘.docx’ to ‘.zip’ and then open it with whatever you use to open zip archives. Taadaa! A folder containing a whole bunch of XML files and other crufty stuff. We think of .docx as a file but it is not, it is a collection of files stored in a compressed container (zip).

So, isn’t that cheating a little? It’s a cheap way to clean up a file system and lucky for us HTML has a companion technology that does just the same thing – EPUB. EPUB is an ebook format which is also just a zip file. You can do the same trick to open an EPUB as you used to open a .docx file. So why not use EPUB as a local storage format for desktop word processing?

But HTML editors don’t allow you to export to EPUB…well, that is one of these side steps we will have to recognise if we went down the path of the HTML word processor. We must think about the page, as well as the application as being the component that offers user functionality. If we can make this conceptual side step then we can see that it would be quite possible to add EPUB export functionality to the document using Javascript… we don’t have to build these features into the core application. The tools are already there… there are plenty of desktop HTML editors (BlueGriffin, Kompozer, Dreamweaver (eek! proprietary!) etc) out there we just need some really smart looking and easy to use, feature-full templates (HTML files)…

So, could an HTML editor with nice margins, and output stored as EPUBs on your file system to keep things clean, be used as a tool for word processing? I just can’t see a reason why not. The main thing in the way is our own stupidity. We think that:

  • HTML is for the web
  • HTML editors are for creating ‘web pages’
  • EPUBs are for ‘ebooks’

But these are conventions. Conventions don’t have to stand. We can pull them down if they don’t make sense and these particular class differences just don’t seem to make much sense. We are making very stupid category mistakes and it is preventing a lot of innovation and efficiencies.

If we could break the way we think of HTML editors down a little and re-imagine them as document creators whose format happens to be HTML, then we would get to some very interesting places very fast. It would help us break free of lock-in legacy ‘ways’ rained down upon us by the creators of out of date technologies like LibreOffice and MS Word.

What’s Wrong with Markdown?

Markdown (.MD) is a text format that lazy people use to write HTML. Unfortunately once those same lazies are used to the format, their eyes glaze over and they start to believe .MD is the solution for all the world’s problems. They share a lot in common with githubbies who think github is the solution for book production, open source, bad democracies etc..

.MD files are common in the geek world. Programmers love them. The design of .MD is simple and efficient. If you know the syntax, you can write basic text documents with headers and bullet lists, blockquotes, bold & emphasis etc. pretty quick. That makes it a handy tool for the elite of text workers – programmers – to develop simple text documents, quickly. So it’s a popular format for writing, for example, human-readable README files that tell you a little bit about the software you are about to install. However, that is where the use case ends. .MD is not a useful format for many other cases unless you want to prove to the programmers that you too can do tricky stuff in plain text. For the rest of us, it has little value.

Markdown was originally developed by John Gruber in 2004 and you can read about some of the reasons why he developed it here. The original purpose of .MD is that it can be read without converting it to another format like HTML. Presumably, John Gruber wanted a format that could be read easily by the eye, allowing the user to be able to quickly understand which part of a text is a heading, which part is a list, which part is a paragraph etc .MD is designed not to be rendered for display, it is meant to be the display.

For example, a list written in .MD would look something like this:

* item 1 
* item 2
* item 3

That actually looks like a list. In HTML we would do something similar and it would look like this:

  • item 1
  • item 2
  • item 3

An asterisk looks like a bullet, so there is little cognisant drag here. Pretty readable.

However, things start to fall apart pretty fast. Can we really parse at speed, for example, nested bold and emphasis in .MD like this:

The quick *brown fox* **jumps** *over* the lazy **dog**  

Is that or the following easier to read?

The quick brown fox jumps over the lazy dog

I’m going to say the second is easier to read. Way easier to read. So, that is just the start of the problems; from here on in it goes downhill pretty quick for use cases beyond simple READMEs.

MarkDown isn’t designed for creating HTML

So, let’s assume we can agree Markdown is readable in a very limited number of scenarios – and move on to rationalise (grasp at straws) other needs for the format. That is pretty much where we are today, with the big selling point being that Markdown is an easy way to create HTML. But let’s face it, even programmers don’t like reading raw Markdown and even in the most popular .MD repository of them all – github – the Markdown files are rendered in the browser as nicely formatted HTML. Great! A use case we can stand behind – use Markdown to create HTML.

However .MD is really a pretty bad way to create HTML. Firstly, you need something to convert the .MD into HTML. So if you use just a plain text editor to create .MD files and load it into the browser you will see just plain, boring Markdown. No nicely formatted documents for you. There are tools that programmers like, and so the rest of us are also expected to like them, for converting .MD to .html. After all, according to the technically gifted, converting a .MD file to HTML is “really really easy.”

One of the most common tools for doing this is Pandoc. Pandoc is a great software and extremely useful. However having to install and learn how to run Pandoc – a complex tool at best – to convert a text file to something readable – sounds a little like the long way home. And  that’s not where the rot ends, far from it, the rot has only just set in and the worst is to come. If everyone was to use Pandoc to convert .MD to HTML we would have consistent conversion results. Unfortunately, that’s not what happens. Each to their own, and we have a lot of different tools with which to do these conversions, and hence we have different results created from the same source. Ugh. That is a file format nightmare right there.

And let’s say you want to add a little colour to your text. Perhaps a highlight? Forget about it. Markdown lacks the tools to enable you to do it. Pandoc might help, however – let’s add some colour highlights to a code block with this easy to remember command from the Pandoc manual:

pandoc code.text -s --highlight-style pygments -o example18a.html

So, first of all, do you know what a command is? Do you know what a terminal is? Happy using one? Oh..that’s actually not ok for you? No problems, there are plenty of online introductory courses on the command line. So before you write that funding document, “about” page on your website or scholarly research document – just whip through a quick course on the command line and you’ll be all set! (don’t forget to read the sections about installing software from the command line, you’ll need it to get Pandoc working).

Problems with conversion tools aside – Markdown struggles to find a nice way to represent HTML. It’s just a bad fit. Use Markdown for creating HTML and you will find all sorts of little formatting gotchas that will cause you frustration. It is why many markdown environments/conversion tools also support HTML tags.

All HTML is valid Markdown. If you’re stuck, not able to format your content as you would like (for example using tables), you can always use plain HTML instead of Markdown. http://support.ghost.org/markdown-guide/

So if you want to really write HTML with Markdown you have to, well, write HTML. Klaro.

Markdown was never intended for writing HTML. It wasn’t designed that way and for good reason – it doesn’t do it well.

Codified text

As mentioned above, by design, the original markdown has a very small subset of elements that can be converted to HTML. As John Gruber says in his philosophy:

Markdown is not a replacement for HTML, or even close to it…The idea for Markdown is to make it easy to read, write, and edit prose.

So, Markdown is not actually designed to be a good format for creating HTML. And it lives up to its design. It is for this reason that some Markdown formats ‘extend’ Markdown to include HTML code, and there are also other forks of Markdown that do some really weird stuff that I can hardly explain. For example, Ghost Markdown, the version of Markdown used for the (Open Source) Ghost blogging platform, tries to wrangle image formatting into Markdown. To place an image you have to write the following:

![]()

Intuitive, right? Nope.

The above is really a leap from ‘readable’ to ‘codified’. It is codified text and in order to be able to work with it, you need to know how to de-code the text… I’m sorry, but I just don’t get it. Markdown adds another level of codified complexity which I then need to de-code first (according to non-standardised, and not-standard rules written in some help file somewhere if I’m lucky), so that I can then sally forth and read the content? No thanks.

Say no to codified text.

Non-standardised formats suck

Efforts to take Markdown and extend it to meet a wider variety of formatting needs are actually where the big trouble starts. Markdown has gone off in a hundred-and-one different directions, each with its own syntax.

That means, if I want to write a Ghost blog (I love Ghost by the way, no disrespect to them) in Markdown (their required format) then it is not enough to learn Markdown. I must learn Ghost markdown …their particular reading of what a good markdown format is… So, that leads us to one of the really big problems. Markdown is not standardised.

Can any of us think of another non-standardised text format and where that leads us? Does MS Word and ‘world of pain’ ring any bells? Yes, Markdown is non-standardised and that is a very big no-no. It is, in fact, quite shocking that programmers, big on standards, do not quite see that by advocating Markdown they advocate dropping some central best practices. Can’t say anything more about that really.

But Markdown is structured!

I often hear the term “structured text” when referring to Markdown. For example, the opening lines from the CommonMark pre-amble.:

Markdown is a plain text format for writing structured documents

Sounds good doesn’t it? Sounds very techy and convincing. But what is structured text? Structured text means basically that we can see if something is a heading or something is a bulleted list, or something is a paragraph. Huh? But that describes just about any text document. Structured text is the basic requirement of any text you create – without it, you just have a flat plain-text document with no headings, no bullet lists etc. So… we might as well start every sentence about documents designed to be read as being ‘structured’. I think tomorrow I will go and buy a structured book. Or perhaps I will write a structured narrative on my text-structuring word processor. Excuse me word processor sales person, do you have structured text word processors? Ugh. Meaningless.

What is left?

Markdown is good for limited use cases. Use it for README files on github.

If you have a good dose of Markdown cool aid then don’t let me bring you down from your sugar high. Markdown away. However, if you have heard that there is this cool format available and it cures all your textual needs and it is just really really easy to use and really really quick… then think about the elixir you are being offered and re-read this document before slurping away…

Some comments on this article on Hacker News here:
https://news.ycombinator.com/item?id=8403783

MarkDown, ugh

MarkDown…lets call it what it is – A Less Useful MarkUp (ALUMU?)…We should also invent a new category for both ALUMU and MarkUp. I don’t have a good name for this superset but it can be characterised easily as “non-standardised, codified text”.

Why anyone would want to use non-standardised, codified text is very hard to understand. I thought we had learnt that mistake from MS Word.

Detained Text

Desktop documents such as MS Word and ODT could be called ‘detained text’ formats. A work might exist on the desktop of any number of people at the same time. However, if more than one person needs to work on that file, we have a problem. There is no way to synchronise the changes made on every copy of the file across all those desktops.

The work-around is for one person to work on the file and then email it to the next person. Hence the text is detained by the current person working on the document. All those waiting down the line must wait until the text is released to them. The further down the line they are, the longer they wait.

Detained text slows down any process where more than one person is involved – whether it be writing, processing, editing, reviewing or other such activities. Detained text slows down text production.

The Collaborative Spectrum

We often talk about collaboration in text production as if it is a switch with default position ‘off’. There is no in-between state. Either collaboration is turned on, or it is turned off.

Collaboration is a broad spectrum of activities and affects all text production. Any given work can be discussed as a position on a spectrum of collaborative activity. At one end we have ‘weak collaboration’, and at the other ‘strong collaboration’. The character of production can be described by a collection of attributes which identify where it sits (at a given time) on this continuum.

Weak Collaboration

  • coordination through technical frameworks
  • little or no human coordination
  • coordination rather than collaboration
  • isolated producers
  • small contributions
  • no shared visions on the individual level
  • discrete parallelised production outputs

Strong collaboration

  • the technical framework is an enabler
  • strong human facilitation
  • intensely connected contributors
  • large contributions
  • shared vision
  • synthesised production

A text may move along this spectrum as it undergoes different stages of production.

Collaborative Knowledge Production

Collaborative production of any knowledge artefact could be called ‘Collaborative Knowledge Production’ or CKP. Understanding the right strategy for CKP is reliant on an understanding of the content to be produced, the greater context, and the resources available. From these, a process can be designed.