Its not US and THEM, its a TEAM, stupid

hi y’all

I have just been reading some posts on Scholarly Kitchen about content creation and the next wave of authoring systems.

It seems the STM sector has long been in need of developing a solution to get their publishing processes out of various traps. The most obvious trap is MS Word. A horrible format, to be sure, but it has long been the default file format for manuscript production, with a small tip of the hat to LaTeX for the technologically gifted. Their recent discussions have mainly been about online ‘authoring systems,’ going beyond MS Word to anticipate documents that are fully transparent to whatever combination of machine and human interactions play a part in understanding and processing the information.

This ‘get-out-of-MS-Word-free’ card is a very attractive proposition. MS Word is basically a binary blob to most publishing systems (even though in actual fact the format of .docx is XML -thereby also abruptly ending the false argument that XML inherently brings structure). As a ‘binary’ (go with me on this for now) MS Word is not transparent to the publication system, there is no record of when the author has worked on it; finding out what they have done since version xxx.xxx and version xxxx.xxxxxxxx is very difficult; nobody else can work on it when the author is also working on it, and there is no control over structure etc etc etc

So getting away from reliance on MS Word is the aim. But getting into (what I might rename as) an ‘authoring only‘ platform – is not the solution.

What is interesting about the SK forum, is that there seems to be a very clear distinction in the minds of publishers between the worlds of the author and the publisher. Most of the comments make this split, and there is much talk of ‘authoring systems’.

It seems a little bizarre to me, as I don’t think it’s wise to think about the author and the publisher as being distinct entities. It’s not a matter of author and publisher working on separate processes to shepherd a manuscript through to publication: it is very much a team effort. Authors and publishers work together in a way that should not be dichotomised: they are a team.

If we don’t acknowledge that, then we will not be able to design good publication systems. There is a lot of unclear thinking around this topic at the moment. The “authoring system” model assumes that content is made in an authoring system by a writer, and then migrates to the publisher’s submission, processing and publishing system, where the publisher does some stuff, and then at various times pings the author back to make changes to metadata, submission information, the manuscript and attendant assets (eg figures)…

In this model, next the author takes the manuscript out of the publisher’s system, ingests to the old authoring system, works on it, exports it, and re-ingests it into the publisher’s system… Hmmm…this cycle is exactly one of the pain points we were trying to avoid by getting away from Microsoft Word.

It seems to me that the current trend to build better authoring systems is a mistake. It is based on the false assumption that ‘MS Word’ is the problem, without realising that there is more to it. Word has been seen as the problem only because it has been the only problem in town. We don’t need better ‘authoring systems’ that repeat the separation between writing and publishing that is inherent in reliance on MS Word. We shouldn’t invest in new authoring systems and believe in them purely because they are ‘not Microsoft Word’. Rather, we need documents to be contained within submission and processing systems for the entire duration of their life, and they need to be completely operational and transparent within that system to all parties that must work on them. Without understanding that need, we are merely mitigating the problem by small steps whilst fooling ourselves that we have solved the larger problem.

We don’t want the author-publisher response/change cycle (a collaborative effort by the team which includes author and publisher) to be in separate systems. We want them working together in the same system. We need teams to work together in the most efficient way possible, and that is in the same (real world- or cyber-) space. Teams work best when they work in the same *space.

Though I see the current efforts towards authoring system development to be interesting, unless they are integrated with processing and workflow features, they will sooner or later be made redundant.

Colophon: written by Adam in 30 mins in a tizz. Tinkered with by Raewyn for another 30 mins. Written using Ghost software (free software!)

Building Book Production Platforms p2.

Amongst the core requirements for a book production platform are the source file format and the editor, and of course, these are intimately linked. The development team is usually faced with choosing the format first, then the editor.

Choosing a format

The choice is pretty much HTML? or not HTML?

Currently, HTML is the ruling choice of format for a web-based book production platform. HTML is native to the browser and has associated standards-compliant support, such as CSS and javascript. Inversely, not choosing HTML puts you in a bit of a hole and can create a lot of overhead.

It might be interesting to look back a little and learn from some others since there have already been projects in this space that started down non-HTML roads and then gave it up for HTML. Kathi Fletcher, originally the project manager and technical director for Connexions (now OpenStax) which built a custom XML editing environment for academic materials, later researched in-browser XML vs HTML editing environments for her Shuttleworth Foundation-funded OERPUB project. Kathi became convinced HTML was the way to go and did some great work on HTML editor usability with the Aloha HTML editor.

We have chosen to use HTML5 as the canonical format for open textbooks, because developers and tools are more plentiful for web technologies than XML technologies.

http://www.w3.org/2012/12/global-publisher/statements-of-interest/29-oerpub.html

The (closed source) O’Reilly Atlas platform also started with the complex AsciiDoc format (a form of markdown) and eventually awoke to the power of HTML in 2012.

HTML5-based authoring offers a streamlined production workflow for producing both print and digital outputs, facilitates “digital first” content development, and is a perfect fit for creating a WYSIWYG, web-based writing experience.

http://radar.oreilly.com/2013/09/html5-is-the-future-of-book-authorship.html

They then got an extra dose of religion and started a project called HTML Book which is a suggested ‘spec’ for a subset of HTML elements to be used in books.

So far I have not seen a book production platform travel the reverse direction, from HTML to something else. Instead, we are seeing more and more platforms start with, or change to, HTML as a source file format.

Markdown

Markdown is sometimes put forward as the way to go but I’m not going to go into that in too much detail here. I have talked about this elsewhere. The only additional thing I will say is that markdown causes even more issues for book production platforms than those included in that article. Namely, in an in-browser markdown environment, the markdown will most likely be displayed as rendered HTML next to the authoring pane. That is a huge amount of lost screen space and extra UI junk for no apparent gain. Think of the UX cost. If you don’t have that rendered display then you will most likely only see pure markdown in a text field with no rendered display. The user won’t really know if their document looks right until it is rendered somewhere down the line, which is also a tremendous cost to the user for no apparent gain. Markdown: all pain, no gain.

NB: There is a possible good use case for markdown as a helpful add-on for HTML WYSI editors but I will cover that later.

LaTeX

There is a more valid use case for LaTeX in the browser since some scientists and academics will never use anything else, and you’ll never convince them to adopt HTML regardless of the benefits. You are up against the great Church of Knuth and I don’t fancy your chances. If your audience is comprised of LaTeX addicts, then I think you have no choice other than to support that.

Many times I have talked about remedies for unstructured MS Word documents (for scientific manuscripts) only to have someone earnestly comment that if everyone just learned LaTeX we would be in a much better position… They might be right, but I’m pretty sure it’s never going to happen.

The preference for LaTeX is a legacy issue, and problematic, but needs to be dealt with. (Unfortunately, today’s Markdown heroes are growing legacy issues like this with each passing day, and that is going to cost us down the road).

Recently there has been some interesting work on in-browser LaTeX editing including the (closed source) Authorea platform and, most notably the (open source) ShareLatex platform. ShareLatex round trips the LaTeX syntax displayed and edited in a text area (in the browser), renders that to a bitmap on the server, and returns it to the browser for a side-by-side ‘WYSIWYG’. The effect is that you can see a just-in-time rendered view of the LaTeX as you type. It’s a neat trick and effective if you insist on LaTeX in a web-based platform. Then you just have to live with the UI costs. However ,you only need this approach if you wish to support the full LaTeX syntax. If you wish to just support LaTeX equations, you can use an HTML editor with a LaTeX plugin based on MathJax or the Khan Academies KaTeX(and there are some other solutions such as Mathoid).

Incidentally, if you need to support full LaTeX I highly recommend checking out ShareLaTeX over WriteLaTeX. They both have the same approach but WriteLaTeX is proprietary whereas you can pick up the ShareLaTeX code and integrate it straight away. You could even build your own ShareLaTeX-like interface, it’s not too tricky – together with a colleague – Rizwan Reza – and I (Riz did all the hard work) we managed to develop a workable prototype in about 2 days, but there are many gotchas setting up the LaTeX compiler correctly.

Not many book projects need LaTeX, so I will leave this as an interesting edge case. There are solutions if you need it, but not many people need it.

XML

I think I will just leave it to the words of the brilliant Dave Cramer (Hachette Book Group):

So we’ve chosen to describe our content with HTML, and build our production system around HTML.

When I tell people that, they smile condescendingly, and chuckle a bit. “That’s cute. Why don’t you use real XML?”

I then ask them what you can do in Docbook (or TEI, or NLM) that you can’t do in XHTML? I haven’t heard a good answer to that question yet. XHTML is XML, by definition. Calling something “para” rather than “p” doesn’t get you anything, except carpal tunnel syndrome and invoices from consultants

The problem with non-HTML XML is that it is essentially just XML the browser can’t use. Hence you lose all that other good stuff like WYSI editors, CSS design tools, cool tricks with JavaScript, and all the cool tools that are being developed for HTML. XML just can’t compete, plus you are going to need to convert the XML into HTML anyway. So don’t make life more complicated than it already is – continue your love affair with XML as long as it’s XHTML!

HTML

HTML is king in the browser and it gives you all you need to make books. I don’t want to spend a lot of time arguing the merits of HTML in this post as there is a lot to say and I want to bring that in at other points of the conversation. But in brief:

  • HTML is supported by JS and CSS.
  • The DOM is known natively by the browser.
  • HTML is standards-based.
  • It is straightforward.
  • HTML is easy to read and easy to clean.
  • HTML is the most popular file format on the planet.
  • You can use HTML to build structure in documents with assigned class and id values, or microdata formats.
  • HTML is the native file format for EPUB.
  • PDF can be rendered directly from HTML in the browser (more on this later).
  • HTML can be paginated in the browser.
  • CSS is moving towards supporting more and more page based elements.
  • The browser can act as a design environment.
  • You can create real what-you-see-is (WYSI) production environments.
  • Basic editing is built into the format itself.
  • HTML is supported by an enormous number of tools for conversion (in and out).
  • HTML is supported by an enormous repository of examples (the web).
  • HTML is cheap to develop with.
  • Even book designers are getting used to it.
  • Some schools teach it.
  • It has a million free tutorials online to help you use it.
  • A lot of people know HTML.
  • HTML is supported by a rapidly proliferating body of JavaScripts for typography, graph production, animation, interactions, dynamic rendering etc etc etc etc

The basic idea really comes down to this.

  • HTML is the cheapest format of our time.
  • HTML is the most popular format of our time.
  • HTML is the networked document format of our time.

Increasingly HTML is the way stories are told, whether that is in books or on the web. It’s a trite analogy perhaps, but HTML is the paper of our time. As Dave Cramer says:

why start with something other than HTML, when you have to turn it into HTML anyway?

It should be noted that Cramer also turns HTML into paper, and the Hachette Book Group have produced many beautiful paper books using HTML as the source format. Many of these books you will now find in the best-selling sections of your local brick and mortar bookstore.

Other print producers are also using HTML as the source. Print-on-demand services, used to producing very ugly books by ingesting MS Word and dealing with all that ugly conversion, are also adopting HTML production environments. Books on Demand, Germany’s largest Print on Demand service, adopted Booktype so their customers could have an easy in-browser book production environment. The source format is HTML but the users don’t know that, and the books look better. That’s the beauty of HTML.

Finally, helped a lot by the efforts of Dave Cramer and the Hachette Book Group, Sourcefabric, the people at O’Reilly, and others adopting HTML, we might be starting to see the very beginning of the changing of the guard.

HTML is the way to go for Book Production Platforms. If you choose another format you will find you inherit a lot of costs and additional overheads and, sadly, you will soon be left behind. There is just no format going forward at the same speed as HTML. Not even close. So, my advice is to first ask the question – can HTML do what you need? Push your team to answer that question. Will format X give you anything HTML can’t? As an exercise ask your team to prove HTML is a bad choice, and if the answer is not-HTML, then contact me and let me try and talk you into it!