XSweet – Adam Hyde

Whats Involved in Building Editoria

We have a webinar coming up soon for Editoria. In anticipation of that I thought I’d write a little overview of what was involved in building the product. So… Starting with a little hand drawn schematic…

The above shows all the moving parts that we had to consider, and build, to make Editoria what it is. From the user’s perspective, it looks like one app – the Editoria interface – where they can upload a docx file (or files) and start work on producing a book.

From an ‘under the hood’ perspective, it looks like the following…

xsweet

As it happens, we had to build all of the above to make it possible… So, here is a list of all the parts starting from ingest…

Ingest

Ingest consists of 2 parts:

INK
XSweet

INK is a job processing workhorse. Essentially it will run any kind of job you want and ‘steps’ are written for specific kinds of jobs/processes. These are a kind of ‘wrapper’ around executable code. INK has an API that enables platforms like Editoria to send files to it and request a specific set of steps to be run, and then return the result. In this instance, INK runs a set of steps that wrap up XSweet.

XSweet is a set of scripts that take a docx file and convert it to HTML. It is very modular and can be pretty easily extended and improved. It can also be customised per use case.

So, on the ingest stage it looks like the following to the user:

user logins in to Editoria
they create a new book from the book dash
they vlivk on the new book
from the book builder interface they choose ‘upload multiple word files’ or ‘upload word’.
they select the word file(s) from a system dialog
these files are automagically appear ‘in the book’ and can be edited
1. note : in the case of multiple docs files, if the files have been named according to a simple convention, Editoria will create each new chapter and automatically populate the structure of the book and load the content in the right place

From the tech side the following happens:

when a word doc file or files are selected, the files are sent to INK via the INK API, with a request to convert to docx using the XSweet steps
INK runs the docx file(s) through the XSweet steps, creating HTML
when the job is done, INK signals Editoria that the files are good to go
Editoria grabs each file when it is ready and puts in the the right chapter, creating the chapter if multiple word files are uploaded at once

So… we had to build all of the above. First, INK exists because, crazy as it sounds, we could not find an elegant open source pipeline manager to manage jobs like this. So we decided to build from zero. In addition, which also sounds crazy, there is not a comprehensive open source library out there for converting docx to HTML that was easy to extend or customise for different use cases. We certainly know of many scripts for docx->html (many are listed on the XSweet site), but they are either unmaintained, or difficult to extend. So, unfortunately, we had to build that also.

Editoria App

The Editoria App is what people think of ‘as Editoria’. Everything the user interacts with is, on the surface, Editoria interfaces in the browser. However, Editoria comprises of 3 main parts:

PubSweet
the Editoria interfaces
Wax

PubSweet is the ‘under the hood’ CMS that enables apps like Editoria to be built on top. PubSweet is a Node/React application written entirely in Javascript and it sits on the server and provides a ‘wrapper’ for the Editoria interfaces. You can think of PubSweet as a ‘headless’ CMS that thinks like a publishing system. Once again, when we started (and still today) there was nothing out there that could realise multiple publishing use cases, as diverse as micropubs, books, journals, content aggregation etc, built on top of the same CMS. So we had to build it ourselves. PubSweet is now pretty mature and indeed supports multiple publishing platforms – at last count I think that totals 6 platforms built on top of PubSweet (and it is early days).

The Editoria interfaces are the ‘web pages’ that make up the Editoria application. This includes the dashboard, the book builder etc… these are all custom code written on top of PubSweet to realise the Editoria use case / platform.

Wax is an online editor, it is so sophisticated we prefer to call it a web-based word processor. It is actually an Editoria interface, except that it is so complex and such a critical part of any online publishing tool, that it is a product in its own right. Nothing like this existed either, so we built our own (typical WYSIWYG editors are not sophisticated enough for the publishing use case).

PagedJS

From Editoria, you can press a button and have a book appear in the browser. If you print from the browser, you will then get a PDF ready to take to the printers with all the typographical bells and whistles that books need. It looks magical. We currently use Vivliostyle for this but we have hit many limitations with the software. We reached out to the org that produces the code but they weren’t ready to collaborate to improve the product. Hence, unfortunately, we had to build our own. Soon we will replace Vivliostyle with this new offering. The speed of dev has been incredibly fast on paged.js (working title) and so watch this space for some exciting news.. there are some early demos in the pagedjs repo if you are interested.

Summary

…. So, as you can see, Editoria is more complex than it appears to the eye of the user. Also, the crazy thing is that most of these parts should have existed already, but, crazy as it is, they didn’t exist. There has been no full page, open source, editor out there gives the source control and extensability required for the publishing use case. The docx conversion tools available do not yet do a good enough job of the conversion, and are not easy to customise. There is no headless CMS that ‘thinks like a publisher’ and would help us build Editoria; and we had to create a pagination engine (paged.js) because the one open source option couldn’t give us what we needed and we failed at trying to help them improve the product.

Its not like we wanted to build all this, but the sad fact is for the publishing industry the web is still an innovation they haven’t embraced (except as a distribution medium). So while all of these parts (or many of them) should exist, the sad truth is they did not exist. We did a pretty good look around to see if we could save ourselves the effort before embarking on these projects. We did, for example, bank on Substance Lens Writer for our editor, but it was not maintained and proved unstable, and our eventual assessment was that it was better to start again – hence Wax. We also have gone a long way down the path of using and promoting Vivliostyle, until it became apparent that neither the technology nor the technology providers could get us to where we needed to be… consequently, after implementing Vivliostyle (it is still baked into Editoria) we started a new approach to this problem – Pagedjs.

So, we had to build all of this ourselves… the good news is, that all of these moving parts are reusable in other scenarios… so you can take whatever you like from the above, and add it to your own product and hopefully that will fuel the publisher sectors ability to innovate and transform their products and workflows.

XSweet 1.0!

We have 1.0 of the docx -> HTML transformation tool XSweet out today! It also has a new site:

http://xsweet.coko.foundation/

XSweet is a finely crafted tool. It takes docx files, those horrible mangy MS Word files, and translates them into clean, lovely, HTML. XSweet is open source, modular, and very nicely done.

A huge tip of the hat to Wendell Piez (XML guru), and to Alex Theg. As geeky as it sounds, I loved watching these two chat about the issues they encountered making this software. The attention to detail was really unbelievable. Amazing work. XSweet is a finely crafted tool.

More info on Coko https://coko.foundation/announcing-xsweet-1-0/

If you want to do a deep dive into why I think this is important, I wrote this some time ago – https://www.adamhyde.net/typescript-redistributing-labor/

Workflow Tweaking at Coko

Yannis and the Athens crew have been tweaking our workflow for the products built on top of PubSweet. This includes (at present) Editoria (book production system) and xpub-collabra (Journal system).

Essentially, after the initial design and build process we slip into faster iterations where I step out of the way and the ‘client’ and developers work in very fast iterations to tweak features and fix bugs. We loop in Julien for UX as necessary. Then we move into a stabilization phase for landing the product.

Right now both Editoria and xpub-collabra are in the tweaks and bugs phase, and we have started adding a ‘roadmap’ to each of the READMEs to list what we are working on now. See:

https://gitlab.coko.foundation/xpub/xpub#roadmap

https://gitlab.coko.foundation/editoria/editoria#roadmap

We’ll be adding a similar table to XSweet and other products. It is a simple, readable, format.

The big 4

I’ve been in the business of trying to work out how to get Publishing (capital ‘P’) into the web. From the start, there have been some ‘big ticket’ items that have needed to be solved. Some are more urgent than others, but by and large we are cracking these nuts one by one. I have considered for a long time the big 4 to be:

MS Word to HTML conversion
an open source web-based word processor
paginated output via the browser to print-ready copy
in-browser designer

1,2 an 3 are the ‘now’ critical items, number 4 is necessary but a little further down the line. Thankfully, at Coko we are solving these first 3 problems. To solve (1) we are building XSweet, a comprehensive (open source) XSLT conversion suite for converting MS Word to HTML. We are also building Wax to solve (2), Wax is an open source web based word processor based on the Substance.io libs. And for (3) we are using Vivliostyle for in-browser rendering. Number 4) is still on the cards.

Interestingly, the pagination technology (3) might need re-evaluating since pagination will eventually be required for the editor and the in-browser designer.

While pagination inside a web-based processor is not critical for publishers, it is critical for authors and small offices etc and if we are going to get publishers to use a web-based word processor then it would be better that they share infrastructure rather than sit on their own island of technology ie. eventually we need authors to use these tools too. By sharing infrastructure I don’t mean they need to use exactly the same tools, they just need to use compatible tools. So, eventually, we need to migrate authors into the web. It is not critical now, but over time, as the workflow for authors and publishers inevitably becomes more integrated, it will turn out to be necessary.

For in-browser design we need pagination support also so we can work off a single source for the content and then design in the browser to output to the various formats publishers need. Think Gimp or InDesign in the browser. It’s not as far away as it might sound, but to do this we need to be able to paginate inside the browser and have that update with live style changes to CSS.

So far, we are solving the big ticket issues 1-3, but for the next stage of changes we may need to change the tools we use for pagination so we can live-update content and styles and reflow in an editor and in-browser designer. That may mean we to start looking for a different pagination solution.

XSweet 1.0 coming, HTML to JATS also

We (Coko) are going to work through some smaller issues on XSweet. XSweet is our XSLT scripts that convert MS Word to HTML Typescript. We need to keep filling out small gotchas but mainly we look into table conversion and some font fixes. So far though, XSweet is looking pretty sweeeeeeeeeeeeeeeeeet…..

In other news – just putting together a HTML to JATS pipeline using INK and building some INK Steps around Martin Fenner’s pandoc-JATS. We’ll experiment with this but may ultimately move to a HTML to JATS XSLT solution that doesn’t need to convert to markdown as an intermediary (as pandoc does).

What this means, is that in a week or so we could create a new recipe that chains together the following INK Steps:

XSweet (MS Word to HTML conversion)
HTML to JATS Typescript (a version of HTML ready for ingestion into pandoc)
pandoc-JATS for HTML to JATS conversion

We don’t currently have a use case for this, but as you can see it takes MS Word through this chain and will produce JATS. We’ll test it out and see how far it gets us. In the journal world, we anticipate this would be useful to some but more interesting is to go from MS Word to HTML, ingest this into a PubSweet journal system, improve with the Wax Editor, and then output this to JATS at publish time. This way we don’t try and optimise the file too early which, IMHO, is a big mistake.

Coko Products

I am currently planning how to keep all the Coko projects balanced and moving forward. It gave me a moment to reflect on just how productive we have been. At present we have 6 major products, all moving forward at an excellent pace, they include :

PubSweet – the API toolkit for building publishing workflows (website coming soon).
INK – the file prosessing (conversion, extraction, enrichment etc) framework.
Editoria – the monograph production platform for publishers
XSweet – the XSLT production for converting MS Word to HTML (HTML Typescript)
Wax – the web based word processor built on top of Substance.io libs
xpub – the (early stage) journal platform.

All this in less than 18 months, which is amazing enough but also consider that Coko was only 3 people (with Jure being the only developer) until 12 months ago. Its kind of astonishing to me.

Why HTML Typescript is a good idea

Many publishers start with MS Word docs and need to move to more structured file formats. To do this they are often advised to go straight from the badly structured markup of docx to a structured document format (some form of XML). However, I believe this to be a mistake.

Wendell Piez and I wrote about a strategy for MS Word-to-HTML conversions. We call it ‘HTML Typescript‘, the core principle being to first make a faithful representation of the docx file in HTML, without rushing too quickly to interpolating structure or translating first into descriptive XML intermediary format (which is the common way to do this), the idea being that clean HTML is a better place to start improving the structure of the document than either ugly docx markup or overly complex XML of some other variety.

Once we have clean HTML, we can then more easily work with it, manually or programmatically adding structure, and we can more easily parse it, for example, for entity extraction etc…

I did a demo of this today at Project MUSE in Baltimore (for a very nice bunch of people) and I wanted to share the following screenshots that more viscerally illustrate the benefits of this strategy. Displayed below are two screenshots of source code. The first is the markup of a docx file, followed by a screenshot showing the result of converting that docx to HTML using the HTML Typescript converters (XSweet) that we built.

And the HTML result after running the same file through XSweet (docx -> HTML Typescript converter).

The question is – which would you rather work with?

Coko and Communities

Kristen Ratan (co-Founder of Coko) and I have been pondering ‘the next phase’ of Coko. We have an organisation in place, with great people, and we have developed products (Editoria, and more coming soon) and frameworks (INK and PubSweet). Also, of course, we have developed the Cabbage Tree Method for facilitating ‘users’ (use-case specialists) to design their own products.

So…next up… community. It seems to me it is an interesting next phase which will require a lot of thinking about. The complexity comes from the fact that we have multiple primary stakeholders at play. A rough breakdown (note, each category is a complex ecosystem of diverse roles – all of which we still have to think through):

PubSweet backend – this is a ‘headless CMS’ built for publishing workflows… however, it is a technical software whose primary community would be developers that commit to its continued development. Additionally, I would argue, its use-case is a whole lot broader than capital P Publishing… any organisation that wants to develop workflows for producing content (which is pretty much every organisation that exists) could benefit from this software. Publishers are just a tiny subset.
INK – the web service for (primarily) managing file conversion pipelines. This is also a technical ‘backend’ whose primary community would be developers. INK’s use-case is also broader than ‘just’ publishers.
INK Steps – INK processes files through steps which are small plugins. These can do anything from file conversion, to validation, content parsing etc…. the primary community for this is possibly more tightly connected to publishing requirements, but the builders of steps are more likely to be those involved on the production side of the workflow ie. file conversion experts, content miners etc
XSweet – a very specific content conversion pipeline for producing HTML from docx. Primary community would be file conversion experts.
Editoria – the monograph production platform. Primary community – Publishers but also interesting to any org that produces books.
Journals – we are on our way to producing a Journal platform. The primary community for this is also publishing organisations…
Ecosystem – we are building out an ecosystem of projects. So as much as possible we need to consider how we interact with and build community/collaboration with other open source/open access (etc) projects.

It is a complex stack..our job is to work out all parts of this stack, understand what they look like, and think through why they would want to be involved in a community, and how…