Many publishers start with MS Word docs and need to move to more structured file formats. To do this they are often advised to go straight from the badly structured markup of docx to a structured document format (some form of XML). However, I believe this to be a mistake.
Wendell Piez and I wrote about a strategy for MS Word-to-HTML conversions. We call it ‘HTML Typescript‘, the core principle being to first make a faithful representation of the docx file in HTML, without rushing too quickly to interpolating structure or translating first into descriptive XML intermediary format (which is the common way to do this), the idea being that clean HTML is a better place to start improving the structure of the document than either ugly docx markup or overly complex XML of some other variety.
Once we have clean HTML, we can then more easily work with it, manually or programmatically adding structure, and we can more easily parse it, for example, for entity extraction etc…
I did a demo of this today at Project MUSE in Baltimore (for a very nice bunch of people) and I wanted to share the following screenshots that more viscerally illustrate the benefits of this strategy. Displayed below are two screenshots of source code. The first is the markup of a docx file, followed by a screenshot showing the result of converting that docx to HTML using the HTML Typescript converters (XSweet) that we built.
And the HTML result after running the same file through XSweet (docx -> HTML Typescript converter).
The question is – which would you rather work with?