Why Vivliostyle is Important

Vivliostyle advertises itself as a “publishing workflow tool.” It is a great tool but this isn’t a terribly accurate or descriptive byline since publishers can’t really use it without changing everything else they do. Hence my preference to refer to it as an HTML pagination engine (although I used to refer to this process as Browser Typesetting). It can be part of a publisher’s workflow but only after they have transformed everything else they do to an HTML-first workflow.

On this point, I think Vivliostyle have their sales pitch wrong but it is hard to know how they might do it better since Vivliostyle imagines, as do many other tools, a radically different way of doing things to how publishers operate now. Vivliostyle is one necessary part of an approach that is, in effect, a paradigm shift in how publishers work. This ‘new way’ of working is exciting and transformative, but it is also hard to capture paradigmatic changes in a few catch phrases. So, to understand the value of Vivliostyle, you first have to understand how publishers work now, why it is a terrible way to work, and what HTML-first workflows can offer. Once you understand that, you can see how exciting Vivliostyle could be for publishing.

So… what does it do? Vivliostyle is the latest in a long list of tools that, among other things (more on these in later posts), can enable the transformation of HTML to PDF. These tools have been typically used to produce PDF for printing – book formatted PDF. There are a few of these softwares out there but most are proprietary and only a handful of them have been Open Source (notably BookJS that I was involved in a long time ago, and CaSSius). Vivliostyle is the latest in this family tree, the root category of which I would describe as an HTML Pagination Engine.

What it does is this – it takes an HTML file and paginates it. It flows the HTML through the ‘boxes’ (pages) you have defined and lays the content out nicely in each box, one box after the other, flowing all the content through it with the right margins (and more, see below). So instead of scrolling through the web page, you page through the document on screen. Vivliostyle converts the HTML into ‘pages’. It is an HTML pagination engine.

It is then possible to do a lot more with these pages, including adding page numbers, using CSS (the style rules used by browsers) to define the look and feel of text and images(etc), adding headers, notes and footers etc. In other words, you can add to each page everything you need to make the result ‘look like a book’.

You can do this transformation in the browser because Vivliostyle is written in JavaScript. If you want to see this in action, check out their demos page. For example, look at this page showing the raw HTML of a book by Lea Verou (published by O’Reilly), then open this page (give it a minute to render). Now right click and print (best in Chrome or Chromium browsers – you may need turn margins off and background images on when printing). The result should give you a one to one co-relation of the paginated content in the browser (which is HTML) to the paginated print-ready PDF.

Since the browser can also print PDF, then you can take this newly styled ‘book looking’ HTML and print it to PDF from the browser. From this, you have a book-formatted PDF that is ready to send to the printer to be printed and bound. I’ve worked this way for many years and printed many books this way for organisations from the World Bank to Cisco. The system works and the printed books look great. Vivliostyle is, at this point, the most sophisticated open source tool for doing this.

It is pretty amazing stuff. From HTML to PDF to printed book at the press of a button. Magic. This process is also catching on. Hachette  produce their trade paperbacks using an HTML-to-PDF rendering engine with styling via CSS. So it is no longer a process reserved for small experimental players.

So, why is this interesting? Well, it is interesting almost without an explanation! Transforming HTML in this manner seems pretty magical and it’s kinda neat just to look at the demos and marvel at it. However, the really interesting part comes into play when we start talking about workflows. This is where Vivliostyle can be part of transforming publishing workflows.

As a publisher, if you were to take Vivliostyle ‘out of the box’ it would not be of much use to you. How many web pages do you have that you want to turn into PDF? Not many, if any. How much of your book or journal content in your current workflow is stored in HTML? Probably none. In all likelihood, HTML in your business is restricted to your website and, perhaps, EPUB if you produce them (EPUB is just a zip archive containing HTML files and some other stuff). But in most publisher’s workflows the EPUB is an end-of-line format. Publishers take the completed copy in MS Word, or (sometimes, regrettably) PDF, and send it to a vendor (typically in India) to transform into EPUB and send back. So chances are, HTML is not the format being used as the basis for your book or journal workflows (apart from possibly being an end-of-line format).

As a publisher, you don’t have manuscript copy in HTML, so Vivliostyle is not going to fit snuggly into your workflow. In order to utilise it, you need to transform the way that you work. You have to start working in HTML or, more difficultly, work in some format that can transform into a very tightly controlled HTML output so that Vivliostyle can work with it.

I don’t like the latter style of workflow. This is where you work in something like XML (of some sort) and then transform to HTML at the end of the process. It’s ugly workflow, not friendly for non-techie users and typically full of workflow redundancy. If you want good HTML, then just work with, and in, HTML. This comes with additional benefits since from good HTML you can get to any format you want PLUS you have the advantage of now being able to move your workflow into the browser. And this is where Vivliostyle fits into a toolset, an approach, that could transform how you work – the HTML-first production environment.

The current ‘state of the nation’ in publishing is pretty terrible. Most publishers use MS Word docx as their document format, and Track Changes and email are their primary workflow tools. This means that there is a single document of record – the collection of Word files. These are shareable, in the sense that you can email people copies, but you cannot have multiple people simultaneously accessing them at the same time. In effect, the MS Word files are like digital paper in the worst possible way. There is only one ‘up to date’ version and only one person that can work on that version while they hold onto the files. There is no easy way to follow a document’s history, revert to specific versions, or identify who made what change when. Further, there is no inherent backup strategy built into standalone MS Word files. Everything must be done manually. That means organizing the files in directory structures with naming conventions that are known by only those in the know (since there is no standard way of doing this).  There are also problems with email as a collaboration tool. Did it send? Did they get it? Did they get the right version? Plus there is no way of understanding the status of the documents unless you ask via email or there is some other system that is manually updated for status tracking. The system is not transparent. Further, changing workflows when using systems like this, even for small optimisations, is quite difficult and the larger the team the more difficult it gets.

Additionally, using Word and email like this is really placing unnecessary gateway mechanisms on the content. If I have the up-to-date versions then you can’t have them or work on them. There is really only one copy and only one person can work on it. That makes for linear workflows and strongly delineated roles. No one can ‘jump in and help’ and it is very difficult to alter the linearity of the process or redistribute the labor to achieve efficiencies.

If publishing is to move on, then workflows need to migrate to the browser. With browser-based workflows, there is no need to have multiple copies of the same file, versioning is taken care of as is document history, it is easy to add and remove people from the process, and labor can be better distributed over both roles and time to create more elegant, efficient, workflows. I wrote about this in an earlier post and will write more in posts to come since there is a lotmore to it. But suffice to say that publishing workflows to the browser is a little like ‘sucking all the gaps’ out of the current Word-email workflows (plus a whole lot of other benefits). No more checking your Inbox while you wait for status updates or someone to send you the files so you, and you alone, can work on the next little part of the process while everyone else waits. Additionally, there can be full transparency as to what needs to be, has been, and is being done (and by whom). There is the opportunity to break down larger tasks into smaller tasks and have them all in play concurrently. There is the opportunity to share the same tools and hence enhance communication and redistribute the work to where (who) it makes most sense. There is so much to be gained.

This is not to say that browser-based workflows are  ‘anything goes’ workflows (which is what most publishers think this way of working amounts to). You can still assert rules of who has access and when. But… in my experience, when you migrate workflows to the browser then publishers start rethinking how they work and you often hear comments like “but we don’t need to do it like that anymore’…They then start designing radically better workflows themselves.

So, the point of all this is that Vivliostyle by itself does not achieve this. It is not, in itself, a workflow tool for publishers. You first need all the other things that enable an HTML-first workflow to be in place and once they are there, then you can utilize Vivliostyle to transform the HTML (at the push of a button) to the PDF you need for printing. That is the radical improvement Vivliostyle can offer. Cut out the file conversion vendors and render the content according to templated style sheets (automated typesetting can produce beautiful results). This means you can check what the book will look like at any moment, plus the CSS stylesheets you use can also be included in your EPUBs (also rendered at the push of a button since the original content is already in HTML, the content filetype for EPUB) so your printed book and the EPUB look the same.

So, Vivliostyle is a necessary tool for HTML workflows and with an HTML workflow you will radically improve what you do.

This is why Vivliostyle is important to publishers but you cannot consider it isolation. You must consider it with regard to migrating to an HTML first workflow. If you migrate to this kind of workflow then not only will you experience the efficiencies described above but your organisational culture will be transformed and the types of content you can then produce will become a lot more open ended. This is the vision that Vivliostyle, and other tools that enable HTML-first workflows (including those developed by UCP and Coko), are imagining and building towards.

Dear reader, out of principle, I do not use proprietary social media platforms and networks. So, if you like this content, please use your channels to promote it – email it to a friend (for example). Many thanks! Adam

Booki to Booktype, BookJS and beyond…

Many years ago I was the Product Manager and Project Lead for Booktype at Sourcefabric. We developed many interesting technologies including Booktype itself, Objavi, StyleJS, BookJS, Booktype Renderer, and Booktype Designer, amongst others.

Booktype is still going very well and has also spawned the very interesting Omnibook service. Due to the recent interest in this project, I revisited this old video which documents some of the exploratory thinking I had when leading the Booktype team at Sourcefabric. It was recorded May 2012 at #dev8ed in Birmingham, UK. At the time I was leading a small team, having just migrated Booki (FLOSS Manuals) to Booktype (at Sourcefabric).

I found the video really interesting as it covers my thinking at the time, (developed over many years of experimenting in this area) over many issues, including rendering books in the browser and using the browser as a design environment for books. There are some nice quotes which accurately reflect how I was thinking then which are interesting:

“there is no one taking responsibility for designing environments where you can target both flowable text as an output like Kindle or EPUBS, and at the same time, target fixed page outputs like paper books. So we are trying to work this out at the moment. How do you deal with this? .[…] We are trying to work out how can you possibly find a paradigm that fits both flow-based, and fixed page, design” [36min 25s]

and

“what we want to see [in the browser] is when you are outputting to book-formatted PDF, we want to see like you see in Google Docs – exactly the page dimensions that you are going to get when you output the PDF. Google Docs does some sort of magic where that is possible, we haven’t yet cracked it ourselves, but for fixed page design we think it is quite important that what you see in the HTML page is what you would eventually get in the PDF. [41min 37s]

“…how do you actually render one to one representation of a book-formatted PDF in a browser?” [49min 49s]

“…we can have JavaScript playing a role in rendering elements of pages for book-formatted PDF.” [16min 58s]

“…we take the Booktype content as HTML, HTML as the base format, and Objavi formats that into one long HTML page for which we have specific CSS rules to structure the book in a specific way. Then we run WKHTML over the top of it, and a number of other tools, and we assemble a book out of it, book-formatted PDF” [18min 38s]

“Thats because WKHTMLTOPDF is webkit, the browsing engine behind Chrome and Safari, … so you can use CSS, and JavaScript and everything from webkit, and turn it into a PDF” [19min 50s]

“…the advantage of using webkit as part of the rendering environment, as webkit is a browser, [is that] if you design in the browser you have a one to one co-relation between content creation environment and output environment” [33 min 49sec]

To be clear, we were already using browser engines to make books for quite some time, and Douglas Bagnall, a friend who also worked with me at FLOSS Manuals, even investigated collaborating with the Gecko (Mozilla layout engine) developers to add widows and orphans controls and the CSS page-break control (which we needed for books), in 2010 or so. Actually, it was pretty cool because Douglas, myself and Robert O’Callahan (Mozilla layout engine dev) were all New Zealanders. But FLOSS Manuals had been making books for many years with browser engines since Behdad Esfahbod advised me to explore this, many years earlier. We knew browsers could be used for producing book-formatted PDF and we had been doing it for years.

However, as I have learned over the years, there is an important role for vision, experimentation, and theoretical exploration prior to developing good software. Hence, I was now exploring how you could take these positions further to design books in the browser client. Rendering PDF was one part of the story, the other was working out the tools to take book design to the browser. This was what Adobe was also after, I believe, when they implemented CSS Regions in webkit and started on their Adobe Edge Reflow line of products that leveraged the browser as a ‘design surface’. They were interesting times.

But back to the Booktype story. The video is a demo in May 2012about a month before I hired anyone (in June) to start on what eventually became BookJS. It took us a while to get there but after much discussion, further experimentation, and some months of development, I was able to introduce BookJS in Oct 2012 on the Sourcefabric blog.

Terrible profile pic of me!
Terrible profile pic of me!

While BookJS didn’t quite get to be the design environment I was (and still am) after, it was still a good tool. In an attempt to get to a design and rendering solution in the browser, we later took the Booktype Designer (demonstrated in the video) ideas to a JavaScript prototype called StyleJS for integrating with BookJS but, unfortunately, it didn’t make it to production. StyleJS enabled a kind of ‘WYSIWYG’ tool for styling a page live. Which is an interesting prototype for future in browser book production exploration.

Work continued on BookJS and it has had a useful life despite some quirky turns in the road. During this time, the Booktype team worked with several people on the development of BookJS and received good advice and contributions from Mihai Balan (from the Adobe CSS Regions team), Phil Schatz (from Connexions), Maria Fraser (University College London) and others. As with many software projects, contributions like this deserve a lot of credit, as I have written elsewhere, since these contributions are not always preserved in the code.

Another quirk that happened is that the Google team, in an unexpected move which surprised many people and turned into a bit of a CSS heavy hitters ‘discussion’, removed CSS Regions from Blink. Many people were pretty shocked. This, I think (but I don’t know the inside story), spelled the end for Adobe’s vision of the browser as a design surface using CSS Regions, and the Adobe Edge Reflow product has been discontinued.

In the Booktype world, Juan Gutierrez (who worked on BookJS at Sourcefabric, and now works with me at Coko) extended BookJS to support the CSS Regions polyfil. It is still in use now with Book Sprints for rendering books. Consequently, we are still very grateful that Booktype and Sourcefabric kept the BookJS product AGPL after I left the project so we could extend it. Hurray for Open Source!

It is good to see Booktype going strong, Sourcefabric still invested in Open Source, and a growing interest around Omnibook. I know the team there, Micz Flor (co-founder of Sourcefabric and Managing Director of Booktype) being an old friend, and Julian Sorge also makes a great Booktype Managing Director. They have brought their own vision to the Booktype products, pushing them in new directions, and it is really great to see. I’m hoping they will continue to go from strength to strength.

In summary, these were interesting, productive times. Sourcefabric provided the opportunity for Booktype to grow, and I experimented a lot, as I had done at FLOSS Manuals (and continue to do now), with new ideas and approaches. There was some great software, books, and ideas that came out of that period. Some of the books we made I have even kept with me through my travels. In the video, for example, I demonstrate the Booktype Designer. We built the Designer before and during the Sandberg Institute workshop I led in Amsterdam and used it in the same month as I did the presentation to create this wonderful artist’s book. I carried it with me all over the world and still have it on my bookshelf now!

Waag Society/Remko Siemerink 2012. https://creativecommons.org/licenses/by/4.0/
Waag Society/Remko Siemerink 2012. https://creativecommons.org/licenses/by/4.0/

Nice to find this old vid.

Original url for the video: https://vimeo.com/43591376

Review: http://devcsi.ukoln.ac.uk/2012/05/29/dev8ed-workshop-booktype/

 

The Old Days

img_3814
First Book Sprint using Booki, Berlin, 2010

Wow…I was browsing some old archives to update this new version of my site. I found the most incredible stuff in the Internet Archives Wayback Machine including the outline of a description of Booki (2010) many years before it became Booktype. Amazing! I didn’t think I had the product manager in me but it seems once upon a time I was really focused on this kind of acute detail for product management. I had forgotten!

Forgive the long post, it’s just pure indulgent nostalgia for me. In any case, here is one of the emails I found really fascinating, from back in 2010, talking about features for Booki and Objavi (book renderer). This has been taken from the zip of a public list we used for dev at the time:
https://web.archive.org/web/20111029143503/http://lists.flossmanuals.net/pipermail/booki-dev-flossmanuals.net/

I’m so astonished how much of my thinking recorded in this email carries through to the way we are approaching product development for Coko now. The statement:

You might have noticed that I prefer to take the easy road for features, leaving as much open as possible, and then refine according to use. That is because,from experience, I have learned that when designing software it is better to be led by the user rather than force them into an imagined work flow.

Might as well be out of the Collaborative Product Design manifesto.

I’ts kind of incredible. The email documents so much of how we were thinking at the time, including using HTML and CSS to create paginated books using browser engines:

* Objavi utilises Webkit for PDF generation. Later Gecko will be added.

…and later in the product description…

3.2.2 CSS Book Design
Status: High Priority, Implemented
Function: The default PDF rendering engine for Booki is now Webkit and will eventually be Mozilla Firefox hence there is full CSS support for creating book formatted PDF in Booki. This changes the language of design from Indesign to CSS - which means any web native can control the design of the book.

Pretty interesting, if only to me! Anyway, the email is below, it documents some features we built on commission for Source Fabric before they eventually took over the project. Thank you for indulging me 🙂

From adam at flossmanuals.net Wed Jul 28 09:11:21 2010
From: adam at flossmanuals.net (adam hyde)
Date: Wed, 28 Jul 2010 18:11:21 +0200
Subject: [Booki-dev] notes to meeting
Message-ID: <1280333481.1582.143.camel@esetera>

hi Frank,

It was good to meet you and I'm glad Source Fabric is considering working with us and you to develop features they and we need (Aco is also keen for this). 

I have sent this email to the dev list and to you and Micz. It might be good for you both to consider joining the list.
http://lists.flossmanuals.net/listinfo.cgi/booki-dev-flossmanuals.net

Below the content of this email is a very basic requirements doc. It does not outline the notes tab, so I thought I would make some notes here for your (and Micz's) consideration should Source Fabric decide they wish to commission all or part of this development. 

In essence, I think that the notes tab could nest the following:
1. To do list
2. Book notes
3. Style guide

These could be hidden via a dropdown or accordian style interface. Our plan is to keep everything as simple as possible so I would imagine a page with three headings and clicking on each reveals the information behind it.

Some ideas:
1. To do list
The basic form could be a Jquery to do as we looked at today:
http://demo.tutorialzine.com/2010/03/ajax-todo-list-jquery-php-mysql-css/demo.php

If this is the format, it would be good enough as it is. The good news is that this is done using Jquery so I imagine this is a very easy implementation. What you would need to work out, however, is how Aco implements the dynamic updates so that when a to do is altered everyone has that info updated.

If there was room to take this development a step further, I would recommend considering adding the following fields:
* assigned to
* due date
* priority

I am not married to those ideas though as I think we need to insure that the interface does not have too many things going on. So I would actually recommend we start with the basic implementation and move on. When users have tried it then we can consider extending it with these items.

2. Book Notes
Something like etherpad would be good but too complex (see.
http://piratepad.net/ )
I would suggest considering either a) the same interface as we have now in the notes pad except with a very very simple WYSIWYG or b) a threaded comment system. I think the best would again be to do the easiest and simplest - what we have now with a WYSIWYG interface (and no need to press 'save'). Then when users use it we extend according to demand for most-needed features. 

3. Style Guide
This is pretty much the same as (2) except it would be used for storing the Style Guide. A style guide is optional but many people request it in FLOSS Manuals and some go out of their way to create one so I think this would be a very good feature to anticipate based on our user experience so far.


I think all of the 3 above are simple and I think Source Fabric's working process (especially for the forthcoming Sprints) would benefit a lot from them.

You might have noticed that I prefer to take the easy road for features, leaving as much open as possible, and then refine according to use. That is because from experience I have learned that when designing software it is better to be led by the user rather than force them into an imagined workflow.

It has worked well for us so far - everything you now see in Booki is pretty much that way because we have tried similar ideas in FLOSS Manuals and seen their effect. I would prefer to continue to work this way with Booki. 

So...there was one more feature we discussed - Chapter Level notes. I think this would be extremely useful for Source Fabric (but Micz needs to comment on this) but we need to be careful that we get it right because it is not so obvious how this might work. 

I think the notes have to be associated with the chapter page when you edit it - however there is very little space there. One possibility is to build this into the WYSIWYG editor - Xinha - as a 'notes server' or some such. ie. it opens from the WYSIWYG editor but stores the content (chapter notes) in the booki db. The risk here is that people will not know that the notes are there...so we need to consider this. Another possibility is to build this into a 'sliding tab' as Micz suggested. I think that might be ok but it would have to be done carefully as it might look too much like a gimmick.

The other issue with chapter level notes is that I strongly believe that an overview of all chapter notes for a book should be able to be seen somewhere, in one place. Otherwise it would mean checking each chapter which would be a tedious job (books easily have 30+ chapters). So if you consider Chapter notes then you must also consider how to do this. 

So on this I am not so clear what would work well for Chapter level notes and because of this I think it's not such a good feature for our first adventure working together. I would recommend instead the first three to be done all together - however this is up to Micz.

My feeling is that the first 3 are an extremely quick development, first however you need to know how it all fits together so i would suggest emailing this list when you have questions and I am sure Aco will answer your questions...

Also, Aco is currently working on the Booki site update so I expect the GIT repo is not updated but will be within the next days once the booki www is updated....

also you should meet Doug - doug is on this list and he is the Objavi (PDF generator) developer....doug - frank, frank - doug

also, meet John who does the Booki manual and other essential tasks intro intro :)


:)

adam






1 INTRODUCTION

1.1 Description
Booki is designed to help you produce books, either by yourself or collaboratively. A book in this context is a "comprehensive text" which can be output to book-formatted PDF (for book production), epub, odt, screen readable PDF, templated HTML and other formats.

Booki supports the rapid development of content. Booki has tools to support the development of content in 'Book Sprints'. Book Sprints are intensive collaborative events where collaborators in real and remote space focus on writing a book together in 3-5 days. 

While you can use Booki to support very traditional book production processes, the feature set matches the rapid pace of publishing possible in the era of print on demand and electronic readers. Booki can output content immediately to multiple electronic formats. Print ready source (book formatted PDF) can be immediately generated, and then uploaded to your favorite Print on Demand (PoD) service, taken to a local printer, or delivered to a publisher.

1.2 Purpose
Booki embraces social and collaborative networked environments as the new production spaces for comprehensive (book) content. 
 
1.3 Scope
Booki is available online as a networked service (http://www.booki.cc) for free. This service is a production tool for the creation of free content and not a publishing/hosting service. Content produced within Booki.cc is intended to be published elsewhere, either under another domain, in paper form (ie. books), distributed in electronic formats, or re-used in other content. 

Booki can be installed by anyone wishing to utilise this software under their own domain or within private or local networks. 
 
 
2 OVERALL DESCRIPTION

2.1 Product Perspective
Booki takes what was learned from building the FLOSS Manuals tool set and posits these lessons within a more suitable architecture. 

Booki is the name of the collaborative production environment, however there are 2 associated softwares that provide all the services required :
Booki - production environment
Objavi - import and export engine
This document refers to Booki 1.5 and Objavi 2.2

2.2 Booki Functions
* User account creation requiring minimal information
* One click book creation
* Drag and drop Table of Contents creation
* One click editing of chapters
* Chapter level locks
* Live chat on a book and group level
* Live book status reports (editing, saving, chapter creation) delivered
to the chat window
* Drop down chapter status markers
* One click to join a group
* One click to add a book to a group
* One click exporting to epub, screen pdf, book formatted pdf, odt, html with default templates
* Easily accessible advanced styling options for export (CSS controlled)
* User profile control (status, image, bio)
* One click group creation
* Easy importing of book content from Archive.org, Mediawiki, other Booki installations
* Option to upload content to Archive.org
 

2.3 User Characteristics
2.3.1 Contributor
The majority of users will be contributors to an existing project. They may contribute to one or more project and may produce text and/or images, provide feedback or encouragement, proof, spell check, or edit content. These are the primary users and the tool set should first meet their needs.

2.3.2 Maintainer
These are advanced users that create their own books or have been elevated to maintainer status for a book by group admins. Maintainers have associated administrative tools for the books they maintain which are not available to other users.

2.3.3 Group admin
These are advanced users that wish to establish and administrate their own group. They have maintenance tools for every book in their group plus additional group admin tools.

2.4 Operating Environment
Booki is designed primarily for standards-based Open Source browser comparability but is tested against other browsers. 
 
2.5 General Constraints
* Booki and Objavi are Python-based.
* Booki is built with the (bare) Django framework.
* Booki uses Jquery for dynamic user interface elements. 
* Booki uses Postgres as the database but sqlite3 can also be used
* Redis is used by Booki for persistent data storage to mediate dynamic data delivery to the user interface
* Objavi utilises Webkit for PDF generation. Later Gecko will be added. 
* Rendering of .odt by Objavi requires OpenOffice to be installed with unoconv. 
* The Booki Web/IRC gateway may eventually (and optionally) require a dedicated standalone IRC service hosted on domain. 
* Content editing in Booki is done by default with the Xinha WYSIWYG editor
* XHTML is the file format for content. 
* Content will be ultimately be stored in GIT. 
* Localisation in Booki is managed with Portable Object files (.po).
* The code repository for both projects is GIT with a dedicated Trac for bug reporting and milestone tracking :
http://booki-dev.flossmanuals.net 
* A Dev mailing list is maintained here:
http://lists.flossmanuals.net/listinfo.cgi/booki-dev-flossmanuals.net 
* Developers can be reached in IRC (freenode, #flossmanuals)
* Each release will be as source. Beta and later releases will also be available as Debian .deb packages. 
* User and API Documentation will be maintained in the FLOSS Manuals
Booki Group. 
* For development we use Apache2 for http delivery
* The license is GPL2+ for all softwares

2.5 User Documentation
Maintained here : http://www.booki.cc/booki-user-guide/


3 SYSTEM FEATURES

3.1 Booki Features

3.1.1 Booki-zip (Internal File Format)
Status: High Priority, Implemented
Function: A Booki-specific file structure for describing books 
Interface: Used for internal data exchange between Booki and Objavi. 
Notes: booki-zip definition maintained here :
http://booki-dev.flossmanuals.net/git?p=objavi2.git;a=blob_plain;f=htdocs/booki-zip-standard.txt

3.1.2 Account Creation
Status: High Priority, Partially Implemented
Function: Quick access to a registration form from the front page for account creation 
Interface: Requires only username, password, email and real name (required for attribution). Email is sent to the user with autogenerated link for verification
Notes: email confirmation mechanism missing

3.1.3 Sign in
Status: High Priority, Implemented
Function: Quick access to a sign-in form from the front page 
Interface: Username and Password form and submit button. Username and
pass remembered. 

3.1.4 Profile Control
Status: Medium Priority, Implemented
Function: When logged in the user can access a profile settings page to set personal details (email, name, bio, image). Personal details can be browsed by other users
Interface: "My Settings" link in user-specific menu on left gives access to a form for changing the details.

3.1.5 Book Creation
Status: High Priority, Implemented
Function: Users can create a book from their homepage ("My Profile").
Interface: User can click on "My Profile" link from the user-specific menu on the left. On the Profile page a text field for the name of the book, and a license drop down menu (license *must* be set) is presented.
Clicking on "Create" adds the empty book with edit button to the listing of the users books on the same page.

3.1.6 Archive.org Book Import
Status: Medium Priority, Implemented
Function: Users can import books from Archive.org
Interface: "My Books" link in the user-specific menu on the left presents the user with a field for inputting the ID of any book from
Archive.org. The book is then imported when the user clicks "Import".
Notes : Interface is through Booki but Objavi does the importing and returns Booki zip to Booki. Relies on Archive.org successfully delivering epub for each book but this is not always happening. Needs error catching and user friendly progress/error messages.

3.1.7 Wikibooks Book Import
Status: Medium Priority, Implemented
Function: Users can import books from Wikibooks
(http://en.wikibooks.org)
Interface: "My Books" link in the user-specific menu on the left presents the user with a field for inputting the URL of any book from Wikibooks. The book is then imported when the user clicks "Import".
Notes : Interface is through Booki but Objavi does the importing and returns Booki zip to Booki. Needs thorough testing as it is sometimes failing possibly due to time-outs. Needs error catching and user friendly progress/error messages. Should be extended to be a "mediawiki import" tool, not just for Wikibooks.

3.1.8 Epub Book Import
Status: Medium Priority, Implemented
Function: Users can import any epub available online
Interface: "My Books" link in the user-specific menu on the left presents the user with a field for inputting the URL of any epub. The book is then imported when the user clicks "Import".
Notes : Interface is through Booki but Objavi does the importing and returns Booki zip to Booki. Needs thorough testing as it is sometimes failing possibly due to time-outs. Needs error catching and user friendly progress/error messages.

3.1.9 Group Creation
Status: High Priority, Implemented
Function: Users can create groups. 
Interface: "My Groups" link in the user-specific menu on the left presents user with 2 text fields - group name, and description. When a name for a group is entered and "Create" is clicked then the group is created.
Notes: Group admin features missing.

3.1.10 Joining Groups
Status: High Priority, Implemented
Function: Users can join groups with one click.
Interface: "Groups" link in the general menu on the left presents a list of all Groups, by clicking on link the user is transported to the homepage for that group. At the bottom of the page the user can click "Join this group" and they are subscribed.

3.1.11 Adding Books to Groups
Status: High Priority, Implemented
Function: Users can add their own books to groups they belong to.
Interface: While on a Group page that the user is subscribed to the user can add their own books to the group. 
Notes: When Group Admin features are in place we will change this so that Group Admins set who can and cannot add books to groups. At present a book can only belong to one group.

3.1.12 Readable Book Display
Status: High Priority, Implemented
Function: Users can read stable content in Booki without the need to log-in.
Interface: Upon clicking on the "Books" link in the general menu on the left a page listing all books is presented. Clicking on any of these presents a basic readable version of the stable content. Alternatively users can link to a book on the url http://[booki install domain]/[book name]

3.1.13 Edit Page
Status: High Priority, Implemented
Function: Page for editing content.
Interface: The edit page is accessed by clicking on "edit" next to the name of a book in "My Books" or "Books" (all books) listings. The user is then presented with a page with tabs for : editing, notes, exporting, history

3.1.14 Edit Tab
Status: High Priority, Implemented
Function: Edit interface for chapters.
Interface: Clicking ?edit? on a chapter title will open the Xinha WYSIWYG editor with the content in place. 

3.1.15 Notes Tab
Status: High Priority, Implemented
Function: A place for contributors to keep notes on the development of the book
Interface: User clicks on the Notes tab for a book and is presented with a shared notepad for recording issues or discussing the development.
Notes : Implemented but future direction TBD 

3.1.16 History Tab
Status: High Priority, Implemented
Function: Shows edit history of the book
Interface: User clicks on the history tab and can see the edit history with edit notes. 
Notes: Implemented but unreadable. Users should also be able to access diffs here.

3.1.17 Export Tab
Status: High Priority, Implemented
Function: Export content to various formats
Interface: User clicks on the Export tab and is presented with a form for choosing export options. Clicking "Export" returns the desired output for download. 

3.1.18 Version Tab
Status: High priority, Not Implemented
Function: can easily freeze content at stable stages while work continues on the unstable version.
Interface: From the Edit Page a maintainer sees an extra tab "Version".
>From here a maintainer can click "create stable version" - the last stable version is archived recorded and the current version becomes the new stable version. 

3.1.19 Subscribe to edit notifications
Status: High Priority, Not Implemented
Function: Users can subscribe to edit notifications
Interface: User clicks "Subscribe to edit notifications" from the Edit Page for a book. If there are edits made a synopsis is emailed with basic edit information (time, chapter, person who made the change, version numbers) and a link to the diff.

3.1.20 Chat
Status: High priority, Implemented
Function: A real time chat (web / IRC gateway).
Interface: Persistent on the edit page for any book. 

3.1.21 Localisation
Status: High priority, Not Implemented
Function: Booki needs to be available in any language where it is needed. Hence we may integrate the Pootle code base into Booki to enable localisation of the environment.
Interface: TBD

3.1.22 Translation
Status: High priority, Not Implemented
Function: Content can be forked and marked for translation. A
translation version of a book will provide link backs to the original
material, be placed in a translation work flow, and edited in a
side-by-side view where the translator can also see the original
source. 
Interface: TBD 

3.1.23 Copyright Tracking (Attribution)
Status: High Priority, Implemented 
Function: Any user contributions are recorded and attributed.
Interface: All attributions are automated in Booki. Book level attribution is output in any chapter that contains the string "##AUTHORS##"
Note: should be a syntax for producing Attribution notes on a per-chapter basis eg. "##CHAPTER-AUTHORS##"
 

3.2 Objavi Features

3.2.1 Book-Formatted PDF Output
Status: High Priority, Implemented
Function: the server side creation of Book Formatted PDF is a pivotal feature. This is managed by Objavi which runs as a separate service. The book formatted PDF supports Unicode, bi-directional text, and reverse binding for printing right-to-left texts on a left-to-right press and vice versa. The formatting engine outputs customisable sizes including split column PDF suitable for printing on large scale newsprint.
Interface: This feature is managed by Objavi, an API is functional and feature rich but not well documented at present. Objavi also presents a web interface for those wanting more nuanced control (see http://objavi.flossmanuals.net/).

3.2.2 CSS Book Design
Status: High Priority, Implemented
Function: The default PDF rendering engine for Booki is now Webkit and will eventually be Mozilla Firefox hence there is full CSS support for creating book-formatted PDF in Booki. This changes the language of design from Indesign to CSS - which means any web native can control the design of the book. 

3.2.3 Export Formats
Status: High Priority, Implemented 
Function: Users also can export to self contained templated (tar.gz) HTML, to .odt (OpenOffice rich text format), epub, and screen readable PDF. Other XML output options can be developed as required. 


I guess I can never claim to not having project management experience again. Darn it.

Building Book Production Platforms p4

The renderer

Note: this is an early version. It has been cleaned up some, but is still needing links and screenshots…. Apologies if the rawness offends you 🙂

This series is skipping around the toolchain, depending on what’s most in my mind at the moment. Today it’s file conversion, otherwise known as ‘rendering’. This is the process of converting one file type to another, for example, HTML-to-EPUB or Word-to-HTML, and so on.

It’s important to have file conversion in the book production world because we often want to convert the HTML to a book format – like book-formatted PDF, or EPUB, mobi and so on, or to import into a new document existing content contained in a file like MS Word.

Manual conversions

It is, of course, quite possible to do all your file conversion manually.

Should you wish to convert HTML into a nice book-formatted PDF, one possible strategy is to go out to InDesign or Scribus and lay it all out like our ancestors did as recently as 2014. Or, if you want to convert MS Word, for example, to HTML, you can just save it as HTML in Word… Yes, Word copies across a lot of formatting junk, but you can clean it up using purpose-built freely available software (such as HTMLTidy and CleanUp HTML), online services (like DirtyMarkup),or a handy app (such as Word HTML Cleaner)…

Manual conversion is not too bad a strategy, as long as it doesn’t take you too long, and it is often more efficient and faster than those convoluted hand-holding technical systems which promise to do it for you in one step. Despite the utopian promises made by automation… you often get better results doing the conversion manually.

I sometimes hear people in Book Sprints, for example, complain something to the tune of “why can’t I just click a button and import part of this paragraph from Wikipedia into the chapter, and then if the entry is updated in Wikipedia, I can just click the button again and it will be updated here”…

I try not to sigh too loudly when I hear this kind of ‘I have all the solutions!’ kind of ‘question’. Some day that may be feasible, but in the meantime, all the knowledge production platforms I have built have an OS-independent trans-format import mechanism which allows those handy keyboard shortcuts ‘control c’ and ‘control p’… sigh. Don’t knock copy and paste! It can get you a long way.

You can also build an EPUB by hand…

But, who really wants to do any of this? Isn’t it better to just push a button and taaadaaa! out pops the format of choice! (I have all the solutions! haha).

I think we can agree it is better if you are able to use a smart tool to convert your files, and the good news is that within certain parameters and for loads of use cases, this is possible. But don’t under-estimate the amount of tweaking for individual docs that might, at times (not always), be required.

Import and export are the same thing

The process of ‘importing’ a document is also sometimes known as ingestion. Before delving down into this, the first gotcha with file transformation is to avoid thinking about import and export as separate technical systems. That can, and has, caused a lot of extra work when building file conversion into a toolchain.

Both import and export are, actually, file conversion. The formats might differ, import might solely be Word-to-HTML in your system and the export HTML-to-EPUB. However, the process of file conversion has many needs that can be abstracted and applied to both of these cases. A quick example – file conversion is often processor and memory intensive. So effective management of these processes is quite important, and in addition, fallbacks for errors or fails need to be managed nicely. These two measures are required independent of the filetypes you are converting from or to. So don’t think about pipelining specific formats, try and identify as many requirements as possible for building just one file conversion system, not an import system plus an export system.

Ingestion

In importing documents to an HTML system, the big use case is MS Word. Converting from MS Word is a road full of potholes and gotchas. The first problem is that there is no single ‘MS Word’ file format, rather there are many many different file formats that all call themselves MS Word. So to initiate a transformation, you need to know what variety of MS Word you are dealing with.

Your life is made much easier if you can stipulate that your system requires one variety – .docx. If you do have to deal with other forms of Word, then it is possible to do transformations on the backend from miscellaneous Word file type X to .docx and then from .docx to HTML. Libreoffice, for example, offers binaries that do this in a ‘headless’ state (it can be executed from the command line without the need to fire up the GUI). However, the more transformations you undertake, the more errors in the conversion you are likely to introduce. Obviously, this then causes QA issues and will increase your workload per transform required.

Another real problem with MS Word versions before .docx, is that .docx is transparent, actually is just XML. So you can view what you are dealing with. Versions before this were horrible binaries – a big clump of ones and zeros – and after that a bunch of gunk. That same problem also exists when you use binaries like soffice (the Libreoffice binary for headless conversions) as it is also a big bucket of numbers. You can’t easily get your head into improving transformations with soffice unless you want to learn to etch code into your CPU with a protractor.

If you have to deal with MS Word at all, I recommend stipulating .docx as the accepted MS Word format. I am not a file type expert, far from it, but from people who do know a lot about file formats I know that .docx looks like it has been designed by a committee… and possibly, a committee whose members never spoke to each other. Additionally, Microsoft, being Microsoft, likes to bully people into doing things their way. .docx is a notable move away from that strategy, and does make it substantially easier to interoperate with other formats, however, there are some horrible gotchas like .docx having its own non-standard version of MathML. Yikes. So, life in the .docx lane is easier, but not necessarily as easy as it should be if we were all playing in the same sandbox like grownups.

I have tried many strategies for Word to HTML conversion. There are many open source solutions out there, but oddly, not as many good ones as you would hope. Recently I looked at these three rather closely:

  • Calibre’s Python based ebook converter script
  • OxGarage
  • soffice (Libreoffice)

There are others…I can’t even remember which ones I have looked at in detail over the years. I have trawled Sourceforge and Github and Gitorious and other places. But the web is enormous these days and maybe there is just the oh-so-perfect solution that I have missed. If you know it then please email it to me, I’ll be ever so grateful (only Open Source solutions please!).

These three are all good solutions, but at the end of the day, I like OxGarage. I won’t go into too much detail about all of them but a quick top-of-mind whys and why-nots would include:

  • Calibre’s scripts are awesome and extendable if you know Python, however they don’t support MS MathML to ‘real’ MathML conversions. That’s a show stopper for me.
  • On the good side, though, Calibre’s developer community is awesome, and they are heroes in this field and deserve support, so if you are a Python coder or dev shop then, by all means, please pitch in and help them improve their .docx to HTML transforms. The world will be a better place for it.
  • soffice does an ok job but it’s a black box, who knows what magic is inside? It tends to make really complex HTML and it is also really heavy on your poor hardware. I have used it a lot but I’m not that big a fan.
  • OxGarage…well…I love OxGarage, so I really recommend this option…

OxGarage was developed by a European Commission-funded project and then, as is common for these kinds of projects, it dried up and was left on a shelf. Along came Sebastian Rhatz, a guru of file transformation, big Open Source guy, and also a force behind the Text Encoding Initiative. Sebastian is also the head of Academic IT Sevices at Oxford University. The guy has credentials! Also, he’s a terribly nice and helpful guy. He has so much experience in this area I feel the trivialness of my questions about our .docx to HTML woes at PLOS… afraid he might absentmindedly swipe me out of the way like I was an inconsequential little midge.. but he’s such a nice chap, instead he invites midges out to lunch.

So, Sebastian picked up the Java code and added some better conversions. OxGarage is essentially a Java framework that manages multiple different types of conversions. You feed it and are fed from it by a simple web API. It doesn’t have the best error handling, but it does do a good job. The .docx to HTML conversion is multi-step. First, the .docx is converted to TEI – a very rich, complex markup, and then from TEI via XSL to HTML. That means that all you really need to worry about is tweaking the XSL to improve the transformation and that’s not too tricky. It could be argued that the TEI conversion is a redundant step. I think it is. But OxGarage works out of the box and does a pretty good job so we have adopted it for the project I am working on for PLOS, and we are happy with it. We have added some special (Open) Sauce but I’ll get to that later. We are using it and will shoot for more elegant solutions later (and we have designed a framework to make this an easy future path).

If you are looking for Word-to-HTML conversion tools, I recommend OxGarage. Im not saying it’s the optimal way to do things, but it will save you having to build another file conversion system from scratch, and from what I can tell from Sebastian, that would take considerable effort.

HTML to books

The other side of the tracks is the conversion of the HTML you have into a book file format. We live in a rather tangled semantic world when it comes to this part of the toolchain. Firstly, it’s hard to know what a book file format actually is these days… on a normal day, I would say a book file format is a file format that can display a human readable structured narrative. Yikes. That’s not particularly helpful… Let’s just say for now that a book file format is – EPUB, book formatted PDF, HTML, and Mobi.

So, transforming from HTML to HTML sounds pretty easy. It is! The question is really how do you want your book to appear on the web? Make that decision first, and then build it. Since you are starting with HTML this should be rather easy and could be done in any programming language.

The next easiest is EPUB. EPUB contains the content in HTML files stored in a zip file with the .epub suffix. That is also easy to create and, depending on your programming language, there are plenty of libraries to help you do this. So moving on…

Mobi. Ok.. mobi is a proprietary format and rather horrible. It contains some HTML, some DB stuff…  I don’t know…  a bit of bad magic, frogs legs… that kind of thing. My recommendation is to first create your EPUB and then use Calibre’s awesome ebook converter script to create the mobi on the backend. Actually, if you use this strategy, you get all the other Calibre output formats for free, including (groan) .docx if you need it. Honestly, go give those Calibre guys all your love, some dev time, and a bit of cash. They are making our world a whole lot easier.

Ok… the holy grail… people still like paper books, and paper books are printed from PDF. Paper these days is a post-digital artifact. So first you need that awkward sounding book-formatted PDF.

Here there are an array of options and then there is this very exciting world that can open to you if you are willing to live a little on the bleeding edge…. I’m referring to CSS Regions… but let’s come back to that.

First, I want to say I am disappointed that some ‘Open Source’ projects use proprietary code for HTML-to-PDF conversion. That includes Press Books and Wikipedia. Wikipedia is re-tooling their entire book-formatted-PDF conversion process to be based on LaTeX and that is an awesome decision. However, right now they use the proprietary PrinceML as does Press Books. I like both projects, but I get a little disheartened when projects with a shared need don’t put some effort into an Open Source solution for their toolchain.

All book production platforms that produce paper books need an HTML-to-PDF renderer to do the job. If it is closed source then I think it needs to be stated that the project is partially Open Source. I’m a stickler for this kind of stuff but also, I am saddened that adoption of proprietary components stops the effort to develop the Open Source solutions we need, while simultaneously enabling proprietary solutions to gain market dominance – which, if you follow the logic through, traps the effort to develop a competitive Open Source solutions in a vicious circle. I wish that more people would try, like the Wikimedia Foundation is trying, to break that cycle.

The browser as renderer

There is one huge Open Source hero in this game. Jacob Truelson. He created WKHTMLTOPDF when he was a university tutor because he wanted his students to be able to write in HTML and give him nicely formatted PDF for evaluation. So he grabbed a headless Webkit, added some QT magic, some tweaks, and made a command line application that converts HTML to book-formatted PDF. We used it in the early days of FLOSS Manuals and it is still one of the renderer choices in the Booktype file conversion suite (Objavi). It was particularly helpful when we needed to produce books in Farsi which contain right to left text. No HTML to PDF renderer supported this at the time except WKHTMLTOPDF because it was based on a browser engine that had RTL support built in.

Some years later WKHTMLTOPDF was floundering, mainly because Jacob was too busy, and I tried to help create a consortium around the project to find developers and finance. However I didn’t have the skills, and there was little interest. Thankfully the problem solved itself over time, and WKHTMLTOPDF is now a thriving project and very much in demand.

WKHTMLTOPDF really does a lot of cool stuff, but more than this, I firmly believe the approach is the right approach. The application uses a browser to render the PDF…that is a HUGE innovation and Jacob should be recognised for it. What this means is – if you are making your book in HTML in the browser, you have at your fingertips lots of really nice tools like CSS and JavaScript. So, for example, you can style your book with CSS or add javaScript to support the rendering of Math, or use typography JavaScripts to do cool stuff… When you render your book to PDF with a browser, you get all that stuff for free. So your HTML authoring environment and your rendering environment are essentially the same thing…  I can’t tell you how much that idea excites me. It is just crazy! This means that all those nice JavaScripts you used, and all that nice CSS which gave you really good looking content in the browser will give you the same results when rendered to PDF. This is the right way to do it and there is even more goodness to pile on, as this also means that your rendering environment is standards-based and open source…

Awesome. This is the future. And the future is actually even brighter for this approach than I have stated. If you are looking to create dynamic content – let’s say cool little interactive widgets based on the incredible tangle! Library – for ebooks (including web-based HTML) … if you use a browser to render the PDF you can actually render the first display state of the dynamic content in your PDF. So, if you make an interactive widget, in the paper book you will see the ‘frozen’ version, and in the ebook/HTML version you get the dynamic version – without having to change anything. I tested this a long time ago and I am itching to get my teeth into designing content production tools to do this.

So many things to do. You can get an idea how it works by visiting that Tangle link above… try the interactive widgets in the browser, and then just try printing to PDF using the browser… you can see the same interactive widgets you played with also print nicely in a ‘static’ state. That gets the principle across nicely.

So a browser-based renderer is the right approach, and Prince, which is, it must be pointed out, partly owned by Håkon Wium Lie, is trying to be a browser by any other name. It started with HTML and CSS to PDF conversion and now…oo!… they added Javascript… so…are they a browser? No? I think they are actually building a proprietary browser to be used solely as a rendering engine. It just sounds like a really bad idea to me. Why not drop that idea and contribute to an actual open source browser and use that. And those projects that use Prince, why not contribute to an effort to create browser-based renderers for the book world? It’s actually easier than you think. If you don’t want to put your hands into the innards of WebKit, then do some JavaScript and work with CSS Regions (see below).

This brings us to another part of the browser-as-renderer story, but first I think two other projects need calling out for thanks. Reportlab for a long time was one of the only command line book-formatted-PDF rendering solutions. It was proprietary but had a community license. That’s not all good news, but at least they had one foot in the Open Source camp. However, what really made Reportlab useful was Dirk Holtwick’s Pisa project that provided a layer on top of Reportab so you could convert HTML to book-formatted-PDF.

The bleeding edge

So, to the bleeding edge. CSS Regions is the future for browser-based PDF rendering of all kinds. Interestingly Håkon Wium Lie has said, in a very emphatic way, that CSS Regions is bad for the web…perhaps he means bad for the PrinceML business model? I’m not sure, I can only say he seemed to protest a little too much. As a result, Google pulled CSS regions out of Chrome. Argh.

However CSS Regions are supported in Safari, and in some older versions of Chrome and Chromium (which you can still find online if you snoop around). Additionally, Adobe has done some awesome work in this area (they were behind the original implementation of CSS Regions in WebKit – the browser engine that used to be behind Chrome and which is still used by Safari). Adobe built the CSS Regions polyfil – a javaScript that plays the same role as built-in CSS regions.

When CSS regions came online in early 2012, Remko Siemerink and I experimented with CSS Regions at an event at the Sandberg (Amsterdam) for producing book- formatted PDF. I’m really happy to see that one of these experiments is still online (NB this needs to be viewed in a browser supporting CSS Regions).

It was obviously the solution for pagination on the web, and once you can paginate in the browser, you can convert those web pages to PDF pages for printing. This was the step needed for a really flexible browser-based book-formatted-PDF rendering solution. It must be pointed out however, that it’s not just a good solution for books… at BookSprints.net we use CSS Regions to create a nicely formatted and paginated form in the browser to fill out client details. Then we print it out to PDF and send it…

Adobe is on to this stuff. They seem to believe that the browser is the ‘design surface’ of the future. Which seems to be why they are putting so much effort into CSS Regions. Im not a terribly big fan of InDesign and proprietary Adobe strategies and products, but credit where credit is due. Without Adobe CSS Regions ^^^ would just be an idea, and they have done it all under open source licenses (according to Alan Stearns from Adobe, the Microsoft and IE teams also contributed to this quite substantially).

At the time CSS Regions were inaugurated, I was in charge of a small team building Booktype in Berlin, and we followed on from Remko’s work, grabbed CSS Regions, and experimented with a JavaScript book renderer. In late 2012, book.js was born (it was a small team but I was lucky enough to be able to dedicate one of my team, Johannes Wilm, to the task) and it’s a JavaScript that leverages CSS Regions to create paginated content in the browser, complete with a table of contents, headers, footers, left-right margin control, front matter, title pages…etc… we have also experimented with adding contenteditable to the mix so you can create paginated content, tweak it by editing it directly in the browser, and outputting to PDF. It works pretty well and I have used it to produce 40 or 50 books, maybe more. The Fiduswriter team has since forked the code to pagination.js which I haven’t looked at too closely yet as I’m quite happy with the job book.js does.

CSS Regions is the way to go. It means you can see the book in the browser and then print to PDF and get the exact same results. It needs some CSS wizardry to get it right, but when you get it right, it just works. Additionally, you can compile a browser in a headless state and run it on the command line if you want to render the book on the backend.

Wrapping it all up

There is one part of this story left to be told. If you are going to go down this path, I thoroughly recommend you create an architecture that will manage all these conversion processes and which is relatively agnostic to what is coming in and going out. For Booktype, Douglas Bagnall and Luka Frelih built the original Objavi, which is a Python based standalone system that accepts a specially formatted zip file (booki.zip) and outputs whatever format you need. It manages this by an API, and it serves Booktype pretty well. Sourcefabric still maintains it and it has evolved to Objavi 2.

However, I don’t think it’s the optimal approach. There are many things to improve with Objavi, possibly the most important is that EPUB should be the file format accepted, and then after the conversion process takes place EPUB should be returned to the book production platform with the assets wrapped up inside. If you can do this, you have a standards-based format for conversion transactions, and then any project that wants to can use it. More on this in another post. Enough to say that the team at PLOS are building exactly this and adding on some other very interesting things to make ‘configurable pipelines’ that might take format X though an initial conversion, through a clean up process, and then a text mining process, stash all the metadata in the EPUB and return it to the platform. But that’s a story for another day…

book.js

I mentioned in an earlier post that the movable type of Gutenberg’s time has become realtime, in a very real sense each book is typeset as we read it. Content is dynamically re-flowed for each device depending on display dimensions and individualised settings. Since ebooks are web pages, browsers have come to play a central role in digital e-readers.
bookjs
Three books produced by the book.js in-browser typesetting library. Photo by Kristin Tretheway.
What is interesting here, is that the browser can also reflow content into fixed page formats such as PDF which means that the browser is on its way to becoming the typesetting engine for print. book.js demonstrates nicely the role of the browser as print typesetting engine.book.js is a JavaScript library that you can use to turn a web page into a PDF formatted for printing as a book. Take a web page, add the Javascript, and you will see the page transformed into a paginated book complete with page breaks, margins, page numbers, table of contents, front matter, headers and so on. When you print that page, you have a book-formatted PDF ready to print. It’s that simple.
1_plainhtml3
Plain HTML file with book content
2_bookjs
Same file with book.js applied
3_illustration
A page with an illustration
4_toc
Illustration of Table of Contents automatically generated by book.js

It brings us closer to in-browser print design and a step closer to the demise of desktop publishing. Although book.js is in an alpha form, it is a clear demonstration that the browser is fast becoming the new environment for print design.

That is an enormous leap, one that not only means print design environments can be developed using browser-based technology, which will surely lead to enormous innovation, but it radically changes the process of design. The design of books and paper products enters a networked environment. This will enable more possibilities for collaborative design and bring print production into the workflow of online content production. There will be no need to exit browser-based environments to take content from source to final output. This means there is no need to juggle multiple sources for different stages of production, there can be efficiency gains through integrated workflow, and, most interestingly, content production and design can occur simultaneously…

It is also important to realise that these same technologies, book.js and others that will follow it, can make the same things possible for ebook production. Flowing text into PDF for a paper book, or into e-reader screen display dimensions, is the same thing. This enables synchronous in-browser design and production on a single source for multiple output formats.

book.js is Open Source, developed originally by and for Booktype, but the team is looking to collaborate with whoever would like to push this code base further. It is at the alpha stage and a lot of work still needs to be done, so please consider jumping in, improving the code and contributing back into the public repository.

book.js demo and information can be found here .  Note: This is strictly for the geeks to try as it requires the latest version of Chrome; see the demo information.

Originally posted on O’Reilly, 29 October 2012
http://toc.oreilly.com/2012/10/bookjs-turns-your-browser-into-a-print-typesetting-engine.html

Circumvention Book Sprint II

I just finished facilitating a Book Sprint about circumvention called “How to Bypass Internet Censorship”. We spent 5 days outside of Berlin updating the book we first created in a Sprint in 2008. It was a ‘re-sprint’ if you like and was extremely successful.

New Update – http://www.lulu.com/product/paperback/how-to-bypass-internet-censorship/15054026

Right now, you can buy this book from lulu.com and you can also contribute to it through the FLOSS manuals installation of Booki – http://booki.flossmanuals.net/bypassing-censorship/edit

It will also be available shortly on the FLOSS Manuals website – I just need to finish the integration with Booki.

The first version of this book was extremely successful – being translated into Burmese, simplified Chinese, Russian, Vietnamese, Spanish, French, Farsi, and Spanish. Most of these were also distributed in perfect book form.

English, Russian, Arabic, Spanish versions

The book-formatted PDF for the above books, including those with bi-directional text (Farsi, Arabic etc) were all generated using Booki.

The new book is *much* better with beautiful illustrations and cover provided by Laleh Torabi  and many new chapters, updates of old chapters and some new sections. Buy it now or wait a few days for the free version…

Importing Archive.org Books with Booki

For some months, Booki has been able to import Archive.org books. This development was sponsored by Archive.org. When importing a book, Booki requests an ePub from Archive.org, converts this to the ‘native file format’ (booki-zip) and loads this into the Booki database. It is then possible to export the same book back into an ePub file.

So, if Booki can import an Archive.org ePub and then export it as ePub what is the point? Seems like Booki is an unnecessary conduit. Well, one point is that with Booki you can export the book into multiple formats – such as book-formatted PDF. That means you can take any of those luscious out-of-copyright books, import them into Booki and make real books from them. This is pretty exciting when you see just how lovely some of these books are. Take for example the copy of Cinderella in the American Libraries section of Archive.org.

Cinderella original edition
Cinderella Edward Dalziel, 1865

This version of Cinderella is out-of-copyright and you can republish as you like. This is a pretty exciting prospect, opening the door for anyone to start their own publishing house importing content from Booki, styling, and exporting to print-formatted-PDF for printing.

However, there are a few steps that you may need to go through first, and this is the real reason why we have implemented importing from Archive.org. All the books in the Archive.org libraries have been created using OCR (Optical Character Recognition) scanning. The process involves loading books onto book scanners and scanning each page.

Archive.org Book Scanner.

However, scanning creates a certain amount of errors. OCR doesn’t render all text correctly and cannot tell the difference between text on a page and text in an image. Hence images with embedded text are usually split up, with the text elements saved as plain text and the surrounding image saved as multiple smaller images. So the OCR-scanned books need proofing and the import feature in Booki enables proofing of OCR scanned books from Archive.org. This means that teams can get together remotely, choose a selection of Archive.org books, and get to work improving them.

While this is all working, we want to build a tighter workflow and a few extra tools to assist the proofing process (if you are a developer familiar with Python and interested in helping us with this good cause then let us know). Douglas Bagnall (Booki/Objavi developer) recently extended the import functionality so that all the metadata imported from Archive.org is preserved. This opens the door to utilising this information to assist proofing of the content – we hope, for example, to eventually be able to show the complete digital image of the original scan, before it was reduced to OCR, alongside the OCR pages to assist proofing. Watch this space!

Incidentally, Booki can import any ePub, so this means that the way is open for the same proofing process to be applied to other OCR scanning projects. If you have a project like this then let us know, maybe we can help.