Importing Archive.org Books with Booki

For some months, Booki has been able to import Archive.org books. This development was sponsored by Archive.org. When importing a book, Booki requests an ePub from Archive.org, converts this to the ‘native file format’ (booki-zip) and loads this into the Booki database. It is then possible to export the same book back into an ePub file.

So, if Booki can import an Archive.org ePub and then export it as ePub what is the point? Seems like Booki is an unnecessary conduit. Well, one point is that with Booki you can export the book into multiple formats – such as book-formatted PDF. That means you can take any of those luscious out-of-copyright books, import them into Booki and make real books from them. This is pretty exciting when you see just how lovely some of these books are. Take for example the copy of Cinderella in the American Libraries section of Archive.org.

Cinderella original edition
Cinderella Edward Dalziel, 1865

This version of Cinderella is out-of-copyright and you can republish as you like. This is a pretty exciting prospect, opening the door for anyone to start their own publishing house importing content from Booki, styling, and exporting to print-formatted-PDF for printing.

However, there are a few steps that you may need to go through first, and this is the real reason why we have implemented importing from Archive.org. All the books in the Archive.org libraries have been created using OCR (Optical Character Recognition) scanning. The process involves loading books onto book scanners and scanning each page.

Archive.org Book Scanner.

However, scanning creates a certain amount of errors. OCR doesn’t render all text correctly and cannot tell the difference between text on a page and text in an image. Hence images with embedded text are usually split up, with the text elements saved as plain text and the surrounding image saved as multiple smaller images. So the OCR-scanned books need proofing and the import feature in Booki enables proofing of OCR scanned books from Archive.org. This means that teams can get together remotely, choose a selection of Archive.org books, and get to work improving them.

While this is all working, we want to build a tighter workflow and a few extra tools to assist the proofing process (if you are a developer familiar with Python and interested in helping us with this good cause then let us know). Douglas Bagnall (Booki/Objavi developer) recently extended the import functionality so that all the metadata imported from Archive.org is preserved. This opens the door to utilising this information to assist proofing of the content – we hope, for example, to eventually be able to show the complete digital image of the original scan, before it was reduced to OCR, alongside the OCR pages to assist proofing. Watch this space!

Incidentally, Booki can import any ePub, so this means that the way is open for the same proofing process to be applied to other OCR scanning projects. If you have a project like this then let us know, maybe we can help.