Importing Archive.org Books with Booki

For some months, Booki has been able to import Archive.org books. This development was sponsored by Archive.org. When importing a book, Booki requests an ePub from Archive.org, converts this to the ‘native file format’ (booki-zip) and loads this into the Booki database. It is then possible to export the same book back into an ePub file.

So, if Booki can import an Archive.org ePub and then export it as ePub what is the point? Seems like Booki is an unnecessary conduit. Well, one point is that with Booki you can export the book into multiple formats – such as book-formatted PDF. That means you can take any of those luscious out-of-copyright books, import them into Booki and make real books from them. This is pretty exciting when you see just how lovely some of these books are. Take for example the copy of Cinderella in the American Libraries section of Archive.org.

Cinderella original edition
Cinderella Edward Dalziel, 1865

This version of Cinderella is out-of-copyright and you can republish as you like. This is a pretty exciting prospect, opening the door for anyone to start their own publishing house importing content from Booki, styling, and exporting to print-formatted-PDF for printing.

However, there are a few steps that you may need to go through first, and this is the real reason why we have implemented importing from Archive.org. All the books in the Archive.org libraries have been created using OCR (Optical Character Recognition) scanning. The process involves loading books onto book scanners and scanning each page.

Archive.org Book Scanner.

However, scanning creates a certain amount of errors. OCR doesn’t render all text correctly and cannot tell the difference between text on a page and text in an image. Hence images with embedded text are usually split up, with the text elements saved as plain text and the surrounding image saved as multiple smaller images. So the OCR-scanned books need proofing and the import feature in Booki enables proofing of OCR scanned books from Archive.org. This means that teams can get together remotely, choose a selection of Archive.org books, and get to work improving them.

While this is all working, we want to build a tighter workflow and a few extra tools to assist the proofing process (if you are a developer familiar with Python and interested in helping us with this good cause then let us know). Douglas Bagnall (Booki/Objavi developer) recently extended the import functionality so that all the metadata imported from Archive.org is preserved. This opens the door to utilising this information to assist proofing of the content – we hope, for example, to eventually be able to show the complete digital image of the original scan, before it was reduced to OCR, alongside the OCR pages to assist proofing. Watch this space!

Incidentally, Booki can import any ePub, so this means that the way is open for the same proofing process to be applied to other OCR scanning projects. If you have a project like this then let us know, maybe we can help.

Booki, OLPC and OER

You may be familiar with the One Laptop Per Child (OLPC) project. It’s pretty well known and aims to provide free laptops to children all over the world who otherwise could not afford them.

The OLPC is also a pretty good ebook reader, as demonstrated here:

eBook on the OLPC

The above image is taken from Reading and Sugar – an excellent manual by James Simmons about working with ebooks on the OLPC. The image shows a book taken from Archive.org and imported into Booki – Booki then exported this to an ePub and this was opened on the OLPC as shown.

In the same manual, James talks about using Booki on the OLPC to author ebooks. To quote James:

“Booki is one of the best tools available for Sugar users to create e-books.  It can be used on the XO or from Sugar on a Stick.  It supports many authors collaborating on a single book.  It supports translating books into many languages.  It can create PDFs and EPUBs.  It can create books formatted for print-on-demand services.  It can create documents in Open Office ODT format (which Open Office can convert to MS Word format).  It can even be used to download, proofread, and correct EPUBs created by the Internet Archive.

Booki is an excellent option for teachers preparing textbooks, but it can be used by students for their own projects too.”

Below is an image from the same manual showing Booki being used in the Browse activity (the OLPC browser).

Booki on the OLPC

We are hoping the good work James has been doing will help raise the awareness of Booki as a platform for book authoring on the OLPC which would open up the world of publishing considerably and (we hope) open up exciting possibilities for OER (Open Educational Resources)…