June 2015 – Adam Hyde

The Four Phases of Knowledge Production

Knowledge production consists of four basic phases – manage, create, process, and share. When designing knowledge production systems, it pays to keep these four phases in mind and to build platforms that have an eye on this high-level abstraction. These phases can be linear dependencies, overlap and/or be concurrent.

‘Thinking from above’ can help us to better understand where the needs of each phase might be placed within the system we are designing.

Manage – when producing a knowledge asset, there needs to be some management of the context. This mainly includes things like the setting up and processing of users, roles, permissions, and groups. However, ‘manage’ can also be an appropriate way to think about the ways users access content. A dashboard is a management interface, as are interfaces to create a Table of Contents (book) or keep track of a Collection (eg a set of related journals). The usefulness of thinking of management needs in this way is that it really highlights that a dashboard (consider Google Drive as a dashboard) and a Table of Contents interface for managing an online book production system are the same category of interface. They might have a different treatment according to the use case, but they facilitate much the same kinds of activities.

Create – content needs to be created, so we need content-creation interfaces. We are familiar with these: think about blogging and where you write your posts – that is a content creation interface; now think about a book production system where you write a chapter – that is also a content creation interface. These interfaces both belong in the same high level abstract container.

We need to think of content creation interfaces at a high level of abstraction when designing these kinds of systems for many reasons.

First, it is useful to think about how components may be re-used. What, for example, is the difference between an interface that is used to create a blog post, and one used to create a chapter? For a great deal of use cases, the answer is nothing. This re-use approach is taken by PressBooks, for example. PressBooks uses WordPress as a book production platform (note: I don’t agree this is a good idea, as WordPress is an entire suite best suited to blogs and not books, but I am pointing out the similarity of the content production needs at a very high level).

Second, it is interesting to ask ourselves what kind of content we are trying to produce and whether we have the right type of content production interface. Think, for example, of a book production system. All (except one) of the online book production tools I am familiar with have one kind of content creation interface – a WYSIWYG editor attached to a blank page. You use the editor to fill up the page until your chapter is done.

But what if you need to produce a glossary? Most book production systems use the same tool. That doesn’t seem like a good idea. Glossaries have very specific needs – users need to be able to sort, create new items, perhaps even translate terms into other languages and sort by those languages etc. A ‘waterfall’ cascading content creation interface (WYSIWYG and a blank page) doesn’t meet that need very well. What if you wish to produce 2 page spreads (in the case of paper books), or an index ? …then imagine that instead of books, we are are looking to produce annotated data sets.

We need to liberate ourselves from the one-size-fits-all approach to content production, and placing these needs in a high level abstract container allows us to think of the needs and not the UI.

Process – one thing I have come to learn through working in STM publishing, is that publishers add value to content by improving it. That is pretty much what every publisher does. In my mind, we should be re-framing publishers and calling them processors. After all, making content public these days is a doddle. And the difference between ‘self-publishing’ and ‘publishing’… is the processing bit. Self- publishers (generally speaking) do not have access to the same level of experienced and useful processing that can improve a work. As the day goes on we will have less and less value for the ‘publishing’, and in the case of STM, the long wait to making science public is already an impediment to progress.

The processing of content is an important part of knowledge production. If we want good knowledge, we need to be able to bring into play all those people and (sometimes) machines at the right moment to improve the work. That is essentially what workflow is all about. Processing is a high-level abstraction for workflow. It’s important to consider processing at a high-level abstraction because far too many systems build hard-coded workflow pipelines into their platforms, and that retards the opportunity to reconsider, optimise, and even radicalise workflows. I also consider other UI elements such as discussions to belong in the ‘process bucket’. Discussions are workflow and they are often the only type of mechanism that can account for the high level of specificity and issue resolution required on a per-knowledge asset (eg issues for each manuscript, chapter etc) level.

Share – formerly this was getting the book to the bookshop, but under current conditions this is something else entirely. Sharing works in the age of digital assets and network communications is all about file formats, APIs, and syndication. Avenues for sharing and the requirements of this process are prolfierating daily and the needs can be so complex that any system built to manage ‘sharing’ needs to be extremely flexible.

Thinking about sharing at a high-level of abstraction helps us with this enormously. For example, ingestion of .docx to an HTML-based knowledge production system is actually an act of sharing. It consists of file conversion and feeding the result into the production system for others to access. That ‘feeding’ of content into our production system, is a type of syndication. And the next ‘export’ of the finished product to some other system (eg a book sales system), requires a further process of file conversion (to the target book format) and feeding that format into the target system. Exactly the same high-level process as ‘ingestion’. They both belong to the ‘sharing’ bucket.

So, we can see that actually ‘import/ingestion’ and ‘export’ are actually the same thing. Consequently, we can save ourselves a lot of effort by building a framework in any knowledge production system that will ‘do both’ (ie recognise that there is no difference between import and export, and manage both rather than building redundant parallel processes).

The beauty of this ‘conceptual schema’ for knowledge production is that we can apply it to a wide variety of use cases to understand the knowledge production process at hand, and the variation and similarities between any of them. For example, Book Sprints traverse each of these 4 phases, as does a typical book from a publisher, as does a Wikipedia article, as does a grant application to a funder. Thinking of the process that way helps us see where the variance is – and helps us to better focus on designing for the needs of each. This suggests that a really good, efficient, single system can be designed to enable the production of a vast range of knowledge types, and accommodate apparently different processes which were formerly housed in standalone single use case platforms.

Colophon: written in Piha after walking on the beach, then cleaned somewhat by Raewyn. Written using Ghost blogging software (free software!).

Why Persistent Identifiers are the Wrong Idea

I think I will rewrite this. It seems to me it’s only half the solution…more coming shortly.

This afternoon I was reading Elizabeth Eisenstein’s “The Printing Revolution in Early Modern Europe” when I came across this passage:

To consult different books, it was no longer so essential to be a wandering scholar...The era of the glossator and the commentator came to an end, and a new "era of intense cross-referencing between one book and another" began.

The point here is that after the printing press came about, there were more books available. An obvious point. As a result, cross-referencing became a feature, since it was possible to access and, consequently, reference other literature more readily.

This made me think about the current discussions around persistent identifiers for scholarly content. It seems the current solution is to offer a layer of indirection: this enables a stable identifier to persist, and should the ‘actual location’ of the content be changed, then we can re-configure the redirect to point to the new location.

Martin Fenner and Geoff Bilder point to this solution in their very good postings on this topic. However, this method does not overcome the real problem. What if either:

no one updates the redirection after the location has changed
the content really goes offline (through loss of domain, for example)

It appears to me that we have the wrong solution. There is really no way to solve this issue with URIs. We can only minimise it.

So, how to go about resolving this (so to speak).

One way to get some insight into the issues is to wind back the clock and look at the way content was located in the age of the newly-born printing press. In this age, scholars were liberated because they didn’t have to wander the world looking for a particular book. Instead, identical printed copies proliferated, and it was just a matter of finding a copy of the work you were pursuing. Preferably you found a copy in a library or shop nearby. To find that work, one merely needed the cross-reference information to track it down. That “persistent identifier,” comprised of author’s name, book’s title, page of reference for the quoted material, publisher’s name and location, publication date, was commonly referred to as a citation (we still use that term). The citation helped the reader or researcher find a copy of the work cited. Not a particular printed book, but a copy of that book.

So, how is it that in an age of digital media we have gone backwards? Where copying something is even easier than in the printed age, why are we still pointing to ‘one’ authoritative copy? In essence, we are still referencing a book by stating the exact book that sits in a specific institution, on a particular shelf, with the blue (not green) cover.

It feels a little like the great leap backwards to me.

A way to get around this problem would be simply to allow and encourage content to be copied. Let digital media do what it does best – copy and distribute itself. The ‘unique identifier’ would then not be an URL (with a layer of indirection) but would take the form of a checksum or hash. Finding the right work would then be a matter of searching for a copy of the material with the right checksum.

I don’t know. I’m probably missing something. But it seems we have no problem tracking down YouTube videos when they spawn into the ether. We can also tell one version of software from another, no matter where it is and how it is labelled. Why not just let the content go and provide mechanisms to find a specific version of the content via hash search (also solving the issues of versioning URIs)?

Colophon: written on a lovely Sunday afternoon in the Mission. Adam got up Monday morning with that nagging feeling… rethought it. Talked to Raewyn. Rewrite coming. Written using Ghost (free) software.

Its not US and THEM, its a TEAM, stupid

hi y’all

I have just been reading some posts on Scholarly Kitchen about content creation and the next wave of authoring systems.

It seems the STM sector has long been in need of developing a solution to get their publishing processes out of various traps. The most obvious trap is MS Word. A horrible format, to be sure, but it has long been the default file format for manuscript production, with a small tip of the hat to LaTeX for the technologically gifted. Their recent discussions have mainly been about online ‘authoring systems,’ going beyond MS Word to anticipate documents that are fully transparent to whatever combination of machine and human interactions play a part in understanding and processing the information.

This ‘get-out-of-MS-Word-free’ card is a very attractive proposition. MS Word is basically a binary blob to most publishing systems (even though in actual fact the format of .docx is XML -thereby also abruptly ending the false argument that XML inherently brings structure). As a ‘binary’ (go with me on this for now) MS Word is not transparent to the publication system, there is no record of when the author has worked on it; finding out what they have done since version xxx.xxx and version xxxx.xxxxxxxx is very difficult; nobody else can work on it when the author is also working on it, and there is no control over structure etc etc etc

So getting away from reliance on MS Word is the aim. But getting into (what I might rename as) an ‘authoring only‘ platform – is not the solution.

What is interesting about the SK forum, is that there seems to be a very clear distinction in the minds of publishers between the worlds of the author and the publisher. Most of the comments make this split, and there is much talk of ‘authoring systems’.

It seems a little bizarre to me, as I don’t think it’s wise to think about the author and the publisher as being distinct entities. It’s not a matter of author and publisher working on separate processes to shepherd a manuscript through to publication: it is very much a team effort. Authors and publishers work together in a way that should not be dichotomised: they are a team.

If we don’t acknowledge that, then we will not be able to design good publication systems. There is a lot of unclear thinking around this topic at the moment. The “authoring system” model assumes that content is made in an authoring system by a writer, and then migrates to the publisher’s submission, processing and publishing system, where the publisher does some stuff, and then at various times pings the author back to make changes to metadata, submission information, the manuscript and attendant assets (eg figures)…

In this model, next the author takes the manuscript out of the publisher’s system, ingests to the old authoring system, works on it, exports it, and re-ingests it into the publisher’s system… Hmmm…this cycle is exactly one of the pain points we were trying to avoid by getting away from Microsoft Word.

It seems to me that the current trend to build better authoring systems is a mistake. It is based on the false assumption that ‘MS Word’ is the problem, without realising that there is more to it. Word has been seen as the problem only because it has been the only problem in town. We don’t need better ‘authoring systems’ that repeat the separation between writing and publishing that is inherent in reliance on MS Word. We shouldn’t invest in new authoring systems and believe in them purely because they are ‘not Microsoft Word’. Rather, we need documents to be contained within submission and processing systems for the entire duration of their life, and they need to be completely operational and transparent within that system to all parties that must work on them. Without understanding that need, we are merely mitigating the problem by small steps whilst fooling ourselves that we have solved the larger problem.

We don’t want the author-publisher response/change cycle (a collaborative effort by the team which includes author and publisher) to be in separate systems. We want them working together in the same system. We need teams to work together in the most efficient way possible, and that is in the same (real world- or cyber-) space. Teams work best when they work in the same *space.

Though I see the current efforts towards authoring system development to be interesting, unless they are integrated with processing and workflow features, they will sooner or later be made redundant.

Colophon: written by Adam in 30 mins in a tizz. Tinkered with by Raewyn for another 30 mins. Written using Ghost software (free software!)