Tools for digital philology: Transcription

| No Comments | No TrackBacks

In the last few months I've had the opportunity to revisit all of the decisions I made in 2007 and 2008 about how I transcribe my manuscripts. In this post I'll talk about why I make full transcriptions in the first place, the system I devised six years ago, my migration to T-PEN, and a Python tool I've written to convert my T-PEN data into usable TEI XML.

Transcription vs. collation

When I make a critical edition of a text, I start with a full transcription of the manuscripts that I'm working with, fairly secure in the knowledge that I'll be able to get the computer to do 99% of the work of collating them. There are plenty of textual scholars out there who will regard me as crazy for this. Transcribing a manuscript is a lot of work, after all, and wouldn't it just be faster to do the collation myself in the first place? But my instinctive answer has always been no, and I'll begin this post by trying to explain why.

When I transcribe my manuscripts, I'm working with a plain-text copy of the text that was made via OCR of the most recent (in this case, 117-year-old) printed edition. So in a sense the transcription I do is itself a collation against that edition text - I make a file copy of the text and begin to follow along it with reference to the printed edition, and wherever the manuscript varies, I make a change in the file copy. At the same time, I can add notation for where line breaks, page divisions, scribal corrections, chapter or section markings, catchwords, colored ink, etc. all occur in the manuscript. By the end of this process, which is in principle no different from what I would be doing if I were constructing a manual collation, I have a reasonably faithful transcription of the manuscript I started with.

But there are two things about this process that make it, in my view, simpler and faster than constructing that collation. The first is the act I'm performing on the computer, and the second is the number of simultaneous comparisons and decisions I have to make at each point in the process. When I transcribe I'm correcting a single text copy, typing in my changes and moving on, in a lines-and-paragraphs format that is pretty similar to the text I'm looking at. The physical process is fairly similar to copy-editing. If I were collating, I would be working - most probably - in a spreadsheet program, trying to follow the base text word-by-word in a single column and the manuscript in its paragraphs, which are two very different shapes for text. Wherever the text diverged, I would first have to make a decision about whether to record it (that costs mental energy), then have to locate the correct cell to record the difference (that costs both mental energy and time spent switching from keyboard to mouse entry), and then deciding exactly how to record the change in the appropriate cell (switching back from mouse to keyboard), thinking also about how it coordinates with any parallel variants in manuscripts already collated. Quite frankly, when I think about doing work like that I not only get a headache, but my tendinitis-prone hands also start aching in sympathy.

Making a transcription

So for my own editorial work I am committed to the path of making transcriptions now and comparing them later. I was introduced to the TEI for this purpose many years ago, and conceptually it suits my transcription needs. XML, however, is not a great format for writing out by hand for anyone, and if I were to try, the transcription process would quickly become as slow and painful as I have just described the process of manual collation as being.

As part of my Ph.D. work I solved this problem by creating a sort of markup pidgin, in which I used single-character symbols to represent the XML tags I wanted to use. The result was that, when I had a manuscript line like this one:

Manuscript line

whose plaintext transcription is this:

Եւ յայնժամ սուրբ հայրապետն պետրոս և իշխանքն ելին առ աշոտ. և

and whose XML might look something like this:

<lb/><hi rend="red">Ե</hi>ւ յայնժ<ex>ա</ex>մ ս<ex>ուր</ex>բ 
հ<ex>ա</ex>յր<ex>ա</ex>պ<ex>ե</ex>տն պետրոս և 
իշխ<ex>ա</ex>նքն ելին առ աշոտ. և

I typed this into my text editor

*(red)Ե*ւ յայնժ\ա\մ ս\ուր\բ հ\ա\յր\ա\պ\ե\տն պետրոս և իշխ\ա\նքն 
ելին առ աշոտ. և

and let a script do the work of turning that into full-fledged XML. The system was effective, and had the advantage that the text was rather easier to compare with the manuscript image than full XML would be, but it was not particularly user-friendly - I had to have all my symbols and their tag mappings memorized, I had to make sure that my symbols were well-balanced, and I often ran into situations (e.g. any tag that spanned more than one line) where my script was not quite able to produce the right result. Still, it worked well enough, I know at least one person who was actually willing to use it for her own work, and I even wrote an online tool to do the conversion and highlight any probable errors that could be detected.

My current solution

Last October I was at a collation workshop in Münster, where I saw a presentation by Alison Walker about T-PEN, an online tool for manuscript transcription. Now I've known about T-PEN since 2010, and had done a tiny bit of experimental work with it when it was released, but had not really thought much about it since. During that meeting I fired up T-PEN for the first time in years, really, and started working on some manuscript transcription, and actually it was kind of fun!

What T-PEN does is to take the manuscript images you have, find the individual lines of text, and then let you do the transcription line-by-line directly into the browser. The interface looks like this (click for a full-size version):


which makes it just about the ideal transcription environment from a user-interface perspective. You would have to try very hard to inadvertently skip a line; your eyes don't have to travel very far to get between the manuscript image and the text rendition; when it's finished, you have not only the text but also the information you need to link the text to the image for later presentation.

The line recognition is not perfect, in my experience, but it is often pretty good, and the user is free to correct the results. It is pretty important to have good images to work with - cropped to include only the pages themselves, rotated and perhaps de-skewed so that the lines are straight, and with good contrast. I have had the good fortune this term to have an intern, and we have been using ImageMagick to do the manuscript image preparation as efficiently as we can. It may be possible to do this fully automatically - I think that OCR software like FineReader has similar functionality - but so far I have not looked seriously into the possibility.

T-PEN does not actively support TEI markup, or any other sort of markup. What it does offer is the ability to define buttons (accessible by clicking the 'XML Tags' button underneath the transcription box) that will apply a certain tag to any portion of text you choose. I have defined the TEI tags I use most frequently in my transcriptions, and using them is fairly straightforward.

Getting data back out

There are a few listed options for exporting a transcription done in T-PEN. I found that none of them were quite satisfactory for my purpose, which was to turn the transcription I'd made automatically into TEI XML, so that I can do other things with it. One of the developers on the project, Patrick Cuba, who has been very helpful in answering all the queries I've had so far, pointed out to me the (so far undocumented) possibility of downloading the raw transcription data - stored on their system using the Shared Canvas standard - in JSON format. Once I had that it was the work of a few hours to write a Python module that will convert the JSON transcription data into valid TEI XML, and will also tokenize valid TEI XML for use with a collation tool such as CollateX.

The tpen2tei module isn't quite in a state where I'm willing to release it to PyPI. For starters, most of the tests are still stubs; also, I suspect that I should be using an event-based parser for the word tokenization, rather than the DOM parser I'm using now. Still, it's on Github and there for the using, so if it is the sort of tool you think you might need, go wild.


There are a few things that T-PEN does not currently do, that I wish it did. The first is quite straightforward: on the website it is possible to enter some metadata about the manuscript being transcribed (library information, year of production, etc.), but this metadata doesn't make it back into the Shared Canvas JSON. It would be nice if I had a way to get all the information about my manuscript in one place.

The second is also reasonably simple: I would like to be able to define an XML button that is a milestone element. Currently the interface assumes that XML elements will have some text inside them, so the button will insert a <tag> and a </tag> but never a <tag/>. This isn't hard to patch up manually - I just close the tag myself - but from a usability perspective it would be really handy.

The third has to do with resource limits currently imposed by T-PEN: although there doesn't seem to be a limit to the number of manuscripts you upload, each manuscript can contain only up to 200MB of image files. If your manuscript is bigger, you will have to split it into multiple projects and combine the transcriptions after the fact. Relatedly, you cannot add new images to an existing manuscript, even if you're under the 200MB limit. I'm told that an upcoming version of T-PEN will address at least this second issue.

The other two things I miss in T-PEN have to do with the linking between page area and text flow, and aren't quite so simple to solve. Occasionally a manuscript has a block of text written in the margin; sometimes the block is written sideways. There is currently no good mechanism for dealing with blocks of text with weird orientations; the interface assumes that all zones should be interpreted right-side-up. Relatedly, T-PEN makes the assumption (when it is called upon to make any assumption at all) that text blocks should be interpreted from top left to bottom right. It would be nice to have a way to define a default - perhaps I'm transcribing a Syriac manuscript? - and to specify a text flow in a situation that doesn't match the default. (Of course, there are also situations where it isn't really logical or correct to interpret the text as a single sequence! That is part of what makes the problem interesting.)


If someone who is starting an edition project today asks me for advice on transcription, I would have little reservation in pointing them to T-PEN. The only exception I would make is for anyone working on a genetic or documentary edition of authors' drafts or the like. The T-PEN interface does assume that the documents being transcribed are relatively clean manuscripts without a lot of editorial scribbling. Apart from that caveat, though, it is really the best tool for the task that I have seen. It has a great user interface for the task, it is an open source tool, its developers have been unfailingly helpful, and it provides a way to get out just about all of the data you put into it. In order to turn that data into XML, you may have to learn a little Python first, but I hope that the module I have written will give someone else a head start on that front too!

Coming back to proper (digital) philology

| No Comments | No TrackBacks

For the last three or four months I have been engaging in proper critical text edition, of the sort that I haven't done since I finished my Ph.D. thesis. Transcribing manuscripts, getting a collation, examining the collation to derive a critical text, and all. I haven't had so much fun in ages.

The text in question is the same one that I worked on for the Ph.D. - the Chronicle of Matthew of Edessa. I have always intended to get back to it, but the realities of modern academic life simply don't allow a green post-doc the leisure to spend several more years on a project just because it was too big for a Ph.D. thesis in the first place. Of course I didn't abandon textual scholarship entirely - I transferred a lot of my thinking about how text traditions can be structured and modelled and analyzed to the work that became my actual post-doctoral project. But Matthew of Edessa had to be shelved throughout much of this, since I was being paid to do other things.

Even so, in the intervening time I have been pressed into service repeatedly as a sort of digital-edition advice columnist. I'm by no means the only person ever to have edited text using computational tools, and it took me a couple of years after my own forays into text edition to put it online in any form, but all the work I've done since 2007 on textual-criticism-related things has given me a reasonably good sense of what can be done digitally in theory and in practice, for someone who has a certain amount of computer skill as well as for someone who remains a bit intimidated by these ornery machines.

Since the beginning of this year, I've had two reasons to finally take good old Matthew off the shelf and get back to what will be the long, slow work of producing an edition. The first is a rash commitment I made to contribute to a Festschrift in Armenian studies. I thought it might be nice to provide an edited version of the famous (if you're a Byzantinist) letter purportedly written by the emperor Ioannes Tzimiskes to the Armenian king Ashot Bagratuni in the early 970s, preserved in Matthew's Chronicle. The second is even better: I've been awarded a grant from the Swiss National Science Foundation to spend the next three years leading a small team not only to finish the edition, but also to develop the libraries, tools, and data models (including, of course, integration of ones already developed by others!) necessary to express the edition as digitally, accessibly, and sustainably as I can possibly dream of doing, and to offer it as a model for other digital work on medieval texts within Switzerland and, hopefully, beyond. I have been waiting six years for this moment, and I am delighted that it's finally arrived.

The technology has moved on in those six years, though. When I worked on my Ph.D. I essentially wrote all my own tools to do the editing work, and there was very little focus on usability, generalizability, or sustainability. Now the landscape of digital tools for text critical edition is much more interesting, and one of my tasks has been to get to grips with all the things I can do now that I couldn't practically do in 2007-9.

Over the next few weeks, as I prepare the article that I promised, I will use this blog to provide something of an update to what I published over the years on the topic of "how to make a digital edition". I'm not going to explore here every last possibility, but I am going to talk about what tools I use, how I choose to use them, and how (if at all) I have to modify or supplement them in order to do the thing I am trying to do. With any luck this will be helpful to others who are starting out now with their own critical editions, no matter their comfort with computers. I'll try to provide a sense of what is easy, what has a good user interface, what is well-designed for data accessibility or sustainability. And of course I'd be very happy to have discussion from others who have walked similar roads, to say what has worked for them.

SOLVED! The mystery of the character encoding

| No Comments | No TrackBacks

Update, two hours later: we have a solution! And it's pretty disgusting. Read on below.

Two posts in a row about the deep technical guts of something I'm working on. Well I guess this is a digital humanities blog.

Yesterday I got a wonderful present in my email - a MySQL dump of a database full of all sorts of historical goodness. The site that it powers displays snippets of relevant primary sources in their original language, including things like Arabic and Greek. Since the site has been around for rather longer than MySQL has had any Unicode support to speak of, it is not all that surprising that these snippets of text in their original language are rather badly mis-encoded.

Not too much of a problem, I naïvely thought to myself. I'll just fix the encoding knowing what it's supposed to have been.

A typical example looks like this. The Greek displayed on the site is: μηνὶ Νοἐμβρίω εἰς τὰς κ ´ ινδικτιῶνος ε ´ ἔτους ,ς

but what I get from the database dump is: μηνὶ Νοἐμβρίω εἰς Ï„á½°Ï‚ κ ´ ινδικτιῶνος ε ´ á¼"τοÏ...Ï‚ ,Ï‚

Well, I recognise that kind of garbage, I thought to myself. It's double-encoded UTF-8. So all I ought to need to do is to undo the spurious re-encoding and save the result. Right?

Sadly, it's not that easy, and here is where I hope I can get comments from some DB/encoding wizards out there because I would really like to understand what's going on.

It starts easily enough in this case - the first letter is μ. In Unicode, that is character 3BC (notated in hexadecimal.) When you convert this to UTF-8, you get two bytes: CE BC. Unicode character CE is indeed Î, and Unicode character BC is indeed ¼. As I suspected, each of these UTF-8 bytes that make up μ has been treated as a character in its own right, and further encoded to UTF-8, so that μ has become μ. That isn't hard to undo.

But then we get along to that ω further down the line, which has become ω. That is Unicode character 3C9, which in UTF-8 becomes CF 89. Unicode CF is the character Ï as we expect, but there is no such Unicode character 89. Now it is perfectly possible to render 89 as UTF-8 (it would become C2 89) but instead I'm getting a rather inexplicable character whose Unicode value is 2030 (UTF-8 E2 80 B0)! And here the system starts to break down - I cannot figure out what possible mathematical transformation has taken place to make 89 become 2030.

There seems to be little mathematical pattern to the results I'm getting, either. From the bad characters in this sample:

ρ -> 3C1 -> CF 81 --> CF 81    (correct!!)
ς -> 3C2 -> CF 82 --> CF 201A
τ -> 3C4 -> CF 84 --> CF 201E
υ -> 3C5 -> CF 85 --> CF 2026
ω -> 3C9 -> CF 89 --> CF 2030

Ideas? Comments? Do you know MySQL like the back of your hand and have you spotted immediately what's going on here? I'd love to crack this mystery.

After this post went live, someone observed to me that the 'per mille' sign, i.e. that double-percent thing at Unicode value 2030, has the value 89 in...Windows CP-1250! And, perhaps more relevantly, Windows CP-1252. (In character encodings just as in almost everything else, Windows always liked to have their own standards that are different from the ISO standards. Pre-Unicode, most Western European characters were represented in an eight-bit encoding called ISO Latin 1 everywhere except Windows*, where they used this CP-1252 instead. For Eastern Europe, it was ISO Latin 2 / CP-1250.)

So what we have here is: MySQL is interpreting its character data as Unicode, and expressing it as UTF-8, as we requested. Only then it hits a Unicode value like 89 which is not actually a character at all. But instead of passing it through and letting us deal with it, MySQL says "hm, they must have meant the Latin 1 value here. Only when I say Latin 1 I really mean CP-1252. So I'll just take this value (89 in our example), see that it is the 'per mille' sign in CP-1252, and substitute the correct Unicode for 'per mille'. That will make the user happy!"

Hint: It really, really, doesn't make the user happy.

So here is the Perl script that will take the garbage I got and turn it back into Greek. Maybe it will be useful to someone else someday!

#!/usr/bin/env perl

use strict;
use warnings;
use Encode;
use Encode::Byte;

while(<>) {
    my $line = decode_utf8( $_ );
    my @chr;
    foreach my $c ( map { ord( $_ ) } split( '', $line ) ) {
        if( $c > 255 ) {
            $c = ord( encode( 'cp1252', chr( $c ) ) );
        push( @chr, $c );
    my $newline = join( '', map { chr( $_ ) } @chr );
    print $newline;

[*]Also, as I realized after posting this, except Mac, which used MacRoman. Standards are great! Let's all have our own!

How to have several Catalyst apps behind one Apache server

| 1 Comment | No TrackBacks

Since I've changed institutions this year, I am in the process of migrating Stemmaweb from its current home (on my family's personal virtual server) to the academic cloud service being piloted by SWITCH. Along the way, I ran into a Perl Catalyst configuration issue that I thought would be useful to write about here, in case others run into a similar problem.

I have several Catalyst applications - Stemmaweb, my edition-in-progress of Matthew of Edessa, and pretty much anything else I will develop with Perl in the future. I also have other things (e.g. this blog) on the Web, and being somewhat stuck in my ways, I still prefer Apache as a webserver. So basically I need a way to run all these standalone web applications behind Apache, with a suitable URL prefix to distinguish them.

There is already a good guide to getting a single Catalyst application set up behind an Apache front end. The idea is that you start up the application as its own process, listening on a local network port, and then configure Apache to act as a proxy between the outside world and that application. My problem was, I want to have more than one application, and I want to reach each different application via its own URL prefix (e.g. /stemmaweb, /ChronicleME, /ncritic, and so on.) The difficulty with a reverse proxy in that situation is this:

  • I send my request to http://my.public.server/stemmaweb/
  • It gets proxied to http://localhost:5000/ and returned
  • But then all my images, JavaScript, CSS, etc. are at the root of localhost:5000 (the backend server) and so look like they're at the root of my.public.server, instead of neatly within the stemmaweb/ directory!
  • And so I get a lot of nasty 404 errors and a broken application.

What I need here is an extra plugin: Plack::Middleware::ReverseProxyPath. I install it (in this case with the excellent 'cpanm' tool):

$ cpanm -S Plack::Middleware::ReverseProxyPath

And then I edit my application's PSGI file to look like this:

use strict;
use warnings;

use lib '/var/www/catalyst/stemmaweb/lib';
use stemmaweb;
use Plack::Builder;

builder {
        enable( "Plack::Middleware::ReverseProxyPath" );
        my $app = stemmaweb->apply_default_middlewares(stemmaweb->psgi_app);

where /var/www/catalyst/stemmaweb is the directory that my application lives in.

In order to make it all work, my Apache configuration needs a couple of extra lines too:

    # Configuration for Catalyst proxy apps. This should eventually move
    # to its own named virtual host.
    RewriteEngine on
    <Location /stemmaweb>
            RequestHeader set X-Forwarded-Script-Name /stemmaweb
            RequestHeader set X-Traversal-Path /
            ProxyPass http://localhost:5000/ 
            ProxyPassReverse http://localhost:5000/
    RewriteRule ^/stemmaweb$ stemmaweb/ [R]

The RequestHeaders inform the backend (Catalyst) that what we are calling "/stemmaweb" is the thing that it is calling "/", and that it should translate its URLs accordingly when it sends us back the response.

The second thing I needed to address was how to start these things up automatically when the server turns on. The guide gives several useful configurations for starting a single service, but again, I want to make sure that all my Catalyst applications (and not just one of them) start up properly. I am running Ubuntu, which uses Upstart to handle its services; to start all my applications I use a pair of scripts and the 'instance' keyword.

description "Starman master upstart control"
author      "Tara L Andrews ("
# Control all Starman jobs via this script
start on filesystem or runlevel [2345] 
stop on runlevel [!2345]
# No daemon of our own, but here's how we start them
pre-start script
  for dir in `ls /var/www/catalyst`; do
    start starman-app APP=$dir PORT=$port || :
end script
# and here's how we stop them
post-stop script
  for inst in `initctl list|grep "^starman-app "|awk '{print $2}'|tr -d ')'|tr -d '('`; do
    stop starman-app APP=$inst PORT= || :
end script

The application script, which gets called by the control script for each application in /var/www/catalyst:

description "Starman upstart application instance"
author      "Tara L Andrews ("
respawn limit 10 5 
setuid www-data
umask 022 
instance $APP$PORT
exec /usr/local/bin/starman --l localhost:5000 /var/www/catalyst/$APP/$APP.psgi

There is one thing about this solution that is not so elegant, which is that each application has to start on its own port and I need to specify the correct port in the Apache configuration file. As it stands the ports will be assigned in sequence (5000, 5001, 5002, ...) according to the way the application directory names sort with the 'ls' command (which roughly means, alphabetically.) So whenever I add a new application I will have to remember to adjust the port numbers in the Apache configuration. I would welcome a more elegant solution if anyone has one!

Enabling the science of history

| No Comments | No TrackBacks
One of the great ironies of my academic career was that, throughout my Ph.D. work on a digital critical edition of parts of the text of Matthew of Edessa's Chronicle, I had only the vaguest inkling that anyone else was doing anything similar. I had heard of Peter Robinson and his COLLATE program, of course, but when I met him in 2007 he only confirmed to me that the program was obsolete and, if I needed automatic text collation anytime soon, I had better write my own program. Through blind chance I was introduced to James Cummings around the same time, who told me of the existence of the TEI guidelines and suggested I use them.

It was, in fact, James who finally gave me a push into the world of digital humanities. I was in the last panicked stages of writing up the thesis when he arranged an invitation for me to attend the first 'bootcamp' held by the Interedition project, whose subject was to be none other than text collation tools. By the time the meeting was held I was in that state of anxious bliss of having submitted my thesis and having nothing to do but wait for the viva, so I could bend all my hyperactive energy in that direction. Through Interedition I made some first-rate friends and colleagues with whom I have continued to work and hack to this day, and it was through that project that I met various people within KNAW (the Royal Dutch Academy of Science.)

After I joined Interedition I very frequently found myself talking to its head, Joris van Zundert, about all manner of things in this wide world of digital humanities. At the time I knew pretty much nothing of the people within DH and its nascent institutional culture, and was moreover pretty ignorant of how much there was to know, so as often as not we ended up in some kind of debate or argument over the TEI, over the philosophy of science, over what constitutes worthwhile research. The main object of these debates was to work out who was holding what unstated assumption or piece of background context.

One evening we found ourselves in a heated argument about the application of the scientific method to humanities research. I don't remember quite how we got there, but Joris was insisting (more or less) that humanities research needed to be properly scientific, according to the scientific method, or else it was rubbish, nothing more than creative writing with a rhetorical flourish, and not worth anyone's time or attention. Historians needed to demonstrate reproducibility, falsifiability, the whole works. I was having none of it--while I detest evidence-free assumption-laden excuses for historical argument as much as any scholar with a proper science-based education would, surely Joris and everyone else must understand that medieval history is neither reproducible nor falsifiable, and that the same goes for most other humanities research? What was I to do, write a Second Life simulation to re-create the fiscal crisis of the eleventh century, complete with replica historical personalities, and simulate the whole to see if the same consequences appeared? Ridiculous. But of course, I was missing the point entirely. What Joris was pushing me to do, in an admittedly confrontational way, was to make clear my underlying mental model for how history is done. When I did, it became really obvious to me how and where historical research ultimately stands to gain from digital methods.

OK, that's a big claim, so I had better elucidate this mental model of mine. It should be borne in mind that my experience is drawn almost entirely from Near Eastern medieval history, which is grossly under-documented and fairly starved of critical attention in comparison to its Western cousin, so if any of you historians of other places or eras have a wildly different perspective or model, I'd be very interested to hear about it!

When we attempt a historical re-construction or create an argument, we begin with a mixture of evidence, report, and prior interpretation. The evidence can be material (mostly archaeological) or documentary, and we almost always wish we had roughly ten times as much of it as we actually do. The reports are usually those of contemporaneous historians, which are of course very valuable but must be examined in themselves for what they aren't telling us, or what they are misrepresenting, as much as for what they positively tell us. The prior interpretation easily outweighs the evidence, and even the reports, for sheer volume, and it is this that constitutes the received wisdom of our field.

So we can imagine a rhetorical structure of dependency that culminates in a historical argument, or a reconstruction. We marshal our evidence, we examine our reports, we make interpretations in the light of received wisdom and prior interpretations. In effect it is a huge and intricate connected structure of logical dependencies that we carry around in our head. If our argument goes unchallenged or even receives critical acceptance, this entire structure becomes a 'black box' of the sort described by Bruno Latour, labelled only with its main conclusion(s) and ready for inclusion in the dependency structure of future arguments.

Now what if some of our scholarship, some of the received wisdom even, is wide of the mark? Pretty much any historian will relish the opportunity to demonstrate that "everything we thought we knew is wrong", and in Near Eastern history in particular these opportunities come thick and fast. This is a fine thing in itself, but it poses a thornier problem. When the historian demonstrates that a particular assumption or argument doesn't hold water--when the paper is published and digested and its revised conclusion accepted--how quickly, or slowly, will the knock-on effects of this new bit of insight make themselves clear? How long will it take for the implications to sort themselves out fully? In practice, the weight of tradition and patterns of historical understanding for Byzantium and the Near East are so strong, and have gone for so long unchallenged, that we historians simply haven't got the capacity to identify all the black boxes, to open them up and find the problematic components, to re-assess each of these conclusions with these components altered or removed. And this, I think, is the biggest practical obstacle to the work of historians being accepted as science rather than speculation or storytelling.

Well. Once I had been made to put all of this into words, it became clear what the most useful and significant contribution of digital technology to the study of history must eventually be. Big data and statistical analysis of the contents of documentary archives is all well and good, but what if we could capture our very arguments, our black boxes of historical understanding, and make them essentially searchable and available for re-analysis when some of the assumptions have changed? They would even be, dare I say it, reproducible and/or falsifiable. Even, perish the thought, computable.

Book cover, Understanding Digital Humanities
A few months after this particular debate, I was invited to join Joris and several other members of the Alfalab project at KNAW in preparing a paper for the 'Computational Turn' workshop in early 2010, which was eventually included in a collection that arose from the workshop. In the article we take a look at the processes by which knowledge is formalized in various fields in the humanities, and how the formalization can be resisted by scholars within each field. Among other things we presented a form of this idea for the formalization of historical research. Three years later I am still working on making it happen.

I was very pleased to find that Palgrave Macmillan makes its author self-archiving policies clear on their website, for books of collected papers as well as for journals. Unfortunately the policy is that the chapter is under embargo until 2015, so I can't post it publicly until then, but if you are interested meanwhile and can't track down a copy of the book then please get in touch!

J. J. van Zundert, S. Antonijevic, A. Beaulieu, K. van Dalen-Oskam, D. Zeldenrust, and T. L. Andrews, 'Cultures of Formalization - Towards an encounter between humanities and computing', in Understanding Digital Humanities, edited by D. Berry (London: Palgrave Macmillan, 2012), pp. 279-94.

Early-career encyclopedism

| 1 Comment | No TrackBacks
So there I was, a newly-minted Ph.D. enjoying my (all too brief) summer of freedom in 2009 from major academic responsibilities. There must be some sort of scholarly pheromone signal that gets emitted in cases like these, some chemical signature that senior scholars are attuned to that reads 'I am young and enthusiastic and am not currently crushed by the burden of a thousand obligations'. I was about to meet the Swarm of Encyclopedists.

It started innocently enough, actually even before I had submitted, when Elizabeth Jeffreys (who had been my MPhil degree supervisor) offered me the authorship of an article on the Armenians to go into an encyclopedia that she was helping to edit. As it happened, this didn't intrude again on my consciousness until the following year--I was duly signed up as author, but my email address was entered incorrectly in a database so I was blissfully ignorant of what exactly I had committed to until I began to get mysterious messages in 2010 from a project I hadn't really even heard of, demanding to know where my contribution was.

Lesson learned: you can almost always get a deadline extended in these large collaborative projects. After all, what alternatives do the editors have, really?

The second lure came quite literally the evening following my DPhil defense, when Tim Greenwood (who had been my MPhil thesis supervisor) got in touch to tell me about a project on Christian-Muslim relations being run out of Birmingham by David Parker, and that I would seem to be the perfect person to write an entry on Matthew of Edessa and his Chronicle. Flush with victory and endorphins, of course I accepted within the hour. Technically speaking this was a 'bibliographical history' rather than an 'encyclopedia', but the approach to writing my piece was very similar, and it was more or less the ideal moment for me to summarize everything I knew about Matthew.

For a little bit of doctoral R&R, academic style, I flew off a few days later to Los Angeles for the 2009 conference of the Society of Armenian Studies. There in the sunshine I must have been positively telegraphing my relaxation and lack of obligations, because Theo van Lint (who had only just ceased being my DPhil supervisor) brought up the subject of a number of encyclopedia articles on Armenian authors that he had promised and was simply not going to have a chance to do. By this time I was beginning to get a little surprised at the number of encyclopedia articles floating around in the academic ether looking for an authorly home, and I was not so naïve as to accept the unworkable deadline that he had, but subject to reasonability I said okay. He assured me that he would send me the details soon.

Around that time, through one of the mailing lists to which I had subscribed in the last month or so of my D.Phil., I got wind of the Encyclopedia of the Medieval Chronicle (EMC). The general editor, Graeme Dunphy, was looking for contributors to take on some of the orphan articles in this project. Matthew of Edessa was on the list, and I was already writing something similar for the Christian-Muslim Relations project, so I wrote to volunteer.

And then everything happened at once. Theo wrote to me with his list, which turned out to be for precisely this EMC project. The project manager at Brill, Ernest Suyver, who knew me from my work on another Brill project, wrote to me to ask if I would consider taking on several of the Armenian articles. Before I could answer either of these, Graeme wrote back to me, offering me not only the article on Matthew of Edessa that I'd asked for--not only the entire set of Armenian articles that both Theo and Ernest had sent in my direction--but the job of section editor for all Armenian and Syriac chronicles! The previous section editor had evidently disappeared from the project and it seems that only someone as young and unburdened as me had any hope of pulling off the organization and project management they needed on the exceedingly short timescale they had, or of being unwise enough to believe it could be done.

But I was at least learning enough by then to expect that any appeal to more senior scholars than myself was likely to be met with "Sorry, I have too much work already" and an unspoken coda of "...and encyclopedia articles are not exactly a priority for me right now." There was the rare exception of course, but I turned pretty quickly to my own cohort of almost- or just-doctored scholars to farm out the articles I couldn't (or didn't want to) write myself. So I suppose by that time even I was beginning to detect the "yes I can" signals coming from the early-career scholars around me. Naturally the articles were not all done on time--it was a pretty ludicrous time frame I was given, after all--but equally naturally, delays in the larger project meant that my part was completed by the time it really needed to be. And so in my first year as a postdoc I had a credit on the editorial team of a big encyclopedia project, and a short-paper-length article, co-authored with Philip Wood, giving an overview of Eastern Christian historiography as a whole. I remain kind of proud of that little piece.

Lesson learned: your authors can almost always get you to agree to a deadline extension in these large collaborative projects. After all, what alternative do you have as editor, short of finding another author, who will need more time anyway, and pissing off the first one by withdrawing the commission?

The only trouble with these articles is that it's awfully hard to know how to express them in the tickyboxes of a typical publications database like KU Leuven's Lirias. Does each of the fifteen entries I wrote get its own line? Should I list the editorship separately, or the longer article on historiography? It's a little conundrum for the CV.

Nevertheless I'm glad I got the opportunity to do the EMC project, definitely. And here's another little secret--if I am able to make the time, I kind of like writing encyclopedia articles. It's a nice way to get to grips with a subject, to cut straight to the essence of "What does the reader--and what do I--really need to know in these 250 words?" This might be why, when yet another project manager for yet another encyclopedia project found me about a year ago, I didn't say no, and so this list will have an addition in the future. After that, though, I might finally have to call a halt.

I have written to Wiley-Blackwell to ask about their author self-archiving policies; I have a PDF offprint but am evidently not allowed to make it public, frustratingly enough. I will update the Lirias record if that changes. Brill has a surprisingly humane policy that allows me to link freely to the offprints of my own contributions in an edited collection, so I have done that here. I don't seem to have an offprint for all the articles I wrote, though, so will need to rectify that.

Andrews, T. (2012). Armenians. In: Encyclopedia of Ancient History, ed. R. Bagnall et al. Malden, MA: Wiley-Blackwell.

Andrews, T. (2012). Matthew of Edessa. In: Christian--Muslim Relations. A Bibliographical History 1. Volume 3 (1050- 1200), ed. D. Thomas and B. Roggema. Leiden: Brill.

Andrews, T. and P. Wood. (2012). Historiography of the Christian East. In: Encyclopedia of the Medieval Chronicle, general editor G. Dunphy. Leiden: Brill.
(Additional articles on Agatʿangełos, Aristakēs Lastivertcʿi, Ełišē, Kʿartʿlis Cxovreba, Łazar Pʿarpecʿi, Mattʿēos Uṙhayecʿi, Movsēs Dasxurancʿi, Pʿawstos Buzand, Smbat Sparapet, Stepʿanos Asołik, Syriac Short Chronicles (with J. J. van Ginkel), Tʿovma Arcruni, Yovhannēs Drasxanakertcʿi.

Public accountability, #acwrimo, and The Book

| No Comments | No TrackBacks
Over the course of 2011, among the long-delayed things I finally managed to do was to put together a book proposal for the publication of my Ph.D. research. While I am reasonably pleased with the thesis I produced, it is no exception to the general rule that it would not make a very good book if I tried to publish it as it stands.  As it happens there is a reasonably well-known series by a well-respected publisher, edited by someone I know, where my research fits in rather nicely. Even more nicely, they accepted my proposal.

Now here is where I have to humblebrag a little: I wrote my Ph.D. thesis kind of quickly, and much more quickly than I would recommend to any current Ph.D. students. Part of this was luck--once I hit upon my main theme, a lot of it just started falling into place--but part of it was the sheer terror of an externally-imposed deadline. I had rather optimistically applied for a British Academy post-doctoral fellowship in October 2008, figuring that either I'd be rejected and it would make no difference at all, or that I'd be shortlisted and have a deadline of 1 April 2009 to have my thesis finished and defended.  At the time I applied I had a reasonable outline, one more or less completed chapter and the seeds for two more, and software that was about 1/3 finished.  By the beginning of January I was only a little farther along, and I realized that the BA was going to make its shortlisting decisions very soon and, unless I made a serious and concerted effort to produce some thesis draft, I may as well withdraw my name.  Amazingly enough this little self-motivational talk worked wonders and I spent the middle two weeks of January writing like crazy and dosing myself with ibuprofen for the increasingly severe tendinitis in my hands. (See? Not recommended.) Then, wonder of wonders, I was shortlisted and I got to dump the entire thing in my supervisor's lap and say "Read this, now!" The next month was a panic-and-endorphin-fuelled rush to get the thing ready for submission by 20 February, so that I could have my viva by the end of March.  This involved some fairly amusing-in-retrospect scenes. I had to enlist my husband to draw a manuscript stemma for me in OmniGraffle because my hands were too wrecked to operate a trackpad. I imposed a series of strict deadlines on my own supervisor for reading and commenting on my draft, and met him on the morning of Deadline Day to incorporate the last set of his corrections, which involved directly hacking a horribly complicated (and programmatically generated) LaTeX file that contained the edited text I had produced. (Yes, *very* poor programming practice that, and I am still suffering the consequences of not having taken the time to do it properly.)

In the end the British Academy rejected me anyway, but what did I care? I had a Ph.D.

With that experience in mind, I set myself an ambitious and optimistic target of 'spring 2012' for having a draft of the book. For the record the conversion requires light-to-moderate revision of five existing chapters, complete re-drafting of the introductory chapter, and addition of a chapter that involves a small chunk of further research.  It was in this context, last October, that I saw the usual buzz surrounding the ramp-up to NaNoWriMo and thought to myself "you know, it would be kind of cool to have an academic version of that."

It turns out I'm not the only one who thought this thought--there actually was an "Ac[ademic ]Bo[ok ]WriMo" last year. In the end the project that was paying my salary demanded too much of my attention to even think about working on the book, and the idea went by the wayside. The target of spring 2012 for production of the complete draft was also a little too optimistic, even by my standards, and that deadline whizzed right on by.

Here it is November again, though, and AcWriMo is still a thing (though they have dropped the explicit 'book' part of it), and my book still needs to be finished, and this year I don't have any excuses. So I signed myself up, and I am using this post to provide that extra little bit of public accountability for my good intentions.  I am excusing myself from weekend work on account of family obligations, but for the weekdays (except *possibly* for the days of ESTS) I am requiring of myself a decent chunk of written work, with one week each dedicated to the two chapters that need major revision or drafting de novo.

I won't be submitting the thing to the publisher on 30 November, but I am promising myself (and now the world) that by the first of December, all that will remain is bibliographic cleanup and cosmetic issues. I am really looking forward to my Christmas present of a finished manuscript, and I am counting on public accountability to help make sure I get it.  Follow me on Twitter or (if you don't already) and harass me if I don't update!

Conference-driven doctoral theses

| No Comments | No TrackBacks
In the computer programming world I have occasionally come across the concept of 'conference-driven development' (and, let's be honest, I've engaged in it myself a time or two.) This is the practice of submitting a talk to a conference that describes the brilliant software that you have written and will be demonstrating, where by "have written" you actually mean "will have written". Once the talk gets accepted, well, it would be downright embarrassing to withdraw it so you had better get busy.

It turns out that this concept can also work in the field of humanities research (as, I suspect, certain authors of Digital Humanities conference abstracts are already aware.) Indeed, the fact that I am writing this post is testament to its workability even as a means of getting a doctoral thesis on track. (Graduate students take note!)

In the autumn of 2007 I was afloat on that vast sea of Ph.D. research, no definite outline of land (i.e. a completed thesis) in sight, and not much wind in the sails of my reading and ideas to provide the necessary direction. I had set out to create a new critical edition of the Chronicle of Matthew of Edessa, but it had been clear for a few months that I was not going to be able to collect the necessary manuscript copies within a reasonable timeframe. Even if I had, the text was far too long and copied far too often for the critical edition ever to have been feasible.

One Wednesday evening, after the weekly Byzantine Studies department seminar, an announcement was made about the forthcoming Cambridge International Chronicles Symposium to be held in July 2008. It was occurring to me by this point that it might be time to branch out from graduate-student conferences and try to get something accepted in 'grown-up' academia, and a symposium devoted entirely to medieval chronicles seemed a fine place to start. I only needed a paper topic.

Matthew wrote his Chronicle a generation after the arrival of the First Crusade had changed pretty much everything about the dynamics of power within the Near East, and his city Edessa was no exception. Early in his text he features a pair of dire prophetic warnings attributed to the monastic scholar John Kozern; the last of these ends with a rather spectacular prediction of the utter overthrow of Persian (read: Muslim, but given the cultural context you may as well read "Persian" too) power by the victorious Roman Emperor, and Christ's peace until the end of time. It is a pretty clearly apocalyptic vision, and much of the Chronicle clearly shows Matthew struggling to make sense of the fact that some seriously apocalyptic events (to wit, the Crusade) occurred and yet it was pretty apparent forty years later that the world was not yet drawing to an end with the return of Christ.

Post-apocalyptic history, I thought to myself, that's nicely attention-getting, so I made it the theme of my paper. This turned out to be a real stroke of luck - I spent the next six months considering the Chronicle from the perspective of somewhat frustrated apocalyptic expectations, and little by little a lot of strange features of Matthew's work began to fall into place. The paper was presented in July 2008; in October I submitted it for publication and turned it into the first properly completed chapter of my thesis. Although this wasn't the first article I submitted, it was the first one that appeared in print.

Announcing Stemmaweb

| No Comments | No TrackBacks

[Cross-posted from the Tree of Texts project blog]

The Tree of Texts project formally comes to an end in a few days; it's been a fun two years and it is now time to look at the fruits of our research. We (that is, Tara) gave a talk at the DH 2012 conference in July about the project and its findings; we also participated in a paper led by our colleagues in the Leuven CS department about computational analysis of stemma graph models, which was presented at the CoCoMILE workshop during the European Conference on Artificial Intelligence. We are now engaged in writing the final project paper; following up on the success of our DH talk, we will submit it for inclusion in the DH-related issue of LLC. Alongside all this, work on the publication of proceedings from our April workshop continues apace; nearly all the papers are in and the collection will soon be sent to the publisher.

More excitingly, from the perspective of text scholars and critical editors who have an interest in stemmatic analysis, we have made our analysis and visualization tools available on the Web! We are pleased to present Stemmaweb, which was developed in cooperation with members of the Interedition project and which provides an online interface to examining text collations and their stemmata. Stemmaweb has two homes: (the official KU Leuven site) (Tara's personal server, less official but much faster)

If you have a Google account or another OpenID account, you can use that to log in; once there you can view the texts that others have made public, and even upload your own. For any of your texts you can create a stemma hypothesis and analyze it with the tools we have used for the project; we will soon provide a means of generating a stemma hypothesis from a phylogenetic tree, and we hope to link our tools to those emerging soon from the STAM group at the Helsinki Institute for Information Technology.

Like almost all tools for the digital humanities, these are highly experimental. Unexpected things might happen, something might go wrong, or you might have a purpose for a tool that we never imagined.  So send us feedback! We would love to hear from you.

Hamburg here I come

| 3 Comments | No TrackBacks
As I write this I am on my way to Hamburg for DH2012. I'm very much looking forward to the conference this year, not only because of the wide variety of interesting papers and the chance to explore a city I've heard a lot of nice things about, but also because this year I feel like I have some substantial research of my own to contribute.

My speaking slot is on Friday morning (naturally opposite a lot of other interesting and influential speakers, but that seems to be the perpetual curse of DH.)  In preparation for that, I thought I might set down the background for the project I have been working on for the last two years, and discuss a little of what I will be presenting on Friday. After all, if I can set it down in a blog post then I can present it, right?

The project is titled The Tree of Texts, and its aim is to provide a basis for empirical modelling of text transmission. It grows out of the problem of text stemmatology, and specifically the stemmatology of medieval texts that were transmitted through manual copies by scribes who were almost never the author of the original text (if, indeed, a single original text ever existed.)

It is well known that texts vary as they are copied, whether through mistakes, changes in dialect, or intentional adaptation of the text to its context; almost as long as texts have been copied, therefore, scholars have tried in one way or another to get past these variations to what they believe to be the original text.  Even in cases where there was never a written original text, or where the interest of the scholar is more in the adaptation than in the starting point, there is a lot to be gained if we can understand how the text changed over time.

Stemmatology, the formal reconstruction of the genesis of a text, developed as a discipline over the course of the nineteenth century; the most common ("Lachmannian") method is based on the principle that if two or more manuscripts share a copying error, they are likely to have been copied either one from the other or both from the same (lost) exemplar. There has been a lot of effort, scholarship, and argument on the subject of how one distinguishes 'error' from an original (or archetypal) reading, how one distinguishes genealogical error (e.g. the misreading of a few words in a nigh-irreversible way so that the meaning of the entire sentence is changed) from coincidental error (e.g. variation in word spelling or dialect, which probably says more about the scribe than about the manuscript being copied).  The classical Lachmannian method requires the practitioner to decide in advance which variants are likely to have been in the original; more recent and computationally-based neo-Lachmannian methods allow the scholar to withhold that particular pre-judgment, but still require a distinction to be made concerning which shared variants are likely or unlikely to have been coincidental or reversible.

A method that requires the scholar to know the answer in advance was always likely to encounter opposition, and Lachmannian stemmatology has spawned entire sub-disciplines in protest at the sheer arrogance (so an anti-Lachmannian might describe it) of claiming to know in advance what is important and what is trivial. Nevertheless the problem remains: how to trace the history of a text, particularly if we begin with the assumption that we know no more, and perhaps considerably less, than the scribes who made the copies?  The first credible answer was borrowed from the field of evolutionary biology, where they have a similar problem in trying to understand the order in which features of species might have evolved and the specific relationships to each other of members of a group.  This is the discipline of phylogenetics, and there are several statistical methods to reconstruct likely family trees based upon nothing more than the DNA sequences of species living today.  Treat a manuscript as an organism, imagine that its text is its DNA sequence, et voilà - you can create an instant family tree.

And yet phylogenetics, if you ask the Lachmannians and other text scholars besides, has its own problems.  First, the phylogenetic model assumes that any species living today is by definition not an ancestor species, and therefore must appear only at the edge of the family tree; in contrast we certainly still possess manuscripts that served as the 'parent' of other extant manuscripts.  Second, in evolutionary terms it is reasonable to model the tree as a bifurcating one - that is, a species only ever divides into two, and then as time progresses either or both of these may divide further.  This also fails to match the manuscript model, where it is easy to see a single text spawning two, three, or ten direct copies.  Third, where the evolutionary model is assumed to be continously branching, it is well known that a manuscript can be copied with reference to two, three, or even four exemplars. This is next to impossible to represent in a tree (and indeed is not usually handled in a Lachmannian stemma either, serving more often as a reason why a stemma was not attempted.)  Fourth is the problem of significance of variants--while some scholars will insist that variants should simply not be pre-judged in terms of their significance, most will acknowledge the probable truth that some sorts of variation are more telling than other sorts.  Most phylogenetic programs do not by default take variant significance into account, and most users of phylogenetic trees don't even try.

In a recent paper, some of the luminaries of text phylogeny argue that none of these problems are insurmountable. Neighbor net diagrams can give some clues regarding multiple text parentage; some more recent and specialized algorithms such as Semstem are able to build trees so that a text can be an ancestor of another text, and so that a text can have more (or even less) than two descendants.  The authors also argued that the problem of significance can be handled trivially in the phylogenetic analysis by anyone who cares to assign weighting factors to the variant sets s/he provides to the program.

While it is undoubtedly true that automated algorithms can handle assignment of significance (that is, weighting), it also remains true that there are only two options for assigning these weightings:

  1. Treat all variants as equal
  2. Assign the weights arbitrarily, according to philological 'common sense', personal experience, or any other criterion that takes your fancy.

This is exactly the 'missing link' in text stemmatology: what sorts of variants occured in medieval copying, how common were they, how commonly were they copied, and how commonly were they changed?  If we can build a realistic picture of what, statistically speaking, variation actually looked like in medieval times, it will be an enormous step toward reconstructing the stemmata by whatever means the philologist chooses, be it neo-Lachmannian, phylogenetic, or a method yet to be invented.

What we have done in the Tree of Texts project is to create a model for representing text variation, and a model for representing stemmata, and methods for analyzing the text against the stemma in order to answer exactly the questions of what sort of variation occurred when and how.  I'll be presenting all of these methods on Friday, as well as some preliminary results of the number crunching. If you are at DH I hope to see you there!


Recent Comments

  • fireartist: I wasn't keen on having services running for each catalyst read more
  • kterzopoulos: Tara, this is exciting! Do you have any fast-track pointers read more
  • julianne.nyhan: Thanks for this entertaining post! Good luck with the next read more
  • Tara L Andrews: Yeah, I know there are non-tree-like features in evolution, and read more
  • ewxrjk: It might be worth mentioning that DNA phylogenies aren’t really read more
  • mapoulos: I've just stumbled across your blog (that's what I get read more
  • Tara L Andrews: Hi Betsy, Glad you liked the story! As for Georgian, read more
  • byzbets: Hi Tara: Thanks for sharing your story about this article, read more
  • ijon: Thanks for an excellent post! You put your finger on read more

Recent Assets

  • tpen_interface.png
  • Manuscript line
  • T-PEN interface


@tla on Twitter

    OpenID accepted here Learn more about OpenID