SOLVED! The mystery of the character encoding

| No Comments | No TrackBacks

Update, two hours later: we have a solution! And it's pretty disgusting. Read on below.

Two posts in a row about the deep technical guts of something I'm working on. Well I guess this is a digital humanities blog.

Yesterday I got a wonderful present in my email - a MySQL dump of a database full of all sorts of historical goodness. The site that it powers displays snippets of relevant primary sources in their original language, including things like Arabic and Greek. Since the site has been around for rather longer than MySQL has had any Unicode support to speak of, it is not all that surprising that these snippets of text in their original language are rather badly mis-encoded.

Not too much of a problem, I naïvely thought to myself. I'll just fix the encoding knowing what it's supposed to have been.

A typical example looks like this. The Greek displayed on the site is: μηνὶ Νοἐμβρίω εἰς τὰς κ ´ ινδικτιῶνος ε ´ ἔτους ,ς

but what I get from the database dump is: μηνὶ Νοἐμβρίω εἰς Ï„á½°Ï‚ κ ´ ινδικτιῶνος ε ´ á¼"τοÏ...Ï‚ ,Ï‚

Well, I recognise that kind of garbage, I thought to myself. It's double-encoded UTF-8. So all I ought to need to do is to undo the spurious re-encoding and save the result. Right?

Sadly, it's not that easy, and here is where I hope I can get comments from some DB/encoding wizards out there because I would really like to understand what's going on.

It starts easily enough in this case - the first letter is μ. In Unicode, that is character 3BC (notated in hexadecimal.) When you convert this to UTF-8, you get two bytes: CE BC. Unicode character CE is indeed Î, and Unicode character BC is indeed ¼. As I suspected, each of these UTF-8 bytes that make up μ has been treated as a character in its own right, and further encoded to UTF-8, so that μ has become μ. That isn't hard to undo.

But then we get along to that ω further down the line, which has become ω. That is Unicode character 3C9, which in UTF-8 becomes CF 89. Unicode CF is the character Ï as we expect, but there is no such Unicode character 89. Now it is perfectly possible to render 89 as UTF-8 (it would become C2 89) but instead I'm getting a rather inexplicable character whose Unicode value is 2030 (UTF-8 E2 80 B0)! And here the system starts to break down - I cannot figure out what possible mathematical transformation has taken place to make 89 become 2030.

There seems to be little mathematical pattern to the results I'm getting, either. From the bad characters in this sample:

ρ -> 3C1 -> CF 81 --> CF 81    (correct!!)
ς -> 3C2 -> CF 82 --> CF 201A
τ -> 3C4 -> CF 84 --> CF 201E
υ -> 3C5 -> CF 85 --> CF 2026
ω -> 3C9 -> CF 89 --> CF 2030

Ideas? Comments? Do you know MySQL like the back of your hand and have you spotted immediately what's going on here? I'd love to crack this mystery.

After this post went live, someone observed to me that the 'per mille' sign, i.e. that double-percent thing at Unicode value 2030, has the value 89 in...Windows CP-1250! And, perhaps more relevantly, Windows CP-1252. (In character encodings just as in almost everything else, Windows always liked to have their own standards that are different from the ISO standards. Pre-Unicode, most Western European characters were represented in an eight-bit encoding called ISO Latin 1 everywhere except Windows*, where they used this CP-1252 instead. For Eastern Europe, it was ISO Latin 2 / CP-1250.)

So what we have here is: MySQL is interpreting its character data as Unicode, and expressing it as UTF-8, as we requested. Only then it hits a Unicode value like 89 which is not actually a character at all. But instead of passing it through and letting us deal with it, MySQL says "hm, they must have meant the Latin 1 value here. Only when I say Latin 1 I really mean CP-1252. So I'll just take this value (89 in our example), see that it is the 'per mille' sign in CP-1252, and substitute the correct Unicode for 'per mille'. That will make the user happy!"

Hint: It really, really, doesn't make the user happy.

So here is the Perl script that will take the garbage I got and turn it back into Greek. Maybe it will be useful to someone else someday!

#!/usr/bin/env perl

use strict;
use warnings;
use Encode;
use Encode::Byte;

while(<>) {
    my $line = decode_utf8( $_ );
    my @chr;
    foreach my $c ( map { ord( $_ ) } split( '', $line ) ) {
        if( $c > 255 ) {
            $c = ord( encode( 'cp1252', chr( $c ) ) );
        }
        push( @chr, $c );
    }
    my $newline = join( '', map { chr( $_ ) } @chr );
    print $newline;
}

[*]Also, as I realized after posting this, except Mac, which used MacRoman. Standards are great! Let's all have our own!

How to have several Catalyst apps behind one Apache server

| 1 Comment | No TrackBacks

Since I've changed institutions this year, I am in the process of migrating Stemmaweb from its current home (on my family's personal virtual server) to the academic cloud service being piloted by SWITCH. Along the way, I ran into a Perl Catalyst configuration issue that I thought would be useful to write about here, in case others run into a similar problem.

I have several Catalyst applications - Stemmaweb, my edition-in-progress of Matthew of Edessa, and pretty much anything else I will develop with Perl in the future. I also have other things (e.g. this blog) on the Web, and being somewhat stuck in my ways, I still prefer Apache as a webserver. So basically I need a way to run all these standalone web applications behind Apache, with a suitable URL prefix to distinguish them.

There is already a good guide to getting a single Catalyst application set up behind an Apache front end. The idea is that you start up the application as its own process, listening on a local network port, and then configure Apache to act as a proxy between the outside world and that application. My problem was, I want to have more than one application, and I want to reach each different application via its own URL prefix (e.g. /stemmaweb, /ChronicleME, /ncritic, and so on.) The difficulty with a reverse proxy in that situation is this:

  • I send my request to http://my.public.server/stemmaweb/
  • It gets proxied to http://localhost:5000/ and returned
  • But then all my images, JavaScript, CSS, etc. are at the root of localhost:5000 (the backend server) and so look like they're at the root of my.public.server, instead of neatly within the stemmaweb/ directory!
  • And so I get a lot of nasty 404 errors and a broken application.

What I need here is an extra plugin: Plack::Middleware::ReverseProxyPath. I install it (in this case with the excellent 'cpanm' tool):

$ cpanm -S Plack::Middleware::ReverseProxyPath

And then I edit my application's PSGI file to look like this:

use strict;
use warnings;

use lib '/var/www/catalyst/stemmaweb/lib';
use stemmaweb;
use Plack::Builder;

builder {
        enable( "Plack::Middleware::ReverseProxyPath" );
        my $app = stemmaweb->apply_default_middlewares(stemmaweb->psgi_app);
        $app;
}

where /var/www/catalyst/stemmaweb is the directory that my application lives in.

In order to make it all work, my Apache configuration needs a couple of extra lines too:

    # Configuration for Catalyst proxy apps. This should eventually move
    # to its own named virtual host.
    RewriteEngine on
    <Location /stemmaweb>
            RequestHeader set X-Forwarded-Script-Name /stemmaweb
            RequestHeader set X-Traversal-Path /
            ProxyPass http://localhost:5000/ 
            ProxyPassReverse http://localhost:5000/
    </Location>
    RewriteRule ^/stemmaweb$ stemmaweb/ [R]

The RequestHeaders inform the backend (Catalyst) that what we are calling "/stemmaweb" is the thing that it is calling "/", and that it should translate its URLs accordingly when it sends us back the response.

The second thing I needed to address was how to start these things up automatically when the server turns on. The guide gives several useful configurations for starting a single service, but again, I want to make sure that all my Catalyst applications (and not just one of them) start up properly. I am running Ubuntu, which uses Upstart to handle its services; to start all my applications I use a pair of scripts and the 'instance' keyword.

description "Starman master upstart control"
author      "Tara L Andrews (tla@mit.edu)"
# Control all Starman jobs via this script
start on filesystem or runlevel [2345] 
stop on runlevel [!2345]
# No daemon of our own, but here's how we start them
pre-start script
  port=5000
  for dir in `ls /var/www/catalyst`; do
    start starman-app APP=$dir PORT=$port || :
    port=$((port+1))
  done
end script
# and here's how we stop them
post-stop script
  for inst in `initctl list|grep "^starman-app "|awk '{print $2}'|tr -d ')'|tr -d '('`; do
    stop starman-app APP=$inst PORT= || :
  done
end script

The application script, which gets called by the control script for each application in /var/www/catalyst:

description "Starman upstart application instance"
author      "Tara L Andrews (tla@mit.edu)"
respawn limit 10 5 
setuid www-data
umask 022 
instance $APP$PORT
exec /usr/local/bin/starman --l localhost:5000 /var/www/catalyst/$APP/$APP.psgi

There is one thing about this solution that is not so elegant, which is that each application has to start on its own port and I need to specify the correct port in the Apache configuration file. As it stands the ports will be assigned in sequence (5000, 5001, 5002, ...) according to the way the application directory names sort with the 'ls' command (which roughly means, alphabetically.) So whenever I add a new application I will have to remember to adjust the port numbers in the Apache configuration. I would welcome a more elegant solution if anyone has one!

Enabling the science of history

| No Comments | No TrackBacks
One of the great ironies of my academic career was that, throughout my Ph.D. work on a digital critical edition of parts of the text of Matthew of Edessa's Chronicle, I had only the vaguest inkling that anyone else was doing anything similar. I had heard of Peter Robinson and his COLLATE program, of course, but when I met him in 2007 he only confirmed to me that the program was obsolete and, if I needed automatic text collation anytime soon, I had better write my own program. Through blind chance I was introduced to James Cummings around the same time, who told me of the existence of the TEI guidelines and suggested I use them.

It was, in fact, James who finally gave me a push into the world of digital humanities. I was in the last panicked stages of writing up the thesis when he arranged an invitation for me to attend the first 'bootcamp' held by the Interedition project, whose subject was to be none other than text collation tools. By the time the meeting was held I was in that state of anxious bliss of having submitted my thesis and having nothing to do but wait for the viva, so I could bend all my hyperactive energy in that direction. Through Interedition I made some first-rate friends and colleagues with whom I have continued to work and hack to this day, and it was through that project that I met various people within KNAW (the Royal Dutch Academy of Science.)

After I joined Interedition I very frequently found myself talking to its head, Joris van Zundert, about all manner of things in this wide world of digital humanities. At the time I knew pretty much nothing of the people within DH and its nascent institutional culture, and was moreover pretty ignorant of how much there was to know, so as often as not we ended up in some kind of debate or argument over the TEI, over the philosophy of science, over what constitutes worthwhile research. The main object of these debates was to work out who was holding what unstated assumption or piece of background context.

One evening we found ourselves in a heated argument about the application of the scientific method to humanities research. I don't remember quite how we got there, but Joris was insisting (more or less) that humanities research needed to be properly scientific, according to the scientific method, or else it was rubbish, nothing more than creative writing with a rhetorical flourish, and not worth anyone's time or attention. Historians needed to demonstrate reproducibility, falsifiability, the whole works. I was having none of it--while I detest evidence-free assumption-laden excuses for historical argument as much as any scholar with a proper science-based education would, surely Joris and everyone else must understand that medieval history is neither reproducible nor falsifiable, and that the same goes for most other humanities research? What was I to do, write a Second Life simulation to re-create the fiscal crisis of the eleventh century, complete with replica historical personalities, and simulate the whole to see if the same consequences appeared? Ridiculous. But of course, I was missing the point entirely. What Joris was pushing me to do, in an admittedly confrontational way, was to make clear my underlying mental model for how history is done. When I did, it became really obvious to me how and where historical research ultimately stands to gain from digital methods.

OK, that's a big claim, so I had better elucidate this mental model of mine. It should be borne in mind that my experience is drawn almost entirely from Near Eastern medieval history, which is grossly under-documented and fairly starved of critical attention in comparison to its Western cousin, so if any of you historians of other places or eras have a wildly different perspective or model, I'd be very interested to hear about it!

When we attempt a historical re-construction or create an argument, we begin with a mixture of evidence, report, and prior interpretation. The evidence can be material (mostly archaeological) or documentary, and we almost always wish we had roughly ten times as much of it as we actually do. The reports are usually those of contemporaneous historians, which are of course very valuable but must be examined in themselves for what they aren't telling us, or what they are misrepresenting, as much as for what they positively tell us. The prior interpretation easily outweighs the evidence, and even the reports, for sheer volume, and it is this that constitutes the received wisdom of our field.

So we can imagine a rhetorical structure of dependency that culminates in a historical argument, or a reconstruction. We marshal our evidence, we examine our reports, we make interpretations in the light of received wisdom and prior interpretations. In effect it is a huge and intricate connected structure of logical dependencies that we carry around in our head. If our argument goes unchallenged or even receives critical acceptance, this entire structure becomes a 'black box' of the sort described by Bruno Latour, labelled only with its main conclusion(s) and ready for inclusion in the dependency structure of future arguments.

Now what if some of our scholarship, some of the received wisdom even, is wide of the mark? Pretty much any historian will relish the opportunity to demonstrate that "everything we thought we knew is wrong", and in Near Eastern history in particular these opportunities come thick and fast. This is a fine thing in itself, but it poses a thornier problem. When the historian demonstrates that a particular assumption or argument doesn't hold water--when the paper is published and digested and its revised conclusion accepted--how quickly, or slowly, will the knock-on effects of this new bit of insight make themselves clear? How long will it take for the implications to sort themselves out fully? In practice, the weight of tradition and patterns of historical understanding for Byzantium and the Near East are so strong, and have gone for so long unchallenged, that we historians simply haven't got the capacity to identify all the black boxes, to open them up and find the problematic components, to re-assess each of these conclusions with these components altered or removed. And this, I think, is the biggest practical obstacle to the work of historians being accepted as science rather than speculation or storytelling.

Well. Once I had been made to put all of this into words, it became clear what the most useful and significant contribution of digital technology to the study of history must eventually be. Big data and statistical analysis of the contents of documentary archives is all well and good, but what if we could capture our very arguments, our black boxes of historical understanding, and make them essentially searchable and available for re-analysis when some of the assumptions have changed? They would even be, dare I say it, reproducible and/or falsifiable. Even, perish the thought, computable.

Book cover, Understanding Digital Humanities
A few months after this particular debate, I was invited to join Joris and several other members of the Alfalab project at KNAW in preparing a paper for the 'Computational Turn' workshop in early 2010, which was eventually included in a collection that arose from the workshop. In the article we take a look at the processes by which knowledge is formalized in various fields in the humanities, and how the formalization can be resisted by scholars within each field. Among other things we presented a form of this idea for the formalization of historical research. Three years later I am still working on making it happen.

I was very pleased to find that Palgrave Macmillan makes its author self-archiving policies clear on their website, for books of collected papers as well as for journals. Unfortunately the policy is that the chapter is under embargo until 2015, so I can't post it publicly until then, but if you are interested meanwhile and can't track down a copy of the book then please get in touch!

J. J. van Zundert, S. Antonijevic, A. Beaulieu, K. van Dalen-Oskam, D. Zeldenrust, and T. L. Andrews, 'Cultures of Formalization - Towards an encounter between humanities and computing', in Understanding Digital Humanities, edited by D. Berry (London: Palgrave Macmillan, 2012), pp. 279-94.

Early-career encyclopedism

| 1 Comment | No TrackBacks
So there I was, a newly-minted Ph.D. enjoying my (all too brief) summer of freedom in 2009 from major academic responsibilities. There must be some sort of scholarly pheromone signal that gets emitted in cases like these, some chemical signature that senior scholars are attuned to that reads 'I am young and enthusiastic and am not currently crushed by the burden of a thousand obligations'. I was about to meet the Swarm of Encyclopedists.

It started innocently enough, actually even before I had submitted, when Elizabeth Jeffreys (who had been my MPhil degree supervisor) offered me the authorship of an article on the Armenians to go into an encyclopedia that she was helping to edit. As it happened, this didn't intrude again on my consciousness until the following year--I was duly signed up as author, but my email address was entered incorrectly in a database so I was blissfully ignorant of what exactly I had committed to until I began to get mysterious messages in 2010 from a project I hadn't really even heard of, demanding to know where my contribution was.

Lesson learned: you can almost always get a deadline extended in these large collaborative projects. After all, what alternatives do the editors have, really?

The second lure came quite literally the evening following my DPhil defense, when Tim Greenwood (who had been my MPhil thesis supervisor) got in touch to tell me about a project on Christian-Muslim relations being run out of Birmingham by David Parker, and that I would seem to be the perfect person to write an entry on Matthew of Edessa and his Chronicle. Flush with victory and endorphins, of course I accepted within the hour. Technically speaking this was a 'bibliographical history' rather than an 'encyclopedia', but the approach to writing my piece was very similar, and it was more or less the ideal moment for me to summarize everything I knew about Matthew.

For a little bit of doctoral R&R, academic style, I flew off a few days later to Los Angeles for the 2009 conference of the Society of Armenian Studies. There in the sunshine I must have been positively telegraphing my relaxation and lack of obligations, because Theo van Lint (who had only just ceased being my DPhil supervisor) brought up the subject of a number of encyclopedia articles on Armenian authors that he had promised and was simply not going to have a chance to do. By this time I was beginning to get a little surprised at the number of encyclopedia articles floating around in the academic ether looking for an authorly home, and I was not so naïve as to accept the unworkable deadline that he had, but subject to reasonability I said okay. He assured me that he would send me the details soon.

Around that time, through one of the mailing lists to which I had subscribed in the last month or so of my D.Phil., I got wind of the Encyclopedia of the Medieval Chronicle (EMC). The general editor, Graeme Dunphy, was looking for contributors to take on some of the orphan articles in this project. Matthew of Edessa was on the list, and I was already writing something similar for the Christian-Muslim Relations project, so I wrote to volunteer.

And then everything happened at once. Theo wrote to me with his list, which turned out to be for precisely this EMC project. The project manager at Brill, Ernest Suyver, who knew me from my work on another Brill project, wrote to me to ask if I would consider taking on several of the Armenian articles. Before I could answer either of these, Graeme wrote back to me, offering me not only the article on Matthew of Edessa that I'd asked for--not only the entire set of Armenian articles that both Theo and Ernest had sent in my direction--but the job of section editor for all Armenian and Syriac chronicles! The previous section editor had evidently disappeared from the project and it seems that only someone as young and unburdened as me had any hope of pulling off the organization and project management they needed on the exceedingly short timescale they had, or of being unwise enough to believe it could be done.

But I was at least learning enough by then to expect that any appeal to more senior scholars than myself was likely to be met with "Sorry, I have too much work already" and an unspoken coda of "...and encyclopedia articles are not exactly a priority for me right now." There was the rare exception of course, but I turned pretty quickly to my own cohort of almost- or just-doctored scholars to farm out the articles I couldn't (or didn't want to) write myself. So I suppose by that time even I was beginning to detect the "yes I can" signals coming from the early-career scholars around me. Naturally the articles were not all done on time--it was a pretty ludicrous time frame I was given, after all--but equally naturally, delays in the larger project meant that my part was completed by the time it really needed to be. And so in my first year as a postdoc I had a credit on the editorial team of a big encyclopedia project, and a short-paper-length article, co-authored with Philip Wood, giving an overview of Eastern Christian historiography as a whole. I remain kind of proud of that little piece.

Lesson learned: your authors can almost always get you to agree to a deadline extension in these large collaborative projects. After all, what alternative do you have as editor, short of finding another author, who will need more time anyway, and pissing off the first one by withdrawing the commission?

The only trouble with these articles is that it's awfully hard to know how to express them in the tickyboxes of a typical publications database like KU Leuven's Lirias. Does each of the fifteen entries I wrote get its own line? Should I list the editorship separately, or the longer article on historiography? It's a little conundrum for the CV.

Nevertheless I'm glad I got the opportunity to do the EMC project, definitely. And here's another little secret--if I am able to make the time, I kind of like writing encyclopedia articles. It's a nice way to get to grips with a subject, to cut straight to the essence of "What does the reader--and what do I--really need to know in these 250 words?" This might be why, when yet another project manager for yet another encyclopedia project found me about a year ago, I didn't say no, and so this list will have an addition in the future. After that, though, I might finally have to call a halt.

I have written to Wiley-Blackwell to ask about their author self-archiving policies; I have a PDF offprint but am evidently not allowed to make it public, frustratingly enough. I will update the Lirias record if that changes. Brill has a surprisingly humane policy that allows me to link freely to the offprints of my own contributions in an edited collection, so I have done that here. I don't seem to have an offprint for all the articles I wrote, though, so will need to rectify that.

Andrews, T. (2012). Armenians. In: Encyclopedia of Ancient History, ed. R. Bagnall et al. Malden, MA: Wiley-Blackwell.

Andrews, T. (2012). Matthew of Edessa. In: Christian--Muslim Relations. A Bibliographical History 1. Volume 3 (1050- 1200), ed. D. Thomas and B. Roggema. Leiden: Brill.

Andrews, T. and P. Wood. (2012). Historiography of the Christian East. In: Encyclopedia of the Medieval Chronicle, general editor G. Dunphy. Leiden: Brill.
(Additional articles on Agatʿangełos, Aristakēs Lastivertcʿi, Ełišē, Kʿartʿlis Cxovreba, Łazar Pʿarpecʿi, Mattʿēos Uṙhayecʿi, Movsēs Dasxurancʿi, Pʿawstos Buzand, Smbat Sparapet, Stepʿanos Asołik, Syriac Short Chronicles (with J. J. van Ginkel), Tʿovma Arcruni, Yovhannēs Drasxanakertcʿi.

Public accountability, #acwrimo, and The Book

| No Comments | No TrackBacks
Over the course of 2011, among the long-delayed things I finally managed to do was to put together a book proposal for the publication of my Ph.D. research. While I am reasonably pleased with the thesis I produced, it is no exception to the general rule that it would not make a very good book if I tried to publish it as it stands.  As it happens there is a reasonably well-known series by a well-respected publisher, edited by someone I know, where my research fits in rather nicely. Even more nicely, they accepted my proposal.

Now here is where I have to humblebrag a little: I wrote my Ph.D. thesis kind of quickly, and much more quickly than I would recommend to any current Ph.D. students. Part of this was luck--once I hit upon my main theme, a lot of it just started falling into place--but part of it was the sheer terror of an externally-imposed deadline. I had rather optimistically applied for a British Academy post-doctoral fellowship in October 2008, figuring that either I'd be rejected and it would make no difference at all, or that I'd be shortlisted and have a deadline of 1 April 2009 to have my thesis finished and defended.  At the time I applied I had a reasonable outline, one more or less completed chapter and the seeds for two more, and software that was about 1/3 finished.  By the beginning of January I was only a little farther along, and I realized that the BA was going to make its shortlisting decisions very soon and, unless I made a serious and concerted effort to produce some thesis draft, I may as well withdraw my name.  Amazingly enough this little self-motivational talk worked wonders and I spent the middle two weeks of January writing like crazy and dosing myself with ibuprofen for the increasingly severe tendinitis in my hands. (See? Not recommended.) Then, wonder of wonders, I was shortlisted and I got to dump the entire thing in my supervisor's lap and say "Read this, now!" The next month was a panic-and-endorphin-fuelled rush to get the thing ready for submission by 20 February, so that I could have my viva by the end of March.  This involved some fairly amusing-in-retrospect scenes. I had to enlist my husband to draw a manuscript stemma for me in OmniGraffle because my hands were too wrecked to operate a trackpad. I imposed a series of strict deadlines on my own supervisor for reading and commenting on my draft, and met him on the morning of Deadline Day to incorporate the last set of his corrections, which involved directly hacking a horribly complicated (and programmatically generated) LaTeX file that contained the edited text I had produced. (Yes, *very* poor programming practice that, and I am still suffering the consequences of not having taken the time to do it properly.)

In the end the British Academy rejected me anyway, but what did I care? I had a Ph.D.

With that experience in mind, I set myself an ambitious and optimistic target of 'spring 2012' for having a draft of the book. For the record the conversion requires light-to-moderate revision of five existing chapters, complete re-drafting of the introductory chapter, and addition of a chapter that involves a small chunk of further research.  It was in this context, last October, that I saw the usual buzz surrounding the ramp-up to NaNoWriMo and thought to myself "you know, it would be kind of cool to have an academic version of that."

It turns out I'm not the only one who thought this thought--there actually was an "Ac[ademic ]Bo[ok ]WriMo" last year. In the end the project that was paying my salary demanded too much of my attention to even think about working on the book, and the idea went by the wayside. The target of spring 2012 for production of the complete draft was also a little too optimistic, even by my standards, and that deadline whizzed right on by.

Here it is November again, though, and AcWriMo is still a thing (though they have dropped the explicit 'book' part of it), and my book still needs to be finished, and this year I don't have any excuses. So I signed myself up, and I am using this post to provide that extra little bit of public accountability for my good intentions.  I am excusing myself from weekend work on account of family obligations, but for the weekdays (except *possibly* for the days of ESTS) I am requiring of myself a decent chunk of written work, with one week each dedicated to the two chapters that need major revision or drafting de novo.

I won't be submitting the thing to the publisher on 30 November, but I am promising myself (and now the world) that by the first of December, all that will remain is bibliographic cleanup and cosmetic issues. I am really looking forward to my Christmas present of a finished manuscript, and I am counting on public accountability to help make sure I get it.  Follow me on Twitter or App.net (if you don't already) and harass me if I don't update!

Conference-driven doctoral theses

| No Comments | No TrackBacks
In the computer programming world I have occasionally come across the concept of 'conference-driven development' (and, let's be honest, I've engaged in it myself a time or two.) This is the practice of submitting a talk to a conference that describes the brilliant software that you have written and will be demonstrating, where by "have written" you actually mean "will have written". Once the talk gets accepted, well, it would be downright embarrassing to withdraw it so you had better get busy.

It turns out that this concept can also work in the field of humanities research (as, I suspect, certain authors of Digital Humanities conference abstracts are already aware.) Indeed, the fact that I am writing this post is testament to its workability even as a means of getting a doctoral thesis on track. (Graduate students take note!)

In the autumn of 2007 I was afloat on that vast sea of Ph.D. research, no definite outline of land (i.e. a completed thesis) in sight, and not much wind in the sails of my reading and ideas to provide the necessary direction. I had set out to create a new critical edition of the Chronicle of Matthew of Edessa, but it had been clear for a few months that I was not going to be able to collect the necessary manuscript copies within a reasonable timeframe. Even if I had, the text was far too long and copied far too often for the critical edition ever to have been feasible.

One Wednesday evening, after the weekly Byzantine Studies department seminar, an announcement was made about the forthcoming Cambridge International Chronicles Symposium to be held in July 2008. It was occurring to me by this point that it might be time to branch out from graduate-student conferences and try to get something accepted in 'grown-up' academia, and a symposium devoted entirely to medieval chronicles seemed a fine place to start. I only needed a paper topic.

Matthew wrote his Chronicle a generation after the arrival of the First Crusade had changed pretty much everything about the dynamics of power within the Near East, and his city Edessa was no exception. Early in his text he features a pair of dire prophetic warnings attributed to the monastic scholar John Kozern; the last of these ends with a rather spectacular prediction of the utter overthrow of Persian (read: Muslim, but given the cultural context you may as well read "Persian" too) power by the victorious Roman Emperor, and Christ's peace until the end of time. It is a pretty clearly apocalyptic vision, and much of the Chronicle clearly shows Matthew struggling to make sense of the fact that some seriously apocalyptic events (to wit, the Crusade) occurred and yet it was pretty apparent forty years later that the world was not yet drawing to an end with the return of Christ.

Post-apocalyptic history, I thought to myself, that's nicely attention-getting, so I made it the theme of my paper. This turned out to be a real stroke of luck - I spent the next six months considering the Chronicle from the perspective of somewhat frustrated apocalyptic expectations, and little by little a lot of strange features of Matthew's work began to fall into place. The paper was presented in July 2008; in October I submitted it for publication and turned it into the first properly completed chapter of my thesis. Although this wasn't the first article I submitted, it was the first one that appeared in print.

Announcing Stemmaweb

| No Comments | No TrackBacks

[Cross-posted from the Tree of Texts project blog]

The Tree of Texts project formally comes to an end in a few days; it's been a fun two years and it is now time to look at the fruits of our research. We (that is, Tara) gave a talk at the DH 2012 conference in July about the project and its findings; we also participated in a paper led by our colleagues in the Leuven CS department about computational analysis of stemma graph models, which was presented at the CoCoMILE workshop during the European Conference on Artificial Intelligence. We are now engaged in writing the final project paper; following up on the success of our DH talk, we will submit it for inclusion in the DH-related issue of LLC. Alongside all this, work on the publication of proceedings from our April workshop continues apace; nearly all the papers are in and the collection will soon be sent to the publisher.

More excitingly, from the perspective of text scholars and critical editors who have an interest in stemmatic analysis, we have made our analysis and visualization tools available on the Web! We are pleased to present Stemmaweb, which was developed in cooperation with members of the Interedition project and which provides an online interface to examining text collations and their stemmata. Stemmaweb has two homes:

http://treeoftexts.arts.kuleuven.be/stemmaweb/ (the official KU Leuven site)
http://byzantini.st/stemmaweb/ (Tara's personal server, less official but much faster)

If you have a Google account or another OpenID account, you can use that to log in; once there you can view the texts that others have made public, and even upload your own. For any of your texts you can create a stemma hypothesis and analyze it with the tools we have used for the project; we will soon provide a means of generating a stemma hypothesis from a phylogenetic tree, and we hope to link our tools to those emerging soon from the STAM group at the Helsinki Institute for Information Technology.

Like almost all tools for the digital humanities, these are highly experimental. Unexpected things might happen, something might go wrong, or you might have a purpose for a tool that we never imagined.  So send us feedback! We would love to hear from you.

Hamburg here I come

| 3 Comments | No TrackBacks
As I write this I am on my way to Hamburg for DH2012. I'm very much looking forward to the conference this year, not only because of the wide variety of interesting papers and the chance to explore a city I've heard a lot of nice things about, but also because this year I feel like I have some substantial research of my own to contribute.

My speaking slot is on Friday morning (naturally opposite a lot of other interesting and influential speakers, but that seems to be the perpetual curse of DH.)  In preparation for that, I thought I might set down the background for the project I have been working on for the last two years, and discuss a little of what I will be presenting on Friday. After all, if I can set it down in a blog post then I can present it, right?

The project is titled The Tree of Texts, and its aim is to provide a basis for empirical modelling of text transmission. It grows out of the problem of text stemmatology, and specifically the stemmatology of medieval texts that were transmitted through manual copies by scribes who were almost never the author of the original text (if, indeed, a single original text ever existed.)

It is well known that texts vary as they are copied, whether through mistakes, changes in dialect, or intentional adaptation of the text to its context; almost as long as texts have been copied, therefore, scholars have tried in one way or another to get past these variations to what they believe to be the original text.  Even in cases where there was never a written original text, or where the interest of the scholar is more in the adaptation than in the starting point, there is a lot to be gained if we can understand how the text changed over time.

Stemmatology, the formal reconstruction of the genesis of a text, developed as a discipline over the course of the nineteenth century; the most common ("Lachmannian") method is based on the principle that if two or more manuscripts share a copying error, they are likely to have been copied either one from the other or both from the same (lost) exemplar. There has been a lot of effort, scholarship, and argument on the subject of how one distinguishes 'error' from an original (or archetypal) reading, how one distinguishes genealogical error (e.g. the misreading of a few words in a nigh-irreversible way so that the meaning of the entire sentence is changed) from coincidental error (e.g. variation in word spelling or dialect, which probably says more about the scribe than about the manuscript being copied).  The classical Lachmannian method requires the practitioner to decide in advance which variants are likely to have been in the original; more recent and computationally-based neo-Lachmannian methods allow the scholar to withhold that particular pre-judgment, but still require a distinction to be made concerning which shared variants are likely or unlikely to have been coincidental or reversible.

A method that requires the scholar to know the answer in advance was always likely to encounter opposition, and Lachmannian stemmatology has spawned entire sub-disciplines in protest at the sheer arrogance (so an anti-Lachmannian might describe it) of claiming to know in advance what is important and what is trivial. Nevertheless the problem remains: how to trace the history of a text, particularly if we begin with the assumption that we know no more, and perhaps considerably less, than the scribes who made the copies?  The first credible answer was borrowed from the field of evolutionary biology, where they have a similar problem in trying to understand the order in which features of species might have evolved and the specific relationships to each other of members of a group.  This is the discipline of phylogenetics, and there are several statistical methods to reconstruct likely family trees based upon nothing more than the DNA sequences of species living today.  Treat a manuscript as an organism, imagine that its text is its DNA sequence, et voilà - you can create an instant family tree.

And yet phylogenetics, if you ask the Lachmannians and other text scholars besides, has its own problems.  First, the phylogenetic model assumes that any species living today is by definition not an ancestor species, and therefore must appear only at the edge of the family tree; in contrast we certainly still possess manuscripts that served as the 'parent' of other extant manuscripts.  Second, in evolutionary terms it is reasonable to model the tree as a bifurcating one - that is, a species only ever divides into two, and then as time progresses either or both of these may divide further.  This also fails to match the manuscript model, where it is easy to see a single text spawning two, three, or ten direct copies.  Third, where the evolutionary model is assumed to be continously branching, it is well known that a manuscript can be copied with reference to two, three, or even four exemplars. This is next to impossible to represent in a tree (and indeed is not usually handled in a Lachmannian stemma either, serving more often as a reason why a stemma was not attempted.)  Fourth is the problem of significance of variants--while some scholars will insist that variants should simply not be pre-judged in terms of their significance, most will acknowledge the probable truth that some sorts of variation are more telling than other sorts.  Most phylogenetic programs do not by default take variant significance into account, and most users of phylogenetic trees don't even try.

In a recent paper, some of the luminaries of text phylogeny argue that none of these problems are insurmountable. Neighbor net diagrams can give some clues regarding multiple text parentage; some more recent and specialized algorithms such as Semstem are able to build trees so that a text can be an ancestor of another text, and so that a text can have more (or even less) than two descendants.  The authors also argued that the problem of significance can be handled trivially in the phylogenetic analysis by anyone who cares to assign weighting factors to the variant sets s/he provides to the program.

While it is undoubtedly true that automated algorithms can handle assignment of significance (that is, weighting), it also remains true that there are only two options for assigning these weightings:

  1. Treat all variants as equal
  2. Assign the weights arbitrarily, according to philological 'common sense', personal experience, or any other criterion that takes your fancy.

This is exactly the 'missing link' in text stemmatology: what sorts of variants occured in medieval copying, how common were they, how commonly were they copied, and how commonly were they changed?  If we can build a realistic picture of what, statistically speaking, variation actually looked like in medieval times, it will be an enormous step toward reconstructing the stemmata by whatever means the philologist chooses, be it neo-Lachmannian, phylogenetic, or a method yet to be invented.

What we have done in the Tree of Texts project is to create a model for representing text variation, and a model for representing stemmata, and methods for analyzing the text against the stemma in order to answer exactly the questions of what sort of variation occurred when and how.  I'll be presenting all of these methods on Friday, as well as some preliminary results of the number crunching. If you are at DH I hope to see you there!

Of circumstance and Armenian chroniclers

| 2 Comments | No TrackBacks
I promised to start blogging an inventory of my publications back in April. Yes, it's now July. It turns out that my breezy confidence concerning the ease of discovery of my rights to my own work was...misguided.

My first publication arose from my M.Phil. thesis. The thesis itself was an enormous logic and date-accounting puzzle, which I thought was all kinds of fun but which, when described to fellow students, tended to get the reaction "I'm so sorry, that sounds horribly boring!"  That says something about the geek disposition, I suppose.

The topic of my thesis, and the eventual paper, was the chronological weirdness of the first book of the Chronicle of Matthew of Edessa. There is a back story there, on how a vaguely Byzantium-fancying computer geek came to be writing about an Armenian historical chronicle concerned in large part with a topic (the Crusades) that, had I been asked in 2003, I would have found utterly uninteresting.  It's also a tale of how the smallest sorts of circumstance can shape a career.

I began grad school on the heels of the Great Dot-com Bust.  My bachelor's degree was a strange MIT hybrid ("Humanities and Engineering") which really meant that I had been on course to do a computer science degree when I realized that I could have a lot more fun doing half my coursework in history, and at the end of it I would still probably get a programming job at some Internet startup.  So it came to pass, but I could never shake the urge to go back and give history a more proper study.  In the end the universe did me a perverse sort of favor when my company laid me off just as I was finally resolving to prepare those grad school applications.

This is how I found myself in a room at Exeter College one gorgeous afternoon working out, together with the other new master's students, what I ought to be doing for the next two years. Among the decisions we needed to make was the language we would study for the examination requirement; the (rather fantastic-sounding) options were Greek, Latin, Armenian, Syriac, Church Slavonic, and Arabic. I had enough Greek and Latin to be getting on with, but my powers as a dead-language autodidact had already failed me once when confronted with Armenian. Why not get some actual tuition in it and see how I did?

Of such whims are career paths made.  Once I had expressed a guarded interest in Armenian language, well, it seemed evident to the assembled dons that I should apply it by studying some Armenian history.  That turned out to be a field so very under-studied that potential thesis topics were lurking under nearly every assigned primary text and journal article.  I resolved eventually to write a thesis on the subject of the Armenian economy of the tenth and eleventh centuries, seeing what we might piece together by looking critically at literary and epigraphic sources. I dutifully began to read, and by August I had a collection of notes on the three main historians of the era (dots indicate approximate note volume):

  • [..]  Aristakes of Lastivert
  • [....]  Stephen of Taron       
  • [.........................................................................]  Matthew of Edessa    

Hm. Clearly my thesis had chosen a direction, even if I hadn't.  It was not Matthew's poetic writing, vivid narrative, or historical accuracy that had caught my attention - in the latter case, rather the opposite. How could such a vast history be so very full of such obvious mistakes? Was there any rhyme or reason to them? Could we trust *anything* that Matthew was trying to tell us? If so, what? It took a few months more for the thesis topic to resolve itself to these chronological mistakes, but I got there in the end. The whole process began to turn into an intriguing logic puzzle that I had a lot of fun trying to solve, and it seemed a little unbelievable that no one had beaten me to it.

It took me three years (and another job in industry) to condense the thesis to an article suitable for publication, but I finally submitted it in 2008 to the standard journal for Armenian scholarship, the Revue des études arméniennes. My reward was a charming hand-written letter from the editor acknowledging my contribution and that he would be happy to publish it, though he wondered what my view was on certain issues I hadn't addressed. I got to pretend for a moment that I was about fifty years older than I am, initiated into the academic community in an era where scholarship was carried on through personal correspondence.

As I have not heard anything from Peeters (and cannot find any information online) concerning author rights, and as I don't believe I actually signed anything handing over any rights in any event, I have chosen to go with the safest reasonable option for open access: the final version of the article content, before typesetting.

Andrews, Tara L., 'The Chronology of the Chronicle: An Explanation of the Dating Errors within Book 1 of the Chronicle of Matthew of Edessa', Revue des études arméniennes 32 (2010): 141-64.

Introduction, inclusion, and open access

| No Comments | No TrackBacks
For as long as I have been part of the wider Digital Humanities community, I have felt like an outsider. On the periphery. Not part of the "clique", although for the most part I have found DH people to be pretty welcoming.  So I was a little struck by the reports from the Cologne Dialogue on the Digital Humanities that took place this week, as (according to Twitter) disputant after disputant also claimed to be on the periphery of DH.  What is it with this field, that so many of its apparent members claim that they are not part of the "in-crowd?"

I don't have a full answer to that (for now), but it was a similar thought process that got me to start this blog.  I'm a Byzantinist, I'm a computer hacker, I combine the two as often as I can--there are really no grounds here for exclusion.  What I did realize is that, more than most geeky pursuits, perceived membership in the "DH club" has almost everything to do with how often and how visibly you speak up.

So, by way of a general introduction, I am going to take a leaf from the book (ebook? blog?) of Melissa Terras, and make a series of posts about the work I have done to date and the publications that my work has led to.  Along the way I will check the open-access policy for each of the publishers, and make sure that anything that can be open access is, and post a link to it.  Unlike Melissa's, mine will be a chronological history; given my odd hybrid career, it is best to avoid backtracking.  And really I think it is a great idea for us scholars to take advantage of whatever rights our publishers allow us to retain over our own work (which is more than I would have thought, for many journal publishers), and get that work out there and indexed in search engines.

This should be fun! Coming soon: how I ended up learning Armenian, and proof that I am indeed a hopeless nerd.

Pages

Recent Comments

  • fireartist: I wasn't keen on having services running for each catalyst read more
  • kterzopoulos: Tara, this is exciting! Do you have any fast-track pointers read more
  • julianne.nyhan: Thanks for this entertaining post! Good luck with the next read more
  • Tara L Andrews: Yeah, I know there are non-tree-like features in evolution, and read more
  • ewxrjk: It might be worth mentioning that DNA phylogenies aren’t really read more
  • mapoulos: I've just stumbled across your blog (that's what I get read more
  • Tara L Andrews: Hi Betsy, Glad you liked the story! As for Georgian, read more
  • byzbets: Hi Tara: Thanks for sharing your story about this article, read more
  • ijon: Thanks for an excellent post! You put your finger on read more

Categories

@tla on Twitter

    OpenID accepted here Learn more about OpenID