An amateur archaeologist's use of XML

John Palmer

February 2004

Previously published in Mark Bell's XML Newsletter 2 and repeated here with his permission.

I would like to describe briefly the XML set-up that I use to maintain the work-in-progress files for my research on the Roman Purbeck stone industry.

Origins

This study began in 1996 when I found myself employed only three days a week but fortunately not suffering a corresponding reduction of income; I decided to use some of my spare time studying archaeology at King Alfred's College in Winchester, in which city I had lived for 25 years. For my first long project (over the summer vacation) I put forward a proposal in these terms:

Proposed area of study:
Shale, stone {and salt} industries of Purbeck in the Roman period

Primary sources:
Dorset County Museum collection
Poole and Wimborne museums
Sites: asking advice from County Museum

Secondary sources:
Royal Commission on Historic Monuments inventory, Dorset South-east, 1952

This being accepted (on the understanding that the braces round {salt} meant that I would only go into this subject if I ran out of sources for shale and stone), I spent some time that summer visiting Dorset and exploring both the field and the library sources for the Roman stone and shale industries. From this came to a paper which was duly submitted as coursework. (You can read it at www.palmyra.me.uk/purbeck1996.html).

It was about two years after this that (being fully employed once more) I returned to the subject of the stone industry. By this time I had dropped the Kimmeridge Shale industry, feeling that it was already well covered by other workers. (For an introduction to Kimmeridge Shale, try Calkin 1953.) On the other hand Purbeck Stone was relatively neglected, the last major review of the subject being 30 years old (Beavis 1970). I determined from the start to put my provisional findings and working notes on the World Wide Web, so that others with related interests might note what I was doing and hopefully make suggestions, corrections and comments and maybe join in the project. I have certainly never regretted this decision, and I am very grateful to the people who have shown interest and helped guide my efforts.

At this point it would be well worth your while to view my current presentation of the data at www.palmyra.me.uk/pur-preface.html. You should bear in mind that in its present form it is far larger and more complex than when I began it. At the beginning the matter on the Web consisted of two files only: the database, being basically a list of Roman Purbeck stone artefacts, and the bibliography, basically the reference-list from my 1996 paper augmented by other citations which I had added since then. These two were quite easy to maintain as HTML files, being each no more than a few tens of thousands of characters long.

The database was basically a long unordered list <ul>, containing items of the following general kind:

<li>
  <ul>
    <li><b>name</b> ..identifying name of artefact.. </li>
    <li><b>site</b> ..where found and when.. </li>
    <li><b>publ</b> <a href="..">..reference to publication..</a> </li>
    <li><b>desc</b> ..description.. </li>
    <!-- other properties of the artefact added here -->
  </ul>
</li>

The bibliography was also a long <ul> but contained items like this one:

<li>
  <a name="Bidwell1979">Bidwell PT 1979</a>,
  <em>The legionary Bath House and Basilica and Forum at Exeter</em>,
  Exeter City Council and Univ of Exeter:
  Exeter Archaeological Reports <b>1</b>
</li>

Most of the hrefs in the database were naturally to items in the bibliography, but from the start I allowed myself unlimited references from anywhere in my files, to anywhere else in my own website and also to other resources on the WWW, as I felt then as now that these were important guides that would assist my own analysis of the data and could also be useful to other readers.

Soon these two simple files grew. By early 2000 I had moved to Dorset and into semi-retirement. I was now on the actual country of my study and had easy access to the excellent library of the Dorset Natural History and Archaeological Society, which I had joined back in 1996. The database had split into several files, such as

Mortars (stone grinding bowls)
Other vessels (baths, basins, etc.)
Other portable artefacts
Roofing tiles of stone
Paving material
Other architectural stone
Inscriptions
Quarry sites
etc.

Moreover the internal organisation of the data on each artefact had become quite varied. Naturally there were often many publ citations for each artefact, and often several desc descriptions, sometimes one from each author cited. Not every artefact even had a distinctive name; but new properties of items, like map-references, location (i.e. in what museum), substance (real Purbeck marble, other Purbeck stone, etc.), and date (1st century, etc.), had been added in many cases. The number and order of these properties varied greatly, and this was making it difficult to study and to update the data, which by now were becoming a resource of some archaeological importance, as I described in a paper in the Dorset Proceedings (Palmer 2001).

The problem of keeping these data in order was not assisted by the fact that the syntax of HTML (any version) is designed for specifying logical subdivisions of a text, but not the significant properties of any particular kind of subject-matter (such as stone artefacts). Although I was accustomed to using nsgmls (James Clark) to verify the conformance of my HTML to the appropriate DTD (data type description), I decided that I needed a DTD more closely related to the subject I was studying.

This DTD took shape in the summer of 2001.

The articles recorded in each file constitute a collection. Each article in the collection is an item. I allow myself to group the items by inserting subheads at suitable points in the list, but this is little more than a presentational device.

<!ELEMENT collection - - ( (subhead | item)* ) >

For convenience in defining elements in the DTD I introduce an entity:

<!ENTITY % textvar "(#PCDATA|br|em|b|a|code|img)*" >

And also this entity, to allow myself some non-ascii characters:

<!ENTITY % ISOlat1 SYSTEM "/usr/html/sgml-lib/ISOlat1.ent">
%ISOlat1;

A subhead is just a few words:

<!ELEMENT subhead - - (%textvar;) >

An item, however, has a fixed structure in which the subdivisions always appear in the same order. This to me is an important aid to reading and understanding the data. (In the old HTML notation there was nothing to enforce this order.)

<!ELEMENT item - - ( name?,
                     number?,
                     cat*,
                     site+,
                     grid*,  
                     source*, publ*, desc*,
                     loc*, subst*, date*,
                     interp*, comment*, cont* ) >

Follow this link for meanings and uses of the inner elements.

You'll observe that an item must have a site, but all the other parts are optional; more than one is allowed of all parts except name, number and site. (Actually, number is not used at all and is only in the DTD in case I should want to start cataloguing artefacts in the style of the great corpuses (corpora?) like RIB (Collingwood and Wright 1965).)

br, em and b are mere presentational devices and mean what they do in HTML, i.e. linebreak, emphasise, and bold-face.

<!ELEMENT br - - (#PCDATA) --will normally be empty-->
<!ELEMENT em - - (%textvar;) >
<!ELEMENT b - - (%textvar;) >

a corresponds to its namesake in HTML and has some of the same attributes. It is a bit old-fashioned in using "name" rather than "id" for the label that is the target of a link.

<!ELEMENT a - - (%textvar;) >
<!ATTLIST a
  href CDATA IMPLIED
  name CDATA IMPLIED
  target CDATA IMPLIED >

"target" is another merely presentational device: as in HTML, it hints to the displaying program that it is worth opening a secondary window. code is also presentational and corresponds to its namesake in HTML.

<!ELEMENT code - - (%textvar;) >

img introduces a picture, as in HTML.

<!ELEMENT img - - (#PCDATA) --will normally be empty-->

All the elements listed above from br to img can be used inside any of the elements listed below, which are the main categories of information about an item. For the meaning and use of the latter, see my website at http://www.palmyra.me.uk/.

<!ELEMENT name - - (%textvar;) 
<!ELEMENT number - - (%textvar;) >
<!ELEMENT cat - - (%textvar;) >
<!ELEMENT site - - (%textvar;) >
<!ELEMENT grid - - (%textvar;) >
<!ELEMENT source - - (%textvar;) >
<!ELEMENT publ - - (%textvar;) >
<!ELEMENT desc - - (%textvar;) >
<!ELEMENT loc - - (%textvar;) >
<!ELEMENT subst - - (%textvar;) >
<!ELEMENT date - - (%textvar;) >
<!ELEMENT interp - - (%textvar;) >
<!ELEMENT comment - - (%textvar;) >
<!ELEMENT cont - - (%textvar;) >
<!--finis-->

(The above data-structure is sufficiently restrictive for my purpose, which was to help me to be regular and consistent in the recording of my data. Observant eyes will note that it does permit me to do some things that make little sense, for instance to put one a element inside another, or to insert some textual content into br or img elements. However I feel no inclination to do these things and don't need the added complication of the code necessary to forbid them.)

Having chosen a data-structure, the first problem was to convert the existing HTML data to the new form, bearing in mind that the component parts of each item had to be forced into a new order to fit the restrictions of the new DTD. There are many ways of doing this, and if mine seems odd, the reader should bear in mind that I was familiar with programming in Perl and inclined to stick to the techniques that I new best. My ad-hoc program html2xml reads the HTML data and converts to the new DTD; it uses the SGML parser nsgmls to convert the HTML to a canonical form and creates a structure of Perl objects corresponding to the elements of the HTML; these are then picked off in the appropriate order to create new items with correctly ordered inner parts. Apart from the time in 2001 when I first introduced the new DTD, I have not used my html2xml again except on one occasion when I removed (deleted) one of my XML files by mistake !

I now had my data stored in XML in my new DTD in files called *.xml. From summer 2001 onwards, all amendments and additions to the data have been made by editing the XML files; this has kept a degree of discipline in my data which was hard to achieve using raw HTML. Of course, every time I amend an XML file, I have to ensure good order by validating it against the DTD described above; I do this with nsgmls, which is so quick and convenient I can use it many times over within a single data-entry session.

I have not attempted to put my XML on the Web directly, as I think it is important not to assume that all my readers will be using the very latest in Web-browsing software! In fact, after amending any of my XML master-files, I create a corresponding file in HTML by means of a Perl program which goes by the name updatehtml. (Although this program will produce correct HTML provided that the master-file is correct XML, I occasionally verify the generated HTML using nsgmls.) The automatically-generated HTML is, at the time of writing, XHTML 1.0.

The conversion to HTML is much simpler than the conversion out of it, for it involves little more than a succession of string-substitutions, the style of which will be familiar to anyone who has used Perl or any of its antecedent programs like sed or vi. The program works on tags, not on elements, which is satisfactory in this case provided that matching operations are performed on both the start- and the end-tag for the same element.

For instance, <collection> becomes <ul>:

$_ =~ s/<collection>/<ul>/;
$_ =~ s/<\/collection>/<\/ul>/;

<item> becomes a <ul> inside a <li>:

$_ =~ s/<item>/<li>\ <ul>/;
$_ =~ s/<\/item>/<\/ul><\/li>/;

The various parts of an item are all treated alike: first the start-tags:

$_ =~ s/<name> */<li><b>name<\/b> /;
$_ =~ s/<number> */<li><b>number<\/b> /;
$_ =~ s/<cat> */<li><b>cat<\/b> /;
$_ =~ s/<site> */<li><b>site<\/b> /;
$_ =~ s/<grid> */<li><b>grid<\/b> /;
$_ =~ s/<source> */<li><b>source<\/b> /;
$_ =~ s/<publ> */<li><b>publ<\/b> /;
$_ =~ s/<desc> */<li><b>desc<\/b> /;
$_ =~ s/<loc> */<li><b>loc<\/b> /;
$_ =~ s/<subst> */<li><b>subst<\/b> /;
$_ =~ s/<date> */<li><b>date<\/b> /;
$_ =~ s/<interp> */<li><b>interp<\/b> /;
$_ =~ s/<comment> */<li><b>comment<\/b> /;
$_ =~ s/<cont> */<li><b>cont<\/b> /;

and the end-tags:

$_ =~ s/<\/name>/<\/li>/;
$_ =~ s/<\/number>/<\/li>/;
$_ =~ s/<\/cat>/<\/li>/;
$_ =~ s/<\/site>/<\/li>/;
$_ =~ s/<\/grid>/<\/li>/;
$_ =~ s/<\/source>/<\/li>/;
$_ =~ s/<\/publ>/<\/li>/;
$_ =~ s/<\/desc>/<\/li>/;
$_ =~ s/<\/loc>/<\/li>/;
$_ =~ s/<\/subst>/<\/li>/;
$_ =~ s/<\/date>/<\/li>/;
$_ =~ s/<\/interp>/<\/li>/;
$_ =~ s/<\/comment>/<\/li>/;
$_ =~ s/<\/cont>/<\/li>/;

As hinted before, I try to remain compatible with older browsers while not neglecting new W3C recommendations, so I ensure that each a element that is the target of a link has both "id" and "name" attributes, both with the same value:

$_ =~ s/ name=(".*?")/ id=$1 name=$1/g; # 2003-01-14

It remains for the program to copy the front- and back-matter from the old version of the HTML file, changing only minor details (most importantly the date of revision wherever it appears.)

Spotmaps

The front-matter of many of my HTML files includes a sketch-map of the province of Britannia indicating the geographical distribution of the relevant class of artefacts. This is generated from the XML files in the following way: a program spotmap scans the file for grid-references (element grid), and generates TeX code that places a suitable symbol at the appropriate spot on the map according to the National Grid. The map is then drawn and annotated using TeX, including a coastal outline, which was obtained from the website of the (United States) National Oceanic and Atmospheric Administration and is stated by NOAA to be in the public domain.

Owing to differences of geographic projection between NOAA data and the British National Grid, there may be small errors in the placement of some points on the maps. Coastlines on a true National Grid basis can be obtained from the British Ordnance Survey, but at present I prefer to avoid their licensing procedures and possible charges.

(Postscript, June 2009: in about December 2007 I converted NOAA's (latitude, longitude) to National Grid by means of the Perl package Geography::NationalGrid, which has made a big improvement to the accuracy of the maps.)

TeX is of course the typesetting program devised by Donald Knuth. For an introduction try http://www.tug.org/, the site of the TeX User Group.

Printable versions

Just a few words on the most recent enhancement. Besides the Web-presentation of my data I need a printed version in a ring-binder, which I can carry about with me and refer to when working in a library or in the field. I began by using the printing facilities of my Web-browser, but rapidly felt the need for something that would rewrite the data in a more compact form. I now have another Perl program that rewrites the XML data as input to LaTeX, which gives a more compact layout than anything I've managed to achieve using a Web-browser; it reduced the thickness of the file I sometimes carry about by about a half.

(LaTeX is an application of TeX, invented by Leslie Lamport (LaTeX, a document preparation system, 2nd ed., Addison-Wesley 1996) which has been much extended by the later contributions of users.)

One word about XSL, XSLT and all that : I feel somewhat guilty about not having used them but I really haven't yet felt the need. I find that as I'm reasonably fluent in Perl, and my source XML has a very simple structure, I can more easily make an ad-hoc program in Perl to convert the XML to whatever I want. One incidental benefit is that I can even carry the comments in the XML over into the output file!

Future developments

At the moment my bibliography is kept as a simple HTML file which is hand-edited rather than created from a source file in some other notation. This has been satisfactory up till now, but the increasing size of the bibliography (it now holds over 600 citations) makes me think of improvements.

The ideal would be to rewrite the bibliography as a BibTeX database, from which I could generate

  1. a full listing typeset with LaTeX,
  2. a HTML version of the above, for presentation on the Web, and
  3. a list of references for any paper I may write (using LaTeX of course), in whatever style I (or the journal I was aiming at) wanted.

The main thing that has made me defer this plan till now is that converting the bibliography from the present HTML form to BibTeX is not straightforward and cannot be done by a simple conversion program; the problem is analogous to converting a page-description (such as a wordprocessor document or a PostScript file) to a logically structured notation (such as LaTeX, or a relational database). Probably I should take this task in hand before the bibliography becomes any larger! (Postscript, June 2004: I have now done so.)