XForms for Archives: 2014

Tuesday, October 7, 2014

xEAC at DCMI 2014

I am heading to Austin, Texas this week to discuss xEAC, with an illustration of linked open data principles applied to archival authorities and collections. This presentation is part of a full day pre-conference workshop at DCMI 2014 detailing the latest advances in digital archives entitled "Fonds & Bonds: Archival Metadata, Tools, and Identity Management." Below is my presentation:

xEAC: XForms for EAC-CPF

Friday, October 3, 2014

Semantic Web Updates to xEAC

After having implemented better semantic web standards in other projects I'm working on that use Orbeon for the front-end, I have applied these changes to xEAC. At present, xEAC supports export of RDF/XML in three different models: A default archival-based model, CIDOC-CRM, and one that conforms to the SNAP ontology. All three are a proof of concept and incomplete.

xEAC now supports the delivery of the xEAC default model in Turtle and JSON-LD, through both REST and content negotiation. URIs for record pages now accept the following content types through the Accept header: text/html, application/xml (EAC-CPF), application/rdf+xml (default model), application/json (JSON-LD), text/turtle, application/tei+xml, and application/vnd.google-earth.kml+xml (KML). Requesting an unsupported content type results in an HTTP 406 Not Acceptable error.

For example:

curl -H "Accept: application/json" http://numismatics.org/authority/elder

Furthermore, content negotiation has been implemented in the browse page. While Solr-based Atom results have been available through their own REST interface, you can now get them by requesting application/atom+xml. You can also get raw Solr XML back from application/xml. This might be useful to developers. I might implement the Solr JSON response, if there is interest (this would require a little more work).

Friday, August 29, 2014

xEAC pre-production release ready for wider testing

xEAC (https://github.com/ewg118/xEAC), an open source, XForms-based framework for the creation and publication of EAC-CPF records (for archival authorities or scholarly prosopographies) is now ready for another round of testing. While xEAC is still under development, it is essentially production-ready for small-to-medium collections of authority records (less than 100,000).

xEAC handles the majority of the elements in the EAC-CPF schema, with particular focus on enhancing controlled vocabulary with external linked open data systems and the semantic linking of relations between entities. The following LOD lookup mechanisms are supported:

Geography: Geonames, LCNAF, Getty TGN, Pleiades Gazetteer of Ancient Places
Occupations/Functions: Getty AAT
Misc. linking and data import: VIAF, DBpedia, nomisma.org, and SNAC

xEAC supports transformation of EAC-CPF into a rudimentary form of three different RDF models and posting data into an RDF triplestore by optionally connecting the system to a SPARQL endpoint. Additionally, EADitor (https://github.com/ewg118/eaditor), an open source framework for EAD finding aid creation and publication can hook into a xEAC installation for controlled vocabulary as well as posting to a triplestore, making it possible to link archival authorities and content through LOD methodologies.

The recently released American Numismatic Society biographies (http://numismatics.org/authorities/) and the new version of the archives (http://numismatics.org/archives/) illustrate this architecture. For example, the authority record for Edward T. Newell (http://numismatics.org/authority/newell), contains a dynamically generated list of archival resources (from a SPARQL query). This method is more scalable and sustainable in the long run than using the EAC resourceRelation element. Now that SPARQL has successfully been implemented in xEAC, I will begin to integrate social network analysis interfaces into the application.

More information:

Github repository: https://github.com/ewg118/xEAC
XForms for Archives, a blog detailing xEAC and EADitor development, as well as linked data methodologies applied to archival collections: http://eaditor.blogspot.com/
xEAC installation instructions: http://wiki.numismatics.org/xeac:xeac

Extended Linked Data Controlled Vocabulary in xEAC and EADitor

Getty TGN

Last week, the Getty announced the latest installation of their linked open data vocabularies: the Thesaurus of Geographic Names. Like the previously released AAT, the TGN is available through a SPARQL endpoint. After returning from the Semantic Technology and Business conference in San Jose (which I have discussed in another blog post), I set out to integrate TGN lookups into the various cultural heritage data frameworks that I'm developing.

Both xEAC and EADitor have been extended to enable lookups of the Getty TGN through their editing interfaces. The functionality is identical to the occupation and function lookups in both systems. 1. The user performs a text search for a term, 2. the XForms engine submits a SPARQL query to the Getty endpoint, and 3. the user then selects the appropriate item from a list generated from the SPARQL response. See the example from xEAC, below:

The geographic lookup mechanism in xEAC also includes an option for geographic names in the Library of Congress Name Authority File.

SNAC Integration

In addition to extending the geographic lookup functionality in both EADitor and xEAC, I have also implemented a SNAC lookup in both applications. With the addition of two URL parameters, the search results page in SNAC can provide the raw cross query XML response instead of the default HTML. I hope that SNAC will eventually provide a documented search API that returns results in a more formal standard, like Atom.

In xEAC, the lookup will embed the SNAC URI into the otherRecordId and source in the EAC-CPF control. Nothing else is pulled from SNAC at the moment, either into the EAC record or into the public user interface, although this could change eventually.

In EADitor, the persname, corpname, and famname element components have been extended to include the SNAC lookup in addition to VIAF and xEAC (if a xEAC instance has been added into the EADitor settings). The SNAC URI is stored in the @authfilenumber of the associated EAD element.

SNAC URIs that are embedded into EAD finding aids (like the URIs from other linked open data vocabulary systems) will be included in the RDF serialization of the archival collection data. This may pave the way for users of EADitor to make their content accessible through SNAC, or whatever international archival entity system evolves from SNAC, by means of linked open data technologies.

Thursday, June 26, 2014

First Newell notebook published in Archer

The first of several dozen digitized Greek coin hoard research notebooks written by Edward Newell has been published into the newly-relaunched ANS archival resource, Archer. This is the first publicly-accessible product from a grant of $7,500 received from The Gladys Krieble Delmas Foundation to digitize these valuable resources. This notebook, written in 1939, includes notes on a handful of Greek coin hoards which were eventually published in An Inventory of Greek Coin Hoards (IGCH).

Specifications

We wanted to go beyond simply scanning page images and offering each notebook as an open access PDF. We wanted to include zoomable images, basic pagination functionality, and annotation of particular items on each page. We decided not to go as far as to offer full transcriptions, but the annotations could include free text or links to other resources on the web: such as linking to coins in the ANS collection, contemporary scholars mentioned in the text that might have entries in VIAF or dbpedia, linking to nomisma.org defined IGCH entries and other identifiers for mints, regions, or numismatic authorities, books in the ANS library, and place names in Geonames. Furthermore, these terms should be indexed into Solr and populate the facet terms in Archer's browse interface, and RDF should be made available for linking resources together.

Technical Underpinnings

We chose TEI XML as the vessel for encoding data about the notebooks. The TEI files include bibliographic metadata in the header and a list of facsimile elements that link to page images and contain xy coordinates for the annotations. The elements also include multiple surface elements for each annotation, with mixed content of free text and links. We have hooked into Rainer Simon's Annotorious library in both the public user interface and the XForms-based backend. Both interfaces are now part of EADitor's core functionality.

It should be stressed that the TEI editing functionality in EADitor is tailored toward annotation of facsimile images. The header is not yet editable, nor is the body, but this functionality may be enhanced in the future. Nevertheless, when you open the TEI form, there is a Facsimiles tab. In this tab are two columns. On the left is a list of thumbnails. Clicking on a thumbnail will load the large image into OpenLayers and make it annotatable through Annotorious.

The JSON generated by Annotorious gets parsed by Orbeon (the XForms processor) and transformed into TEI and inserted into the XML document (not just the text annotations, but annotation coordinate values crosswalked into TEI attributes).

<facsimile xml:id="nnan0-187715_X007">
<graphic url="0-187715_X007" n="Loose 2"/>
<surface xml:id="aj6xve43lj5" ulx="-0.86506513142592" uly="1.3798999767388" lrx="-0.45771458478717" lry="1.2864639140252">
     <desc>
      <ref target="http://nomisma.org/id/sardes">Sardis</ref>
    </desc>
</surface>
</facsimile>

URIs are parsed and made into TEI "ref" elements. The label of the element may be extracted from an API, depending on the content of the URI. For example, a URI matching "http://numismatics.org/collection" will append ".xml," and the title will be pulled from the NUDS/XML serialization. Similarly, an AACR2-conformant place name label will be generated from querying Geonames' APIs and RDF is pulled from nomisma.org, dbpedia, or VIAF to extract the skos:prefLabel or rdfs:label (see more with the XPL and XSLT).

Once a new annotation is created and URL is parsed, the link will be clickable. When you click on a different thumbnail in the left-hand column to load another image for annotation, the TEI file will be saved to eXist, and the document will be re-published to Solr, if it had previously been designated as public, and re-published to the RDF triplestore, if the TEI document is both public and EADitor's config contains endpoint URLs for updating and posting data.

Public User Interface

Since the EAC-CPF based authority records delivered through xEAC and Archer are now both hooked into an RDF triplestore and SPARQL endpoint, the notebooks will appear under the list of related resources in the Newell authority record, together with finding aids (encoded in EAD) in which Newell is a creator or correspondent or photographs (encoded in MODS) in which he appears.

The interface for the notebook itself is not markedly different than for other record types in EADitor. In the right column is OpenLayers+Annotorious (in read-only mode) for showing images and annotations. The annotations are rendered by querying a REST protocol inherent to EADitor which transforms the TEI facsimile element into Annotorious' JSON model by passing in a few request parameters. Images can be paged through by clicking on thumbnails, next/previous links, or a drop down menu.

In the left column is bibliographic metadata and a list of unique terms that appear in the TEI document. The terms are clickable links to direct the user to the results page for that term's query. A user may click on the external link image to load the target URL (for example, to view the VIAF or Geonames page). Lastly, under each term are page numbers on which the term appears in the document. Clicking on one of these page number links will load the image and annotations in OpenLayers.

There are still some more improvements to make with the system, especially in making the TEI Header editable, but I think that this framework has really great potential for lowering the barriers to creating, editing, and publishing TEI, especially in a large scale linked open data system.

Monday, June 23, 2014

ANS' Archer v2 has gone live: EAD + EAC-CPF + SPARQL

I have spent much of the last two weeks working on deploying the newest version of EADitor and xEAC into production for version of of Archer, the Archives of the American Numismatic Society. As mentioned in earlier blog posts, EADitor has been hooked up to xEAC for personal, corporate, and family name lookups. Furthermore, both applications will serialize data into RDF and post into a triplestore to further connect archival content with authorities. EADitor supports the publication of MODS records in addition to EAD finding aids (although there is no editing form for MODS as of yet), and we will be publishing our first TEI files in the coming weeks (annotated facsimiles from the Greek numismatic research notebooks of Edward T. Newell).

Archives: http://numismatics.org/archives/
Authorities: http://numismatics.org/authorities/

Process

I first needed to parse out all of the personal and corporate names from the EAD finding aids in order to remove duplicate entities with slight variations in name form (because the earliest finding aids were created with an earlier version of EADitor that used an inconsistent autosuggest feature). The origination element contained only plain text, so these needed to be matched with the normalized personal or corporate names to insert corpname or persname elements. Furthermore, the MODS records needed to be reprocessed because all terms were categorized as subjects, regardless of whether they were indeed subject topics or genres, people, corporate bodies, etc.

The personal or corporate names that appeared in the EAD origination formed the basis for the new EAC-CPF collection. The EAD and MODS files were updated to point to the newly minted URIs for these entities, and matches to VIAF URIs were made for many of the other personal or corporate names that appear in the controlled access headings in the finding aids or in the subject terms in the MODS records.

After generating more than 100 EAC-CPF stubs (which included a biogHist extracted from the EAD), I went through the list of biographies of prominent members of the ANS on the website, filling in gaps as necessary. I added images into the EAC-CPF record, where applicable, designated as resource relations with an xlink:arcole of foaf:depiction, which is defined as a semantic localTypeDeclaration in the EAC-CPF control element. The namespace for foaf: is defined in the declaration, making it possible to include a foaf:depiction in the RDF serialization (see http://numismatics.org/authority/anthon.rdf). I also inserted some CPF relations as necessary, either internally or externally to entities defined by VIAF URIs. Finally, I added a plethora of occupations (with dates and places, if known)--most of which were looked up from the Getty AAT SPARQL endpoint through xEAC--, life events, and associated places (looked up from the Geonames API in xEAC). I'm looking forward to leveraging the VIAF URIs stored in the EAC records in order to generate lists of monographs or journal articles by or about entities, pulled from Worldcat's APIs.

Results

All in all, it is a fairly comprehensive and powerful research tool, even if there are only fewer than 150 entities in the system. Because the authority records and the archival resources are pushed into the RDF triplestore upon publication, a researcher can gain access to contents from an authority record. A user viewing a finding aid can view an abstract dynamically extracted from the EAC-CPF record for the creator of the archival content.

I believe I inserted just enough places and chronological events to make the map and timeline an adequate demonstration of geotemporal visualization for most EAC-CPF records. Certainly, we could spend more time enhancing the biographical context of each record. We will begin this process, and the records will continue to evolve. The user interface will continue to evolve as well, as I aim to introduce social network graph visualizations based on SPARQL queries, and I will create an XQuery interface to accompany the SPARQL endpoint interface to facilitate a wider variety of complex queries.

Conclusions

There are no firm ontology or data model standards for linked open archival description, so the RDF model used in this interface should be considered to be a beta, and it will be adapted when stronger standards emerge from the archival community. The system is beyond a proof of concept at this phase, but it has not been tested for larger scale implementation. I believe that the system can accommodate hundreds of thousands (even into the low millions) of EAC-CPF records and many millions of triples, but I have yet to test at this scale. In any case, I believe that this more modularized/linked approach to archival collections represents the direction of digital archives, or more broadly, cultural heritage materials.

Look for a more official announcement in the near future, probably after we publish a few annotated TEI-based notebooks from the Newell project, which we received modest funding to digitize last year.

Friday, May 30, 2014

Enabling a SPARQL endpoint in EADitor and xEAC

Not long ago, I discussed the enhancement of both xEAC and EADitor by connecting them through a SPARQL endpoint. I have extended this further by enabling a wrapper to this endpoint in both xEAC and EADitor. It didn't take long to employ, as I basically copied and pasted some files/code from the nomisma.org Github repository.

Essentially, once SPARQL endpoint URLs have been inputted into the config for xEAC or EADitor, a checkbox is made available to enable a wrapper to the endpoint. By wrapper, I mean a pipeline is created for the SPARQL query interface and the query response. This wrapper interacts with the SPARQL endpoint directly, but the xEAC's/EADitors SPARQL page carries the same style as the rest of the site, and if HTML results are selected, the SPARQL XML response is fashioned through an XSLT stylesheet in xEAC/EADitor into HTML which also conforms to the inherent style of the application. It's essentially identical to the functionality on http://kerameikos.org/sparql: the page and query response are merely interfaces to Fuseki's endpoint.

Why is the SPARQL endpoint useful?

SPARQL is a complicated beast, and I do plan on writing documentation on the ontologies used and models implemented in both xEAC and EADitor. Most users will likely not use the SPARQL query endpoint directly. But the major point is: it exists for a small subset of users that want to perform really sophisticated queries on the dataset. In the same vein, xEAC will also eventually expose an XQuery interface for performing different types of complex queries on the dataset.

The real advantage: building UIs on SPARQL

As demonstrated in a variety of other projects that I work on: the best uses of SPARQL are the ones you don't even realize. On http://kerameikos.org/id/red_figure, the timeline/map, list of thumbnails, and chart (if generated) showing the distribution of a particular typology are all generated by SPARQL queries and rendered into something that is far more visually understandable to a human being. Likewise with a chart showing the change in weight of Roman imperial denarii from the start of the Roman Empire in 27 B.C. to about A.D. 220 (the end point in which we currently have data): http://numismatics.org/ocre/visualize?measurement=weight&chartType=line&interval=5&fromDate=-30&toDate=220&sparqlQuery=nm%3Adenomination+%3Chttp%3A%2F%2Fnomisma.org%2Fid%2Fdenarius%3E#measurements .

While there is still much work to do in refining the ontologies and models used for representing EAC-CPF or EAD records as RDF, now that the SPARQL publication mechanism actually functions, it will be possible to begin to build more sophisticated visualizations on top of these queries--to incorporate social network graph visualizations into xEAC, dynamically generated by SPARQL and easily manipulated and navigated by users.

The first step, however, is to be able to generate lists of related resources, such as a test finding aid linked to Edward T. Newell, below:

We plan to deploy xEAC and the current version of EADitor into production at the American Numismatic Society fairly soon for Archer 2.0. While the ANS' archives are fairly small, I think you can get the idea of the potential for a large collection of entity records, like SNAC, when you are able to link millions of entities to many tens of millions of related materials.

Similarly, in EADitor, when a finding aid has been linked to an entity in xEAC (and the @type of the persname, famname, or corpname has been set to xeac:entity [automatically done in the editing interface]), EADitor will extract biographical information directly from the EAC-CPF record:

Friday, May 16, 2014

Linking archival entities and resources with SPARQL

Linked open data methodologies have an important role to play in the future of archival description.

With this in mind, I am moving EADitor and xEAC further in this direction. Both frameworks already support serialization of EAD or EAC-CPF into RDF. xEAC supports three different RDF models, in fact, depending on which community the data are intended to serve. EADitor transforms EAD finding aids into the Arch ontology. There are no true standards yet for representing archival resources as RDF, but I am hopeful that some will emerge. Since EADitor is now capable of embedding xEAC URIs for archival entities into EAD finding aids, the next step in linking resources together is the implementation of an RDF triplestore and SPARQL endpoint into both xEAC and EADitor.

An Example: Resource Relations

EAC-CPF records two primary types of relations: links to other corporate, personal, or familial entities in the form of CPF Relations and Resource Relations, or links to other resources by or about an entity that may be available on the web. It makes a lot of sense to me to store CPF relations within the EAC-CPF record, especially if these related entities are stored in the same information system (like xEAC). On the other hand, I don't think it makes sense to store resource relations within the EAC, mainly because I think that it's far too complicated to maintain a growing list of relationships.

Let's suppose you have an entity, Thomas Jefferson, defined by a URI, http://example.org/thomas_jefferson. A significant portion of his collection is contained at the University of Virginia and Monticello, but he corresponded with many of his other contemporaries. Therefore, the papers of John Adams or George Washington may also contain letters from Jefferson. He also corresponded with numerous prominent Europeans, so some of his materials may be contained in archives overseas. There may be dozens or hundreds of institutions which contain at least one article by or about Jefferson. If each of these archives adopts a stable URI that defines Jefferson, then it is much easier to accept RDF derived from EAD, MODS, or a relational database into an RDF triplestore, and use SPARQL to gather these materials dynamically from the triplestore when a researcher accesses the http://example.org/thomas_jefferson entity record.

This is the approach that I am implementing in xEAC and EADitor. For example, if the <origination> within the <did> of a finding aid (or individual component) contains an entity URI, the RDF derivative of the finding aid will link the archival resource to the archival entity through the dcterms:creator property. In EADitor, once the SPARQL endpoint URLs for querying, publishing, and updating data have been established, the RDF will be posted into the triplestore when the finding aid has been designated for publication on the web. Likewise, if these endpoint URLs have been added into the xEAC configuration file, the XSLT template for generating HTML from EAC-CPF will query the SPARQL endpoint to list related resources that have been pushed into the triplestore from EADitor. Furthermore, EADitor itself isn't necessary for this functionality in xEAC. RDF may be pushed into the triplestore by other means--from an institutional repository, from ArchivesSpace, or something else. You could even feed data from Europeana into a triplestore to build a prosopography of Impressionistic artists. The sky is the limit.

The SPARQL query looks something like this:

SELECT ?uri ?title WHERE {
?uri dcterms:creator <URI> ;
dcterms:title ?title
}

There is still work to be done in the UI, but the underlying technological functionality is now available in the Github repository for both applications.

Technical mumbo jumbo

The functionality for linking EADitor and xEAC to SPARQL endpoints is identical in both applications. The URLs are added through the Settings page. Under the SPARQL heading, the user clicks the "Connect" button, which launches a popup window requiring the user to input three separate URLs: the query URL, the Graph Store URL, and the SPARQL/Update URL. These URLs may vary from application to application, especially if the configurations have been changed. Note that a SPARQL 1.1-compliant endpoint is required.

After ending these URLs and clicking "Connect," the XForms engine will test each one individually. First, it will attempt a basic query. Then, it will post RDF into the Graph Store URL, and if successful, an XForms submission will be executed to delete the graph through SPARQL/Update. If all three processes complete successfully, the URLs will be added into the EADitor/xEAC config, and then the config can be saved. The user is also presented with the option to post all records that have been slated for publication (in Solr) into the RDF triplestore.

From the admin page, when a user deletes a record from the eXist database or removes the record from publication, the triples will be purged from the endpoint as well as the docs being deleted from Solr. When a user publishes a record, the record will be serialized into RDF and posted into the triplestore in addition to the user Solr publication. Likewise, triples will be updated when the user saves a published record from the EAD/EAC-CPF editing pages.

Ergo, enterprise archival linked open data publishing.

There is still work to do here: the ontology and RDF data models still need work, which is more of a community effort. And of course, I have a lot of plans for enhancing the user experience.

Once this new publication model is fully functional, I will begin SPARQL-based visualizations of social networks and relations between entities and their archival resources.

Incorporating xEAC entities into EAD finding aids

xEAC and EADitor were both conceived as standalone applications. This is what separates xEAC, especially, from other authority management modules that come packaged in larger archival suites like ICA-AtoM and ArchivesSpace. xEAC is applicable toward LAM authority control, for whom EAC-CPF is the primary audience, but can also be applied to scholarly prosopographies (and eventually support social network analysis built upon linked open data methodologies).

It is now possible to hook xEAC and EADitor together through an intermediate, optional, RDF triplestore and SPARQL endpoint. This will be discussed in greater detail in a later blog post. This particular post will detail the more immediate connection between entities defined in xEAC and personal, family, and corporate names within EAD finding aids.

Since its inception, xEAC has provided a Solr-based Atom feed for published EAC-CPF records. The Atom feed returns results based on the Lucene query syntax. A number of fields are available to narrow the search. For example, the entityType_facet Solr field allows a user to search for a name of a particular entity type, which is defined in the EAC-CPF schema as being either "person," "family," or "corporateBody." See http://admin.numismatics.org/xeac/feed/?q=augustus%20AND%20entityType_facet:person for example. The results are machine readable, and therefore the EADitor XForms application can read and process the search results. The interface for persname, corpname, and famname in EADitor has been adjusted to include xEAC lookups, and the functionality is practically identical to the VIAF lookup mechanism (which returns results in RSS as opposed to Atom).

Integrating a xEAC lookup mechanism into EADitor was incredibly easy as a result, and I managed to implement it in about 30 minutes. The Settings page for EADitor now includes an input for the xEAC home page URL. An XForms submission will process this URL and append 'feed/' to it to ascertain whether Atom XML is available at that resource. If so, the xEAC URL will be committed into the EADitor configuration file. If the xEAC URL is in the config, the persname, corpname, and famname element XBL components within the XForms application will include a radio button to select the xEAC lookup, in addition to VIAF and local vocabulary. When you select an entity after performing the lookup, the entity's URI will be embedded in the @authfilenumber attribute. In EAD3, this attribute will be @vocabularysource (I think), following linked data advancements in EAC-CPF.

When linking the EAD finding aid to an entity defined in xEAC, the @type attribute will be set for the persname, corpname, or famname to 'xeac:entity.' Ideally, I would like to avoid system-defined attributes, but I think they are very useful in this case, as it will indicate to the EADitor UI XSLT stylesheets that the EAC-CPF XML can be extracted programmatically by appending '.xml' to the entity URI, and therefore biographical information may be extracted directly from the EAC-CPF entity record. This, I believe, is the dream of the creators of EAC-CPF. Authority information and archival/biographical context are stored separately from the finding aid, but yet the information is made available through the finding aid user interface by means of linked open data methodologies.

I have not yet built these hooks into EADitor's finding aid user interface, but expect them to be available when the new version of EADitor is released later this summer. This feature represents a major advancement in the publication of archival materials.

But wait, there's more.

Friday, April 4, 2014

xEAC in London: SNAP 2014

I was recently in London to discuss ANS participation in the Standards for Networking Ancient Prosopographies meeting, hosted by King's College London, March 31 - April 1. Most Roman imperial people related to coins have already had URIs minted about them on nomisma.org, and we are about to create nomisma IDs for Roman Republican moneyers. Additionally, we will be creating a large number of Greek authorities in the coming months. The as-of-now institutionally unaffiliated kerameikos.org is another thesaurus which will contain numerous URIs for Greek potters and painters which may supplement SNAP with biographical information or objects of cultural heritage. Lastly, I discussed the Roman Imperial Social Network (RISN) project, for which we have requested a grant to develop further. This is a prosopography of the Roman Empire built on EAC-CPF from existing open resources on the web (such as Tom Elliott's PIR RDF and DBpedia) and embellished with xEAC to include important life events, occupations, and places (linked to Pleiades).

Meeting Overview

The meeting was a great experience, and I always find these sorts of gatherings (like LAWDI) useful for learning what other people are working on. There was, of course, some overlap from previous LAWDIs, in terms of participants or projects. The Syriac Reference Portal once again made an appearance, as did Tresmigestos. Maggie Robb is working on a prosopography of the Roman Republic, and so there's a possibility for collaboration between her project and nomisma.org, as we are going to create nomisma IDs for Republican moneyers in the coming weeks, and will certainly incorporate her URIs into our system. Day 1 included presentations and discussions by SNAP organizers as well as a selected group of participants which have either potential datasets to include in SNAP or are using tools or other methodologies for prosopographies. My presentation contained a bit of both tools/methodologies and datasets. In Day 2, the meeting participants split into smaller groups for focused break-out sessions on various topics.

Incorporating RDF relationship ontologies into xEAC

Several months ago, just after presenting the latest developments in xEAC at MARAC, I wrote on the application's enhanced relationship maintenance capabilities. The new system required manual entry of relationships into the system. One of the questions I received at MARAC was, basically, will xEAC be able to harvest from existing ontologies? Now, the answer is "yes."

While this is still very much a prototype (because there may be numerous ways of constructing a relationship ontology in RDF), I have successfully implemented an RDF (XML) upload mechanism. The xEAC relationship maintenance section will parse the RDF/XML provided http://vocab.org/relationship/. The XForms processor will read the relationship properties in the file, creating symmetrical or inverse relationships when applicable. It allow you to select the prefix you would like to use to define the ontology and will create the localTypeDeclaration that contains the abbreviation (the prefix) and citation (URI) if it does not already exist in the config.

Therefore, it will take some RDF that looks like this:

<rdf:Description rdf:about="http://purl.org/vocab/relationship/grandchildOf">
 <rdf:type rdf:resource="http://www.w3.org/1999/02/22-rdf-syntax-ns#Property"/>
 <owl:equivalentClass rdf:resource="http://www.perceive.net/schemas/relationship/grandchildOf"/>
 <owl:inverseOf rdf:resource="http://purl.org/vocab/relationship/grandparentOf"/>
 <rdfs:subPropertyOf rdf:resource="http://xmlns.com/foaf/0.1/knows"/>
 <rdfs:subPropertyOf rdf:resource="http://www.w3.org/2002/07/owl#differentFrom"/>
 <rdfs:label xml:lang="en">Grandchild Of</rdfs:label>
 <rdfs:label>Grandchild Of</rdfs:label>
 <skos:definition xml:lang="en">A person who is a child of any of this person's children.</skos:definition>
 <rdfs:domain rdf:resource="http://xmlns.com/foaf/0.1/Person"/>
 <rdfs:range rdf:resource="http://xmlns.com/foaf/0.1/Person"/>
 <rdfs:isDefinedBy rdf:resource="http://purl.org/vocab/relationship/"/>
 <skos:historyNote rdf:nodeID="mor53341b13853b8"/>
</rdf:Description>

And turn it into this:

Once you've saved xEAC's settings, these relationships will be available through the @xlink:arcole in CPF Relations in the EAC-CPF form.

Of course, after you establish a relationship between your source and target person, family, or corporate body record, the target EAC-CPF record will be updated with the symmetrical/inverse relationship which points back to the source. These relationships will be expressed in RDF output generated by xEAC.

Wednesday, March 26, 2014

Serializing EAC-CPF into CIDOC CRM

xEAC supports a fairly rudimentary RDF/XML output by appending '.rdf' onto a URI for an entity. There is an RDF ontology based on EAC-CPF, but I am not sure it has seen wide usage (it will eventually be implemented in xEAC, regardless). The RDF model employed in xEAC out of the box is little more than a proof of concept, a placeholder until a more standard model emerges from the archival community. It is based slightly on Aaron Rubinstein's arch ontology and contains little more than labels for name entries, relations (from CPF relations that contain an RDF predicate in the @xlink:arcole), and a dcterms:abstract derived from the EAC-CPF abstract element.

There has been some use of CIDOC CRM to model people. Much of this work has been done by Michele Pasin and John Bradley at King's College London (see their paper). I am heading to London next week for the first meeting of the Standards for Networking Ancient Prosopographies project, and I suspect I will hear much more about their work in this regard there. In order to reach the broadest audience, I am making EAC-CPF data available through xEAC, serialized into CIDOC CRM. This is no easy task, but I have gotten the ball rolling a little bit. I will make more progress once I learn more about the model at the SNAP meeting.

The great advantage of the CRM is that since it is very generalizable, it can be used to model anything. This is a double edged sword, however, since it can be so generalizable that a complicated model is sometimes necessary to communicate a relatively simple concept.

Exist Dates

In EAC-CPF, the date range of existence occupies about four lines of XML, and the @standardDate, @notBefore, and @notAfter attributes communicate ISO standard dates and some semantic certaintly (or uncertainty). These can be modeled in CRM, but in a more complicated fashion. First, a person (or family or organization) was P92i_was_brought_into_existence_by an E63_Beginning_of_Existence which P4_has_time-span designated by a E52_Time-Span which has a human-readable rdfs:label and machine readable P82a_begin_of_the_begin and P82a_end_of_the_begin. @notBefore is P82a_begin_of_the_begin and @notAfter is P82a_end_of_the_begin. That's for a eac:fromDate. An eac:toDate has all of this stuff in a slightly different manner, with End_of_Existence and begin_of_the_end and end_of_the_end. The creation and end of an event can have a place as well, but there are some difficulties in translating birth and death places from EAC-CPF into CRM in this regard.

First, the semantic of an exist date is a bit fuzzy. By definition the existDates are "The dates of existence of the entity being described, such as dates of establishment and dissolution for corporate bodies and dates of birth and death or flourit for persons." The only way to determine between the birth and death dates of a person and the flourit is by using the localType attribute, and the values of @localType may vary from project to project. Therefore, if the entity being described is a person and the existDates are of his or her birth and death, then I should be using properties related to E67_Birth instead of the more generic E63_Beginning_of_Existence (of which E67_Birth is a subclass). Instead, I must opt for the more generic class. The same goes for organizations. Ultimately, the solution to this problem is to implement in the xEAC editing interface a checkbox for inserting a @localType designating whether the existDates are of the life or flourit of the person or organization (e.g., @localType='xeac:birth' or 'xeac:death'). The same goes for the place of birth or death. That way the XSLT stylesheets can read the system-based localType attribute and construct the CIDOC CRM model accordingly, and allow for variation between the exist dates for persons or corporate bodies.

This is something I will continue to wrestle with over the coming weeks, but eventually I hope to have a fully compatible crosswalk between EAC-CPF and both CIDOC CRM and TEI. CRM includes properties for relating children with parents, but arguably these types of relationships should be maintained in a separate ontology build specifically for relations. In fact, there could be many relationship ontologies, depending on the needs of the project. This I am sure will be a topic of discussion at SNAP.

Resources

http://admin.numismatics.org/xeac/api/get?id=alexander_the_great&model=cidoc-crm
XSLT: https://github.com/ewg118/xEAC/blob/master/ui/xslt/serializations/eac/rdf-templates.xsl (follow the mode="crm")

Tuesday, March 25, 2014

Exporting EAC-CPF to TEI

As indicated in the TO-DO list in the recent xEAC beta announcement, and as part of the design specifications for our IMLS grant application for the further development of xEAC (and the creation of a prosopographical datset of the Roman Empire), I have implemented a basic EAC-CPF-to-TEI transformation. It isn't yet a complete crosswalk, but it handles the following:

name entries
biographical description
the generation of a chronological list of events (including descriptions with normalized dates and places that link to either Pleiades or Geonames)
a list of relations, implementing semantic relationships defined in the @xlink:arcrole of the cpfRelation element.

There is now a link on the HTML page for an entity record to the TEI export. One can access this alternate model by appending '.tei' to the URI, e.g., http://admin.numismatics.org/xeac/id/id/alexander_the_great.tei.

The XSLT stylesheet is available at Github at https://github.com/ewg118/xEAC/blob/master/ui/xslt/serializations/eac/tei.xsl.

The model is based upon the TEI Prosopography documentation and some examples in the Lexicon of Greek Personal Names.

Monday, March 24, 2014

Further AAT Integration into EAC-CPF and EAD

xEAC

Like occupations, the function element in EAC-CPF has also been hooked into the Getty AAT via XForms-based SPARQL queries. The top concept for the function facet in the AAT is http://vocab.getty.edu/aat/300054593. According to Kathleen Roe on the EAD listserv:

The Getty functions vocabulary was built from an analysis of government functions in the U.S. as part of the Government Records Description project (early 1990s) undertaken by 12 or so state archives, including Utah (Jeff Johnson was the state archivist at the time).

The query appears as follows:

SELECT ?c ?label {
?c a gvp:Concept; skos:inScheme aat: ;
gvp:broaderTransitive aat:300054593 ;
gvp:prefLabelGVP/xl:literalForm ?label ;
luc:term "SEARCH_QUERY*"
} LIMIT 25

EADitor

I extended these functionalities to EADitor, with the code practically copied and pasted from the genreform XBL into the function and occupation components (just replacing genreform with occupation/function and updating the SPARQL query). EADitor now supports lookup mechanisms for the following controlled access terms:

geogname: Geonames (modern)/Pleiades (ancient)
genreform: AAT/LCGFT
function: AAT
occupation: AAT
subject: LCSH
persname: VIAF
corpname: VIAF

EADitor will eventually hook into SNAC (or whatever evolves from it) for persname, corpname, and famname, and I will extend it to hook into xEAC for linking finding aids with EAC-CPF records. xEAC already has a REST query mechanism that returns results in the form of Atom XML, so this will be pretty easy.

Thursday, March 6, 2014

xEAC beta 2014a ready for testing

I have finally gotten xEAC to a stage where I feel it is ready for wider testing (and I have updated the installation documentation). This has been a few months coming, since I had intended to release the beta shortly after MARAC in November. The xEAC documentation can be found here: http://wiki.numismatics.org/xeac:xeac

Features

Create, edit, publish EAC-CPF documents. Most, but not all, EAC-CPF elements are supported.
Public user interface migrated to bootstrap 3 to support mobile devices.
Maps and timelines for visualization of life events.
Basic faceted search and Solr-based Atom feed in the UI.
Export in EAC-CPF, KML, and rudimentary RDF/XML. HTML5+RDFa available in entity record pages.
Manage semantic relationships between identities (http://eaditor.blogspot.com/2013/11/maintaining-relationships-in-eac-cpf.html). Target records are automatically updated with symmetrical or inverse relationships, where relevant, and relationships are expressed in the RDF output. TODO: parse relationship ontologies defined in RDF (e.g., http://vocab.org/relationship/.rdf) for use in xEAC.

REST interactions

The XForms engine interacts with the following web services to import name authorities, biographical, or geographic information:

VIAF lookup
DBPedia import
Geonames for modern places (placeEntry element)
Pleiades Gazetteer of Ancient Places (placeEntry)
Getty AAT SPARQL (occupation element) (http://eaditor.blogspot.com/2014/03/linking-eac-cpf-occupations-to-getty-aat.html)
SPARQL query mechanism of nomisma.org in the UI (and extensible, generalizable lookup widgets)

When the OCLC linked data service supports queries by VIAF URI, I will create a lookup widget to provide lists of related bibliographic resources.

TODO list

I aim to improve xEAC over the following months and incorporate the following:

Finish form: Represent all EAC-CPF elements and attributes
Test for scalability
Interface with more APIs in the editing interface
Employ SPARQL endpoint for more sophisticated querying and visualization, automatically publish to SPARQL on EAC-CPF record save.
Improve public interface, especially searching and browsing
Incorporate social network graph visualization (see SPARQL, above)
Follow evolving best practices in RDF, support export in TEI for prosopographies (http://wiki.tei-c.org/index.php/Prosopography) and CIDOC-CRM.
Interact with SNAC or international entity databases which evolve from it.

Wednesday, March 5, 2014

Linking EAC-CPF Occupations to the Getty AAT

The occupation element in xEAC now supports a SPARQL-based lookup mechanism to link EAC-CPF records to terms defined in the newly-released linked open data Getty AAT.

I won't go into great detail about how this works in the back end, because it is basically identical to the process by which I hooked EADitor into the AAT with EAD genreform elements, which I covered in a blog post last month.

One thing to note, however, is that the xEAC occupation lookup filters for terms that contain "Agents Facet" in the gvp:parentStringAbbrev property. There are different categories of terms--object types, agents, stylistic periods, etc.--that are not semantically distinguished, but at least contain a string in a generic field which allows filtering. I hope that the Getty will move forward with a more formal representation of these facets to improve querying efficiency.

Therefore queries for occupations look something like this:

SELECT ?c ?label WHERE {
?c rdf:type gvp:Concept .
?c skos:inScheme aat: .
?c skos:prefLabel ?label .
?c luc:term "president" .
?c gvp:parentStringAbbrev ?facet 
FILTER regex(?facet, "Agents Facet") 
FILTER langMatches(lang(?label), "en")}
ORDER BY ASC(?label)
LIMIT 25

I plan to apply these filters to the LOD thesaurus editor for kerameikos.org in order to provide a more accurate list of style periods, pottery techniques, wares, and shapes for linking kerameikos URIs to Getty AAT identifiers. For example, "Black Figure" is defined by the Getty as both a technique and a style or period, so "Black Figure" on kerameikos, defined by http://kerameikos.org/ontology#Technique, should refer to the Getty's technique facet (not the style or period) for the term with owl:sameAs.

xEAC: Current and Future Work this Month

I am in the process of migrating various projects to Bootstrap 3, which greatly improves mobile support. Numishare's master branch has been migrated to Bootstrap from jQuery UI (with the exception of multiselect, which is on the agenda). I recently completed the migration of xEAC to Bootstrap (including multiselects on the browse page), and EADitor will be next. Now that I have successfully implemented Bootstrap Multiselect, I will be able to apply these changes back to Numishare. Frankly, the AJAX lookup mechanism for dynamic Solr facet terms is much simpler in Bootstrap Multiselect compared to the older jQuery UI one I had been using for three years--far less javascript required on my end.

While I was at it, and since I'm having Orbeon (the engine powering both the front end user interface and the back end editing in both xEAC and EADitor) output pages in HTML5, I went ahead and applied fairly basic RDFa to EAC-CPF record pages so that machine readable data can be extracted by using the W3C distiller.

I will be traveling to London at the end of this month to participate in the Standards for Networking Ancient Prosopographies meeting to discuss EAC-CPF and xEAC to some degree. The meeting consists mainly of digital humanists who have a lot of experience with TEI and CIDOC-CRM, but may be completely unaware of the emergence of EAC-CPF as a LAM standard for modeling entities and their relationships. Since we at the American Numismatic Society are moving forward with our own prosopography of the Roman Empire (which will tie into other projects, such as Online Coins of the Roman Empire and nomisma.org), we aim to contribute our entity URIs into SNAP, which will facilitate larger scale aggregation of cultural heritage materials related to ancient people. In order to broaden access and use of our data, we will not only provide the source EAC-CPF XML documents, but also alternative serializations in various forms of RDF (like CIDOC-CRM) and TEI conforming to the prosopography recommendations. By the end of the month, I plan to have some basic CIDOC-CRM and TEI exports functional, as well as possibly hooking xEAC up to a RDF triplestore/SPARQL endpoint as a proof of concept of publishing EAC-CPF as linked open data right out of the box.

Wednesday, February 5, 2014

Integrating EADitor with the Getty linked data AAT

I've been following linked open data developments at the Getty pretty closely over the last few months, especially related to incorporating Getty AAT URIs (and eventually ids from other vocabulary systems) into Nomisma.org and my side-project Kerameikos.org, a LOD thesaurus geared specifically toward Greek pottery.

For some reason, it occurred to me only yesterday that I should adapt EADitor to incorporate Getty AAT identifiers into EAD finding aids. After all, XForms applications communicate nicely with other REST services (such as SPARQL), and I've already done SPARQL query work in XForms with Nomisma's backend. I spent about a half hour this afternoon improving the Genreform functionality in EADitor to make AAT (as opposed to the Library of Congress Genre/Format Terms) as the default lookup mechanism.

Here's how it works:

User Interface

Add a genreform element into your controlled access headings in your EAD finding aid.
Click the Getty AAT radio button (selected by default) to activate the query interface.
Type a term and click the search button.
A list of results (limited to 25, filtered by English labels, and arranged alphabetically) will appear in the select list. After clicking an option, click the "Select" button to set the text of the genreform node to the skos:prefLabel from the Getty SPARQL results and to set the @authfilenumber attribute of the genreform element to the Getty id.

Under the Hood

Clicking on the search button does two things: First it replaces 'SEARCH_QUERY' in the SPARQL query, below, with search text in the XForms input. Then it sends an XForms submission with the following action: http://vocab.getty.edu/sparql?query={encode-for-uri(instance('sparqlQuery'))}&format=xml.

SELECT ?c ?label WHERE {
?c rdf:type gvp:Concept .
?c skos:prefLabel ?label
FILTER langMatches(lang(?label), "en") .
FILTER regex(?label, "SEARCH_QUERY", "i") .
}
ORDER BY ASC(?label)
LIMIT 25

Assume that the query above includes the necessary SKOS and GVP prefixes. The options in the select box in the user interface are supplied by the SPARQL XML results. You can see the code here.

What's it do?

Other than being an excellent controlled vocabulary source and universally recognized system of identifiers, incorporating Getty AAT ids into finding aids created with EADitor opens the door to the aggregation of content (in a useful way) in other large systems.

EADitor's flickr integration enables the injection of Getty-based machine tags into photo metadata. AAT URIs are treated as dcterms:format in RDF serializations. While the Digital Public Library of America doesn't yet make use of linked open data identifiers, it is on their agenda. Therefore, finding aids which incorporate AAT identifiers, in addition to VIAF, Geonames, and LCSH ids will be among the most useful to researchers, since these are the most easily categorized and filtered in a large information system, such as DPLA.

Improving Date Functionality

The default EAD templates in EADitor have been updated to require the @normal attribute for the encoding of dates, and I have finally gotten around to improving the interface for entering in standard ISO-compliant dates (and automatically generating human-readable text). This will ultimately improve the finding aids created in EADitor by making contents sortable by creation date.

When inserting a date or unit date anywhere in the document, the user may select the Date or Date Range radio button to display the associated data inputs. These values (and the machine-generated human readable text) are not inserted into the finding aid until they are valid. So therefore, the Date, From Date, and To Date must conform to the xs:date (yyyy-mm-dd), xs:gYearMonth (yyyy-mm) or xs:gYear (yyyy) formats. Furthermore, the To Date must be a greater value than the From Date. This is a small step, but it should have a great impact on the usefulness of the data with respect to querying.

Pages