Tuesday, October 7, 2014

xEAC at DCMI 2014

I am heading to Austin, Texas this week to discuss xEAC, with an illustration of linked open data principles applied to archival authorities and collections. This presentation is part of a full day pre-conference workshop at DCMI 2014 detailing the latest advances in digital archives entitled "Fonds & Bonds: Archival Metadata, Tools, and Identity Management." Below is my presentation:

Friday, October 3, 2014

Semantic Web Updates to xEAC

After having implemented better semantic web standards in other projects I'm working on that use Orbeon for the front-end, I have applied these changes to xEAC. At present, xEAC supports export of RDF/XML in three different models: A default archival-based model, CIDOC-CRM, and one that conforms to the SNAP ontology. All three are a proof of concept and incomplete.

xEAC now supports the delivery of the xEAC default model in Turtle and JSON-LD, through both REST and content negotiation. URIs for record pages now accept the following content types through the Accept header: text/html, application/xml (EAC-CPF), application/rdf+xml (default model), application/json (JSON-LD), text/turtle, application/tei+xml, and application/vnd.google-earth.kml+xml (KML). Requesting an unsupported content type results in an HTTP 406 Not Acceptable error.

For example:

curl -H "Accept: application/json" http://numismatics.org/authority/elder

Furthermore, content negotiation has been implemented in the browse page. While Solr-based Atom results have been available through their own REST interface, you can now get them by requesting application/atom+xml. You can also get raw Solr XML back from application/xml. This might be useful to developers. I might implement the Solr JSON response, if there is interest (this would require a little more work).

Friday, August 29, 2014

xEAC pre-production release ready for wider testing

xEAC (https://github.com/ewg118/xEAC), an open source, XForms-based framework for the creation and publication of EAC-CPF records (for archival authorities or scholarly prosopographies) is now ready for another round of testing. While xEAC is still under development, it is essentially production-ready for small-to-medium collections of authority records (less than 100,000).

xEAC handles the majority of the elements in the EAC-CPF schema, with particular focus on enhancing controlled vocabulary with external linked open data systems and the semantic linking of relations between entities. The following LOD lookup mechanisms are supported:

  • Geography: Geonames, LCNAF, Getty TGN, Pleiades Gazetteer of Ancient Places
  • Occupations/Functions: Getty AAT
  • Misc. linking and data import: VIAF, DBpedia, nomisma.org, and SNAC

xEAC supports transformation of EAC-CPF into a rudimentary form of three different RDF models and posting data into an RDF triplestore by optionally connecting the system to a SPARQL endpoint. Additionally, EADitor (https://github.com/ewg118/eaditor), an open source framework for EAD finding aid creation and publication can hook into a xEAC installation for controlled vocabulary as well as posting to a triplestore, making it possible to link archival authorities and content through LOD methodologies.

The recently released American Numismatic Society biographies (http://numismatics.org/authorities/) and the new version of the archives (http://numismatics.org/archives/) illustrate this architecture. For example, the authority record for Edward T. Newell (http://numismatics.org/authority/newell), contains a dynamically generated list of archival resources (from a SPARQL query). This method is more scalable and sustainable in the long run than using the EAC resourceRelation element. Now that SPARQL has successfully been implemented in xEAC, I will begin to integrate social network analysis interfaces into the application.
More information:

Extended Linked Data Controlled Vocabulary in xEAC and EADitor

Getty TGN

Last week, the Getty announced the latest installation of their linked open data vocabularies: the Thesaurus of Geographic Names. Like the previously released AAT, the TGN is available through a SPARQL endpoint. After returning from the Semantic Technology and Business conference in San Jose (which I have discussed in another blog post), I set out to integrate TGN lookups into the various cultural heritage data frameworks that I'm developing.

Both xEAC and EADitor have been extended to enable lookups of the Getty TGN through their editing interfaces. The functionality is identical to the occupation and function lookups in both systems. 1. The user performs a text search for a term, 2. the XForms engine submits a SPARQL query to the Getty endpoint, and 3. the user then selects the appropriate item from a list generated from the SPARQL response. See the example from xEAC, below:


The geographic lookup mechanism in xEAC also includes an option for geographic names in the Library of Congress Name Authority File.

 

SNAC Integration

In addition to extending the geographic lookup functionality in both EADitor and xEAC, I have also implemented a SNAC lookup in both applications. With the addition of two URL parameters, the search results page in SNAC can provide the raw cross query XML response instead of the default HTML. I hope that SNAC will eventually provide a documented search API that returns results in a more formal standard, like Atom.

In xEAC, the lookup will embed the SNAC URI into the otherRecordId and source in the EAC-CPF control. Nothing else is pulled from SNAC at the moment, either into the EAC record or into the public user interface, although this could change eventually.

In EADitor, the persname, corpname, and famname element components have been extended to include the SNAC lookup in addition to VIAF and xEAC (if a xEAC instance has been added into the EADitor settings). The SNAC URI is stored in the @authfilenumber of the associated EAD element.

SNAC URIs that are embedded into EAD finding aids (like the URIs from other linked open data vocabulary systems) will be included in the RDF serialization of the archival collection data. This may pave the way for users of EADitor to make their content accessible through SNAC, or whatever international archival entity system evolves from SNAC, by means of linked open data technologies.

Thursday, June 26, 2014

First Newell notebook published in Archer

The first of several dozen digitized Greek coin hoard research notebooks written by Edward Newell has been published into the newly-relaunched ANS archival resource, Archer. This is the first publicly-accessible product from a grant of $7,500 received from The Gladys Krieble Delmas Foundation to digitize these valuable resources. This notebook, written in 1939, includes notes on a handful of Greek coin hoards which were eventually published in An Inventory of Greek Coin Hoards (IGCH).

Specifications

We wanted to go beyond simply scanning page images and offering each notebook as an open access PDF. We wanted to include zoomable images, basic pagination functionality, and annotation of particular items on each page. We decided not to go as far as to offer full transcriptions, but the annotations could include free text or links to other resources on the web: such as linking to coins in the ANS collection, contemporary scholars mentioned in the text that might have entries in VIAF or dbpedia, linking to nomisma.org defined IGCH entries and other identifiers for mints, regions, or numismatic authorities, books in the ANS library, and place names in Geonames. Furthermore, these terms should be indexed into Solr and populate the facet terms in Archer's browse interface, and RDF should be made available for linking resources together.

Technical Underpinnings

We chose TEI XML as the vessel for encoding data about the notebooks. The TEI files include bibliographic metadata in the header and a list of facsimile elements that link to page images and contain xy coordinates for the annotations. The elements also include multiple surface elements for each annotation, with mixed content of free text and links. We have hooked into Rainer Simon's Annotorious library in both the public user interface and the XForms-based backend. Both interfaces are now part of EADitor's core functionality.

It should be stressed that the TEI editing functionality in EADitor is tailored toward annotation of facsimile images. The header is not yet editable, nor is the body, but this functionality may be enhanced in the future. Nevertheless, when you open the TEI form, there is a Facsimiles tab. In this tab are two columns. On the left is a list of thumbnails. Clicking on a thumbnail will load the large image into OpenLayers and make it annotatable through Annotorious.


The JSON generated by Annotorious gets parsed by Orbeon (the XForms processor) and transformed into TEI and inserted into the XML document (not just the text annotations, but annotation coordinate values crosswalked into TEI attributes).
<facsimile xml:id="nnan0-187715_X007">
  <graphic url="0-187715_X007" n="Loose 2"/>
  <surface xml:id="aj6xve43lj5" ulx="-0.86506513142592" uly="1.3798999767388" lrx="-0.45771458478717" lry="1.2864639140252">
     <desc>
      <ref target="http://nomisma.org/id/sardes">Sardis</ref>
    </desc>
  </surface>
</facsimile>
URIs are parsed and made into TEI "ref" elements. The label of the element may be extracted from an API, depending on the content of the URI. For example, a URI matching "http://numismatics.org/collection" will append ".xml," and the title will be pulled from the NUDS/XML serialization. Similarly, an AACR2-conformant place name label will be generated from querying Geonames' APIs and RDF is pulled from nomisma.org, dbpedia, or VIAF to extract the skos:prefLabel or rdfs:label (see more with the XPL and XSLT).


Once a new annotation is created and URL is parsed, the link will be clickable. When you click on a different thumbnail in the left-hand column to load another image for annotation, the TEI file will be saved to eXist, and the document will be re-published to Solr, if it had previously been designated as public, and re-published to the RDF triplestore, if the TEI document is both public and EADitor's config contains endpoint URLs for updating and posting data.

Public User Interface

Since the EAC-CPF based authority records delivered through xEAC and Archer are now both hooked into an RDF triplestore and SPARQL endpoint, the notebooks will appear under the list of related resources in the Newell authority record, together with finding aids (encoded in EAD) in which Newell is a creator or correspondent or photographs (encoded in MODS) in which he appears.

The interface for the notebook itself  is not markedly different than for other record types in EADitor. In the right column is OpenLayers+Annotorious (in read-only mode) for showing images and annotations. The annotations are rendered by querying a REST protocol inherent to EADitor which transforms the TEI facsimile element into Annotorious' JSON model by passing in a few request parameters. Images can be paged through by clicking on thumbnails, next/previous links, or a drop down menu.


In the left column is bibliographic metadata and a list of unique terms that appear in the TEI document. The terms are clickable links to direct the user to the results page for that term's query. A user may click on the external link image to load the target URL (for example, to view the VIAF or Geonames page). Lastly, under each term are page numbers on which the term appears in the document. Clicking on one of these page number links will load the image and annotations in OpenLayers.

There are still some more improvements to make with the system, especially in making the TEI Header editable, but I think that this framework has really great potential for lowering the barriers to creating, editing, and publishing TEI, especially in a large scale linked open data system.

Monday, June 23, 2014

ANS' Archer v2 has gone live: EAD + EAC-CPF + SPARQL

I have spent much of the last two weeks working on deploying the newest version of EADitor and xEAC into production for version of of Archer, the Archives of the American Numismatic Society. As mentioned in earlier blog posts, EADitor has been hooked up to xEAC for personal, corporate, and family name lookups. Furthermore, both applications will serialize data into RDF and post into a triplestore to further connect archival content with authorities. EADitor supports the publication of MODS records in addition to EAD finding aids (although there is no editing form for MODS as of yet), and we will be publishing our first TEI files in the coming weeks (annotated facsimiles from the Greek numismatic research notebooks of Edward T. Newell).

Archives: http://numismatics.org/archives/
Authorities: http://numismatics.org/authorities/

Process

I first needed to parse out all of the personal and corporate names from the EAD finding aids in order to remove duplicate entities with slight variations in name form (because the earliest finding aids were created with an earlier version of EADitor that used an inconsistent autosuggest feature). The origination element contained only plain text, so these needed to be matched with the normalized personal or corporate names to insert corpname or persname elements. Furthermore, the MODS records needed to be reprocessed because all terms were categorized as subjects, regardless of whether they were indeed subject topics or genres, people, corporate bodies, etc.

The personal or corporate names that appeared in the EAD origination formed the basis for the new EAC-CPF collection. The EAD and MODS files were updated to point to the newly minted URIs for these entities, and matches to VIAF URIs were made for many of the other personal or corporate names that appear in the controlled access headings in the finding aids or in the subject terms in the MODS records.

After generating more than 100 EAC-CPF stubs (which included a biogHist extracted from the EAD), I went through the list of biographies of prominent members of the ANS on the website, filling in gaps as necessary. I added images into the EAC-CPF record, where applicable, designated as resource relations with an xlink:arcole of foaf:depiction, which is defined as a semantic localTypeDeclaration in the EAC-CPF control element. The namespace for foaf: is defined in the declaration, making it possible to include a foaf:depiction in the RDF serialization (see http://numismatics.org/authority/anthon.rdf). I also inserted some CPF relations as necessary, either internally or externally to entities defined by VIAF URIs. Finally, I added a plethora of occupations (with dates and places, if known)--most of which were looked up from the Getty AAT SPARQL endpoint through xEAC--, life events, and associated places (looked up from the Geonames API in xEAC). I'm looking forward to leveraging the VIAF URIs stored in the EAC records in order to generate lists of monographs or journal articles by or about entities, pulled from Worldcat's APIs.

Results

All in all, it is a fairly comprehensive and powerful research tool, even if there are only fewer than 150 entities in the system. Because the authority records and the archival resources are pushed into the RDF triplestore upon publication, a researcher can gain access to contents from an authority record. A user viewing a finding aid can view an abstract dynamically extracted from the EAC-CPF record for the creator of the archival content.

I believe I inserted just enough places and chronological events to make the map and timeline an adequate demonstration of geotemporal visualization for most EAC-CPF records. Certainly, we could spend more time enhancing the biographical context of each record. We will begin this process, and the records will continue to evolve. The user interface will continue to evolve as well, as I aim to introduce social network graph visualizations based on SPARQL queries, and I will create an XQuery interface to accompany the SPARQL endpoint interface to facilitate a wider variety of complex queries.

Conclusions

There are no firm ontology or data model standards for linked open archival description, so the RDF model used in this interface should be considered to be a beta, and it will be adapted when stronger standards emerge from the archival community. The system is beyond a proof of concept at this phase, but it has not been tested for larger scale implementation. I believe that the system can accommodate hundreds of thousands (even into the low millions) of EAC-CPF records and many millions of triples, but I have yet to test at this scale. In any case, I believe that this more modularized/linked approach to archival collections represents the direction of digital archives, or more broadly, cultural heritage materials.

Look for a more official announcement in the near future, probably after we publish a few annotated TEI-based notebooks from the Newell project, which we received modest funding to digitize last year.

Friday, May 30, 2014

Enabling a SPARQL endpoint in EADitor and xEAC

Not long ago, I discussed the enhancement of both xEAC and EADitor by connecting them through a SPARQL endpoint. I have extended this further by enabling a wrapper to this endpoint in both xEAC and EADitor. It didn't take long to employ, as I basically copied and pasted some files/code from the nomisma.org Github repository.

Essentially, once SPARQL endpoint URLs have been inputted into the config for xEAC or EADitor, a checkbox is made available to enable a wrapper to the endpoint. By wrapper, I mean a pipeline is created for the SPARQL query interface and the query response. This wrapper interacts with the SPARQL endpoint directly, but the xEAC's/EADitors SPARQL page carries the same style as the rest of the site, and if HTML results are selected, the SPARQL XML response is fashioned through an XSLT stylesheet in xEAC/EADitor into HTML which also conforms to the inherent style of the application. It's essentially identical to the functionality on http://kerameikos.org/sparql: the page and query response are merely interfaces to Fuseki's endpoint.


Why is the SPARQL endpoint useful?

SPARQL is a complicated beast, and I do plan on writing documentation on the ontologies used and models implemented in both xEAC and EADitor. Most users will likely not use the SPARQL query endpoint directly. But the major point is: it exists for a small subset of users that want to perform really sophisticated queries on the dataset. In the same vein, xEAC will also eventually expose an XQuery interface for performing different types of complex queries on the dataset.

The real advantage: building UIs on SPARQL

As demonstrated in a variety of other projects that I work on: the best uses of SPARQL are the ones you don't even realize. On http://kerameikos.org/id/red_figure, the timeline/map, list of thumbnails, and chart (if generated) showing the distribution of a particular typology are all generated by SPARQL queries and rendered into something that is far more visually understandable to a human being. Likewise with a chart showing the change in weight of Roman imperial denarii from the start of the Roman Empire in 27 B.C. to about A.D. 220 (the end point in which we currently have data): http://numismatics.org/ocre/visualize?measurement=weight&chartType=line&interval=5&fromDate=-30&toDate=220&sparqlQuery=nm%3Adenomination+%3Chttp%3A%2F%2Fnomisma.org%2Fid%2Fdenarius%3E#measurements .

While there is still much work to do in refining the ontologies and models used for representing EAC-CPF or EAD records as RDF, now that the SPARQL publication mechanism actually functions, it will be possible to begin to build more sophisticated visualizations on top of these queries--to incorporate social network graph visualizations into xEAC, dynamically generated by SPARQL and easily manipulated and navigated by users.

The first step, however, is to be able to generate lists of related resources, such as a test finding aid linked to Edward T. Newell, below:


 
We plan to deploy xEAC and the current version of EADitor into production at the American Numismatic Society fairly soon for Archer 2.0. While the ANS' archives are fairly small, I think you can get the idea of the potential for a large collection of entity records, like SNAC, when you are able to link millions of entities to many tens of millions of related materials.

Similarly, in EADitor, when a finding aid has been linked to an entity in xEAC (and the @type of the persname, famname, or corpname has been set to xeac:entity [automatically done in the editing interface]), EADitor will extract biographical information directly from the EAC-CPF record: