Monday, June 23, 2014

ANS' Archer v2 has gone live: EAD + EAC-CPF + SPARQL

I have spent much of the last two weeks working on deploying the newest version of EADitor and xEAC into production for version of of Archer, the Archives of the American Numismatic Society. As mentioned in earlier blog posts, EADitor has been hooked up to xEAC for personal, corporate, and family name lookups. Furthermore, both applications will serialize data into RDF and post into a triplestore to further connect archival content with authorities. EADitor supports the publication of MODS records in addition to EAD finding aids (although there is no editing form for MODS as of yet), and we will be publishing our first TEI files in the coming weeks (annotated facsimiles from the Greek numismatic research notebooks of Edward T. Newell).

Archives: http://numismatics.org/archives/
Authorities: http://numismatics.org/authorities/

Process

I first needed to parse out all of the personal and corporate names from the EAD finding aids in order to remove duplicate entities with slight variations in name form (because the earliest finding aids were created with an earlier version of EADitor that used an inconsistent autosuggest feature). The origination element contained only plain text, so these needed to be matched with the normalized personal or corporate names to insert corpname or persname elements. Furthermore, the MODS records needed to be reprocessed because all terms were categorized as subjects, regardless of whether they were indeed subject topics or genres, people, corporate bodies, etc.

The personal or corporate names that appeared in the EAD origination formed the basis for the new EAC-CPF collection. The EAD and MODS files were updated to point to the newly minted URIs for these entities, and matches to VIAF URIs were made for many of the other personal or corporate names that appear in the controlled access headings in the finding aids or in the subject terms in the MODS records.

After generating more than 100 EAC-CPF stubs (which included a biogHist extracted from the EAD), I went through the list of biographies of prominent members of the ANS on the website, filling in gaps as necessary. I added images into the EAC-CPF record, where applicable, designated as resource relations with an xlink:arcole of foaf:depiction, which is defined as a semantic localTypeDeclaration in the EAC-CPF control element. The namespace for foaf: is defined in the declaration, making it possible to include a foaf:depiction in the RDF serialization (see http://numismatics.org/authority/anthon.rdf). I also inserted some CPF relations as necessary, either internally or externally to entities defined by VIAF URIs. Finally, I added a plethora of occupations (with dates and places, if known)--most of which were looked up from the Getty AAT SPARQL endpoint through xEAC--, life events, and associated places (looked up from the Geonames API in xEAC). I'm looking forward to leveraging the VIAF URIs stored in the EAC records in order to generate lists of monographs or journal articles by or about entities, pulled from Worldcat's APIs.

Results

All in all, it is a fairly comprehensive and powerful research tool, even if there are only fewer than 150 entities in the system. Because the authority records and the archival resources are pushed into the RDF triplestore upon publication, a researcher can gain access to contents from an authority record. A user viewing a finding aid can view an abstract dynamically extracted from the EAC-CPF record for the creator of the archival content.

I believe I inserted just enough places and chronological events to make the map and timeline an adequate demonstration of geotemporal visualization for most EAC-CPF records. Certainly, we could spend more time enhancing the biographical context of each record. We will begin this process, and the records will continue to evolve. The user interface will continue to evolve as well, as I aim to introduce social network graph visualizations based on SPARQL queries, and I will create an XQuery interface to accompany the SPARQL endpoint interface to facilitate a wider variety of complex queries.

Conclusions

There are no firm ontology or data model standards for linked open archival description, so the RDF model used in this interface should be considered to be a beta, and it will be adapted when stronger standards emerge from the archival community. The system is beyond a proof of concept at this phase, but it has not been tested for larger scale implementation. I believe that the system can accommodate hundreds of thousands (even into the low millions) of EAC-CPF records and many millions of triples, but I have yet to test at this scale. In any case, I believe that this more modularized/linked approach to archival collections represents the direction of digital archives, or more broadly, cultural heritage materials.

Look for a more official announcement in the near future, probably after we publish a few annotated TEI-based notebooks from the Newell project, which we received modest funding to digitize last year.

No comments:

Post a Comment