Wednesday, July 11, 2018

Creating and Updating SNAC constellations directly in xEAC

After 2-3 weeks of work, I have made some very significant updates to xEAC, one which paves the way to making archival materials at the American Numismatic Society (and other potential users of our open source software frameworks) broadly accessible to other researchers. This is especially important for us, since we are a small archive with unique materials that don't reach a general historical audience, and we are now able to fulfill one of the potentialities we outlined in our Mellon-NEH Open Humanities Book project: that we would be able to make 200+ open ebooks available through Social Networks and Archival Context (SNAC).

I have introduced a new feature that interacts with the SNAC JSON API within the XForms backend of xEAC (note that you need to use an XForms 2.0 compliant processor for xEAC in order to make use of JSON data). The feature will create a new constellation if none exists or supplement existing constellations with data from the local EAC-CPF record. While the full range of EAC-CPF components is supported by the SNAC API, I have focused primarily on the integration of the stable URI for the entity in the local authority system (e.g.,, existDates (if they are not already in the constellation), and the biogHist. Importantly, if xEAC users have opted to connect to a SPARQL endpoint that also contains archival or libraries materials, these related resources will be created in SNAC and linked to the constellation.

It should be noted that this system is still in beta and has only been tested with the SNAC development server. There is still work to do with improving the authentication handshake between xEAC and SNAC.

The process


Step 1: Reviewing an existing constellation for content

The first step of the process is executed when the user loads the form. If the EAC-CPF record already contains an entityId that conforms to the permanent, stable SNAC ARK URI, a "read" query will be issued to the SNAC API in order to determine what content already exists in the constellation, including what resources are already available in the constellation vs. the resources extracted from the local archival information system via SPARQL.

The SPARQL query for extracted resources from the endpoint is as follows:

PREFIX rdf:      <>
PREFIX dcterms:  <>
PREFIX foaf:  <>
SELECT ?uri ?role ?title ?type ?genre ?abstract ?extent WHERE {
?uri ?role <> ;
     dcterms:title ?title ;
     rdf:type ?type ;
     dcterms:type ?genre .
  OPTIONAL {?uri dcterms:abstract ?abstract}
  OPTIONAL {?uri dcterms:extent ?extent}
} ORDER BY ASC(?role)

I recently made an update to our Digital Library and Archival software so that every different type of resource (ebooks and notebooks in TEI, photographs in MODS, finding aids in EAD) will include a dcterms:type linking to a Getty AAT URI in the RDF serialization. This AAT URI, in conjunction with the rdf:type of the archival or library object (often a Class), will help determine the type of resource according to SNAC's own parameters (BibliographicResource, ArchivalResource, DigitalArchivalRescource). Additionally, the role of the entity with respect to the resource (dcterms:creator, dcterms:subject) informs the role within the SNAC resource-constellation connection: creatorOf, referencedIn. Abstracts and extents are inserted, if available.

Step 2: Validate authentication

SNAC uses Google user tokens for validation within its own system. There is currently no handshake available between xEAC and SNAC which will facilitate multiple users in xEAC to each have their own credentials in SNAC. At the moment, the "user" information is stored in the xEAC config file. A user will have to enter their Google credentials from the SNAC API Key page into the web form and click the "Confirm User Data" button. xEAC will submit an "edit" to a random constellation to verify the validity of the authentication information. If it is successful, the credentials are then stored back into the config (although the token only lasts about 24 hours) and the constellation is immediately unlocked. The user will then proceed to the create/update constellation interface.

Authenticating through xEAC

Step 3: Creating or updating a constellation

The user will now see several checkboxes to add information into the constellation. Eventually, it will be possible to remove data as well. Below is a synopsis of options:

  1. Same As URI: The URI of the entity in the local authority system will be added into the constellation. This is especially important or establishing concordances between different vocabulary systems.
  2. Exist dates can be added into the constellation if they are not already present.
  3. If there isn't already a biogHist in the constellation and there is one present in the EAC-CPF record, the biogHist will be escaped and published to SNAC. A source will also be created in the constellation in order to link the new biogHist to SNAC control metadata, tying the new biogHist directly to the local URI for the authority. This makes it possible to update or delete only the biogHist associated with your own entity without overwriting other biogHist information that might already be present within the constellation. While SNAC does support multiple biogHists, only the most recently added biogHist will appear in the HTML view of the entity. For this reason (at present), xEAC will only insert a biogHist if there isn't one in the constellation already. In step 1, if the constellation already contains a biogHist associated with the source URI for your authority, it will hash encode the constellation's biogHist and compare it to the hash-encoded biogHist currently in the EAC-CPF record. If there is a difference between these hashes, the constellation will be updated with the current version of the biogHist in the EAC-CPF record.
  4. A list of resource relations derived from SPARQL will be displayed. All will be checked by default in order to first create the resource with the "insert_resource" API command, and second to connect the constellation to that newly created resource with "update_constellation". Each resource entry will display some basic metadata and whether or not it already exists in the constellation, and what action will be taken. It is possible to uncheck the box for a resource that exists in the constellation to remove it from the constellation.
The interface for creating and updating SNAC constellations

Step 4: Saving the ARK back to the EAC-CPF record, if applicable

After the successful issuing of "publish_constellation" to the SNAC API, an entityId with the new SNAC ARK URI will be inserted into the EAC-CPF record, if the constellation is newly created (updates presume the ARK already exists in the EAC record). Saving the EAC record will trigger a re-indexing of the document to Solr and a SPARQL/Update that will insert the ARK as a skos:exactMatch into the concept object for the entity.

PREFIX rdf:      <>
PREFIX skos:  <>
PREFIX foaf:  <>

INSERT { ?concept skos:exactMatch <ARK>  }
WHERE { ?concept foaf:focus <URI> }

The data above are those I consider to most vital to SNAC integration--essential historical or biographical context and related archival or library resources that can be made more broadly accessible. I am not sure how many other authority systems are able to interact with SNAC with this degree of granularity yet, but I am hopeful that these features will propel more unique research materials into the public sphere.

I will briefly touch on these new features when I present our our comprehensive LOD-oriented numismatic research platform at SAA next month (I will upload the slideshow soon).

Thursday, June 7, 2018

SNAC Lookups Updated in xEAC and EADitor

Since the Social Networks and Archival Context has migrated to a new platform, it has published a JSON-based REST API, which they have well-documented. Although EADitor and xEAC have had lookup mechanisms to link personal, corporate, and family entities from SNAC to EAD and EAC-CPF records since 2014 (see here), the lookup mechanisms in the XForms-based backends to these platforms interacted with an unpublicized web service that provided an XML response for simple queries.

With the advent of these new SNAC APIs and JSON processing within the XForms 2.0 spec (present in Orbeon since 2016), I have finally gotten around to overhauling the lookups in both EADitor and xEAC. Following documentation for the Search API, the XForms Submission process now submits (via PUT) an instance that conforms to the required JSON model. The @serialization attribute is set to "application/json" in the submission, and the JSON response from SNAC is serialized back into XML following the XForms 2.0 specification. Side note: the JSON->XML serialization differs between XForms 2.0 and XSLT/XPath 3.0, and so there should be more communication between these groups to standardize JSON->XML across all XML technologies.

The following XML instance is transformed into API-compliant JSON upon submission.

<xforms:instance id="query-json" exclude-result-prefixed="#all">
 <json type="object" xmlns="">

The submission is as follows:

<xforms:submission id="query-snac" ref="instance('query-json')" 
    action="" method="put" replace="instance" 
    instance="snac-response" serialization="application/json">
 <xforms:message ev:event="xforms-submit-error" level="modal">Error transfroming 
into JSON and/or interacting with the SNAC

The SNAC URIs are placed into the entityIds within the cpfDescription/identity in EAC-CPF or as the @authfilenumber for a persname, corpname, or famname in EAD.

The next task to to build APIs into xEAC for pushing data (biographical data, skos:exactMatch URIs, and related archival resources) directly into SNAC. By tomorrow, all (or nearly all) of the authorities in the ANS Archives will be linked to SNAC URIs.

Friday, May 18, 2018

Three new Edward Newell research notebooks added to Archer

Three research notebooks of Edward T. Newell have been added to Archer, the archives of the American Numismatic Society. These had been scanned as part of the larger Newell digitization project, which was migrated into IIIF for display in Mirador (with annotations) in late 2017.

These three notebooks had been scanned, but TEI files had not been generated due to some minor oversight. Generating the TEI files was fairly straightforward--there's a small PHP script that will extract MODS from our Koha-based library catalog. These MODS files are subsequently run through an XSLT 3.0 stylesheet to generate TEI with a facsimile listing of all image files associated with the notebook, linking to the IIIF service URI. XSLT 3.0 comes into play to parse the info.json for each image in order to insert the height and width of the source image directly into the TEI, which is used for the TEI->IIIF Manifest JSON transformation (the canvas and image portions of the manifest), which is now inherent to TEI files published in the EADitor platform.

The notebooks all share the same general theme: they are Newell's notes on the coins in the Berlin M├╝nzkabinett, which we aim to annotate in Mirador over the course of the NEH-funded Hellenistic Royal Coinages project.

A fourth notebook was found to have not yet been scanned, and so it will be published online soon.

Friday, April 6, 2018

117 ANS ebooks published to Digital Library

I have finally put the finishing touches on 117 ANS out-of-print publications that have been digitized into TEI (and made available as EPUB and PDF) as part of the NEH and Mellon-funded Open Humanities Book project. This is the "end" (more details on what an end entails later) of the project, in which about 200 American Numismatic Society monographs were digitized and made freely and openly available to the public.

All of these, plus a selection of numismatic electronic theses and dissertations as well as two other ebooks not funded by the NEH-Mellon project, are available in the ANS Digital Library. The details of this project have been outlined in previous blog posts, but to summarize, the TEI files have been annotated with thousands of links to people, places, and other types of entities defined in a variety of information systems--particularly (for ancient entities), Wikidata, and Geonames (for modern ones).

  • Books have been linked to 153 coins (so far) in the ANS collection identified by accession number. Earlier books cite Newell's personal collection, bequeathed to the ANS and accessioned in 1944. A specialist will have to identify these.
  • 173 total references to coin hoards defined in the Inventory of Greek Coin Hoards, plus several from Kris Lockyear's Coin Hoards of the Roman Republic.
  • 166 references to Roman imperial coin types defined in the NEH-funded Online Coins of the Roman Empire.
  • A small handful of Islamic glass weights in The Metropolitan Museum of Art 
  • One book by Wolfgang Fischer-Bossert, Athenian Decadrachm, has a DOI, connected to his ORCID.
Since each of these annotations is serialized into RDF and published in the ANS archival SPARQL endpoint, the other various information systems (MANTIS, IGCH, OCRE, etc.) query the endpoint for related archival or library materials.

For example, the clipped shilling, 1942.50.1, was minted in Boston, but the note says it was found among a mass of other clippings in London. The findspot is not geographically encoded in our database (and therefore doesn't appear on the map), but this coin is cited in "Part III Finds of American Coins Outside the Americas" in Numismatic finds of the Americas.

Using OpenRefine for Entity Reconciliation

Unlike the first phase of the project, the people and places tagged in these books were extracted into two enormous lists (20,000 total lines) that were reconciled against Wikidata, VIAF, or Nomisma OpenRefine reconciliation APIs. Nomisma was particularly useful because of the high degree of accuracy in matching people and places. Wikidata and VIAF were useful for modern people and places, but these were more challenging in that there might be dozens of American towns with the same name or numerous examples of Charles IV or other regents. I had to evaluate the name within the context of the passage in which it occurred, a tedious process that took nearly two months to complete. The end result, however, has a significantly broader and more accurate coverage than the 85 books in the first iteration of the grant. After painstakingly matching entities to their appropriate identifiers, it only took about a day to write the scripts to incorporate the URIs back into the TEI files, and a few more days of manual, or regex linking for IGCH, ANS coins, etc.

As a result of this effort, and through the concordance between Nomisma identifiers and Pleiades places, there are a total of 3,602 distinct book sections containing 4,304 Pleiades URIs, which can now be made available to scholars through the Pelagios project.

What's Next for ANS Publications?

So while the project concludes in its official capacity, there is room for improvement and further integration. Now that the corpus has been digitized, it will be possible to export all of the references into OpenRefine in an attempt to restructure the TEI and link to URIs defined by Worldcat. We will want to link to other DOIs if possible, and make the references for each book available in Crossref. Some of this relies on the expansion of Crossref itself to support entities identifiers beyond ORCID (e.g., ISNI) and citations for Worldcat. Presently, DOI citation mechanisms allow us to build a network graph of citations for works produced in the last few years, but the extension of this graph to include older journals and monographs will allow us to chart the evolution of scientific and humanistic thought over the course of centuries.

As we know, there is never an "end" to Digital Humanities projects. Only constant improvement. And I believe that the work we have done will open the door to a sort of born-digital approach to future ANS publications.