Thursday, June 7, 2018

SNAC Lookups Updated in xEAC and EADitor

Since the Social Networks and Archival Context has migrated to a new platform, it has published a JSON-based REST API, which they have well-documented. Although EADitor and xEAC have had lookup mechanisms to link personal, corporate, and family entities from SNAC to EAD and EAC-CPF records since 2014 (see here), the lookup mechanisms in the XForms-based backends to these platforms interacted with an unpublicized web service that provided an XML response for simple queries.

With the advent of these new SNAC APIs and JSON processing within the XForms 2.0 spec (present in Orbeon since 2016), I have finally gotten around to overhauling the lookups in both EADitor and xEAC. Following documentation for the Search API, the XForms Submission process now submits (via PUT) an instance that conforms to the required JSON model. The @serialization attribute is set to "application/json" in the submission, and the JSON response from SNAC is serialized back into XML following the XForms 2.0 specification. Side note: the JSON->XML serialization differs between XForms 2.0 and XSLT/XPath 3.0, and so there should be more communication between these groups to standardize JSON->XML across all XML technologies.

The following XML instance is transformed into API-compliant JSON upon submission.


<xforms:instance id="query-json" exclude-result-prefixed="#all">
 <json type="object" xmlns="">
  <command>search</command>
  <term/>
  <entity_type/>
  <start>0</start>
  <count>10</count>
 </json>
</xforms:instance>


The submission is as follows:


<xforms:submission id="query-snac" ref="instance('query-json')" 
    action="http://api.snaccooperative.org" method="put" replace="instance" 
    instance="snac-response" serialization="application/json">
 <xforms:header>
  <xforms:name>User-Agent</xforms:name>
  <xforms:value>XForms/xEAC</xforms:value>
 </xforms:header>
 <xforms:message ev:event="xforms-submit-error" level="modal">Error transfroming 
into JSON and/or interacting with the SNAC
  API.</xforms:message>
</xforms:submission> 

The SNAC URIs are placed into the entityIds within the cpfDescription/identity in EAC-CPF or as the @authfilenumber for a persname, corpname, or famname in EAD.

The next task to to build APIs into xEAC for pushing data (biographical data, skos:exactMatch URIs, and related archival resources) directly into SNAC. By tomorrow, all (or nearly all) of the authorities in the ANS Archives will be linked to SNAC URIs.

Friday, May 18, 2018

Three new Edward Newell research notebooks added to Archer

Three research notebooks of Edward T. Newell have been added to Archer, the archives of the American Numismatic Society. These had been scanned as part of the larger Newell digitization project, which was migrated into IIIF for display in Mirador (with annotations) in late 2017.

These three notebooks had been scanned, but TEI files had not been generated due to some minor oversight. Generating the TEI files was fairly straightforward--there's a small PHP script that will extract MODS from our Koha-based library catalog. These MODS files are subsequently run through an XSLT 3.0 stylesheet to generate TEI with a facsimile listing of all image files associated with the notebook, linking to the IIIF service URI. XSLT 3.0 comes into play to parse the info.json for each image in order to insert the height and width of the source image directly into the TEI, which is used for the TEI->IIIF Manifest JSON transformation (the canvas and image portions of the manifest), which is now inherent to TEI files published in the EADitor platform.

The notebooks all share the same general theme: they are Newell's notes on the coins in the Berlin M├╝nzkabinett, which we aim to annotate in Mirador over the course of the NEH-funded Hellenistic Royal Coinages project.

A fourth notebook was found to have not yet been scanned, and so it will be published online soon.

Friday, April 6, 2018

117 ANS ebooks published to Digital Library

I have finally put the finishing touches on 117 ANS out-of-print publications that have been digitized into TEI (and made available as EPUB and PDF) as part of the NEH and Mellon-funded Open Humanities Book project. This is the "end" (more details on what an end entails later) of the project, in which about 200 American Numismatic Society monographs were digitized and made freely and openly available to the public.

All of these, plus a selection of numismatic electronic theses and dissertations as well as two other ebooks not funded by the NEH-Mellon project, are available in the ANS Digital Library. The details of this project have been outlined in previous blog posts, but to summarize, the TEI files have been annotated with thousands of links to people, places, and other types of entities defined in a variety of information systems--particularly Nomisma.org (for ancient entities), Wikidata, and Geonames (for modern ones).

Additionally:
  • Books have been linked to 153 coins (so far) in the ANS collection identified by accession number. Earlier books cite Newell's personal collection, bequeathed to the ANS and accessioned in 1944. A specialist will have to identify these.
  • 173 total references to coin hoards defined in the Inventory of Greek Coin Hoards, plus several from Kris Lockyear's Coin Hoards of the Roman Republic.
  • 166 references to Roman imperial coin types defined in the NEH-funded Online Coins of the Roman Empire.
  • A small handful of Islamic glass weights in The Metropolitan Museum of Art 
  • One book by Wolfgang Fischer-Bossert, Athenian Decadrachm, has a DOI, connected to his ORCID.
Since each of these annotations is serialized into RDF and published in the ANS archival SPARQL endpoint, the other various information systems (MANTIS, IGCH, OCRE, etc.) query the endpoint for related archival or library materials.

For example, the clipped shilling, 1942.50.1, was minted in Boston, but the note says it was found among a mass of other clippings in London. The findspot is not geographically encoded in our database (and therefore doesn't appear on the map), but this coin is cited in "Part III Finds of American Coins Outside the Americas" in Numismatic finds of the Americas.


Using OpenRefine for Entity Reconciliation

Unlike the first phase of the project, the people and places tagged in these books were extracted into two enormous lists (20,000 total lines) that were reconciled against Wikidata, VIAF, or Nomisma OpenRefine reconciliation APIs. Nomisma was particularly useful because of the high degree of accuracy in matching people and places. Wikidata and VIAF were useful for modern people and places, but these were more challenging in that there might be dozens of American towns with the same name or numerous examples of Charles IV or other regents. I had to evaluate the name within the context of the passage in which it occurred, a tedious process that took nearly two months to complete. The end result, however, has a significantly broader and more accurate coverage than the 85 books in the first iteration of the grant. After painstakingly matching entities to their appropriate identifiers, it only took about a day to write the scripts to incorporate the URIs back into the TEI files, and a few more days of manual, or regex linking for IGCH, ANS coins, etc.

As a result of this effort, and through the concordance between Nomisma identifiers and Pleiades places, there are a total of 3,602 distinct book sections containing 4,304 Pleiades URIs, which can now be made available to scholars through the Pelagios project.


What's Next for ANS Publications?

So while the project concludes in its official capacity, there is room for improvement and further integration. Now that the corpus has been digitized, it will be possible to export all of the references into OpenRefine in an attempt to restructure the TEI and link to URIs defined by Worldcat. We will want to link to other DOIs if possible, and make the references for each book available in Crossref. Some of this relies on the expansion of Crossref itself to support entities identifiers beyond ORCID (e.g., ISNI) and citations for Worldcat. Presently, DOI citation mechanisms allow us to build a network graph of citations for works produced in the last few years, but the extension of this graph to include older journals and monographs will allow us to chart the evolution of scientific and humanistic thought over the course of centuries.

As we know, there is never an "end" to Digital Humanities projects. Only constant improvement. And I believe that the work we have done will open the door to a sort of born-digital approach to future ANS publications.

Tuesday, October 31, 2017

EADitor now supports EAD and MODS to IIIF manifest generation

After migrating the Newell TEI notebooks to support serialization of facsimiles into IIIF manifests and the render of these manifests in an embedded Mirador viewer, I implemented a transformation of EAD finding aid image collections and MODS records for photographs into manifests.

EAD updates

The EAD finding aids were updated to replace the daogrp's linking to flickr images to link to thumbnail, reference, and IIIF service URLs (dao[@xlink:role='IIIFService']). An XSLT transformation of the EAD into manifest JSON occurs, with an intermediate process of iterating through the IIIFService info.json files with the Orbeon XForms processor in XPL to extract the height and width to generate canvases for each image.

The Brett finding aid now includes clickable thumbnails that will launch the zoomable Leaflet viewer in a fancybox popup window. At the top of the page, the user can download the manifest, and there's also a link to view the manifest in our internal Mirador viewer. You can view the EAD XML (link at top) for more details.

MODS updates

The updates to the MODS were twofold. First, in the previous version of Archer, all photographs were suppressed from the public regardless of copyright concerns. We have re-evaluated these concerns by applying one of several Rights Statements. Two of these rights statements are most permissible, and therefore, we will display the high resolution image when we have every right to do so. In any case, thumbnails are Fair Use, and therefore, they are always visible in the record page and the search results pages.

Where copyright allows us to do so, the MODS file includes a URL for the reference image and a URL[@access='raw object' and @note='IIIFService']. When a IIIFService URL is present in the MODS record, the XSLT transformation will include a Leaflet div and initiate the display of the image. See A Portrait Photograph of Margaret Thompson, for example. Like the finding aid, a manifest is dynamically generated from MODS, but only one XForms processor is called to extract the height and width from the info.json for the single image linked in the MODS file.

Pelagios Updates

Since the Brett collection links many photographs to ancient places defined in the Pleiades Gazetteer of Ancient Places, I have updated the EADitor RDF output for Pelagios. The output now includes IIIF service metadata conforming to the Europeana Data Model specification. Rainer Simon has imported these photographs into Peripleo.

Friday, October 6, 2017

Newell notebooks migrated to IIIF

As part of our transition to IIIF for high resolution photographs for the numismatic collection in MANTIS (see http://numismatics.org/collection/1944.100.45250 for example), I have begun to migrate our archival images into IIIF as well. These new features will be available in our new dedicated server as soon as the migration of Wordpress from one server to another is complete, which I expect in the next few weeks. The implementation of IIIF for our archival resources entails three overhauls of the current metadata model and HTML/IIIF Manifest serialization: TEI (for Newell notebooks of facsimile images), Encoded Archival Description (EAD) finding aids, and MODS. The transformation of the TEI notebooks into IIIF compliance is completed, and the functionality for EAD and MODS has been built, but the XML data have not been fully updated to link to IIIF services (mainly because the high resolution images haven't been uploaded to the server yet).

Annotated Newell notebook IIIF manifest displayed in Mirador


TEI to IIIF Manifest

The first Newell notebook was published to Archer (built on EADitor) more than three years ago. There are now about 50 notebooks published, but only a handful have been annotated to link to people, IGCH hoards, and coins in our collection (we will complete the annotation as part of the Hellenistic Royal Coinages project). To summarize the technical underpinnings, each notebook is a TEI file with facsimile elements for each page. The facsimile contains a link to the image and 0-n surface elements representing annotations. These surface elements were created by roundtripping the Annotorious/OpenLayers annotation JSON <-> TEI. The @ulx, @uly, @lrx, and @lry attributes represent the coordinates of the upper left and lower right hand corners of the annotations, and the coordinates were relative ratios based on OpenLayers bounds.

 For IIIF compliance, I ran the TEI through an XSLT 3 transformation to load the info.json metadata from our IIIF image server to extract the height and width of each image, and then recalculate the coordinates to be more in line with Web Annotation segments. The lower right coordinates are still stored in the TEI, but upon generation of annotation lists for the manifest, the left coordinates are subtracted to the right to correctly establish the annotation height and width.

      <surface lrx="1540" lry="155" ulx="1182" uly="54" xml:id="aho40v9vbhq7">
         <desc>
            <ref target="http://coinhoards.org/id/igch1516">IGCH 1516</ref>
         </desc>
     </surface>
      

The tei:facsimile to annotation list transformation outputs:

http://numismatics.org/archives/manifest/nnan187715/canvas/nnan0-187715_X006#xywh=1182,54,358,101


The tei:graphic was replaced with tei:media[@type='IIIFService'], with the @url pointing to the IIIF service URI instead of an image location. XSLT transformations for the manifest, HTML, RDF, and Solr outputs do the rest.

The Javascript has been updated so that clicking on a page under the index of annotations will force Mirador to change the the correct canvas.

You can see an example here: http://numismatics.org/archives/id/nnan187715

I will post another update on EAD and MODS -> IIIF next week. 

Thursday, August 10, 2017

First DOIs minted for ANS Digital Library items

Several weeks ago, we migrated an older, circa 2002 TEI ebook on the Taranto 1911 hoard, authored by John Kroll and Sebastian Heath, into our Digital Library. The original TEI file and subsequent updates have been loaded into our TEI Github repository. The updates follow transcription precedents that we have set in older ANS-published printed monographs as part of the Mellon-funded Open Humanities Book Program: relevant places, objects, people, etc. have been linked to entities in LOD systems, such as Nomisma.org. All of the objects within this hoard (itself linked to IGCH 1864) are in the British Museum and linked to their URIs. Upon publication into the ANS Digital Library, the document parts are now accessible from the IGCH 1864 record and in (eventually) in Pelagios, connected to relevant ancient places.

Since Sebastian is an active scholar, with an ORCID, this document served as a proof of concept for the next iteration of ANS digital publication: that our current and future monographs and journal articles, once issued openly online, should be connected to ORCIDs for their authors, and publication metadata should be submitted to Crossref to mint a DOI and enhance accessibility. Furthermore, since there's a direct connection between ORCID and Crossref submissions, this new digital publication workflow would automatically populate an author's scholarly profile with ANS publications. This is a vast improvement over the likes of Academia.edu, which requires manual submission. The broad vision is this:

Regardless of whether an author submits works through the American Numismatic Society Digital Library, Zenodo.org, Humanities Commons, their own institutional repository, or an Open Access journal system, their ORCID profile is the central, canonical aggregation of the entirety of their intellectual output (which includes datasets, software, etc.).

This aggregation system between DOIs and ORCIDs, following Linked Open Data principles, is the future of academic publication. Ideally, it should be expanded beyond citations to modern works with DOIs and ORCIDs to include more historic works defined by Worldcat and linked to historic scholars with ISNI identifiers. It would take a tremendous amount of work, but in theory, it would be possible to create a network graph of citations across all disciplines, going back in history to the advent of the printed book, charting the evolution of how knowledge is generated and disseminated. Therefore, Crossref, ISNI, and ORCID would perhaps play a greater role than providing simple (and superficial) citation metrics in enabling us to develop a broader historiography and analysis of scholarship itself. We plan to mint DOIs for our historical publications eventually, if Crossref extends its XML schema to support ISNI identifiers.

Under the Hood

Some extensions were implemented in ETDPub, the TEI/MODS publication framework that underlies the ANS Digital library. First, I authored XSLT stylesheets that would crosswalk TEI or MODS into the appropriate Crossref XML model according to their schema version 4.4.0. You can see an example of my MA thesis here: http://numismatics.org/digitallibrary/ark:/53695/gruber_roman_numismatics.xref.

XSLT:
If the author/editor URI matches an ORCID URI in the TEI, then the Admin panel in ETDPub will enable the publication of the metadata to Crossref. Similarly, within the MODS ETD editing interface (in XForms), a user can insert a mods:nameIdentifier[@type='orcid'] under the mods:name for an author/editor in order to capture the ORCID. So far, only TEI or MODS records with ORCIDs attached to people are available for submission into Crossref to mint a DOI.

Submission Workflow

In the admin panel, if a document is eligible for submission to Crossref, a checkbox is available. Clicking on this will fire off a series of actions in the XForms engine:
  1. The TEI/MODS-to-Crossref XML transformation is executed and loaded into an XForms instance
  2. The Crossref XML is serialized to /tmp because it must be attached via multipart/form-data
  3. Still having difficulty getting multipart/form-data to execute correctly in the XForms engine, the XForms engine instead interacts with a PHP script in CGI
  4. After the PHP script responds with a successful HTTP code, the MODS/TEI document is loaded in the XForms engine in order to insert the DOI in the proper location within the document
  5. The TEI/MODS file is saved back to eXist, and the standard publication workflow is executed (a chain of XForms submissions), updating the Solr search index and the triplestore/SPARQL endpoint
So far two documents in the Digital Library have DOIs connected to ORCIDs:

Taranto 1911: http://dx.doi.org/10.26608/taranto1911
My thesis (Recent Advancements in Roman Numismatics): http://dx.doi.org/10.26608/gruber_roman_numismatics

Friday, July 14, 2017

Improved mapping in EADitor - Brett archaeology photos as a test

At long last, I have migrated from OpenLayers to Leaflet in EADitor. This required modifications in two areas: the HTML pages for rendering EAD finding aids and the map interface. As a result, I introduced two new serializations:

  • The map interface renders Solr search results rendering into GeoJSON (instead of OpenLayers displaying Solr->KML as before)
  • A transformation of an EAD finding aid into GeoJSON. A GeoJSON point is created for all unique mappable places from Geonames or Pleiades, and coordinates are extracted in real time by reading Geonames APIs or Pleiades RDF. The GeoJSON features include references to all uniquely addressable components that include that place in the controlaccess element. You can append the extension '.geojson' to get JSON response. Content negotiation will be implemented eventually. See http://numismatics.org/archives/ark:/53695/nnan0037.geojson for example.

 

 Restructuring the Agnes Baldwin Brett finding aid

Agnes Baldwin Brett was a curator at the ANS from 1909-1912 and a prominent scholar of Greek numismatics. Our archives hold a variety of interesting materials, including photographs from her travels around Greece, Italy, and Turkey in the early 1900s. Numerous photos have been digitized, were uploaded to flickr Commons, and linked to the Brett EAD finding aid. Some photographs were identified and described (with brief text snippets) by ANS archivist, David Hill, but all photographs were placed in a single series-level component. All identifiable places were linked in EADitor's Geonames lookup mechanism in a top-level controlaccess element. There was no direct correlation between individual photographs and the people, places, and things depicted.

In order to demonstrate the full functionality of the new mapping interface, I finally took the time to restructure the finding aid so that each photograph would appear in its own item-level component with a controlaccess element enabling individual identification of the place depicted in the photo. Furthermore, while many finding aids have been linked to modern places defined in Geonames, the Brett collection of archaeological photographs provided an opportunity to link photos to ancient places in Pleiades, which would, in turn, open the door to the integration of these valuable materials into the wider Linked Ancient World Data cloud via Pelagios. The photos feature Mycenaean tombs, Greek temples, and even the Grave Stele of Hegeso.

Identifying individual monuments within Athens


Not only that, some photographs feature other students from the American School of Classical Studies at Athens that went on to be prominent scholars later in life. Since many of these scholars have produced published works and archival materials held at other institutions, they have URIs in the Social Network and Archival Context project. EADitor has had SNAC lookups for quite some time, and so I was able to link photos to these URIs when applicable. I hope that we can make these photos available to researchers even beyond the ancient world.

Linking people to SNAC
In addition to the tagging of places and people, many photographs feature known archaeological monuments that are notable enough to warrant their own Wikipedia articles, and therefore Wikidata entity URIs. I extended the subject lookup mechanism in EADitor beyond the standard Library of Congress Subject Headings to query the Wikidata API, embedding entity IDs directly into the EAD finding aid, which are then transformed into dcterms:subject URIs upon RDF serialization.

 

EAD to RDF

Since each individual component has an ID in EADitor, each component is uniquely addressable by fragment identifiers, e.g., http://numismatics.org/archives/ark:/53695/nnan0037#d1e131. After making some minor modifications to the RDF output to conform with the emerging schema.org archival extension, These Wikidata, SNAC, Pleiades, and Geonames URIs are exposed in the RDF for each component, which are hierarchically linked together.

@prefix arch: <http://purl.org/archival/vocab/arch#> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix schema: <http://schema.org/> .
@prefix xml: <http://www.w3.org/XML/1998/namespace> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<http://numismatics.org/archives/ark:/53695/nnan0037#d1e131> a schema:ArchiveItem ;
    dcterms:coverage <http://www.geonames.org/264371> ;
    dcterms:date "1900-12-07"^^xsd:date ;
    dcterms:identifier "06-00242" ;
    dcterms:isPartOf <http://numismatics.org/archives/ark:/53695/nnan0037#c_92f631e3f903281a8cdedbfebfca0654> ;
    dcterms:subject <http://socialarchive.iath.virginia.edu/ark:/99166/w61c5qjp> ;
    dcterms:title "American School students wearing bug bags" ;
    dcterms:type <http://vocab.getty.edu/aat/300046300> ;
    foaf:depiction <http://farm9.staticflickr.com/8320/8003385533_c83827b679_o.jpg> ;
    foaf:thumbnail <http://farm9.staticflickr.com/8320/8003385533_55f1f093b1_t.jpg> .

This RDF is posted into Archer's SPARQL endpoint.

Archer RDF → SPARQL → Pelagios RDF

Now that we have numerous uniquely addressable photographs linked to Pleiades URIs published in our SPARQL endpoint, it was a breeze to create an RDF export for Pelagios. It is essentially a DESCRIBE query, and our model of RDF is run through XSLT into the Pelagios data model.

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX dcterms: <http://purl.org/dc/terms/>
DESCRIBE ?s WHERE {
 ?s dcterms:coverage ?place FILTER (strStarts(str(?place), 'https://pleiades.stoa.org'))  
}

The link to the Pelagios VoID is available on the front page of Archer. It is generated by an ASK query similar to above to see whether there are any objects in the SPARQL endpoint with Pleiades places expressed by the dcterms:coverage property.

Summary

The Brett collection is incredibly interesting, and I hope that we will be able to digitize more photographs and the corresponding travel diary at some point in the future. There are still many photographs that haven't been identified, and so perhaps we might be able to accomplish this through crowdsourcing. We will implement a IIIF server by the end of summer and begin the transition of our archival materials into IIIF--not only photographs, but also the Newell diaries. Perhaps one day we will be able to annotate the people, places, and things from the Brett diary and photographs with Mirador or a similar IIIF viewer. While Pelagios integration is somewhat imminent, the aggregation of disparate archival holdings through shared SNAC identifiers is still further along the horizon.