Friday, May 30, 2014

Enabling a SPARQL endpoint in EADitor and xEAC

Not long ago, I discussed the enhancement of both xEAC and EADitor by connecting them through a SPARQL endpoint. I have extended this further by enabling a wrapper to this endpoint in both xEAC and EADitor. It didn't take long to employ, as I basically copied and pasted some files/code from the nomisma.org Github repository.

Essentially, once SPARQL endpoint URLs have been inputted into the config for xEAC or EADitor, a checkbox is made available to enable a wrapper to the endpoint. By wrapper, I mean a pipeline is created for the SPARQL query interface and the query response. This wrapper interacts with the SPARQL endpoint directly, but the xEAC's/EADitors SPARQL page carries the same style as the rest of the site, and if HTML results are selected, the SPARQL XML response is fashioned through an XSLT stylesheet in xEAC/EADitor into HTML which also conforms to the inherent style of the application. It's essentially identical to the functionality on http://kerameikos.org/sparql: the page and query response are merely interfaces to Fuseki's endpoint.


Why is the SPARQL endpoint useful?

SPARQL is a complicated beast, and I do plan on writing documentation on the ontologies used and models implemented in both xEAC and EADitor. Most users will likely not use the SPARQL query endpoint directly. But the major point is: it exists for a small subset of users that want to perform really sophisticated queries on the dataset. In the same vein, xEAC will also eventually expose an XQuery interface for performing different types of complex queries on the dataset.

The real advantage: building UIs on SPARQL

As demonstrated in a variety of other projects that I work on: the best uses of SPARQL are the ones you don't even realize. On http://kerameikos.org/id/red_figure, the timeline/map, list of thumbnails, and chart (if generated) showing the distribution of a particular typology are all generated by SPARQL queries and rendered into something that is far more visually understandable to a human being. Likewise with a chart showing the change in weight of Roman imperial denarii from the start of the Roman Empire in 27 B.C. to about A.D. 220 (the end point in which we currently have data): http://numismatics.org/ocre/visualize?measurement=weight&chartType=line&interval=5&fromDate=-30&toDate=220&sparqlQuery=nm%3Adenomination+%3Chttp%3A%2F%2Fnomisma.org%2Fid%2Fdenarius%3E#measurements .

While there is still much work to do in refining the ontologies and models used for representing EAC-CPF or EAD records as RDF, now that the SPARQL publication mechanism actually functions, it will be possible to begin to build more sophisticated visualizations on top of these queries--to incorporate social network graph visualizations into xEAC, dynamically generated by SPARQL and easily manipulated and navigated by users.

The first step, however, is to be able to generate lists of related resources, such as a test finding aid linked to Edward T. Newell, below:


 
We plan to deploy xEAC and the current version of EADitor into production at the American Numismatic Society fairly soon for Archer 2.0. While the ANS' archives are fairly small, I think you can get the idea of the potential for a large collection of entity records, like SNAC, when you are able to link millions of entities to many tens of millions of related materials.

Similarly, in EADitor, when a finding aid has been linked to an entity in xEAC (and the @type of the persname, famname, or corpname has been set to xeac:entity [automatically done in the editing interface]), EADitor will extract biographical information directly from the EAC-CPF record:


Friday, May 16, 2014

Linking archival entities and resources with SPARQL

Linked open data methodologies have an important role to play in the future of archival description.

With this in mind, I am moving EADitor and xEAC further in this direction. Both frameworks already support serialization of EAD or EAC-CPF into RDF. xEAC supports three different RDF models, in fact, depending on which community the data are intended to serve. EADitor transforms EAD finding aids into the Arch ontology. There are no true standards yet for representing archival resources as RDF, but I am hopeful that some will emerge. Since EADitor is now capable of embedding xEAC URIs for archival entities into EAD finding aids, the next step in linking resources together is the implementation of an RDF triplestore and SPARQL endpoint into both xEAC and EADitor.


An Example: Resource Relations

EAC-CPF records two primary types of relations: links to other corporate, personal, or familial entities in the form of CPF Relations and Resource Relations, or links to other resources by or about an entity that may be available on the web. It makes a lot of sense to me to store CPF relations within the EAC-CPF record, especially if these related entities are stored in the same information system (like xEAC). On the other hand, I don't think it makes sense to store resource relations within the EAC, mainly because I think that it's far too complicated to maintain a growing list of relationships.

Let's suppose you have an entity, Thomas Jefferson, defined by a URI, http://example.org/thomas_jefferson. A significant portion of his collection is contained at the University of Virginia and Monticello, but he corresponded with many of his other contemporaries. Therefore, the papers of John Adams or George Washington may also contain letters from Jefferson. He also corresponded with numerous prominent Europeans, so some of his materials may be contained in archives overseas. There may be dozens or hundreds of institutions which contain at least one article by or about Jefferson. If each of these archives adopts a stable URI that defines Jefferson, then it is much easier to accept RDF derived from EAD, MODS, or a relational database into an RDF triplestore, and use SPARQL to gather these materials dynamically from the triplestore when a researcher accesses the http://example.org/thomas_jefferson entity record.

This is the approach that I am implementing in xEAC and EADitor. For example, if the <origination> within the <did> of a finding aid (or individual component) contains an entity URI, the RDF derivative of the finding aid will link the archival resource to the archival entity through the dcterms:creator property. In EADitor, once the SPARQL endpoint URLs for querying, publishing, and updating data have been established, the RDF will be posted into the triplestore when the finding aid has been designated for publication on the web. Likewise, if these endpoint URLs have been added into the xEAC configuration file, the XSLT template for generating HTML from EAC-CPF will query the SPARQL endpoint to list related resources that have been pushed into the triplestore from EADitor. Furthermore, EADitor itself isn't necessary for this functionality in xEAC. RDF may be pushed into the triplestore by other means--from an institutional repository, from ArchivesSpace, or something else. You could even feed data from Europeana into a triplestore to build a prosopography of Impressionistic artists. The sky is the limit.




The SPARQL query looks something like this:

SELECT ?uri ?title WHERE {
?uri dcterms:creator <URI> ;
dcterms:title ?title
}

There is still work to be done in the UI, but the underlying technological functionality is now available in the Github repository for both applications.

Technical mumbo jumbo

The functionality for linking EADitor and xEAC to SPARQL endpoints is identical in both applications. The URLs are added through the Settings page. Under the SPARQL heading, the user clicks the "Connect" button, which launches a popup window requiring the user to input three separate URLs: the query URL, the Graph Store URL, and the SPARQL/Update URL. These URLs may vary from application to application, especially if the configurations have been changed. Note that a SPARQL 1.1-compliant endpoint is required.


After ending these URLs and clicking "Connect," the XForms engine will test each one individually. First, it will attempt a basic query. Then, it will post RDF into the Graph Store URL, and if successful, an XForms submission will be executed to delete the graph through SPARQL/Update. If all three processes complete successfully, the URLs will be added into the EADitor/xEAC config, and then the config can be saved. The user is also presented with the option to post all records that have been slated for publication (in Solr) into the RDF triplestore.

From the admin page, when a user deletes a record from the eXist database or removes the record from publication, the triples will be purged from the endpoint as well as the docs being deleted from Solr. When a user publishes a record, the record will be serialized into RDF and posted into the triplestore in addition to the user Solr publication. Likewise, triples will be updated when the user saves a published record from the EAD/EAC-CPF editing pages.

Ergo, enterprise archival linked open data publishing.

There is still work to do here: the ontology and RDF data models still need work, which is more of a community effort. And of course, I have a lot of plans for enhancing the user experience.

Once this new publication model is fully functional, I will begin SPARQL-based visualizations of social networks and relations between entities and their archival resources.

Incorporating xEAC entities into EAD finding aids

xEAC and EADitor were both conceived as standalone applications. This is what separates xEAC, especially, from other authority management modules that come packaged in larger archival suites like ICA-AtoM and ArchivesSpace. xEAC is applicable toward LAM authority control, for whom EAC-CPF is the primary audience, but can also be applied to scholarly prosopographies (and eventually support social network analysis built upon linked open data methodologies).

It is now possible to hook xEAC and EADitor together through an intermediate, optional, RDF triplestore and SPARQL endpoint. This will be discussed in greater detail in a later blog post. This particular post will detail the more immediate connection between entities defined in xEAC and personal, family, and corporate names within EAD finding aids.

Since its inception, xEAC has provided a Solr-based Atom feed for published EAC-CPF records. The Atom feed returns results based on the Lucene query syntax. A number of fields are available to narrow the search. For example, the entityType_facet Solr field allows a user to search for a name of a particular entity type, which is defined in the EAC-CPF schema as being either "person," "family," or "corporateBody." See http://admin.numismatics.org/xeac/feed/?q=augustus%20AND%20entityType_facet:person for example. The results are machine readable, and therefore the EADitor XForms application can read and process the search results. The interface for persname, corpname, and famname in EADitor has been adjusted to include xEAC lookups, and the functionality is practically identical to the VIAF lookup mechanism (which returns results in RSS as opposed to Atom).

Integrating a xEAC lookup mechanism into EADitor was incredibly easy as a result, and I managed to implement it in about 30 minutes. The Settings page for EADitor now includes an input for the xEAC home page URL. An XForms submission will process this URL and append 'feed/' to it to ascertain whether Atom XML is available at that resource. If so, the xEAC URL will be committed into the EADitor configuration file. If the xEAC URL is in the config, the persname, corpname, and famname element XBL components within the XForms application will include a radio button to select the xEAC lookup, in addition to VIAF and local vocabulary. When you select an entity after performing the lookup, the entity's URI will be embedded in the @authfilenumber attribute. In EAD3, this attribute will be @vocabularysource (I think), following linked data advancements in EAC-CPF.


When linking the EAD finding aid to an entity defined in xEAC, the @type attribute will be set for the persname, corpname, or famname to 'xeac:entity.' Ideally, I would like to avoid system-defined attributes, but I think they are very useful in this case, as it will indicate to the EADitor UI XSLT stylesheets that the EAC-CPF XML can be extracted programmatically by appending '.xml' to the entity URI, and therefore biographical information may be extracted directly from the EAC-CPF entity record. This, I believe, is the dream of the creators of EAC-CPF. Authority information and archival/biographical context are stored separately from the finding aid, but yet the information is made available through the finding aid user interface by means of linked open data methodologies.

I have not yet built these hooks into EADitor's finding aid user interface, but expect them to be available when the new version of EADitor is released later this summer. This feature represents a major advancement in the publication of archival materials.

But wait, there's more.