Monday, August 19, 2013

Adding value to Flickr Commons: Machine Tags

It has been possible since earlier EADitor betas to use the Flickr APIs to pull thumbnail and reference-sized images given the photograph's URI.  If we can use EADitor to create excellent metadata utilizing linked open data web services from Geonames, Pleiades, VIAF, LCSH, and, soon, the Getty thesauri, why not push these metadata back into Flickr itself with machine tags?  Libraries, archives, and museums have for years uploaded historical photographs to Flickr, but meaningful metadata are typically lacking.  Flickr is regarded as a "cool" thing to do, but the service itself is underutilized as an aggregation engine.  What if we took Flickr more seriously as a repository for open content?

Institutions large and small have uploaded photographic collections to Flickr.  But what if you want to see all photos relating to baseball in Flickr Commons?  Or what if you want to see all photographs taken in Chicago?  You can't unless the textual metadata for the images contain your search keywords.

But, as usual, the Pleiades Project is ahead of the curve.  As Sean Gilles detailed here, a growing number of Flickr users have adopted the pleiades namespace and a handful of useful predicates (atteststo, origin, depicts, findspot) to associate their photos with Pleiades places.  There are now almost 12,000 photos in the pleiades namespace.

So, suppose you have an archaeological archive.  You've already used EADitor to associate your manuscripts and photographs to Pleiades URIs.  Now, if you're using Flickr as your archive's photo repository, you can inject these IDs back into your image metadata in the form of machine tags.  EADitor now communicates with Flickr's OAuth authentication to allow you to log in as a Flickr user, enabling EADitor to use the setTags API.

The interface is quite simple.  After you have associated a Flickr image URI to your Digital Archival Object Group (daogrp element in EAD), you can click on the "+Tags" XForms trigger to load a window which contains potential machine tags from your daogrp's nearest parent component.  The predicates are bound to drop-down menus that can be specified in your EADitor config.  You can also add custom tags.

Once you click the Apply Tags button, the tags will be pushed to Flickr, and will then be available through the Pleiades interface shortly thereafter.

Flickr provides a lower barrier to entry than, say, the Pelagios Project, since individual users who are unattached to digital humanities projects or organizations can link their photos to Pleiades.  I think there's enormous potential in machine tagging.  Sure, access to imagery is broadened when the Pelagios web page for an ancient place links back to the machine tag search results page on Flickr, but what if you can develop a machine tag system for specific pieces of artwork?  Perhaps it might be possible to crowdsource photographs of museum objects which can be fed into photogrammetry software for the creation and dissemination of open 3D models of statues, vases, or other such artifacts.

Integrating Pleiades/Pelagios into Archaeological Archives

EADitor has for quite some time incorporated Geonames lookups into the EAD-editing interface to link archival collections and subordinate components to modern geographic places.  The EAD geogname XBL component has been updated to support the Pleiades Gazetteer of Ancient Places search services so that place names and Pleiades IDs can be extracted by means of the project's RSS feed response.

So after we select the correct Pleiades place from the list, the @authfilenumber will be populated with the Pleiades ID, and the @source is 'pleiades.'  When the finding aid is published to Solr, a lookup of the RDF serialization is performed on the Pleiades URI, and geographic coordinates are stored in Solr to power the mapping interfaces.  Furthermore, the URI is reconstructed given the @authfilenumber and @source and stored in the Solr index (along with URIs for Geonames, VIAF, LCSH, and other controlled vocabulary sources).

Like Numishare, EADitor now has piplines for pelagios.void.rdf and pelagios.rdf so that each finding aid in the system is represented as an oac:Annotation in the Pelagios RDF dump.  Each annotation includes associated Pleiades URIs.

Currently, EADitor only supports the publication of collections as a whole as individual Solr documents, and thus the oac:Annotation reflects the entire finding aid.  However, in the near future, I will implement a different publication system that allows EADitor users to publish individual components on an atomized level: selecting higher level series for publication as Solr docs and even components on the item level.  Already, EADitor has been updated to support the display of individual components at uniquely addressable URIs by appending '/' + the component's @id to the finding aid URI.  The '.rdf' extension can be added to the finding aid or component URI to receive an RDF serialization back, which conforms to Aaron Rubinstein's arch ontology.  I plan to make EADitor's linked open data standards conform to the Linked Archival Metadata (LiAM) Guidebook in the long-term.

So what does all of this mean?   Linked-data awareness built into the application increases the archive's potential for meaningful data aggregation.  An archive which consists of content about the ancient world can be incorporated into Pelagios, making this content available to researchers interested more broadly in the ancient world.  Archives and photographs from excavations in Athens, for example, will be available alongside epigraphy, coins, statues, pottery, etc. also created or found in the ancient city.  Linking to Geonames or VIAF  references can improve the user experience of the archival collection.  These resources link to dbpedia, and so biographical or contextual information can be extracted by machine-readable means and displayed in the web interface.

But wait, there's more.

Wednesday, August 7, 2013

Major Revision Coming to EADitor

After sitting dormant for more than a year, I have begun a significant revision of EADitor.  Some aspects are being completely rewritten to gear the software toward a 1.0 release shortly after the finalization of the next version of EAD.  I'll provide a short overview of the new and improved features.  First and foremost, I have migrated the project from Google Code to GitHub, which is a major step in making the code more maintainable in the long run.  The new URL is

Migration to latest Solr and Orbeon

I am migrating EADitor into the latest versions of Solr and Orbeon.  With this comes some general improvements to performance and aesthetic style.  The public user interface and the backend XForms interface have become detached from each other, making it easier to customize the public UI without affecting the backend UI, which is more scalable for higher resolution monitors.  The previous version of EADitor depended on three Solr indexes.  Two have been eliminated, leaving just one for the public interface.  The administrative interface relies more heavily on XQuery for pagination and search, and the controlled vocabulary index has been eliminated since term lookups will be performed directly on the APIs, rather than through autosuggest powered by the Solr index operating in a silo.

EAD v3

I have reviewed the recent beta release of the EAD schema.  I will begin coding support for the beta schema by the start of next week so that when the final schema is released in a few months, I'll only need to make some minor edits to the XForms and XSLT scripts.  My aim is to support EAD v3 editing and publication in EADitor as immediately as possible, which will undoubtedly be well ahead of the competition.  EADitor will support the upload of EAD2002 schema or DTD-based finding aids, which will be preprocessed into EAD v3-compliant files for editing.

API integration

Currently, EADitor supports lookups on VIAF for corporate and personal names and Geonames for geographic terms.  LCSH terms could be linked to subjects in finding aids by performing lookups on an out-of-date Solr index.  I plan to expand this functionality significantly.  Yesterday, I extended EADitor to tap into search services for the Pleiades Gazetteer of Ancient Places, particularly useful for georeferencing archaeological archives.  Lookups can be performed directly on for subjects and genreforms.  I can incorporate other relevant Library of Congress APIs.  The Getty is due to release their thesauri in the form of linked open data eventually, and EADitor will be extended to query their APIs as well.  And, of course, EADitor will query NAAC web services (the National Archival Authority Cooperative, the eventual evolution of SNAC) for personal, corporate, and family names as soon as the services are available (admittedly, EADitor 1.0 will likely be released well ahead of NAAC).

Component-Level Publication

Currently, EADitor supports the traditional form of EAD publication and dissemination: finding aids are published and searched as a single Solr document and viewed as a whole.  I plan to add in support for publishing individual components (series, items, etc.) as atomically searchable and displayable documents.  Additionally, EADitor does currently support the publication and display of MODS records, and will soon support editing and uploading of MODS as well.

Export/Linked Open Data

EADitor already supports OAI-PMH integration, with finding aids represented as fairly simple Dublin Core fragments.  I plan to support more robust export serializations in RDF.  The integration of LCSH, Pleiades, VIAF, Geonames, Getty, etc. URIs into the EAD v3 records directly will enable much better linked data functionality for archives--both on the finding aid and component levels. Since individual components will be represented in RDF (and other export formats, like MODS), components will have individually addressable URIs.  I'd like administrators of EADitor-based archival collections to be able to provide their data to major harvesters of cultural heritage content in supported standards out of the box, whether compliant to the Digital Public Library of America or the Pelagios project, for archives with a significant level ancient world geographic content.

Enhanced Flickr Support

EADitor supports linking to flickr images in the daogrp once a flickr API key has been inputted into the configuration file.  I'd like to take this to the next level by extending the interface to send machine tags back to flickr for VIAF, Geonames, Pleiades, etc. identifiers associated with the photographs, providing greater metadata context within flickr itself.