Thursday, November 15, 2018

An American Europeana

The blog is often reserved for updates or technical explanations of archival/authority software development at the American Numismatic Society, or experimentation in new modes of archival data publication (mainly Linked Open Data).

However, since I have long been a proponent of open, community-oriented efforts to publish cultural heritage aggregations, like Europeana and DPLA, I wanted to take a bit of time to hash out some thoughts in the form of a blog post instead of starting a series of disjointed Twitter threads [1, 2].

Most of you have likely heard that DPLA laid off six employees, and John S. Bracken went online to speak of his vision and answer some questions. This vision seems to revolve around ebook deals primarily, with cultural heritage aggregation as a secondary function of DPLA. However, DPLA laid off the people that actually know how to do that stuff, so the aggregation aspect of the organization (which is its real and lasting value to the American people) no longer seems viable.

I believe the ultimate solution for an American version of Europeana is tying it into the institutional function of a federally-funded organization like the Library of Congress or Smithsonian, with the backing of Congressional support for the benefit of the American people (which is years away, at least). However, I do think there are some shorter-term solutions that can be undertaken to bootstrap an aggregation system and administered by one organization or a small body of institutions working collaboratively. There doesn't need to be a non-profit organization in the middle to manage this system, at least at this phase.

There are a few things to point out regarding the system's political and technical organization:

  1. The real heavy lifting is done by the service/content hubs. It takes more time/money/professional expertise to harvest and normalize the data than it does to build the UI on top of good quality data.
  2. Much of the aggregation software has been written already, but hasn't been shared broadly with the community.
  3. There seems to be a wide variation in the granularity and quality of data provided to DPLA. I wrote a harvester for Orbis Cascade that provided them with DPLA Metadata Application Profile-compliant RDF that had some normalization of strings extracted from Dublin Core to Getty AAT and VIAF URIs, which were modeled properly into SKOS Concepts or EDM Agents. But DPLA couldn't actually ingest their own data model.
  4. Europeana has already written a ton of tools that can be repurposed. 
  5. There are other off the shelf tools that scale that could be appropriated for either the UI or underlying architecture (Blacklight, various open source triplestores, like Apache Fuseki, which I have heard will scale at least to a billion triples).
  6. On a non-technical level, the name "Digital Public Library of America" itself is problematic, because the project has been overwhelmingly driven by R1 research libraries. Cultural Heritage is more than what you find in a Special Collections Library, and museums are notably absent from this picture (in contrast to Europeana).

Without knowing more of the details, I had heard that DPLA had scaling issues with their SPARQL endpoint software. I don't know if this is still an issue with this particular software, but I do believe the data were a problem. Aside from what was produced by those organizations that are part of Orbis Cascade that opted to reconcile their strings to things (sadly, most did not choose to take this additional step), how much data ingested by DPLA is actual, honest to God Linked Open Data--with, you know, links? A giant triplestore that's nothing but literals is not very useful, and it's impossible to build UIs for the public that can live up to the potential of the data and the architectural principles of LOD.

At some point, there needs to be a minimum data quality barrier to entry into DPLA, and part of this is implementing a required layer of reconciliation of entities to authoritative URIs. I understand this does create more work for individual organizations that wish to participate, but the payoffs are immense:

  1. Reconciliation is a two way street: it enables you to extract data from external sources to enhance your own public-facing user interface (biographies about people--that sort of thing).
  2. Social Networks and Archival Context should play a vital role in the reconciliation of people, families, and corporate bodies. There should be greater emphasis in the LibTech community to interoperate with SNAC in order to create entities that only exist in local authority files, which will then enable all CPF entities to be normalized to SNAC URIs upon DPLA ingestion.
    • Furthermore, SNAC itself can interact with DPLA APIs in order to populate a more complete listing of cultural heritage objects related to that entity. Therefore, there is an immediate benefit to contributors to DPLA, as their content will simultaneously become available in SNAC to a wide range of researchers and genealogists via LOD methodologies.
    • SNAC is beginning to aggregate content about entities, so it frankly doesn't make sense for there to be two architecturally dissimilar systems that have the same function. DPLA and SNAC should be brought closer together. They need each other in order for both projects to maximize their potential. I strongly believe these projects are inseparable.
  3. With regard to the first two points, content hubs should put greater emphasis on building the reconciliation services for non-technical libraries, archivists, curators, etc. to use, with intuitive user interfaces that allow for efficient clean-up. Many people (including myself) have already built systems that look up entities in Geonames, VIAF, SNAC, the Getty AAT/ULAN, Wikidata, etc. This work doesn't need to be done from scratch.
Because DPLA's data are so simple and unrefined, many of the lowest hanging fruits in a digital collection interfaces have not been achieved, such as basic geographic visualization. Furthermore facet fields are basically useless because there's no controlled vocabulary.

After expanding the location facet for a basic text search of Austin, I am seeing lists that appear to be Library of Congress-formatted geographic subject headings. The most common heading is "United States - Texas - Travis County - Austin", mainly from the Austin History Center, Austin Public Library. However, there are many more variations of the place name contributed by other organizations.

The many Austins

This is really a problem that needs to be addressed further down the chain from DPLA at the hub level. If you want to build a national aggregation system that reaches its full potential, more emphasis needs to be placed on data normalization.




DPLA decided to go large scale, low quality. I am much more of a small scale, good quality person, because it is easier to scale up later once you have the workflows to produce good quality data than it is to go back and clean up a pile of poor data. And I don't think that the current form of the DPLA interface is powerful enough to demonstrate the value of entity reconciliation to the librarians, curators, etc. making the most substantial investment of time. You can't get the buy-in from that specialist community without demonstrating a powerful user interface that capitalizes on the effort they have made. I know this from experience. Nomisma.org struggled to get buy-in until we built Online Coins of the Roman Empire, and now Nomisma is considered one of the most successful LOD projects out there.

My recommendation is to go back to the drawing board with a small number of data contributors to develop the workflows that are necessary to build a better aggregation system. This process should be completely transparent and can be replicated within the other content hubs. The burden of cleaning data shouldn't fall on the shoulders of DPLA (or whoever comes next).

There are obvious funding issues here, but contributions of staff time and expertise can be more valuable than monetary contributions in this case.