Wednesday, October 3, 2012

Using dbpedia to jumpstart EAC-CPF record creation

Dbpedia offers a wealth of open information in the form of RDF that can be used for all sorts of purposes.  It contains links to resources about the Wikipedia topic available online, the birth and death dates of individuals, subjects, occupations, and a variety of relations.  Additionally, abstracts and names are available in a plethora of languages.  These data can be used to generate fairly sophisticated EAC-CPF stubs, and over the last few weeks I have implemented two approaches to generating EAC from dbpedia RDF.

  1. In xEAC, through the XForms interface.
  2. With a PHP script which is offered open source in the xEAC github repository.

Generating EAC-CPF stubs with XForms in xEAC


I'll first address #1 above.  Suppose, for example, you are creating a new record in xEAC for Alexander the Great, whose resource is represented by http://dbpedia.org/resource/Alexander_the_Great.  By clicking the "Import DBpedia Data" trigger at the top of the page, a window will launch for the user to enter the resource URI.  After checking to see whether that is a viable resource, the user will have a list of options to check for importation: names, abstract, exist dates, CPF relations, resource relations, and thumbnail.





xEAC can import all of the names (rdfs:label) from dbpedia as nameEntries.  The xEAC system language, stored in its config file, will be used to assign the label whose @xml:lang matches the system language (default, 'en') as a preferredForm, with all other names assigned as an alternativeForm.  A WIKIPEDIA convention declaration is inserted into the EAC-CPF control element.  Abstracts are imported in similar fashion.



The dates of existence can be imported from dbpedia if both the birth and death dates adhere to ISO 8601, yyyy-mm-dd, standards.  Unfortunately, dbpedia is inconsistent in this respect.  One of the most important aspects of the import process is the utilization of dbpedia's internal linkage to other personal, corporate, and family name resources to create CPF relations in the EAC record.  Wikipedia's internal ontology includes links to the mother, father, children, dynasty/family, successor, predecessor, and influences of the source in question.

Finally, resource relations can be created for each dbpedia-owl:wikiPageExternalLink in the RDF, and a resourceRelation @xlink:role="portrait" may be generated for the dbpedia-owl:thumbnail, enabling the EAC record to link to a freely and openly available image which represents the entity being described.

In addition to importing these data, xEAC will set the dbpedia resource URI as a source in the control as well as an otherRecordId.

Expanding Relations

By default, CPF relations will point to dbpedia resources, but the interface allows the user to create stubs within the xEAC collection or instead link the relation to an existing EAC record.  By clicking on the "Create Record" link pictured behind the popup window in the screenshot below, a window will appear to allow the user to set the recordId and xlink attributes for the stub which will be generated upon saving the source record.  The stub which will be created for the Argead Dynasty will contain a CPF relation which points back to Alexander the Great.



Using PHP to crawl dbpedia to generate EAC-CPF records en masse

Several weeks ago, I began work on a PHP script which could start at a given dbpedia resource and add the CPF relations of that resource into an array that continuously processes until it reaches an end.  That is to say, if I start with Augustus, then Livia, Julia, Tiberius, the Julio-Claudian Dynasty will be added to the array and generated as stubs.  Then the relations of these resources will be crawled.  The process continues expanding like a spider web.  Left unchecked, the link of successors and predecessors, spouses, and children, will continue forward through the Byzantine period and backward through the Hellenistic kingdoms to generate a network of the ruling hierarchy of the West for more than 2500 years up to the present.  This mass of EAC records can be uploaded into xEAC.  Currently, my focus is the ancient Greek and Roman period, but the script can be applied to any era.  The stubs are a good starting point when turnover over the management of the content to specialists who might add greater chronological or geographical context.

This script has a few more options to work with:

$start = 'http://dbpedia.org/resource/Augustus';
$end = '';
$lang = 'en';
$options = array(
                'internal'=>true,
                'occupations'=>true,
                'subjects'=>false,
                'birth/death places'=>true,
                'children/parents'=>true,
                'dynasties'=>true,
                'successors/predecessors'=>true,
                'spouses'=>true,
                'influences'=>false,
                'resourceRelations'=>true,
                'thumbnail'=>true
            );
An $end variable can be set to attempt to establish the final record to be processed, but there is no guarantee this will work.  The options allow for the user to set which data he or she would like to import into the EAC records.  This is similar to the XForms interface, but birth/death places are supported (with associated dates), and the user can set the 'internal' option to true if the CPF relations' @xlink:href attributes should link to shortened, relative URIs for a single EAC collection (as opposed to linking to the dbpedia resource URI).  Furthermore, if the dbpedia RDF contains a reference to a VIAF ID, the script will attempt to gather birth and death dates from VIAF, as well as insert otherRecordIds as fit.

The script is available at https://github.com/ewg118/xEAC/blob/master/misc/dbpedia-to-eac.php

1 comment: