Graeme Forbes, NLS opens the CIGS LOD day on a beautifully bright and crisp morning at the Edinburgh Carbon Centre and invites our first speaker to take to the stage to present on Publishing the British National Bibliography as Linked Open Data / Corine Deliot, British Library.
Corine’s presentation describes the development of a linked data instance of the British National Bibliography (BNB) by the British Library (BL). The focus is on the development of an RDF data model and the technical process to convert MARC 21 Bibliographic Data to Linked Data using existing resources. BNB was launched as linked open data in 2011 on a Talis platform. In 2013 it was migrated to a new platform, hosted by TSO.
Corine discusses some of the motivations behind the British Library’s drive to open their data, including publishing for others to use and opening up data to a wider audience. BL felt that open data would further benefit staff developing new skill sets. BNB was published as open data with a CC-0 license, allowing people to modify, adapt and reuse freely, http://bnb.data.bl.uk/
The British Library created a process for opening their data considering the design of the URIs to be used that are readable by humans and what concept vocabularies to use The BL terms and RDF schema is available at http://www.bl.uk/schemas/bibliographic/blterms#
The BNB data model looks a bit complex but successfully maps out what BL have used and how they have referred their data, be it subjects, authors, series and other uniqie identifiers http://www.bl.uk/bibliographic/pdfs/bldatamodelbook.pdf
The BL have taken an event based approach with publication data models, considering future publication events and how to model this within, including scope for future publications and out of print publishing events, birth and death are modelled as biographical events and extensive use of foaf:focus to relate ‘things in the world’.
BL have created a MARC to RDf conversion workflow that documents their process of the selection of records to be converted, determine the character set conversion, converting to pre-composed UTF8. BL then generate the URIs, quality check using Jena Eyeball and create and load the RDF sets sets.http://jena.sourceforge.net/Eyeball
The outcomes of BLs work within linked data is they have created two datasets, books & serials, created a BNB linked data platform, SPARQL endpoint http://bnb.data.b.uk/sparql and SPARQL editor http://bnb.data.bl.uk/flint and a BNB downloader http://www.bl.uk/bibliographic/download.html that is updated monthly.
The British Library plan to refine and extend the model, investigate frbr-ization, link to further external sources such as DBpedia, geonames, DNB bib resources and expanding the scope beyond current BNB.
Peter McKeague, RCAHMS, takes to the stage to talk about SENESCHAL: Semantic ENrichment Enabling Sustainability of arCHAeological Links on behalf of the SENESCHAL Project team
Peter discusses issues arising from the development, implementation and running of a linked data service and also theirplans for future developments.
The Royal Commission on the Ancient and Historical Monuments of Scotland (RCAHMS) http://www.rcahms.gov.uk along with its sister organisation in Wales (RCAHMW) and English Heritage are each responsible for a series of heritage thesauri and vocabularies. In Scotland, RCAHMS maintains the Scottish Monuments Thesaurus, and thesauri for archaeological objects and maritime craft.
Although these key thesauri are available online for reference until now they have lacked the persistent Linked Open Data (LOD) URIs needed to allow them to act as vocabulary hubs for the Web of Data.
Peter discusses the drivers for Linked Data for cultural heritage and the historic environment and the creation of thesauri as Linked Data, published through http://www.heritagedata.org/blog/ before discussing the practicalities of implementing Linked Data products in a working environment.
RCAHMS currently publish their datasets on CANMORE http://www.rcahms.gov.uk/canmore.html but find the opportunities for linked data can further push the interoperability amongst archaeology and other useful datasets, it also aligns itself with the mandate set out at govt level that all datasets of the same subjects must be created, published and released using the same data structure.
Previous projects identified the same old issues in converging datasets that required a level of data cleansing and unifying prior to alignment of data published. SENESCHAL’s main remit is to enhance the ease of how linked data vocabularies are created and to enable a knowledge exchange amongst differnt archaeological data sets and a wider global audience.
You can keep up to date with the project at: http://www.heritagedata.org/blog
Gordon Dunsire, Chair of the IFLA Namespaces Technical Group discusses methods for publishing local metadata records as linked data, using the National Library of Scotland’s database of metadata records for digital and digitised resources as a case study.
Topics covered include developing database structures as element sets, extracting data and creating linked data triples, and creating links from local data to the global Semantic Web. The presentation will also include a demonstration system for a primitive linked data OPAC.
Gordon discusses how to get started in open data, taking a pick and mix approach to linked data, using global elements and ensuring that your local elements have the same scope as the elements you choose, so as not to degrade your data or confuse meaning.
Many element sets are available from the general such as Dublin Core, FOAF, SKOS to the specific BibO, FRBR, ISBD, RDA etc. Searchable registries such as the Open Metadata Registry, LOV or joinup are also available to find the bib data
Gordon demonstrates the DIY approach with a case study of the NLS digital object database, used to publish the data, http://nlsdata.info/dod/elements/ This gives the NLS the opportunity to map to further external services such as DBpedia.
All the NLS data is stored as RDF triples, held in a MySQL triple stores with online access the data…we are now being treated to a live experimental demo of the triple store…fingers crossed…
So far so good, successfully bringing together digital images regarding the Battle of Passchendaele searched for within the NLS DOD and bringing in DBpedia data too, enabling the user to leave the DOD dataset and begin ‘following your nose’ and switching search interest, instead of a subject we can link to geonodes too and view maps for the area from google maps.
Stopped for lunch, where I’ve just scoffed a coffee cake the same size as my head!
Everyone refreshed and happy after some ace sarnies and cakes and looking forward to Kate Byrne’s presentation on Can documents be Linked Data? from the Edinburgh University School of Informatics
Kate begins with a statement of the semantic web being ancient, first referenced by Berners Lee way back in 1994 and in 2009 Tim further laments that ‘the web as I envisaged it is not there yet…’
Awesome analogy using The Hobbit opening sentence to express semantic structure!
Plenty of great work is being done on extracting RDF from existing databases to create Linked Data that can be part of the semantic web. But there is a vast amount of information that is not in structured databases. Most of the information we use every day and curate in archives and libraries is in free text form: documents, books, web pages and so forth.
Kate describes some of the research into extracting structured data – that can be turned into Linked Data – from natural language text and some of the issues likely to face anyone attempting this task, and also consider how to go about building connections between datasets, to make them full members of the Web of Data, to read more about Kate’s research visit http://homepages.inf.ed.ac.uk/kbyrne3/research.html
Kate brings us back to what is the point of all of this and how it would be really cool to form links with many disparate datasets covering the same concepts or events.
Kate mentions some of the project tools she has used, such as the Unlock Text API run by EDINA http://edina.ac.uk/unlock/texts the Google Ancient Places GAP at http://googleancientplaces.wordpress.com and an online interface called GapVis at http://nrabinowitz.github.io/gapvis/
Gill Hamilton takes to the stage and has created a real snappy title for her presentation – LO(d) and Behold!: extending to the Giant Global Graph – (catalogue that!) Gill is the Digital Access Manager at National Library of Scotland and is outlining some of the challenges, including some tools, tips and techniques on how local library linked data can be linked to other linked data sets such as dbPedia (RDF representation of Wikipedia), Library of Congress Subject Headings and GeoNames. Gill discusses the idea that one connection exists for all time and how link after links can build a richer landscape of ‘things’
Gil explains the different ways we can data match by large number processes, and matching strings and that the big issue is moving from the string to the thing.
Where LCSH, Getty Thesaurus and other vocabularies used this is recorded in the DOD and it allows the NLS to make an exact match between the DOD and controlled vocabularies such as LCSH, this can in turn make an exact match between the DOD triple and the LCSH triple. Getty Thesaurus isn’t available…yet but real soon now!
The great opportunity and one of the benefits of making these connections and links is that once you have linked your local to a global dataset you are then linked forever.
The work to do achieve this paradise is resources heavy but it’s mainly about maintaining the momentum and enabling staff to dip in and out of when they can, it’s the beginning of a crowd sourced way of working, debating what is the nearest or exact match…in the same guise as something like Project Gutenberg in its heyday. This is an excellent example of getting all things Scottish up on the global data graph!
Gordon Dunsire sums up the day by stating that the data cloud is now so big it’s not as useful as it once was but using a service such as LOV: lov.okfn.org/dataset/lov allows you to link via the graph to meta-metadata of the vocabularies.
Due to the nature and ideology of the semantic web and linked data Gordon warns us not to believe everything we link,if we can’t identify the authorities, we can be lost, authority control can give us a little of that trust back. There needs to be a declaration of the authority of the vocabulary. LOV highlights who’s using a particular vocab, but not it’s authority – its neutral. But if it’s clear that a vocab is maintained by an individual or an organisation – the authority and trust will generally ensue.
It’s a paradigm shift and points of view of the cataloguer will fundamentally change – the underlying principles are understood by cataloguers, but the way it will work in the modern environment is not – and cataloguers need to engage and influence the new age.
Gordon tells the audience about the Joinup semantic assets – a European commission project. https://joinup.ec.europa.eu/asset/all The project allows you to download the schemas and import the Open metadata registry elements.
Persistence is fundamental, but no persistence is guaranteed (this should be emblazoned on a t-shirt). Gordon further backs up the presenters of the day saying that one link is all it takes to make it work – but we don’t know which are the correct links, so use loads of them but beware and go to trusted places to find them.
- Slides from the day have been uploaded to the CIG Scotland SlideShare with permission from all of the speakers.
- Tweets about “#cigslod2013”
- Places still available – CIGS LOD seminar, Edinburgh, 18th November
What tool do you use for convert MArc21 to RDF?
Thanks for the question Pricila. I’ve asked the committee and included an answer from Alan Danskin at the British Library below. Hope it helps!
To convert our MARC 21 records to RDF when we publish the British National Bibliography as Linked Open Data, we use the following tools:
– We pre-process the records for normalization and improved matching (e.g. removal of end of field punctuation, etc.) using MARC Global (http://www.marcofquality.com/soft/mgfeatures.html), a companion to MARC Report developed by TMQ (http://www.marcofquality.com/)
– To convert the records from Unicode decomposed to pre-composed, generate BL URIs and match to external datasets, we use some British Library tools called Catalogue Bridge utilities (these tools are command line tools written in C).
– To convert the data to RDF/XML using XSLT, we use the command line MARC XML utility,(http://www.marcofquality.com/wiki/mrt/doku.php?id=236:cmxhelp) part of MARC Report.
– At the development stage, we used the RDF validation service from the W3C (W3C RDF validator – http://www.w3.org/RDF/Validator/)
– To check the output, we use Jena Eyeball (http://jena.apache.org/)
– We generate the N-Triples file from the RDF/XML using a Catalogue Bridge utility.
Excellent seminar, many thanks to all concerned.