This is an overview of the project team's discussions about citing data

Starting points from Bryan's blog:

http://home.badc.rl.ac.uk/lawrence/blog/2006/09/21/more_on_citation

http://home.badc.rl.ac.uk/lawrence/blog/2006/09/22/more_on_citation_-_part_two,_mst

http://home.badc.rl.ac.uk/lawrence/blog/2006/09/26/more_on_citation_-_part_three,_delving

What is the data citation is for:

Citation - to reference the dataset in a paper indicating its quality standing so that the creator can get recognition for the work and build citation metrics(here there could also be discussion about how the Data Manager gets recognition if he has done enormous amount of work on dataset)

Discovery - short description of the dataset to enable a prospective user to identify whether the data might be of use and to uniquely identify it ... and provide links to further information. We should not expect citation to provide all the discovery information (as defined here), so let's leave this off the table for now.

Availability - where can the user get more information, access or obtain dataset

Where we need to understand that these are three VERY different options, and so it is not appropriate to provide one URL that doesn't distinguish between which of these options exists at the URL! Hence my discussion about an opendap url ...

Or put a different way'''

Setting the context for when someone might come across a dataset (or subset of a dataset) citation I am assuming that the citation to the dataset is going to be used in a formal document such as a journal article in a similar way to a citation to another document. At that stage the user will want/gain/infer the following:

- information on what it is and who created it

- information on who "published" it

- enough information to be able to locate it

The first two are bound up with implicit and explicit judgements about the quality and relevance of the information. The final one enables the user to locate it (or in certain cases attempt to locate it - many factors including access rights and availability may mean that it is not possible to find it).

Citation content

Identify the elements that we want in the citation and then decide how to format. There are already citation conventions so probably best to try and emulate where possible, and in this datasets are probably analagous (but not the same) as monographs. Or for some datasets a better analogy might be either a series (Lecture Notes in Computer Science, Advances in Physics etc) or an anthology.

I think there is agreement on datasets issued by the originating organisation

proposed ELEMENTS for datasets issued /made available thru a 'repository':

creator (personal or corporate)

Date (year of issue/publication/download) I think we have to separate these out in to DateOfPublication/Issue? DateOfDownload?. (self explanatory)

Title (dataset title including any temporal information and originating organisation with any ID they assigned) I think these things need to be separated just as they are in print media and should not include temporal information, it can be too ambiguous. I think the only temporal information that makes sense in a citation are the dates above, but ...

Place of Publication (the citation is a snapshot so whether BADC moves is immaterial at the time of access it was at Chilton - also British ADC might be fine without place of pub, but Geological Society Data Centre on its own is not - there are many Geological Societies around the world) There is some internal discussion on whether this is necessary in this context.

Publisher (the data centre /repository)

Available (URL to the full metadata record )

Accessed/Downloaded (when date)

in addition we need:

a URI: a unique identifier which is provided by the issuing organisation for the data element. if that unique identifier carries semantics which are of use in the citation, (as IS NECESSARY IF WE WANT TO CITE INTO A DATASET ...), then we additionally need a link to something which defines those semantics. However let's postpone this, and see if we can solve the dataset alone problem first.

I think this is the hub of some of the problems here. Why should it point to a metadata record? When I cite an electronic journal article I cite the thing itself at a doi! Why should I be forced to make my catalogue persistent when the object itself is more likely to be persistent. The catalogue entry had better be being updated (as for example it will be when a dataset is growing).

To me, a publisher does something to the original that adds value - in a document this may be pagnination and copy editing. So I would distinguish between hosting data that is owned by someone else and data whose original home is in this repository. This is rather a simplistic view and is applying my point about document citation above to datasets and may not be appropriate for the dataset domain.

so Citation becomes

Creator Date Title. Place of Publication: Publisher. [Available: URL ] (access date)

The only problem I have is how we know it is a dataset from this citation. If the title includes the word dataset or data it is obvious, but if the title is like your Radar Facility example it does not.

Alternative citation versions

version 1:

Natural Environment Research Council, Mesosphere-Stratosphere-Troposphere Radar at Aberystwyth, [Internet], British Atmospheric Data Centre (BADC), 1990-, urn badc.nerc.ac.uk/data/mst/v3/upd15032006, feature 200409031205 http://featuretype.registry/verticalProfile [downloaded Sep 21 2006, available from http://badc.nerc.ac.uk/data/mst/v3/]

version 2:

Natural Environment Research Council 2006 Mesosphere-Stratosphere-Troposphere Radar at Aberystwyth 1990 - [dataset]. Chilton, UK : British Atmospheric Data Centre(BADC). [Available: http://badc.nerc.ac.uk/data/mst/v3/] (21 Sep 2006)

This may be only citing the entire dataset.

(1) There is a key difference between where something is available and the identifier. Consider a journal article. You're quite happy to see vol 13, page 22 in there, and that's a local identifier to that journal. Something similar is needed for data. Hence the uri in my example. One needs a clear distinction between the thing (urn maybe) that identifiers the data, and the thing that identifies where it is found. Even if they are the same!

(2) The more complicated example included the concept of citing INTO a database. I want to, for example, cite cruise report within a cruise database, and i'm citing the DATA, not the document about the data. I need to give folk some way of a) identifying the thing within the data (hence my feature uri), and b) some way of identifying what lives at that address, hence my link to a feature type description.

Additional issues to consider

In journal article publishing the citation is always to the formally published version, in some work the NISO working Group on Journal Article versions are doing (http://www.niso.org/committees/Journal_versioning/JournalVer_comm.html) , this is defined as the Version of Record. So this assumes that there is one "approved" version that can be considered to be definitive but there may be many physical copies. This approved version in the present information landscape is likely to have been generated/published by an academic publisher. The user may have looked at a copy of the version of record at the publisher's website or at a third party aggregator or electronically from the British Library or from somewhere else but for document citation it is not important where the document was located as they are all copies of the same thing.

However this concept of Version of Record and many identical copies in different locations transfers as easily to datasets. Citing documents is an easier process as by definition the document is a completed process (no extra information being added as can happen with a dataset - the MST one that Bryan discussed is a good case in point) I won't go into errata etc as this is a pointless diversion in this dicussion!

Another issue that I haven't really thought enough about to articulate properly is bound up with access rights. In the print world you couldn't get from the citation to the physical copy without doing some resource discovery. In the electronic world you may, depending on the services that your library provides, be able to click on the citation and get to the article - this is all dependant on your access rights & privileges. Datasets also have access rights and so if anyone is able to link directly to the data what happens if they are not entitled to access it?