The following are brife notes form short interviews with praticing scientists conducted by Sam Pepler. The main question asked was "How would you like to cite data?".

Martin Juckes

Martins starting point was that there would need to be a human readable citation like there is for papers. As an example he refered to the Quarterly Journal of the Royal Met Soc. Data citations would need an author and year of publication equivilent and a human readable title as well as any ids sure as a doi. He also said there needs to be an unambiguous repository title - i.e. "British Atmospheric Data Centre".

Martin would like to cite the whole dataset leaving the details to the text of the paper. For example a paper using one year of ECMWF ERA-40 data would have appear in the text as "... this is most evident in July 1998 ECMWF ERA-40 data (Bloggs 2005).", and as a reference as "Bloggs et al, 2005, ECMWF 40 year reanalysis, British Atmospheric Data Centre, doi:10.12345/de.era40".

We also dicussed how versions and operational products would be referenced.

Martin is an assossiate editor of QJ. He also discussed how the peer review process of data would work. He has some information on the peer review process with he will forward to me

David Hoopper

David is the project scientist for the NERC MST radar facicility. As well as writing papers, he is also concerned with getting data from the radar referenced in a more consistance manner.

Like martin, David would like to cite the overareching dataset not the lower file level. In particular David would like to cite the radar facility as a container for all the instruments at the radar site, but also have references to the indervidual instruments at the site.

The radar produces a series of data will more added every day. This gives a problem as to what you are citing and when its published. The best solution we could think of was to publish every month a dataset which contained all the data to the present time. Thus "MST data, July 2003" would contain all the MST radar data from its start until July 2003.

Robin Hogan

URL citation are regarded as amateurish

Permanent references are expected

Versions are a problem, need dates of publication

Chopping datasets into yearly lumps is inconvenient for users and artificial.

Instrument introduction paper is the traditional way to cite a data stream from an instrument.

Text and captions are where the more exact references belong.

The data provider needs acknowledgement. A data citation would solve the data-producer co-authorship problem. This is when a data-producer insists that they are authors on papers produced using the data as this is there only mechanism for recognition. If the producer is merely mentioned in the acknowledgements this is not counted in any measurable way.

Kathrine Bouton

Kathrine is mainly interested in model datasets, for example climate simulations.

The most likly thing to cite is the model itself not not the data. The run "name" may also be quoted. "The control run from the Higem model". Kathrine is working on specifing the inital state of a model via an XML represenation, NMM. This is also something she would like to cite.

She thought that a direct referance to the data is less likley, as model data are often not retained, however some standard runs would be refered to, e.g. ECMWF ERA40 data.

doi type identifiers feel like they are more permanent.

Barry Latter

Barry writes technical reports for ESA. He usually uses URLs to refer to a data source and discribes the data in the paper. Barry refers to the top level of the repository web site as he thinks other URLs are unreliable. Barry needs versioning of data to be explicit, if this is in the citation or in the text he does not mind. An unambiguous data reference includes Satelite, instrument, processing level, processing version and product type.

Barry quotes the soruces of the data (i.e. the repository) not the producer of the data.

Kathrine Bouton

Here are the responses I have received to date. As expected, the answers range all over the place as far as data granularity goes, though the examples A and B given in my original question seem to be the most popular.

  1. My only experience is with ECMWF data where I would cite sub-sets of their data archive e.g. ERA-40, ECMWF operational data, etc.
  2. I tend to use single simulations or ensembles - eg PRECIS A2 simulation from Indo-UK project, or QUMP ensemble members. In a scheme such as the one you propose, it would be important to me to make it clear exactly what simulation was used.
  3. A) a single HiGEM experiment e.g. Fred's Experiment HiGEM xaabc That is what I would most often do. But sometimes I cite all the experiments of a particular scenario or type done with a range of models e.g. all the SRES A1B runs in the PCMDI AR4 database. Generally when I refer to a model experiment, I am using one, or a few, diagnostics from it, but only a small number of diagnostics.
  4. I'd be inclined to do either a) or b) from your list below, depending on the number of experiments in question - i.e. if it were more than 5 or so xaabc's I'd just clump them together as in b)

Hope this helps somewhat. I'll send more as I get them

Katherine

Original question:
> With that in mind, could you respond to me via email, how you would 
> like to cite in any paper you published the the data you used. I need 
> to know what level of detail you would like/be most likley to, cite 
> the dataset(s).
> 
> For example, at the end of your publication you now cite all the 
> references,papers you used in your research.  CLADDIER imagines in 
> addition, you would cite the datasets use.  In citing the datasets you 
> used would you prefer to cite (as an example)
> A) a single HiGEM experiment e.g. Fred's Experiment HiGEM xaabc
> B) a grouping of experiments e.g. HiGEM, RAPID
> C) a result e.g. SST dataset from HiGEM
> D) all of the above
> E) Others ...