This is an output for wp4 and wp5 This page is an ideas clipboard of the project deliverable Report IV: Methodologies and Practices for Data Publication.

Introduction

This report outlines a method of data publication to be used at the BADC. The first section of the report outlines the rational and methodology in establishing this as the publication system. subsequenct sections describe the main issues for the publication system and the final section describes the publication system itself.

Methods

There are three components to the methodology used to establish the data publication system.

  • Analysis of the conceptual models in the academic publishing and library world and how they map to conceptional models in the data centre world. For the libaray world the conceptioal model used was the FRBR model, and for the data centre domain the concepts from the NERC DataGrid? project (NDG), in praticular applying NDG metadata model to BADC datasets.
  • Interviews with scientists and data providers. This is primarily aimed at working out what and how people would like to cite data. CiteInterviewNotes
  • Analysis of existing publishing practices. This includes other data publising projects like ebank, Planetary data service and german climate data. But also examining standard practice in academic publishing such as Peer review machanics from RMS, and ISO690 citation standards.

These sources of best practice are then combined with existing BADC procedures for data set creation, update and review. The result is a data publication methodology which the BADC has implimented.

At the outset of this work it is worth declareing the definitions of publication and citation used as this in turn defines our aim of createing a formal publication method for datasets.

Publication Citable, Implyed perminance of a publication, avaliable. This makes it citable. What is a publication system. Elements of a publication system. Quality control.

Citation The citations we are concerned with are those as used in academic acticals. These are described in full by ISO 690. Citations for data should be equivelent to those of acticals.

How do the scientists want to cite the data?

The the answer to the simple question "How do the scientists want to cite the data?" should encapsulate the requirements of data users. Users and producers of data. Data users are

Data Users requirements for citation

  • unambiguous - data citations should point to a well defined entity.
  • broad scale - people wants to cite a few data sets, not millions of unagrigated files.
  • like normal citations - The should be some equvilence, in both syntax and semantics, between data and paper citations. This includes haveing information on things like publication data, author and title. There is a perseption that URLs are "Amiterish".
  • permanently meaningful - References are only meaningful if they point to the same item in the furture.
  • Solve the data-provider recognition problem. This is when a data-producer insists that they are authors on papers produced using the data as this is there only mechanism for recognition. This can be unfair for the paper's authors as, even though the data is critical in the work, it is may not part of the intelectual contribution of the paper. Alternately, if the data provider is merely mentioned the paper's acknowledgements, then there is no measurable way in which the .

A related question is how do data produces requirements for citations

  • specific
  • unique
  • tracable to producer
  • countable citations
  • regognised as equivelent to papers
  • Abliity to search for papers citing data

Two types of useful citable object.

  • A citation for a mutible data object that captures the whole data set. This is for changing datasets what is feed to the discovery interfaces.
  • A citation for fixed imutable data objects that are garentied to exist in the same form as quoted.

Comparing Conceptual Models for publishing of acedemic papers and data

Citations refer to

Cataloging in libraries

  • FRBR
  • Cataloguing rules

Catalogueing in libararies is item up. The Librarian is presented with an item. The item is described and in the process the manifestation, expression and work are described.

In this section we compare the conceptal models applied to published acedemic papers and data held within the BADC.

What defines a BADC dataset? Answer: Something that is citable.

Relating data to the FRBR model

The BADC datasets are works. Files are manifestation, (but not of the datasets)

  • New versions are new works.
  • Need static datasets

NGD model for environmental data metadata very specific to environmental data Citation on about specifics environment v other data

The big differences.

  • dynamic publications
  • ill defined dataset
  • overlapping datasets
  • more citation levels - file level, parameter, spatial-temperoral subset.
  • documentation and metadata citations

Make Static datasets

The citation of electronic documents is handled by ISO690-2. This uses an explicit time of citation to get round non static targets. This can also be viewed as a way in which the citer is declaring that they believe that the document is lilky to change or disapear. As stated above the very publication process itself should strive for perminance and so this approch should be rejected. The alternative is to has a clear versioning process. Versions will not only clearly define This is something thatneeds to be avoided , but it makes life easier for the publisher, archive and the citer it these are static and have a static citation.

Datasets are often changing with time. Examples of this are time serries that are added to on a daily basis, data that are reprocessed (as a whole or in part), data where different teams add the contributions are very different times, and addition of data in new formats. With the premis that each new addition or modification constitutes a new dataset version then there will be a vast array of versions to track. For example, the BADC hosts the data from the MST radar facility. Data produced by the facility have been ingested into the BADC every 15 miniutes for the past 15 years in a semi-continuous manner, and now amounts to around 400 GB in 300,000 files. Clearly for this data keeping a copy of the dataset as a whole for each ingestion cycle a little expensive as it would involve storing approximately half a million dataset versions averageing 100GB each, or 50 PB of data! An alternative is to regard each file as a dataset in its own right. This produces 300,000 datasets, but since a scientist analysising a 1 year time series is not going to willingly cite thousands of data files when writing a paper; some measure of arrigation is needed. If the data is arrigated on say yearly or monthly basis then the same problem arrises but with more managable numbers; a paper analysing all the data will still need to cite 15 yearly datasets. There is a clear need to cite the dataset as a whole. If keeping fresh copies of the data is not managable and files are arriving continuously then some form of version tagging is needed, so that any file can be found to be part of a number of versions. The frequency at which the BADC declares a new version for tagging is debatable. A new version can be given at every ingest cycle, every new file, periodicaly (hourly, weekly, yearly), or as a result of some procedure (integrity checks, peer-review). What every the case the dataset as it was at the time of that version release needs to be recoverable.

Current practice in data and academic publication

Analysis of existing publishing practices. This includes other data publising projects like ebank, Planetary data service and german climate data. But also examining standard practice in academic publishing such as Peer review machanics from RMS, and ISO690 citation standards.

Other data publication

ebank description of how ebank publication works.

Planetary data service description of how the Planetary data service publication works

German climate data.

PLoS and PDB Great review process. Tried to paper publication. Homogenius data.

Peer review in academic publications

Can we get RMS to do it for us?

The RMS peer review process.

The BADC selection procedures.

PDS data review

non-peer reviewed publications Tehnical reports in epubs and NOCS?

Seperating the roles

How should a data centre seperate the following roles. As noted in the introduction publication is a loaded word and means many different and confilcting things the different people. Instead of providing another definition of publishing we will consentrate on the functions perteanent in both data centres, libaraies and publishers.

I all cases the

  • Distrabution (D) - Providing the information to the designated community. avaliblity (Embargo and access retrictions are common with data)
  • Preservation (P) - Supporting the information over a long time period. garenteed longevity
  • Quality control, Peer review, idetification, validation checks
  • Submission (S) - Quality control, Peer review, idetification, validation checks
Organisation type Primary function Secondary functions
Journal publisher DS P
Library D
institutional repository PD
Data centre DPS
Copyright library/Archive SP D
Data producer S
paper author S
  • conformance quality control -
  • semantic quality control - peer review
  • packageing - identification
  • metadata generation
  • preservation
  • content distribution
  • metadata distribution

conformance quality control

conformance quality control includes checks on file format, paper layout, syntax, spelling and gramma. These lend themselves to automation. These are checks against a set of rules defined by the repositories policies.

semantic quality control

semantic quality control, sanity checks. These are checks against a broader and vauger set of global rules of everything. These are often interpritation of a reviewer view point and are thus posibly contentious. Its hard to automate these checks, but automation can help. For example a data scienist dicides that it is unreasonable for the air temperature at ground level to be less than -70 C. Automatic checking of all data is now posible, however the arbitary figure of -70 C is the data scientists choice bases of her experience.

The dividing line between conformance and semantic quality checks is repositore policy. If a repository has a policy that no ground level temperatures of less that -70 C are acseptable then this is a conformance check.

packageing

It is clear from the above discution of citeable objects that this is still major issue for data. For traditional publishing the packages are fairly well defined, Journals, papers, books.

Metadata creation

Th

Dataset publication issues to address

  • Standards conformace
  • External review
  • Authorship and ownership
  • Identification
  • Fixity

Suggested publication mechanics

The publication of data would be based on the normal journal paper procedures. These are:

  • Preperation for publication - data additions can be new dataset or addition to old one. This includes the encapsulation of dataset so that they are clearly defined.
  • Licences for data distribution agreed - Need to have an agreement to distribute and archive data set from author. Open access principles should be striven for.
  • Peer review - significant data set change need an external review process
  • Publication - Data is made avaliable.
  • Citation in new publications

Non-static data and indervidule files

  • data preperation by scientists
  • ingest QC by Data centre
  • data made avaliable by data centre, citable as current records and proposed data package.

Citable data packages

  • citable data package prepared by data centre in collaboration with authors
  • peer review by scientists
  • data package citation and package definition made avaliable by data centre

Conclusions

  • The ability to cite data is heavily linked to defining data
  • Unlike journal papers data can be continoually evolving and being added to, thus there are needs to be a fast and frequent publication of new editions.
  • Citation can be used to define the data sets used

References

http://www.ukoln.ac.uk/events/pv-2005/pv-2005-final-papers/029.pdf