Propsed changes to BADC Data Publication Procedures
Citable objects
A reasonable definition of a publication is something that can be cited with confidence that future readers will be able to find and interprate the object cited. These citable object in the context of the BADC are:
- Datasets descriptions. These are descriptions of groups of files and what they contain.
- Suplimentary information documents. These documents contain related information and metadata about a dataset. They may be in a either
- Scholarly works related to the data. These are articals about the dataset.
- Indervidule data files. These are the files that the data is contained in.
It is clear from the views of scientists that citation of the first three objects are the most important and it is this that this document concentrates on. However it should be noted that Dataset descriptions must point to (cite) the indervidule files in order to be useful.
The BADC publishes datasets as records in a catalogue that can be queried from web pages, Z39.50 and, via the NERC DataGrid? OAI-PMH.
Requirments for a data publication procedures
The requirements:
- conformance quality control -
- semantic quality control - peer review
- packageing - identification and metadata generation
- preservation
- distribution - content and metadata
Implimentation: Documented QC procedures for each dataset.
Packageing
The packageing of datasets should enable the data to be clearly and unabiguously defined. This should be done by haveing a document that has an identifier and that can refer to other identified digital objects. These data decription documents (DDD) are imutible and fixed even if the data to which they refer is changing in some way. The DDD should contain a description of the fixity, authorship, content of the data it refers to and provenance of the document itself. The DDD may have to conform to some standard itself; at a minimium Dublin Core (http://dublincore.org/documents/dcmi-terms/#H2). The elements that the dataset description has are analyses in the table below
| Core Element Name | DC definition | BADC application comments |
| Contributor | An entity responsible for making contributions to the content of the resource. | This is not in the present catalogue. This should be the list of all people and organisations that contributed to makeing the data. |
| Coverage | The extent or scope of the content of the resource. | Hopefully geographic and temporal extent of the data.Should be automaticly generated from the data |
| Creator | An entity primarily responsible for making the content of the resource. | As DC meaning. not yet in catalogue |
| Date | A date associated with an event in the life cycle of the resource. | Creation date of dataset |
| Description | An account of the content of the resource. | Dataset description |
| Format | The physical or digital manifestation of the resource. | in catalogue. |
| Resource Identifier | An unambiguous reference to the resource within a given context. | To be assigned |
| Language | A language of the intellectual content of the resource. | en-GB for written text of metadata. |
| Publisher | An entity responsible for making the resource available | Mostly the source organisation: NERC, ECMWF, Met Office, etc. |
| Relation | A reference to a related resource. | ids of other datasets or related papers |
| Rights Management | Information about rights held in and over the resource. | Rights text. This could include the text of any agreement signed |
| Source | A reference to a resource from which the present resource is derived. | If the data is a data production tool then this should be a reference to a description of the tool. ditto ob station, activity |
| Subject and Keywords | The topic of the content of the resource. | Keywords and parameter lists |
| Title | A name given to the resource. | dataset title |
| Resource Type | The nature or genre of the content of the resource. | dataset http://purl.org/dc/dcmitype/Dataset |
| None-Core Element Name | DC definition | BADC application comments |
| Abstract | A summary of the content of the resource. | Long description in catalogue |
| Available | Date (often a range) that the resource will become or did become available. | |
| Bibliographic Citation | A bibliographic reference for the resource. | text of citation in iso690 form |
| Conforms To | A reference to an established standard to which the resource conforms. | QC measures |
| Created | Date of creation of the resource. | Processing date |
| Is Referenced By | The described resource is referenced, cited, or otherwise pointed to by the referenced resource. | needed for trak backs |
| Is Replaced By | The described resource is supplanted, displaced, or superseded by the referenced resource. | For new versions of operaional datasets |
| License | A legal document giving official permission to do something with the resource. | The access licences agreement |
| Provenance | A statement of any changes in ownership and custody of the resource since its creation that are significant for its authenticity, integrity and interpretation. | BADC should store this. |
| References | The described resource references, cites, or otherwise points to the referenced resource. | Needed for citation and linking |
Implimentation: Identifier assignment. DDD creation. Authorship rules. DDD syntax.
Quality control
There are two forms of quality control. The aim of quality control measures is to produce data of useable quality.
Conformance quality control
Conformance quality control includes checks on file format, paper layout, syntax, spelling and gramma. These lend themselves to automation. These are checks against a set of rules defined by the repositories policies. In the case of the BADC these are rules which are agreed with the data supplier at the start of data ingestion. These rules primarily refer to conformance to a set self-describing file format. And the supply of some minimum of descriptive metadata.
Peer review quality control
Semantic quality control and sanity checks. For the BADC a review panel should be convened to discuss all aspects of a dataset. The reviewers should be picked from the user community, but may also incorperate members from data suppliers. The reviewed entertity should be the dataset decription document. This ensures that the package is sufficently well discribed to be of use the designated community.
Implimentation: Peer review terms of referance. rules on membership.
Preservation
The DDD should discribe the prevervation policy for the dataset. A mechanisum needs to be in place to ensure that that policy is enacted. Datasets marked as a permanent should remain intact at the data centre, whereas a temporay verion of a dataset may only be avaliable for a set period. Implimentation: DDD parser that extracts the preservation poliy for each dataset.
Distribution
Datasets should be accessible. All DDD's should be publicly avaliable, even if the data it describes is resricted in some way.
Implimentation: link checking in DDD. Are all DDD's Avaliable
Detailed example
Data from the NERC UTLS programme is stored as a collection of files at the BADC these files have been checked to ensure that they compliy to the NASA-Ames format and that additional information is in the files header. A strict nameing convention for the files has also been applied. Infomation
