Preparing Your Data File Set

What files are involved and how are they related to each other?

The four relationships to understand in order to prepare the data accurately are:

A collection of data files related to one study are grouped together and are known as a data file set.

For each data file in a data File Set there are two files that accompany it: the variable file and code set.

The variable file is the definition of the data file that gives a description for each field that has a set of data.

The code set accompanies the variable file to explain the meaning of measurements or types of data that is provided.

Data File Set

CEDR is organized by data file sets. Each data file set contains one or more data files with accompanying data documentation. Data file sets vary widely in size and scope. The purpose of a data file set is to describe the set of files collected during the course of a particular study or research project.

Data File Set Template

Name for data file set:
Provide the name used by researchers at the providing organizations.
Description of the data file set:
Describe the study and its major findings:

The following items should be part of a concise narrative: a general description of the study type, purpose, and major findings; geographic site(s) or facility name(s); and type of work process(es). Describe the study population and design, including the number of individuals (total, and by race and sex), enrollment dates, criteria for inclusion or exclusion in the study, use of questionnaires, and response rates.

For dose reconstruction projects or other types of studies where individuals are not under study, provide an abstract describing the research, such as found at the beginning of a journal article, conference paper, or research progress report.

General description of the files:

Indicate the number of files included in the data file set and describe the type(s) of information found in each file. Include the number of individuals in each file and describe the criteria for their inclusion.

Exposures of interest:

Describe the exposures of interest in the context of the operations conducted at the facility. List the exposure data contained in the files and indicate the exposure variables used in the analysis.
Investigator-authored reference(s) pertaining to the data file set:
If the study has not been published, enter "unpublished."

Provide bibliographic citation(s) for documents that describe this data file set or have resulted from its use. The style suggested is the Chicago Manual of Style for referencing social and natural science.
Name of principal investigator or contact
Email address of principal investigator or contact
Notes or additional comments (optional)
Number of data files included with this data set

Download Data File Set Template: If you would like the layout like in the example, you can download the provided Data File Set Template.

Data File Set Documentation Example

Example Table 1 - Data File Set Example
Name for data file set	Description of the data file set	Investigator-authored reference(s) pertaining to the data file set	Name of principal investigator or contact	Email address of principal investigator or contact	Notes or additional comments (optional)	Number of data files included with this data set
Hanford case-cohort lung cancer study	The purpose of the study was to investigate the association between lung cancer risk and occupational radiation exposure with appropriate adjustment for tobacco use. Data were analyzed using methods that took into account both the case-cohort design and the changes over time in the quality of the tobacco-use information that was collected. Tobacco use was not strongly related to the level of radiation exposure and adjustment for tobacco use did not greatly modify results of analyses assessing the association between lung cancer risk and cumulative dose equivalent. With or without adjustment for tobacco use, the estimated risks per unit of cumulative dose equivalent were negative, but the 95% confidence intervals were wide and included values several times those estimated from populations with high levels of irradiation.> The single analytic file (HFLUNGCA) contains one record for each of the study years 1965 through 1980 (or year of death if earlier) for each of the workers qualifying as a lung cancer case, or selected as a subcohort member from a stratified random sample of cohort members. White male operations workers who died of lung cancer qualified as cases if they were monitored for external radiation for at least three years and terminated employment on or after January 1, 1965. Questions about tobacco use became a routine part of the periodic medical examination in 1965. Termination in or after this year allowed most workers to have a least one examination during the study period. The criterion for cohort members was identical except for the diagnosis of lung cancer, although this did not exclude their selection. The lung cancer cases were stratified into year-of-birth groups in five-year intervals. These intervals were used as strata for identifying eligible persons for the subcohort. For each stratum, at least five times as many subcohort members as cases were randomly selected. Eighty-six workers qualified as lung cancer cases. This resulted in the random selection of 445 subcohort members from a total of 5445 eligible workers. Thirteen of those selected also qualified as lung cancer cases. One of the 86 cases and three of the 445 subcohort members were excluded from the analyses because their medical records could not be located. Vital status was ascertained through December 31, 1980, the study end date. Of the 442 subcohort members, 344 remained alive through the end of the study. Internal as well as external radiation exposures were examined. Workers at the Hanford Site were involved in a variety of activities that resulted in their exposure to radiation, including reactor operations, chemical separation of reactor fuel to obtain plutonium, treatment and storage of hazardous waste, and biological and engineering research. Personal dosimeters (film or thermoluminescent) have been used since 1944. Annual whole-body doses to penetrating external radiation are presented in units of millisieverts. Quality factors of 10 for fast neutrons, 3 for slow neutrons, and 1 for photons and electrons were used in the conversion of exposure to dose. Bioassay programs to detect exposures to internally deposited radionuclides, primarily transuranics, were also initiated in 1944. The potential for inhalation of uranium in this study was evaluated by reviewing each worker's uranium bioassay records. It was assumed that the number of bioassay measurements provided a rough indication of potential for exposure. Bioassay programs for uranium were primarily concerned with monitoring for uptake by the kidney and did not directly provide indications of lung dose.	Petersen, Gerald R., Ethel S. Gilbert, Jeffrey A. Buchanan and Richard G. Stevens, "A Case-Cohort Study of Lung Cancer, Ionizing Radiation, and Tobacco Smoking Among Males at the Hanford Site", Health Physics 58:3-11, 1990.	Arnold B. Smith	AB_Smith@lal.edu	Agency funding the study is: Office of Health, U.S. Department of Energy	1

Data File

Each data file should contain the data necessary to support the published findings and to recreate the original analysis.

The data files can be submitted in any standard format, including Excel spreadsheets or pipe delimited text files.

Characteristics of Data Files:

Each physical record should equate to one "logical" record.

All records must have the same variables, in the same order. Variable names should not include embedded spaces.

A pipe character ("|") should be used as the delimiter in delimited text files. Avoid using spaces or tabs as delimiters.

Data files should exclude blank columns not associated with any variable.

No special characters, such as carriage returns, line feeds, or other non-ASCII characters should appear in the data file.

Data File Template

Provide the following information for each data file being submitted to CEDR.

Name for file:
Provide a name for the data file of no more than eight characters.
Description of the file:
Summarize the purpose of the file and/or major types of data contained in this file. For example, "The (title of file) file contains data for 4,222 persons who have been exposed to internal deposition of radium-226 or radium-228."

In one or more subsequent paragraphs, provide detailed information about the formulation of the file and its contents. For example, indicate what a single (logical) record in the data file uniquely describes (e.g., a single member of the cohort, a single annual exposure record, a single badge reading).
Number of variables:
Indicate the number of variables in the file.
Number of records:
Indicate the total number of records in the file.
Notes or additional comments (optional):
Provide any additional information, such as caveats about the data file that might be useful to researchers using this file.

Download Data File Template: If you would like the layout like in the example, you can download the provided Data File Set Template.

Data File Documentation Example

Example Table 2 - Data File Example>
Name for file	Description of the file	Number of Variables	Number of Records	Notes or additional comments (optional)
HFLUNGCA	This file of 34 variables contains vital statistics, smoking information, external dosimetry data, uranium bioassay data and occupational information.	34	7077

Variable File

Variable File Template

As you are preparing a file for submission, it is important for us to be able to have a descriptive variable file that informs us of each field column header that comes in with data.

Below is our template that is recommended that you follow to help establish the relationship between a given column and its data:

Note: Please use items one through eight as column headers left to right

Order number of variable
Name of variable.
Description of variable.
Data type.
Length.
Measurement units.
Code set.
Notes or additional comments (optional)

Download Variable File Template: If you would like the layout like in the example, you can download the provided Variable File Template.

Variable Documentation Example

Below is an example of a simple file that gives reference to three fields within that given data:

Example Table 3: Variable File
Order number of variable	Name of variable	Description of variable	Data type	Length	Measurement units	Code set	Notes or additional comments (optional)
1	ID	This is an identification number assigned to each person to ensure confidentiality.	Numeric	8			Variable ID is always the first variable in the data file
2	CUMRAEXP	Cumulative Radiation Exposure - This is the cumulative amount of external radiation exposure the worker received prior to the year of follow-up. Dosimetry information was available through 1978. Cumulative radiation exposure ranges from 0.0 to 637.3 mSv	Numeric	6	Millisieverts
3	Gender	Sex (gender) code for the person	Character	1		GenderCode

Code Set

When to Use a Code Set

A code set should be described, as in the code set example below, where the values for a variable conform to a controlled vocabulary or a finite list of possible values. A code set is not generally used for variables recording values that are continuous. However, even variables representing the recording of continuous values (e.g., measured doses) might include a code set if some designated value has a special meaning. For example, a variable describing exposures in rems might include a record with a value of "999." The value "999" might have been intended to mean no reading taken or dosimeter missing. In such cases, a code set should be provided to CEDR that defines the meaning of "999."

Code Set Template

Please provide the following information for the codes associated with each variable for which a code set was used when originating the data file. Code set documentation may be submitted in a tabular format such as a spreadsheet or table. Refer to the example below as needed.

Code Set Name:
Enter the name of the code set. (The variable file should point to this code set.)
1 Code Value:
Enter the code value as it appears in the data file or describe the special character(s) used.
2 Description of Code:
Enter the description of the code value.

The code set provided should include all non-continuous values that were available to be assigned to records in the file, even if some of those values were not used.

Download Code Set Template: If you would like the layout like in the example, you can download the provided Code Set Template.

Code Set Documentation Example

Below is an example of the structure of a simple code set using a CEDR template for code set documentation (A variable, "sex," has two possible code values, 1=female and 2=male):

Example Table 4 - Code Set
Code Value	Description of Code
1	Female
2	Male

Other Considerations in Preparing Code Sets

Code sets are grouped by data file, i.e., each data file will have an associated code set documentation applicable to variables in that data file. If there are no code sets pertaining to a particular data file, indicate "No code sets for this data file."

Please provide separate descriptions for each code value used in each code set. Additionally, please provide individual code descriptions to indicate the meaning of null values, missing values, or special characters that are used in place of actual values.

For published standard codes, such as the International Classification of Diseases (ICD) or U.S. Census Bureau occupation codes, give a reference including date and version.

Next: So you are ready to submit data?