Unit 2.5 Variable metadata

Unit study time

35 minutes

Intended Learning Outcome

By the end of the unit, you will be able to ...

Identify standard variable metadata elements
Describe how variable metadata supports reuse and cross‑study analysis.
Document variable metadata for a simple dataset.
Recognise how value representations and valid ranges influence interpretation.

Variable metadata was defined in the 'Introduction to metadata' training course, please refer back to this for a refresher including definitions. In short, variable metadata describes how a single variable is defined, represented, collected, and stored in the dataset.

Why create variable metadata?

If you were presented with the following information...

The dataset contains the variables:

f_n
h
b_d
m_s

Would you have enough information to be able to assess whether the dataset is of interest to you before you pay to access it?

Similarly, imagine this is a dataset you created 10 years ago and you're returning to see if it contains useful data for a new project. Would you remember what the variables f_n; h; b_d; m_s mean?

What information would you need in order to be able to understand the example dataset and its variables?

Variable metadata

In order to understand the variables, you will probably need information on the following questions...

What do the abbreviations mean?
What type of data did you collect for the variable?
How was the variable measured?
How was the variable data collected?
What was the unit of study?
How much data was collected for each variable?
What are the valid inputs for the variable data?

Variable metadata provides the information to answer these questions.

If we didn't have information about the variables, we would have to decode what the variables mean and what type of data they collected. If it is our own data, this may include working back through legacy files to remember how the research was conducted. If it is other people's data or an external person is trying to understand your data, we would have to contact the data creator in order to clarify what the variables mean and the type of data. If information about the variables isn't available, the dataset is less trustworthy, and as such, would be less likely to be re-used.

What are the benefits of variable metadata?

Benefits of variable metadata

Personal benefit
By having a clear record of what variable abbreviations mean and the type of data collected, metadata allows you to manage your data during the research process. It also helps your future self when you return to your data, allowing you to quickly understand your dataset and potentially reuse it.

Re-usability of data
With variable metadata, people get a broader and more comprehensive picture of what data exists in a dataset. Even if a variable is not a focal point of the original research project or publication, another person may find this variable and reuse it in their secondary research. Moreover, by knowing what variables a research project contains, researchers may be able to make cross-study comparisons that they did not know would be possible if they were only working from catalogue and dataset metadata.

Discoverability of data
If you're sharing your (meta)data, variable metadata can also help the discoverability of data. Variable metadata is sometimes used by data catalogues as part of their search and filter functions. For example:

These services also provide information about variables directly, allowing users to interact with variable metadata in a straightforward way.

In the rest of this unit, we'll go through variable metadata elements and look at what information to capture for each.

Variable Identification

First of all, you need to describe and define the variable itself. Identification metadata helps users understand what the variable is and how to recognise it. This includes...

Variable name
This is the term used for the variable in your dataset e.g. f_n, h, b_d, m_s

Variable label
A short, clear label describing the variable, explaining any abbreviations in the variable name e.g. f_n = first name

Variable description
A longer explanation of what the variable measures. Sometimes this includes information about conditions of the data collected (e.g. first name, no nicknames'), how the variable was collected (e.g. 'self-reported biological sex') and what measurement was used (e.g. 'height calculated to nearest cm'). Add enough detail so that someone new to your research can have an understanding of the variable. If you are using software packages to manage your variables, be aware that many of them impose a character limit on variable descriptions. It is therefore important to keep descriptions concise and focused. Do not include full question text or lengthy instructions here, as these belong in the question metadata and will be held elsewhere. This will be discussed further in the metadata relationships unit.

For our example dataset above, we could create this variable metadata.

Name	Label	Description
f_n	first name	Full legal first name, no nicknames
s	sex	Self-reported biological sex
h	height	Height calculated to nearest cm
b_d	birthdate	Date of birth yyyy/mm/dd
m_s	marital status	Self-reported marital status

Variable data descriptions

Next, you need to describe how data appears in the dataset. It tells users and computers what kind of information the variable contains and how to interpret it.

How we describe data may vary across different projects, depending on what type of data is collected and the level of specificity you want to capture. It's ok for you to have your own way of describing your data, however, it is important that you stay consistent in your descriptions and each metadata element has a clear purpose and defined list of allowed values.

Value representation

Value representation describes how the data is represented, this could be text, numeric, date, codelist, scale, geographic, symbol. Documenting this metadata helps interpret the data correctly in analysis.

For example:

Height: numeric (e.g. 156cm)
First name: text (e.g. 'Laura')
Date of birth: date (e.g. '2011-03-05')
marital status: codelist (e.g. '1' = single, '2' = married)

Codelists

Codelist were explored in the previous unit. Here we will add more detail for codelists as value representations. While question-level codelists describe the response options shown to the participant, variable‑level codelists describe how those responses are recorded and represented in the dataset. These may differ as the variable-level codelist may have additional processing, for example:

a category may be collapsed or expanded during data cleaning
extra codes may be added for “Don’t know”, “Refused”, “Not asked”, etc.
the question may allow for an "Other" response in addition to the categories provided, which may be coded
the question may allow multiple selections, but the dataset stores derived variables

Documenting this codelist is essential for interpreting the variable correctly, understanding missingness, and enabling others to reuse the dataset. A codelist may appear in both places, and may be the same, but the metadata role is different. Documenting this codelist is essential for interpreting the variable correctly, understanding missingness, and enabling others to reuse the dataset.

Data type

Data type provides information about how the data appears in the dataset. This helps us to understand how to transform or merge datasets without errors.

For example, we've already described the variable 'height' as numeric in the value representation. What else could we specify about the numeric data?

It will only contain positive numbers as you can't have a minus number for height
In this dataset, the height variable data only contains whole numbers, not decimals
In which case, we can describe the data type to be positive integer. However, if the data did contain decimal points, for example 178.5cm, we may describe the data type for height as float.

For 'date', the data is written in ISO 8601 format and you can use the data type column to specify this.

For the variables 'sex' and 'marital status' the codelist is expressed as whole numbers in the dataset. So we would put 'integer' as the data type. Alternatively, if the codes were expressed as 'm', 'fe', 'othr', 'ns' (rather than 1, 2, 3, 4), we would put the data type as 'string'.

Name	Label	Description	Value representation	Data type
f_n	first name	Full legal first name, no nicknames	text	string
s	sex	Self-reported biological sex	codelist	integer
h	height	Height calculated to nearest cm	numeric	positive integer
b_d	birthdate	Date of birth yyyy/mm/dd	date	date ISO 8601
m_s	marital status	Self-reported marital status	codelist	integer

Unit of measurement

Once you have described the variable as it appears in the dataset, it's important to include extra contextual information so we can interpret the data further. For example, for height, we know the data is numeric. However, if the unit of measurement wasn't in the variable description, we wouldn't know if these numbers indicate metres, centimeters or millimeters. The variable description is an open text field and so may not contain this information, so it's important not to rely on this to cover other metadata.

The unit of measurement is simply the unit used to express the quantity or magnitude of the object you're measuring. For the example dataset... The unit of measurement for the variable height in this example is centimetres. However, unit of measurement will not be applicable to some variables such as first name or marital status. In these cases, you can put N/A.

Documenting unit of measurement can prevent incorrect comparisons (e.g., mixing cm and inches).

In order for research data to be suitable for analysis, it is best practice to use one unit of measurement per concept. For example, the variable data for height shouldn't be recorded as 1m 78cm as this would mix two units of measurement, metre and centimetres. So even if the data was collected in this way, we should transform the data into one measurement unit, e.g. 1.78m or 178 cm, keeping the one-to-one relationship between concept (height) and measure (m or cm).

Valid range

Valid range outlines the allowed values for this metadata element. Documenting this helps identify invalid, erroneous, or out-of-range values and supports automated quality checks.

For the example dataset ...

For height, you may want to set permitted values to be between 55 - 220 (depending on the population you are collecting data from).
If the example dataset only collected data from living adults, you could set a valid range for birthdate as 1910-2007.
For text, you may want to specify how many characters are permitted and set a valid range of 1-50 characters.
For variables using codelists (sex and marital status), you could include a reference to the codelist you are using so people can reference the permissible codes and their meanings (we will look more in depth into codelist metadata in unit 2.4)

name	label	Value representation	Data type	Unit of measurement	Valid range
f_n	first name	text	string	N/A	1-50 characters
s	sex	codelist	numeric	N/A	CLsx_01
h	height	numeric	positive integer	cm	55 - 220
b_d	birthdate	date	ISO 8601	N/A	1910/01/01-2007/01/01
m_s	marital status	codelist	numeric	N/A	CLms_01

Variable provenance

Provenance describes where the variable data originated, how it was collected, and whether it has been transformed. Provenance information increases transparency and improves the trustworthiness and re‑usability of research data. The provenance of a variable relies on additional metadata which is provided around the variable, rather than held in the variable itself. Links can be provided to this metadata rather than duplicated here. This was briefly described in the questions and measurement unit and will be raised again in the metadata relationships unit.

In terms of what provenance is recorded at the variable level, this may include information on any processing or derivation applied after collection. This may include...

Derived (Y/N) Indicates whether the variable was directly collected or created from one or more other variables through recoding, calculation, or transformation.

Derivation The logic or algorithm used (can include a formal expression or natural-language description).

Variable statistics

Once we've described the type of data and its parameters, we can give an overview about how complete, valid, and reliable the data values are. This can help identify issues and make decisions about whether the variable is suitable for analysis for your research. They also support transparent reporting and future reuse.

Missing value code
It is useful to include the missing value code for a variable so you can identify where there are gaps in the data. The specific code used in the dataset to indicate that no valid data was recorded (e.g., -999, .). Recording this code ensures that users can correctly identify missing data and avoid treating these values as meaningful numbers.

Number of rows or cases
The total number of records for the variable, including both valid and missing values. Missing value codes are counted within the total number of cases, but are explicitly identified using the missing value code metadata and excluded when calculating valid cases.

Invalid cases
As the valid permissible range for a variable was outlined earlier, you can identify a number of invalid cases. The invalid cases should also include cases that contain other errors that mean the data should not be considered when drawing conclusions. From that you can calculate the number of invalid responses.

[!NOTE] BO - An example might be helpful here "should also include cases that contain other errors that mean the data should not be considered when drawing conclusions"

Valid cases
Through subtracting invalid cases from the number of cases, you can calculate the the number of valid cases. This can help us understand how effective a research process was and whether further investigation into how data was collected is needed. For example, if there is a comparatively high number of Invalid Cases to Valid Cases, we will probably have to look into the trustworthiness of the data before we draw conclusions. Equally, it can point to areas that you could potentially improve your data collection processes.

Name	Label	Missing value code	Number of cases	Invalid cases	Valid cases
f_n	first name	-999	186	4	182
s	sex	-999	186	4	182
h	height	-999	186	7	179
b_d	birthdate	-999	186	0	186
m_s	marital status	-999	186	1	185

Sub‑Universe and missingness
Variables created from conditioned questions will have structural missing values for respondents who were not eligible to answer. These missing values reflect the sub‑universe defined at the question level, and are not data quality problems. Recording sub‑universes helps interpret valid cases, missingness patterns, and comparability.

Test your knowledge

What is meant by “variable metadata”?

The number of records in a dataset
Information describing the meaning and properties of variables
The software used to analyse data
The location where data are stored

Reveal answer

Variable metadata describes the meaning, structure, and characteristics of variables, helping users understand how to interpret them.

Why is it important to document derived variables?

They are always self-explanatory
They reduce dataset size
Their calculation and assumptions affect interpretation
They replace raw variables

Reveal answer

Derived variables depend on how they are calculated, so documenting this is essential for understanding and reuse.

A researcher uses a variable without checking its metadata and draws conclusions. What is the main risk?

Faster analysis
Incorrect conclusions due to misunderstanding the variable
Reduced dataset size
Improved comparability

Reveal answer

Without consulting metadata, variables may be misinterpreted, leading to invalid conclusions.

Summary

Variable metadata describes how each variable in a dataset is defined, represented, and interpreted.

Key elements include:

Identification: what the variable is
Data description: how data appears and can be interpreted
Provenance: how the variable was created or derived
Statistics: how complete and reliable the data is

Together, these elements allow researchers to understand, assess, and reuse data correctly, supporting transparency, comparability, and high‑quality analysis.