Unit 2.4 Codes and category metadata

Overview

Unit study time

20 minutes

Intended Learning Outcome

By the end of the unit, you will be able to ...

Explain the purpose of codelists at question and variable levels.
Identify the risks of missing or inconsistent codelist metadata.
Evaluate whether a dataset provides sufficient codelist information for reuse.

What is a codelist?

A codelist is a type of value representation or response domain. Simply, codelists are a list of codes that represent different categories. The categories are pre-defined options allowed for a particular variable or question. These categories are represented by codes in order to enable data analysis and processing. Each category is assigned a code (usually numeric) so that responses can be recorded, stored, and analysed efficiently. In this way, the categories define the possible answers, while the codes provide a standardised way to represent those answers in data processing and analysis.

Codelists appear in two different parts of your metadata, and although the term is the same, they serve different purposes depending on whether you are documenting the question or the variable.

At the question level, a codelist defines the set of answer options a respondent can choose from. This forms part of the response domain for the question. This was explored in the previous unit. The codelist is describing the valid responses presented to the participant and tells us what the respondent could pick when answering the question.
At the variable level, a codelist forms part of the value representation. This is how the values are recorded and represented in the data file, rather than what the respondent originally saw. Once the data is collected and stored in a dataset, the same categories appear as in the response domain, but additional values may be added (e.g. -1 = Refused, -9 = Missing, 3 = Don’t know), and the format of the codes may differ from what appeared in the questionnaire. This will be explored further in the next unit.

	Response domain codelist	Value representation codelist
Associated with	Question	Variable
Purpose	Defines what the participant could choose for their response to a question	Defines how the data is stored and represented in the data file
What it represents	Valid answers to the question	The codes and categories in the dataset

Why do you think it is important to capture metadata for codelists?

With the following variable metadata...

Name	Label	Value representation	Data type
f_n	first name	text	string
s	sex	codelist	numeric
h	height	numeric	positive integer
b_d	birthdate	date	ISO 8601
m_s	marital status	codelist	numeric

Do we have enough information to understand the data in the example dataset we looked at in unit 2.2...

f_n	h	b_d	s	m_s
John	178	1998-09-02	2	3
Gill	200	1934-06-12	1	4
Alice	182	1922-12-24	4	1
Fred	168	2001-05-16	5	2
Laura	156	2011-03-05	3	1

Do you know what the code '1' means for the variable 'Sex'? Do you know what the code '3' means for the variable 'Marital status'?

Equally, if the codes were represented by letters, would that become more clear?

f_n	h	b_d	s	m_s
John	178	1998-09-02	Fm	M
Gill	200	1934-06-12	M	W
Alice	182	1922-12-24	NS	S
Fred	168	2001-05-16	Fd	D
Laura	156	2011-03-05	NB	S

As codes are just a sign to represent a category, we need information about the categories to assign meaning to the codes and understand the data. Sometimes the category can be inferred by the code. For example, you might interpret 'S' for marital status means 'Single'. However, guessing the meaning of codes in this way can cause errors. For example 'S' could signify 'Separated'. Moreover, the same code could have a different meaning in different codelists within the same dataset. "For example, "M" in the variable "s" means "male", whereas "M" in the variable "m_s" means "married. In itself, a code has no meaning. Instead, it's a symbol signalling another term which has been associated with the code. As part of our metadata creation, it is important that we define codes so they have meaning.

If you don't have codelist metadata ...

You will not know what codes mean and will have to either guess or take time consuming processes such as contacting the data creator
The meaning of codes are forgotten and/or lost
Diminishes the value of your variable/question metadata as you are not sure what data is collected

::: notes: Representations – Codes and Categories: DDI Alliance Training Library, Version 1.0, DDI Alliance, DDI Train the Trainer Workshop, DDI Training Working Group :::

Codelist metadata

So what code metadata can you create for codelists?

First, you need to describe the codelist itself. This is particularly important if your dataset contains different codelists for different variables or questions, you would like to reuse codelists, or you need to ensure consistency in how categories are defined throughout your study.

Codelist name
Codelist name is a unique identifier for that code set.

Codelist label
Codelist label is a short description of what the codelist refers to.

Codelist description
Codelist description provides a longer explanation to what the codelist describes.

Codelist name	Codelist label	codelist description
person_gender_si	Gender of person	Six gender categories for self-identification
person_m_s	Marital status of person	Seven marital status categories for self-identification

Code and category metadata

Once you describe the purpose of the codelist and the codes it includes, you then need to describe the codes themselves.

The meaning of codes are called categories. Categories assign meaning to the codes.

Let's look at describing the codes for the codelist person_gender_si.

The two examples using different codes for gender are...

Example one	Example 2
- 1 - 2 - 3 - 4 - 5 - 6 - 7 - 8	- M - Fm - TM - TFm - NB - Fd - NS - PN

However, both examples use the same categories for gender. The categories are...

Male
Female
Trans male
Trans female
Non-binary
Gender fluid
Not specified
Prefer not to answer

In our metadata, we can then map the code to the category...

Codelist name	Code	Category
person_gender_si	1	Male
person_gender_si	2	Female
person_gender_si	3	Trans male
person_gender_si	4	Trans Female
person_gender_si	5	Non-binary
person_gender_si	6	Gender fluid
person_gender_si	7	Prefer not to answer
person_gender_si	8	Not specified

Codelist name	Code	Category
person_gender_si	M	Male
person_gender_si	Fm	Female
person_gender_si	TM	Trans male
person_gender_si	TFm	Trans Female
person_gender_si	NB	Non-binary
person_gender_si	Fd	Gender fluid
person_gender_si	PN	Prefer not to answer
person_gender_si	NS	Not specified

Codelists and controlled vocabularies

When designing a questionnaire and creating codelists for questions, it is best practice to use controlled vocabularies to define your codelists or categories were were possible. Using controlled vocabularies for categories will be important when sharing your data more widely, as different people may interpret the same category in different ways. Defining each category ensures you are referring to the same underlying concept rather than relying on individual interpretations. You can add your own definitions to your category metadata, or point to predefined definitions in existing concept based ontologies (a specific type of controlled vocabulary) using a stable link (ideally a persistent identifier). We will discuss how codelist categories are concepts in a later unit.

Example ontologies to use for concepts:

MeSH (Medical Subject Headings)
ELSST
Homosaurus (LGBTQ+ vocabulary)

If you use a controlled vocabulary to define a category, you will need to document what controlled vocabulary you used. You can input a persistent identifier (URI/DOI) pointing directly to a category definition within the controlled vocabulary.

[!NOTE] SW - perhaps provide a link to define the term 'stable link' as its possible not everyone will know what this means - or provide a short side note xplaining what it is

You may also want to draw on an already defined codelist that has been established by a core institution. While using controlled vocabularies defines the meaning of individual categories, using standardised codelists provides the structure you use to collect and store responses in a consistent, organised way. This improves comparability and interoperability.

Example standardised codelists:

ISO
ONS
https://op.europa.eu/en/web/eu-vocabularies/code-lists

If you use an existing codelist, you will need to document what you used. You can input a URI/DOI pointing directly to a codelist where available. You may need to add additional metadata depending on how the codelist is maintained. ONS provides stable links, however as these may contain multiple codelist, you should also use the identifier and since the codelists may be updated, you should also record the version or date that it was used.

Codelist name	Codelist reference	Version
ONS_gender_identity_8a	https://www.ons.gov.uk/census/census2021dictionary/variablesbytopic/sexualorientationandgenderidentityvariablescensus2021/genderidentity/classifications#	25 September 2023

[!NOTE] SW - personally i think having all the codelist info together in this current module makes more sense. The question metadata module has a lot of info in it so this seems like a discreet way to split things up into slightly smaller chunks.

Test your knowledge

Which of the following is an example of a code and its category?

2020 = year of data collection
1 = Male, 2 = Female
Age in years
Household income value

Reveal answer

Codes (e.g. 1, 2) are linked to categories (e.g. Male, Female) to make data interpretable.

Two datasets use different coding schemes for the same concept (e.g. 0/1 vs 1/2). What should you do before combining them?

Ignore the difference
Use the larger dataset only
Delete the variable
Recode values to ensure consistency

Reveal answer

Coding differences must be reconciled (harmonised) to ensure valid comparison or integration.

How do codes and category metadata relate to question and measurement metadata?

They provide the representation of measured concepts in the data
Codes replace measurement definitions
They are unrelated
They only affect file structure

Reveal answer

Measurement metadata defines what is being measured, while codes and categories show how those measurements are represented in the dataset.