Unit 2.4 Codes and category metadata
Overview
Unit study time
- 20 minutes
Intended Learning Outcome
By the end of the unit, you will be able to ...
- Explain the purpose of codelists at question and variable levels.
- Identify the risks of missing or inconsistent codelist metadata.
- Evaluate whether a dataset provides sufficient codelist information for reuse.
What is a codelist?
A codelist is a type of value representation or response domain. Simply, codelists are a list of codes that represent different categories. The categories are pre-defined options allowed for a particular variable or question. These categories are represented by codes in order to enable data analysis and processing. Each category is assigned a code (usually numeric) so that responses can be recorded, stored, and analysed efficiently. In this way, the categories define the possible answers, while the codes provide a standardised way to represent those answers in data processing and analysis.
Codelists appear in two different parts of your metadata, and although the term is the same, they serve different purposes depending on whether you are documenting the question or the variable.
-
At the question level, a codelist defines the set of answer options a respondent can choose from. This forms part of the response domain for the question. This was explored in the previous unit. The codelist is describing the valid responses presented to the participant and tells us what the respondent could pick when answering the question.
-
At the variable level, a codelist forms part of the value representation. This is how the values are recorded and represented in the data file, rather than what the respondent originally saw. Once the data is collected and stored in a dataset, the same categories appear as in the response domain, but additional values may be added (e.g. -1 = Refused, -9 = Missing, 3 = Don’t know), and the format of the codes may differ from what appeared in the questionnaire. This will be explored further in the next unit.
| Response domain codelist | Value representation codelist | |
|---|---|---|
| Associated with | Question | Variable |
| Purpose | Defines what the participant could choose for their response to a question | Defines how the data is stored and represented in the data file |
| What it represents | Valid answers to the question | The codes and categories in the dataset |
Why do you think it is important to capture metadata for codelists?
With the following variable metadata...
| Name | Label | Value representation | Data type |
|---|---|---|---|
| f_n | first name | text | string |
| s | sex | codelist | numeric |
| h | height | numeric | positive integer |
| b_d | birthdate | date | ISO 8601 |
| m_s | marital status | codelist | numeric |
Do we have enough information to understand the data in the example dataset we looked at in unit 2.2...
| f_n | h | b_d | s | m_s |
|---|---|---|---|---|
| John | 178 | 1998-09-02 | 2 | 3 |
| Gill | 200 | 1934-06-12 | 1 | 4 |
| Alice | 182 | 1922-12-24 | 4 | 1 |
| Fred | 168 | 2001-05-16 | 5 | 2 |
| Laura | 156 | 2011-03-05 | 3 | 1 |
Do you know what the code '1' means for the variable 'Sex'? Do you know what the code '3' means for the variable 'Marital status'?
Equally, if the codes were represented by letters, would that become more clear?
| f_n | h | b_d | s | m_s |
|---|---|---|---|---|
| John | 178 | 1998-09-02 | Fm | M |
| Gill | 200 | 1934-06-12 | M | W |
| Alice | 182 | 1922-12-24 | NS | S |
| Fred | 168 | 2001-05-16 | Fd | D |
| Laura | 156 | 2011-03-05 | NB | S |
As codes are just a sign to represent a category, we need information about the categories to assign meaning to the codes and understand the data. Sometimes the category can be inferred by the code. For example, you might interpret 'S' for marital status means 'Single'. However, guessing the meaning of codes in this way can cause errors. For example 'S' could signify 'Separated'. Moreover, the same code could have a different meaning in different codelists within the same dataset. "For example, "M" in the variable "s" means "male", whereas "M" in the variable "m_s" means "married. In itself, a code has no meaning. Instead, it's a symbol signalling another term which has been associated with the code. As part of our metadata creation, it is important that we define codes so they have meaning.
If you don't have codelist metadata ...
- You will not know what codes mean and will have to either guess or take time consuming processes such as contacting the data creator
- The meaning of codes are forgotten and/or lost
- Diminishes the value of your variable/question metadata as you are not sure what data is collected
::: notes: Representations – Codes and Categories: DDI Alliance Training Library, Version 1.0, DDI Alliance, DDI Train the Trainer Workshop, DDI Training Working Group :::
Codelist metadata
So what code metadata can you create for codelists?
First, you need to describe the codelist itself. This is particularly important if your dataset contains different codelists for different variables or questions, you would like to reuse codelists, or you need to ensure consistency in how categories are defined throughout your study.
Codelist name
Codelist name is a unique identifier for that code set.
Codelist label
Codelist label is a short description of what the codelist refers to.
Codelist description
Codelist description provides a longer explanation to what the codelist describes.
| Codelist name | Codelist label | codelist description |
|---|---|---|
| person_gender_si | Gender of person | Six gender categories for self-identification |
| person_m_s | Marital status of person | Seven marital status categories for self-identification |
Code and category metadata
Once you describe the purpose of the codelist and the codes it includes, you then need to describe the codes themselves.
The meaning of codes are called categories. Categories assign meaning to the codes.
Let's look at describing the codes for the codelist person_gender_si.
The two examples using different codes for gender are...
| Example one | Example 2 |
|---|---|
| - 1 - 2 - 3 - 4 - 5 - 6 - 7 - 8 |
- M - Fm - TM - TFm - NB - Fd - NS - PN |
However, both examples use the same categories for gender. The categories are...
- Male
- Female
- Trans male
- Trans female
- Non-binary
- Gender fluid
- Not specified
- Prefer not to answer
In our metadata, we can then map the code to the category...
| Codelist name | Code | Category |
|---|---|---|
| person_gender_si | 1 | Male |
| person_gender_si | 2 | Female |
| person_gender_si | 3 | Trans male |
| person_gender_si | 4 | Trans Female |
| person_gender_si | 5 | Non-binary |
| person_gender_si | 6 | Gender fluid |
| person_gender_si | 7 | Prefer not to answer |
| person_gender_si | 8 | Not specified |
| Codelist name | Code | Category |
|---|---|---|
| person_gender_si | M | Male |
| person_gender_si | Fm | Female |
| person_gender_si | TM | Trans male |
| person_gender_si | TFm | Trans Female |
| person_gender_si | NB | Non-binary |
| person_gender_si | Fd | Gender fluid |
| person_gender_si | PN | Prefer not to answer |
| person_gender_si | NS | Not specified |
Codelists and controlled vocabularies
When designing a questionnaire and creating codelists for questions, it is best practice to use controlled vocabularies to define your codelists or categories were were possible. Using controlled vocabularies for categories will be important when sharing your data more widely, as different people may interpret the same category in different ways. Defining each category ensures you are referring to the same underlying concept rather than relying on individual interpretations. You can add your own definitions to your category metadata, or point to predefined definitions in existing concept based ontologies (a specific type of controlled vocabulary) using a stable link (ideally a persistent identifier). We will discuss how codelist categories are concepts in a later unit.
Example ontologies to use for concepts:
- MeSH (Medical Subject Headings)
- ELSST
- Homosaurus (LGBTQ+ vocabulary)
If you use a controlled vocabulary to define a category, you will need to document what controlled vocabulary you used. You can input a persistent identifier (URI/DOI) pointing directly to a category definition within the controlled vocabulary.
[!NOTE] SW - perhaps provide a link to define the term 'stable link' as its possible not everyone will know what this means - or provide a short side note xplaining what it is
You may also want to draw on an already defined codelist that has been established by a core institution. While using controlled vocabularies defines the meaning of individual categories, using standardised codelists provides the structure you use to collect and store responses in a consistent, organised way. This improves comparability and interoperability.
Example standardised codelists:
- ISO
- ONS
- https://op.europa.eu/en/web/eu-vocabularies/code-lists
If you use an existing codelist, you will need to document what you used. You can input a URI/DOI pointing directly to a codelist where available. You may need to add additional metadata depending on how the codelist is maintained. ONS provides stable links, however as these may contain multiple codelist, you should also use the identifier and since the codelists may be updated, you should also record the version or date that it was used.
| Codelist name | Codelist reference | Version |
|---|---|---|
| ONS_gender_identity_8a | https://www.ons.gov.uk/census/census2021dictionary/variablesbytopic/sexualorientationandgenderidentityvariablescensus2021/genderidentity/classifications# | 25 September 2023 |
[!NOTE] SW - personally i think having all the codelist info together in this current module makes more sense. The question metadata module has a lot of info in it so this seems like a discreet way to split things up into slightly smaller chunks.