Development of an algorithm for ethnicity recording in cohorts from the UK Clinical Practice Research Datalink primary care and linked Hospital Episode Statistics databases

Objective To evaluate various prioritisation strategies within an algorithm designed to ascertain the most likely ethnicity and create a standardised methodology to benefit future research.Design Retrospective cohort study.Setting The Clinical Practice Research Datalink (CPRD) primary care and linke...

Full description

Saved in:
Bibliographic Details
Main Authors: Rachael Williams, Eleanor L Axson, Suhail I Shiekh
Format: Article
Language:English
Published: BMJ Publishing Group 2025-07-01
Series:BMJ Open
Online Access:https://bmjopen.bmj.com/content/15/7/e100533.full
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849716678509723648
author Rachael Williams
Eleanor L Axson
Suhail I Shiekh
author_facet Rachael Williams
Eleanor L Axson
Suhail I Shiekh
author_sort Rachael Williams
collection DOAJ
description Objective To evaluate various prioritisation strategies within an algorithm designed to ascertain the most likely ethnicity and create a standardised methodology to benefit future research.Design Retrospective cohort study.Setting The Clinical Practice Research Datalink (CPRD) primary care and linked Hospital Episode Statistics (HES) data sets.Participants The population of 54 029 174 patients included all acceptable patients registered at English practices in CPRD GOLD or CPRD Aurum from the May 2023 to May 2022 builds, respectively.Primary outcome measure Ethnicity data within CPRD and HES data sets were identified by employing established code lists and subsequently categorised into broader ethnic groups. Changes were made to a previously used algorithm to assess their effect on ethnic categorisations. Modifications included prioritising primary over secondary care data, recent over frequent records and ‘non-other’ ethnicity categories. Different data sources were examined: CPRD with all HES data sets, CPRD with HES Admitted Patient Care (APC) only, CPRD only and HES APC only. Ethnic distributions from these variations were compared using counts and percentages, evaluating inter-rater reliability using Cohen’s kappa. Sensitivity analyses included repetition using only currently registered patients and after removing cases with unknown ethnicity. Ethnic distributions were compared with English Census 2021.Results There was almost perfect agreement in ethnicity distributions whether prioritising primary over secondary care data (kappa=1.0000, SE=0.0001), whether prioritising most frequently or most recently recorded data (kappa=0.9824, SE=0.0001) and whether prioritising ‘non-Other’ categories (kappa=0.9705, SE=0.0001). There was moderate agreement in ethnicity distributions when sourcing data from single data sources (CPRD only (kappa=0.5554, SE=0.0001) or HES APC only (kappa=0.5526, SE=0.0001)) compared with combined data sources (CPRD and HES datasets).Conclusions All variations of the algorithm produced similar population-level ethnicity distributions. Versions using data from multiple sources had higher inter-rater reliability than those using a subset of sources; however, there was little difference in categorisations produced by varying the hierarchical decision-making of the ethnicity algorithm. The CPRD population was representative of the English population in terms of ethnicity. While researchers should remain vigilant of the limitations of using these data, the CPRD Ethnicity Records provide a standardised and pragmatic approach to ascertaining ethnicity for future research.
format Article
id doaj-art-3a43c8c7ccc64674bba3c2846f18e5ab
institution DOAJ
issn 2044-6055
language English
publishDate 2025-07-01
publisher BMJ Publishing Group
record_format Article
series BMJ Open
spelling doaj-art-3a43c8c7ccc64674bba3c2846f18e5ab2025-08-20T03:12:56ZengBMJ Publishing GroupBMJ Open2044-60552025-07-0115710.1136/bmjopen-2025-100533Development of an algorithm for ethnicity recording in cohorts from the UK Clinical Practice Research Datalink primary care and linked Hospital Episode Statistics databasesRachael Williams0Eleanor L Axson1Suhail I Shiekh2Medicines and Healthcare Products Regulatory Agency, London, England, UKMedicines and Healthcare Products Regulatory Agency, London, England, UKLeicester Real World Evidence Unit, Diabetes Research Centre, University of Leicester, Leicester, England, UKObjective To evaluate various prioritisation strategies within an algorithm designed to ascertain the most likely ethnicity and create a standardised methodology to benefit future research.Design Retrospective cohort study.Setting The Clinical Practice Research Datalink (CPRD) primary care and linked Hospital Episode Statistics (HES) data sets.Participants The population of 54 029 174 patients included all acceptable patients registered at English practices in CPRD GOLD or CPRD Aurum from the May 2023 to May 2022 builds, respectively.Primary outcome measure Ethnicity data within CPRD and HES data sets were identified by employing established code lists and subsequently categorised into broader ethnic groups. Changes were made to a previously used algorithm to assess their effect on ethnic categorisations. Modifications included prioritising primary over secondary care data, recent over frequent records and ‘non-other’ ethnicity categories. Different data sources were examined: CPRD with all HES data sets, CPRD with HES Admitted Patient Care (APC) only, CPRD only and HES APC only. Ethnic distributions from these variations were compared using counts and percentages, evaluating inter-rater reliability using Cohen’s kappa. Sensitivity analyses included repetition using only currently registered patients and after removing cases with unknown ethnicity. Ethnic distributions were compared with English Census 2021.Results There was almost perfect agreement in ethnicity distributions whether prioritising primary over secondary care data (kappa=1.0000, SE=0.0001), whether prioritising most frequently or most recently recorded data (kappa=0.9824, SE=0.0001) and whether prioritising ‘non-Other’ categories (kappa=0.9705, SE=0.0001). There was moderate agreement in ethnicity distributions when sourcing data from single data sources (CPRD only (kappa=0.5554, SE=0.0001) or HES APC only (kappa=0.5526, SE=0.0001)) compared with combined data sources (CPRD and HES datasets).Conclusions All variations of the algorithm produced similar population-level ethnicity distributions. Versions using data from multiple sources had higher inter-rater reliability than those using a subset of sources; however, there was little difference in categorisations produced by varying the hierarchical decision-making of the ethnicity algorithm. The CPRD population was representative of the English population in terms of ethnicity. While researchers should remain vigilant of the limitations of using these data, the CPRD Ethnicity Records provide a standardised and pragmatic approach to ascertaining ethnicity for future research.https://bmjopen.bmj.com/content/15/7/e100533.full
spellingShingle Rachael Williams
Eleanor L Axson
Suhail I Shiekh
Development of an algorithm for ethnicity recording in cohorts from the UK Clinical Practice Research Datalink primary care and linked Hospital Episode Statistics databases
BMJ Open
title Development of an algorithm for ethnicity recording in cohorts from the UK Clinical Practice Research Datalink primary care and linked Hospital Episode Statistics databases
title_full Development of an algorithm for ethnicity recording in cohorts from the UK Clinical Practice Research Datalink primary care and linked Hospital Episode Statistics databases
title_fullStr Development of an algorithm for ethnicity recording in cohorts from the UK Clinical Practice Research Datalink primary care and linked Hospital Episode Statistics databases
title_full_unstemmed Development of an algorithm for ethnicity recording in cohorts from the UK Clinical Practice Research Datalink primary care and linked Hospital Episode Statistics databases
title_short Development of an algorithm for ethnicity recording in cohorts from the UK Clinical Practice Research Datalink primary care and linked Hospital Episode Statistics databases
title_sort development of an algorithm for ethnicity recording in cohorts from the uk clinical practice research datalink primary care and linked hospital episode statistics databases
url https://bmjopen.bmj.com/content/15/7/e100533.full
work_keys_str_mv AT rachaelwilliams developmentofanalgorithmforethnicityrecordingincohortsfromtheukclinicalpracticeresearchdatalinkprimarycareandlinkedhospitalepisodestatisticsdatabases
AT eleanorlaxson developmentofanalgorithmforethnicityrecordingincohortsfromtheukclinicalpracticeresearchdatalinkprimarycareandlinkedhospitalepisodestatisticsdatabases
AT suhailishiekh developmentofanalgorithmforethnicityrecordingincohortsfromtheukclinicalpracticeresearchdatalinkprimarycareandlinkedhospitalepisodestatisticsdatabases