Development and validation of a distributed representation model of Japanese high-dimensional administrative claims data for clinical epidemiology studies

Abstract Background Unmeasured confounders pose challenges when observational data are analysed in comparative effectiveness studies. Integrating high-dimensional administrative claims data may help adjust for unmeasured confounders. We determined whether distributed representations can compress hig...

Full description

Saved in:
Bibliographic Details
Main Authors: Hiroki Matsui, Kiyohide Fushimi, Hideo Yasunaga
Format: Article
Language:English
Published: BMC 2025-04-01
Series:BMC Medical Research Methodology
Subjects:
Online Access:https://doi.org/10.1186/s12874-025-02549-7
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849737468770779136
author Hiroki Matsui
Kiyohide Fushimi
Hideo Yasunaga
author_facet Hiroki Matsui
Kiyohide Fushimi
Hideo Yasunaga
author_sort Hiroki Matsui
collection DOAJ
description Abstract Background Unmeasured confounders pose challenges when observational data are analysed in comparative effectiveness studies. Integrating high-dimensional administrative claims data may help adjust for unmeasured confounders. We determined whether distributed representations can compress high-dimensional administrative claims data to adjust for unmeasured confounders. Method Using the Japanese Diagnosis Procedure Combination (DPC) database from 1291 hospitals (between April 2018 and March 2020), we applied the word2vec algorithm to create distributed representations for all medical codes. We focused on patients with heart failure (HF) and simulated four risk-adjustment models: 1, no adjustment; 2, adjusting for previously reported confounders; 3, adjusting for the sum of distributed representation weights of administrative claims data on the day of hospitalisation (novel method); and 4, a combination of models 2 and 3. We re-evaluated a previous study on the effect of early rehabilitation in patients with HF and compared these risk-adjustment methods (models 1–4). Results Distributed representations were generated from the data of 15 998 963 in-patients, and 319 581 HF patients were identified. In the simulation study, Model 3 reduced the impact of unmeasured confounders and achieved better covariate balances than Model 1. Model 4 showed no increase in bias compared with the true model (Model 2) and was used as a reference model in the real-world application. When applied to a previous study, models 3 and 4 showed similar results. Conclusion Distributed representation can compress detailed administrative claims data and adjust for unmeasured confounders in comparative effectiveness studies.
format Article
id doaj-art-b8d4aa19a5284eacb34d01b181d9cf44
institution DOAJ
issn 1471-2288
language English
publishDate 2025-04-01
publisher BMC
record_format Article
series BMC Medical Research Methodology
spelling doaj-art-b8d4aa19a5284eacb34d01b181d9cf442025-08-20T03:06:54ZengBMCBMC Medical Research Methodology1471-22882025-04-0125111010.1186/s12874-025-02549-7Development and validation of a distributed representation model of Japanese high-dimensional administrative claims data for clinical epidemiology studiesHiroki Matsui0Kiyohide Fushimi1Hideo Yasunaga2Department of Clinical Epidemiology and Health Economics, School of Public Health, The University of TokyoDepartment of Health Policy and Informatics, Institute of Science Tokyo Graduate School of Medical and Dental SciencesDepartment of Clinical Epidemiology and Health Economics, School of Public Health, The University of TokyoAbstract Background Unmeasured confounders pose challenges when observational data are analysed in comparative effectiveness studies. Integrating high-dimensional administrative claims data may help adjust for unmeasured confounders. We determined whether distributed representations can compress high-dimensional administrative claims data to adjust for unmeasured confounders. Method Using the Japanese Diagnosis Procedure Combination (DPC) database from 1291 hospitals (between April 2018 and March 2020), we applied the word2vec algorithm to create distributed representations for all medical codes. We focused on patients with heart failure (HF) and simulated four risk-adjustment models: 1, no adjustment; 2, adjusting for previously reported confounders; 3, adjusting for the sum of distributed representation weights of administrative claims data on the day of hospitalisation (novel method); and 4, a combination of models 2 and 3. We re-evaluated a previous study on the effect of early rehabilitation in patients with HF and compared these risk-adjustment methods (models 1–4). Results Distributed representations were generated from the data of 15 998 963 in-patients, and 319 581 HF patients were identified. In the simulation study, Model 3 reduced the impact of unmeasured confounders and achieved better covariate balances than Model 1. Model 4 showed no increase in bias compared with the true model (Model 2) and was used as a reference model in the real-world application. When applied to a previous study, models 3 and 4 showed similar results. Conclusion Distributed representation can compress detailed administrative claims data and adjust for unmeasured confounders in comparative effectiveness studies.https://doi.org/10.1186/s12874-025-02549-7Distributed representationWord2vecAdministrative claims dataHigh-dimensional propensity scoreUnmeasured confounder
spellingShingle Hiroki Matsui
Kiyohide Fushimi
Hideo Yasunaga
Development and validation of a distributed representation model of Japanese high-dimensional administrative claims data for clinical epidemiology studies
BMC Medical Research Methodology
Distributed representation
Word2vec
Administrative claims data
High-dimensional propensity score
Unmeasured confounder
title Development and validation of a distributed representation model of Japanese high-dimensional administrative claims data for clinical epidemiology studies
title_full Development and validation of a distributed representation model of Japanese high-dimensional administrative claims data for clinical epidemiology studies
title_fullStr Development and validation of a distributed representation model of Japanese high-dimensional administrative claims data for clinical epidemiology studies
title_full_unstemmed Development and validation of a distributed representation model of Japanese high-dimensional administrative claims data for clinical epidemiology studies
title_short Development and validation of a distributed representation model of Japanese high-dimensional administrative claims data for clinical epidemiology studies
title_sort development and validation of a distributed representation model of japanese high dimensional administrative claims data for clinical epidemiology studies
topic Distributed representation
Word2vec
Administrative claims data
High-dimensional propensity score
Unmeasured confounder
url https://doi.org/10.1186/s12874-025-02549-7
work_keys_str_mv AT hirokimatsui developmentandvalidationofadistributedrepresentationmodelofjapanesehighdimensionaladministrativeclaimsdataforclinicalepidemiologystudies
AT kiyohidefushimi developmentandvalidationofadistributedrepresentationmodelofjapanesehighdimensionaladministrativeclaimsdataforclinicalepidemiologystudies
AT hideoyasunaga developmentandvalidationofadistributedrepresentationmodelofjapanesehighdimensionaladministrativeclaimsdataforclinicalepidemiologystudies