Development and validation of a distributed representation model of Japanese high-dimensional administrative claims data for clinical epidemiology studies

Abstract Background Unmeasured confounders pose challenges when observational data are analysed in comparative effectiveness studies. Integrating high-dimensional administrative claims data may help adjust for unmeasured confounders. We determined whether distributed representations can compress hig...

Full description

Saved in:

Bibliographic Details
Main Authors:	Hiroki Matsui, Kiyohide Fushimi, Hideo Yasunaga
Format:	Article
Language:	English
Published:	BMC 2025-04-01
Series:	BMC Medical Research Methodology
Subjects:	Distributed representation Word2vec Administrative claims data High-dimensional propensity score Unmeasured confounder
Online Access:	https://doi.org/10.1186/s12874-025-02549-7
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849737468770779136
author	Hiroki Matsui Kiyohide Fushimi Hideo Yasunaga
author_facet	Hiroki Matsui Kiyohide Fushimi Hideo Yasunaga
author_sort	Hiroki Matsui
collection	DOAJ
description	Abstract Background Unmeasured confounders pose challenges when observational data are analysed in comparative effectiveness studies. Integrating high-dimensional administrative claims data may help adjust for unmeasured confounders. We determined whether distributed representations can compress high-dimensional administrative claims data to adjust for unmeasured confounders. Method Using the Japanese Diagnosis Procedure Combination (DPC) database from 1291 hospitals (between April 2018 and March 2020), we applied the word2vec algorithm to create distributed representations for all medical codes. We focused on patients with heart failure (HF) and simulated four risk-adjustment models: 1, no adjustment; 2, adjusting for previously reported confounders; 3, adjusting for the sum of distributed representation weights of administrative claims data on the day of hospitalisation (novel method); and 4, a combination of models 2 and 3. We re-evaluated a previous study on the effect of early rehabilitation in patients with HF and compared these risk-adjustment methods (models 1–4). Results Distributed representations were generated from the data of 15 998 963 in-patients, and 319 581 HF patients were identified. In the simulation study, Model 3 reduced the impact of unmeasured confounders and achieved better covariate balances than Model 1. Model 4 showed no increase in bias compared with the true model (Model 2) and was used as a reference model in the real-world application. When applied to a previous study, models 3 and 4 showed similar results. Conclusion Distributed representation can compress detailed administrative claims data and adjust for unmeasured confounders in comparative effectiveness studies.
format	Article
id	doaj-art-b8d4aa19a5284eacb34d01b181d9cf44
institution	DOAJ
issn	1471-2288
language	English
publishDate	2025-04-01
publisher	BMC
record_format	Article
series	BMC Medical Research Methodology
spelling	doaj-art-b8d4aa19a5284eacb34d01b181d9cf442025-08-20T03:06:54ZengBMCBMC Medical Research Methodology1471-22882025-04-0125111010.1186/s12874-025-02549-7Development and validation of a distributed representation model of Japanese high-dimensional administrative claims data for clinical epidemiology studiesHiroki Matsui0Kiyohide Fushimi1Hideo Yasunaga2Department of Clinical Epidemiology and Health Economics, School of Public Health, The University of TokyoDepartment of Health Policy and Informatics, Institute of Science Tokyo Graduate School of Medical and Dental SciencesDepartment of Clinical Epidemiology and Health Economics, School of Public Health, The University of TokyoAbstract Background Unmeasured confounders pose challenges when observational data are analysed in comparative effectiveness studies. Integrating high-dimensional administrative claims data may help adjust for unmeasured confounders. We determined whether distributed representations can compress high-dimensional administrative claims data to adjust for unmeasured confounders. Method Using the Japanese Diagnosis Procedure Combination (DPC) database from 1291 hospitals (between April 2018 and March 2020), we applied the word2vec algorithm to create distributed representations for all medical codes. We focused on patients with heart failure (HF) and simulated four risk-adjustment models: 1, no adjustment; 2, adjusting for previously reported confounders; 3, adjusting for the sum of distributed representation weights of administrative claims data on the day of hospitalisation (novel method); and 4, a combination of models 2 and 3. We re-evaluated a previous study on the effect of early rehabilitation in patients with HF and compared these risk-adjustment methods (models 1–4). Results Distributed representations were generated from the data of 15 998 963 in-patients, and 319 581 HF patients were identified. In the simulation study, Model 3 reduced the impact of unmeasured confounders and achieved better covariate balances than Model 1. Model 4 showed no increase in bias compared with the true model (Model 2) and was used as a reference model in the real-world application. When applied to a previous study, models 3 and 4 showed similar results. Conclusion Distributed representation can compress detailed administrative claims data and adjust for unmeasured confounders in comparative effectiveness studies.https://doi.org/10.1186/s12874-025-02549-7Distributed representationWord2vecAdministrative claims dataHigh-dimensional propensity scoreUnmeasured confounder
spellingShingle	Hiroki Matsui Kiyohide Fushimi Hideo Yasunaga Development and validation of a distributed representation model of Japanese high-dimensional administrative claims data for clinical epidemiology studies BMC Medical Research Methodology Distributed representation Word2vec Administrative claims data High-dimensional propensity score Unmeasured confounder
title	Development and validation of a distributed representation model of Japanese high-dimensional administrative claims data for clinical epidemiology studies
title_full	Development and validation of a distributed representation model of Japanese high-dimensional administrative claims data for clinical epidemiology studies
title_fullStr	Development and validation of a distributed representation model of Japanese high-dimensional administrative claims data for clinical epidemiology studies
title_full_unstemmed	Development and validation of a distributed representation model of Japanese high-dimensional administrative claims data for clinical epidemiology studies
title_short	Development and validation of a distributed representation model of Japanese high-dimensional administrative claims data for clinical epidemiology studies
title_sort	development and validation of a distributed representation model of japanese high dimensional administrative claims data for clinical epidemiology studies
topic	Distributed representation Word2vec Administrative claims data High-dimensional propensity score Unmeasured confounder
url	https://doi.org/10.1186/s12874-025-02549-7
work_keys_str_mv	AT hirokimatsui developmentandvalidationofadistributedrepresentationmodelofjapanesehighdimensionaladministrativeclaimsdataforclinicalepidemiologystudies AT kiyohidefushimi developmentandvalidationofadistributedrepresentationmodelofjapanesehighdimensionaladministrativeclaimsdataforclinicalepidemiologystudies AT hideoyasunaga developmentandvalidationofadistributedrepresentationmodelofjapanesehighdimensionaladministrativeclaimsdataforclinicalepidemiologystudies

Development and validation of a distributed representation model of Japanese high-dimensional administrative claims data for clinical epidemiology studies

Similar Items