Development and validation of a distributed representation model of Japanese high-dimensional administrative claims data for clinical epidemiology studies

Abstract Background Unmeasured confounders pose challenges when observational data are analysed in comparative effectiveness studies. Integrating high-dimensional administrative claims data may help adjust for unmeasured confounders. We determined whether distributed representations can compress hig...

Full description

Saved in:

Bibliographic Details
Main Authors:	Hiroki Matsui, Kiyohide Fushimi, Hideo Yasunaga
Format:	Article
Language:	English
Published:	BMC 2025-04-01
Series:	BMC Medical Research Methodology
Subjects:	Distributed representation Word2vec Administrative claims data High-dimensional propensity score Unmeasured confounder
Online Access:	https://doi.org/10.1186/s12874-025-02549-7
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Abstract Background Unmeasured confounders pose challenges when observational data are analysed in comparative effectiveness studies. Integrating high-dimensional administrative claims data may help adjust for unmeasured confounders. We determined whether distributed representations can compress high-dimensional administrative claims data to adjust for unmeasured confounders. Method Using the Japanese Diagnosis Procedure Combination (DPC) database from 1291 hospitals (between April 2018 and March 2020), we applied the word2vec algorithm to create distributed representations for all medical codes. We focused on patients with heart failure (HF) and simulated four risk-adjustment models: 1, no adjustment; 2, adjusting for previously reported confounders; 3, adjusting for the sum of distributed representation weights of administrative claims data on the day of hospitalisation (novel method); and 4, a combination of models 2 and 3. We re-evaluated a previous study on the effect of early rehabilitation in patients with HF and compared these risk-adjustment methods (models 1–4). Results Distributed representations were generated from the data of 15 998 963 in-patients, and 319 581 HF patients were identified. In the simulation study, Model 3 reduced the impact of unmeasured confounders and achieved better covariate balances than Model 1. Model 4 showed no increase in bias compared with the true model (Model 2) and was used as a reference model in the real-world application. When applied to a previous study, models 3 and 4 showed similar results. Conclusion Distributed representation can compress detailed administrative claims data and adjust for unmeasured confounders in comparative effectiveness studies.
ISSN:	1471-2288

Development and validation of a distributed representation model of Japanese high-dimensional administrative claims data for clinical epidemiology studies

Similar Items