Development and validation of a distributed representation model of Japanese high-dimensional administrative claims data for clinical epidemiology studies
Abstract Background Unmeasured confounders pose challenges when observational data are analysed in comparative effectiveness studies. Integrating high-dimensional administrative claims data may help adjust for unmeasured confounders. We determined whether distributed representations can compress hig...
Saved in:
| Main Authors: | , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
BMC
2025-04-01
|
| Series: | BMC Medical Research Methodology |
| Subjects: | |
| Online Access: | https://doi.org/10.1186/s12874-025-02549-7 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849737468770779136 |
|---|---|
| author | Hiroki Matsui Kiyohide Fushimi Hideo Yasunaga |
| author_facet | Hiroki Matsui Kiyohide Fushimi Hideo Yasunaga |
| author_sort | Hiroki Matsui |
| collection | DOAJ |
| description | Abstract Background Unmeasured confounders pose challenges when observational data are analysed in comparative effectiveness studies. Integrating high-dimensional administrative claims data may help adjust for unmeasured confounders. We determined whether distributed representations can compress high-dimensional administrative claims data to adjust for unmeasured confounders. Method Using the Japanese Diagnosis Procedure Combination (DPC) database from 1291 hospitals (between April 2018 and March 2020), we applied the word2vec algorithm to create distributed representations for all medical codes. We focused on patients with heart failure (HF) and simulated four risk-adjustment models: 1, no adjustment; 2, adjusting for previously reported confounders; 3, adjusting for the sum of distributed representation weights of administrative claims data on the day of hospitalisation (novel method); and 4, a combination of models 2 and 3. We re-evaluated a previous study on the effect of early rehabilitation in patients with HF and compared these risk-adjustment methods (models 1–4). Results Distributed representations were generated from the data of 15 998 963 in-patients, and 319 581 HF patients were identified. In the simulation study, Model 3 reduced the impact of unmeasured confounders and achieved better covariate balances than Model 1. Model 4 showed no increase in bias compared with the true model (Model 2) and was used as a reference model in the real-world application. When applied to a previous study, models 3 and 4 showed similar results. Conclusion Distributed representation can compress detailed administrative claims data and adjust for unmeasured confounders in comparative effectiveness studies. |
| format | Article |
| id | doaj-art-b8d4aa19a5284eacb34d01b181d9cf44 |
| institution | DOAJ |
| issn | 1471-2288 |
| language | English |
| publishDate | 2025-04-01 |
| publisher | BMC |
| record_format | Article |
| series | BMC Medical Research Methodology |
| spelling | doaj-art-b8d4aa19a5284eacb34d01b181d9cf442025-08-20T03:06:54ZengBMCBMC Medical Research Methodology1471-22882025-04-0125111010.1186/s12874-025-02549-7Development and validation of a distributed representation model of Japanese high-dimensional administrative claims data for clinical epidemiology studiesHiroki Matsui0Kiyohide Fushimi1Hideo Yasunaga2Department of Clinical Epidemiology and Health Economics, School of Public Health, The University of TokyoDepartment of Health Policy and Informatics, Institute of Science Tokyo Graduate School of Medical and Dental SciencesDepartment of Clinical Epidemiology and Health Economics, School of Public Health, The University of TokyoAbstract Background Unmeasured confounders pose challenges when observational data are analysed in comparative effectiveness studies. Integrating high-dimensional administrative claims data may help adjust for unmeasured confounders. We determined whether distributed representations can compress high-dimensional administrative claims data to adjust for unmeasured confounders. Method Using the Japanese Diagnosis Procedure Combination (DPC) database from 1291 hospitals (between April 2018 and March 2020), we applied the word2vec algorithm to create distributed representations for all medical codes. We focused on patients with heart failure (HF) and simulated four risk-adjustment models: 1, no adjustment; 2, adjusting for previously reported confounders; 3, adjusting for the sum of distributed representation weights of administrative claims data on the day of hospitalisation (novel method); and 4, a combination of models 2 and 3. We re-evaluated a previous study on the effect of early rehabilitation in patients with HF and compared these risk-adjustment methods (models 1–4). Results Distributed representations were generated from the data of 15 998 963 in-patients, and 319 581 HF patients were identified. In the simulation study, Model 3 reduced the impact of unmeasured confounders and achieved better covariate balances than Model 1. Model 4 showed no increase in bias compared with the true model (Model 2) and was used as a reference model in the real-world application. When applied to a previous study, models 3 and 4 showed similar results. Conclusion Distributed representation can compress detailed administrative claims data and adjust for unmeasured confounders in comparative effectiveness studies.https://doi.org/10.1186/s12874-025-02549-7Distributed representationWord2vecAdministrative claims dataHigh-dimensional propensity scoreUnmeasured confounder |
| spellingShingle | Hiroki Matsui Kiyohide Fushimi Hideo Yasunaga Development and validation of a distributed representation model of Japanese high-dimensional administrative claims data for clinical epidemiology studies BMC Medical Research Methodology Distributed representation Word2vec Administrative claims data High-dimensional propensity score Unmeasured confounder |
| title | Development and validation of a distributed representation model of Japanese high-dimensional administrative claims data for clinical epidemiology studies |
| title_full | Development and validation of a distributed representation model of Japanese high-dimensional administrative claims data for clinical epidemiology studies |
| title_fullStr | Development and validation of a distributed representation model of Japanese high-dimensional administrative claims data for clinical epidemiology studies |
| title_full_unstemmed | Development and validation of a distributed representation model of Japanese high-dimensional administrative claims data for clinical epidemiology studies |
| title_short | Development and validation of a distributed representation model of Japanese high-dimensional administrative claims data for clinical epidemiology studies |
| title_sort | development and validation of a distributed representation model of japanese high dimensional administrative claims data for clinical epidemiology studies |
| topic | Distributed representation Word2vec Administrative claims data High-dimensional propensity score Unmeasured confounder |
| url | https://doi.org/10.1186/s12874-025-02549-7 |
| work_keys_str_mv | AT hirokimatsui developmentandvalidationofadistributedrepresentationmodelofjapanesehighdimensionaladministrativeclaimsdataforclinicalepidemiologystudies AT kiyohidefushimi developmentandvalidationofadistributedrepresentationmodelofjapanesehighdimensionaladministrativeclaimsdataforclinicalepidemiologystudies AT hideoyasunaga developmentandvalidationofadistributedrepresentationmodelofjapanesehighdimensionaladministrativeclaimsdataforclinicalepidemiologystudies |