The Social Construction of Categorical Data: Mixed Methods Approach to Assessing Data Features in Publicly Available Datasets

BackgroundIn data-sparse areas such as health care, computer scientists aim to leverage as much available information as possible to increase the accuracy of their machine learning models’ outputs. As a standard, categorical data, such as patients’ gender, socioeconomic statu...

Full description

Saved in:

Bibliographic Details
Main Authors:	Theresa Willem, Alessandro Wollek, Theodor Cheslerean-Boghiu, Martha Kenney, Alena Buyx
Format:	Article
Language:	English
Published:	JMIR Publications 2025-01-01
Series:	JMIR Medical Informatics
Online Access:	https://medinform.jmir.org/2025/1/e59452
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1832583343858778112
author	Theresa Willem Alessandro Wollek Theodor Cheslerean-Boghiu Martha Kenney Alena Buyx
author_facet	Theresa Willem Alessandro Wollek Theodor Cheslerean-Boghiu Martha Kenney Alena Buyx
author_sort	Theresa Willem
collection	DOAJ
description	BackgroundIn data-sparse areas such as health care, computer scientists aim to leverage as much available information as possible to increase the accuracy of their machine learning models’ outputs. As a standard, categorical data, such as patients’ gender, socioeconomic status, or skin color, are used to train models in fusion with other data types, such as medical images and text-based medical information. However, the effects of including categorical data features for model training in such data-scarce areas are underexamined, particularly regarding models intended to serve individuals equitably in a diverse population. ObjectiveThis study aimed to explore categorical data’s effects on machine learning model outputs, rooted the effects in the data collection and dataset publication processes, and proposed a mixed methods approach to examining datasets’ data categories before using them for machine learning training. MethodsAgainst the theoretical background of the social construction of categories, we suggest a mixed methods approach to assess categorical data’s utility for machine learning model training. As an example, we applied our approach to a Brazilian dermatological dataset (Dermatological and Surgical Assistance Program at the Federal University of Espírito Santo [PAD-UFES] 20). We first present an exploratory, quantitative study that assesses the effects when including or excluding each of the unique categorical data features of the PAD-UFES 20 dataset for training a transformer-based model using a data fusion algorithm. We then pair our quantitative analysis with a qualitative examination of the data categories based on interviews with the dataset authors. ResultsOur quantitative study suggests scattered effects of including categorical data for machine learning model training across predictive classes. Our qualitative analysis gives insights into how the categorical data were collected and why they were published, explaining some of the quantitative effects that we observed. Our findings highlight the social constructedness of categorical data in publicly available datasets, meaning that the data in a category heavily depend on both how these categories are defined by the dataset creators and the sociomedico context in which the data are collected. This reveals relevant limitations of using publicly available datasets in contexts different from those of the collection of their data. ConclusionsWe caution against using data features of publicly available datasets without reflection on the social construction and context dependency of their categorical data features, particularly in data-sparse areas. We conclude that social scientific, context-dependent analysis of available data features using both quantitative and qualitative methods is helpful in judging the utility of categorical data for the population for which a model is intended.
format	Article
id	doaj-art-fb8f93c8c7c64f30915104ed2ab71c14
institution	Kabale University
issn	2291-9694
language	English
publishDate	2025-01-01
publisher	JMIR Publications
record_format	Article
series	JMIR Medical Informatics
spelling	doaj-art-fb8f93c8c7c64f30915104ed2ab71c142025-01-28T18:00:35ZengJMIR PublicationsJMIR Medical Informatics2291-96942025-01-0113e5945210.2196/59452The Social Construction of Categorical Data: Mixed Methods Approach to Assessing Data Features in Publicly Available DatasetsTheresa Willemhttps://orcid.org/0000-0001-7643-8816Alessandro Wollekhttps://orcid.org/0000-0001-9535-0502Theodor Cheslerean-Boghiuhttps://orcid.org/0000-0002-0381-9105Martha Kenneyhttps://orcid.org/0000-0002-8900-8791Alena Buyxhttps://orcid.org/0000-0002-5726-7633 BackgroundIn data-sparse areas such as health care, computer scientists aim to leverage as much available information as possible to increase the accuracy of their machine learning models’ outputs. As a standard, categorical data, such as patients’ gender, socioeconomic status, or skin color, are used to train models in fusion with other data types, such as medical images and text-based medical information. However, the effects of including categorical data features for model training in such data-scarce areas are underexamined, particularly regarding models intended to serve individuals equitably in a diverse population. ObjectiveThis study aimed to explore categorical data’s effects on machine learning model outputs, rooted the effects in the data collection and dataset publication processes, and proposed a mixed methods approach to examining datasets’ data categories before using them for machine learning training. MethodsAgainst the theoretical background of the social construction of categories, we suggest a mixed methods approach to assess categorical data’s utility for machine learning model training. As an example, we applied our approach to a Brazilian dermatological dataset (Dermatological and Surgical Assistance Program at the Federal University of Espírito Santo [PAD-UFES] 20). We first present an exploratory, quantitative study that assesses the effects when including or excluding each of the unique categorical data features of the PAD-UFES 20 dataset for training a transformer-based model using a data fusion algorithm. We then pair our quantitative analysis with a qualitative examination of the data categories based on interviews with the dataset authors. ResultsOur quantitative study suggests scattered effects of including categorical data for machine learning model training across predictive classes. Our qualitative analysis gives insights into how the categorical data were collected and why they were published, explaining some of the quantitative effects that we observed. Our findings highlight the social constructedness of categorical data in publicly available datasets, meaning that the data in a category heavily depend on both how these categories are defined by the dataset creators and the sociomedico context in which the data are collected. This reveals relevant limitations of using publicly available datasets in contexts different from those of the collection of their data. ConclusionsWe caution against using data features of publicly available datasets without reflection on the social construction and context dependency of their categorical data features, particularly in data-sparse areas. We conclude that social scientific, context-dependent analysis of available data features using both quantitative and qualitative methods is helpful in judging the utility of categorical data for the population for which a model is intended.https://medinform.jmir.org/2025/1/e59452
spellingShingle	Theresa Willem Alessandro Wollek Theodor Cheslerean-Boghiu Martha Kenney Alena Buyx The Social Construction of Categorical Data: Mixed Methods Approach to Assessing Data Features in Publicly Available Datasets JMIR Medical Informatics
title	The Social Construction of Categorical Data: Mixed Methods Approach to Assessing Data Features in Publicly Available Datasets
title_full	The Social Construction of Categorical Data: Mixed Methods Approach to Assessing Data Features in Publicly Available Datasets
title_fullStr	The Social Construction of Categorical Data: Mixed Methods Approach to Assessing Data Features in Publicly Available Datasets
title_full_unstemmed	The Social Construction of Categorical Data: Mixed Methods Approach to Assessing Data Features in Publicly Available Datasets
title_short	The Social Construction of Categorical Data: Mixed Methods Approach to Assessing Data Features in Publicly Available Datasets
title_sort	social construction of categorical data mixed methods approach to assessing data features in publicly available datasets
url	https://medinform.jmir.org/2025/1/e59452
work_keys_str_mv	AT theresawillem thesocialconstructionofcategoricaldatamixedmethodsapproachtoassessingdatafeaturesinpubliclyavailabledatasets AT alessandrowollek thesocialconstructionofcategoricaldatamixedmethodsapproachtoassessingdatafeaturesinpubliclyavailabledatasets AT theodorcheslereanboghiu thesocialconstructionofcategoricaldatamixedmethodsapproachtoassessingdatafeaturesinpubliclyavailabledatasets AT marthakenney thesocialconstructionofcategoricaldatamixedmethodsapproachtoassessingdatafeaturesinpubliclyavailabledatasets AT alenabuyx thesocialconstructionofcategoricaldatamixedmethodsapproachtoassessingdatafeaturesinpubliclyavailabledatasets AT theresawillem socialconstructionofcategoricaldatamixedmethodsapproachtoassessingdatafeaturesinpubliclyavailabledatasets AT alessandrowollek socialconstructionofcategoricaldatamixedmethodsapproachtoassessingdatafeaturesinpubliclyavailabledatasets AT theodorcheslereanboghiu socialconstructionofcategoricaldatamixedmethodsapproachtoassessingdatafeaturesinpubliclyavailabledatasets AT marthakenney socialconstructionofcategoricaldatamixedmethodsapproachtoassessingdatafeaturesinpubliclyavailabledatasets AT alenabuyx socialconstructionofcategoricaldatamixedmethodsapproachtoassessingdatafeaturesinpubliclyavailabledatasets

The Social Construction of Categorical Data: Mixed Methods Approach to Assessing Data Features in Publicly Available Datasets

Similar Items