Non-negligible Occurrence of Errors in Gender Description in Public Data Sets

Due to advances in omics technologies, numerous genome-wide studies on human samples have been published, and most of the omics data with the associated clinical information are available in public repositories, such as Gene Expression Omnibus and ArrayExpress. While analyzing several public dataset...

Full description

Saved in:

Bibliographic Details
Main Authors:	Jong Hwan Kim, Jong-Luyl Park, Seon-Young Kim
Format:	Article
Language:	English
Published:	BioMed Central 2016-03-01
Series:	Genomics & Informatics
Subjects:	blood DNA methylation gender identity gene expression microarray analysis
Online Access:	http://genominfo.org/upload/pdf/gni-14-34.pdf
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1832572302348255232
author	Jong Hwan Kim Jong-Luyl Park Seon-Young Kim
author_facet	Jong Hwan Kim Jong-Luyl Park Seon-Young Kim
author_sort	Jong Hwan Kim
collection	DOAJ
description	Due to advances in omics technologies, numerous genome-wide studies on human samples have been published, and most of the omics data with the associated clinical information are available in public repositories, such as Gene Expression Omnibus and ArrayExpress. While analyzing several public datasets, we observed that errors in gender information occur quite often in public datasets. When we analyzed the gender description and the methylation patterns of gender-specific probes (glucose-6-phosphate dehydrogenase [G6PD], ephrin-B1 [EFNB1], and testis specific protein, Y-linked 2 [TSPY2]) in 5,611 samples produced using Infinium 450K HumanMethylation arrays, we found that 19 samples from 7 datasets were erroneously described. We also analyzed 1,819 samples produced using the Affymetrix U133Plus2 array using several gender-specific genes (X (inactive)-specific transcript [XIST], eukaryotic translation initiation factor 1A, Y-linked [EIF1AY], and DEAD [Asp-Glu-Ala-Asp] box polypeptide 3, Y-linked [DDDX3Y]) and found that 40 samples from 3 datasets were erroneously described. We suggest that the users of public datasets should not expect that the data are error-free and, whenever possible, that they should check the consistency of the data.
format	Article
id	doaj-art-cabc725b0e4d42e7b07946fd20805918
institution	Kabale University
issn	1598-866X 2234-0742
language	English
publishDate	2016-03-01
publisher	BioMed Central
record_format	Article
series	Genomics & Informatics
spelling	doaj-art-cabc725b0e4d42e7b07946fd208059182025-02-02T10:53:20ZengBioMed CentralGenomics & Informatics1598-866X2234-07422016-03-01141344010.5808/GI.2016.14.1.34188Non-negligible Occurrence of Errors in Gender Description in Public Data SetsJong Hwan Kim0Jong-Luyl Park1Seon-Young Kim2Genome Structure Research Center, Korea Research Institute of Bioscience and Biotechnology (KRIBB), Daejeon 34141, Korea.Epigenome Research Center, Korea Research Institute of Bioscience and Biotechnology (KRIBB), Daejeon 34141, Korea.Genome Structure Research Center, Korea Research Institute of Bioscience and Biotechnology (KRIBB), Daejeon 34141, Korea.Due to advances in omics technologies, numerous genome-wide studies on human samples have been published, and most of the omics data with the associated clinical information are available in public repositories, such as Gene Expression Omnibus and ArrayExpress. While analyzing several public datasets, we observed that errors in gender information occur quite often in public datasets. When we analyzed the gender description and the methylation patterns of gender-specific probes (glucose-6-phosphate dehydrogenase [G6PD], ephrin-B1 [EFNB1], and testis specific protein, Y-linked 2 [TSPY2]) in 5,611 samples produced using Infinium 450K HumanMethylation arrays, we found that 19 samples from 7 datasets were erroneously described. We also analyzed 1,819 samples produced using the Affymetrix U133Plus2 array using several gender-specific genes (X (inactive)-specific transcript [XIST], eukaryotic translation initiation factor 1A, Y-linked [EIF1AY], and DEAD [Asp-Glu-Ala-Asp] box polypeptide 3, Y-linked [DDDX3Y]) and found that 40 samples from 3 datasets were erroneously described. We suggest that the users of public datasets should not expect that the data are error-free and, whenever possible, that they should check the consistency of the data.http://genominfo.org/upload/pdf/gni-14-34.pdfbloodDNA methylationgender identitygene expressionmicroarray analysis
spellingShingle	Jong Hwan Kim Jong-Luyl Park Seon-Young Kim Non-negligible Occurrence of Errors in Gender Description in Public Data Sets Genomics & Informatics blood DNA methylation gender identity gene expression microarray analysis
title	Non-negligible Occurrence of Errors in Gender Description in Public Data Sets
title_full	Non-negligible Occurrence of Errors in Gender Description in Public Data Sets
title_fullStr	Non-negligible Occurrence of Errors in Gender Description in Public Data Sets
title_full_unstemmed	Non-negligible Occurrence of Errors in Gender Description in Public Data Sets
title_short	Non-negligible Occurrence of Errors in Gender Description in Public Data Sets
title_sort	non negligible occurrence of errors in gender description in public data sets
topic	blood DNA methylation gender identity gene expression microarray analysis
url	http://genominfo.org/upload/pdf/gni-14-34.pdf
work_keys_str_mv	AT jonghwankim nonnegligibleoccurrenceoferrorsingenderdescriptioninpublicdatasets AT jongluylpark nonnegligibleoccurrenceoferrorsingenderdescriptioninpublicdatasets AT seonyoungkim nonnegligibleoccurrenceoferrorsingenderdescriptioninpublicdatasets

Non-negligible Occurrence of Errors in Gender Description in Public Data Sets

Similar Items