A Survey of Data Partitioning and Sampling Methods to Support Big Data Analysis

Computer clusters with the shared-nothing architecture are the major computing platforms for big data processing and analysis. In cluster computing, data partitioning and sampling are two fundamental strategies to speed up the computation of big data and increase scalability. In this paper, we prese...

Full description

Saved in:

Bibliographic Details
Main Authors:	Mohammad Sultan Mahmud, Joshua Zhexue Huang, Salman Salloum, Tamer Z. Emara, Kuanishbay Sadatdiynov
Format:	Article
Language:	English
Published:	Tsinghua University Press 2020-06-01
Series:	Big Data Mining and Analytics
Subjects:	big data analysis data partitioning data sampling distributed and parallel computing approximate computing
Online Access:	https://www.sciopen.com/article/10.26599/BDMA.2019.9020015
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1832572939229200384
author	Mohammad Sultan Mahmud Joshua Zhexue Huang Salman Salloum Tamer Z. Emara Kuanishbay Sadatdiynov
author_facet	Mohammad Sultan Mahmud Joshua Zhexue Huang Salman Salloum Tamer Z. Emara Kuanishbay Sadatdiynov
author_sort	Mohammad Sultan Mahmud
collection	DOAJ
description	Computer clusters with the shared-nothing architecture are the major computing platforms for big data processing and analysis. In cluster computing, data partitioning and sampling are two fundamental strategies to speed up the computation of big data and increase scalability. In this paper, we present a comprehensive survey of the methods and techniques of data partitioning and sampling with respect to big data processing and analysis. We start with an overview of the mainstream big data frameworks on Hadoop clusters. The basic methods of data partitioning are then discussed including three classical horizontal partitioning schemes: range, hash, and random partitioning. Data partitioning on Hadoop clusters is also discussed with a summary of new strategies for big data partitioning, including the new Random Sample Partition (RSP) distributed model. The classical methods of data sampling are then investigated, including simple random sampling, stratified sampling, and reservoir sampling. Two common methods of big data sampling on computing clusters are also discussed: record-level sampling and block-level sampling. Record-level sampling is not as efficient as block-level sampling on big distributed data. On the other hand, block-level sampling on data blocks generated with the classical data partitioning methods does not necessarily produce good representative samples for approximate computing of big data. In this survey, we also summarize the prevailing strategies and related work on sampling-based approximation on Hadoop clusters. We believe that data partitioning and sampling should be considered together to build approximate cluster computing frameworks that are reliable in both the computational and statistical respects.
format	Article
id	doaj-art-3fc66f78b01c4f7e9c34f86beb1835a3
institution	Kabale University
issn	2096-0654
language	English
publishDate	2020-06-01
publisher	Tsinghua University Press
record_format	Article
series	Big Data Mining and Analytics
spelling	doaj-art-3fc66f78b01c4f7e9c34f86beb1835a32025-02-02T05:59:18ZengTsinghua University PressBig Data Mining and Analytics2096-06542020-06-01328510110.26599/BDMA.2019.9020015A Survey of Data Partitioning and Sampling Methods to Support Big Data AnalysisMohammad Sultan Mahmud0Joshua Zhexue Huang1Salman Salloum2Tamer Z. Emara3Kuanishbay Sadatdiynov4<institution content-type="dept">National Engineering Laboratory for Big Data System Computing Technology</institution>, <institution>Shenzhen University</institution>, <city>Shenzhen</city> <postal-code>518060</postal-code>, <country>China</country>, and <institution content-type="dept">Big Data Institute, College of Computer Science and Software Engineering</institution>, <institution>Shenzhen University</institution>, <city>Shenzhen </city><postal-code>518060</postal-code>, <country>China</country>.<institution content-type="dept">National Engineering Laboratory for Big Data System Computing Technology</institution>, <institution>Shenzhen University</institution>, <city>Shenzhen</city> <postal-code>518060</postal-code>, <country>China</country>, and <institution content-type="dept">Big Data Institute, College of Computer Science and Software Engineering</institution>, <institution>Shenzhen University</institution>, <city>Shenzhen </city><postal-code>518060</postal-code>, <country>China</country>.<institution content-type="dept">National Engineering Laboratory for Big Data System Computing Technology</institution>, <institution>Shenzhen University</institution>, <city>Shenzhen</city> <postal-code>518060</postal-code>, <country>China</country>, and <institution content-type="dept">Big Data Institute, College of Computer Science and Software Engineering</institution>, <institution>Shenzhen University</institution>, <city>Shenzhen </city><postal-code>518060</postal-code>, <country>China</country>.<institution content-type="dept">National Engineering Laboratory for Big Data System Computing Technology</institution>, <institution>Shenzhen University</institution>, <city>Shenzhen</city> <postal-code>518060</postal-code>, <country>China</country>, and <institution content-type="dept">Big Data Institute, College of Computer Science and Software Engineering</institution>, <institution>Shenzhen University</institution>, <city>Shenzhen </city><postal-code>518060</postal-code>, <country>China</country>.<institution content-type="dept">National Engineering Laboratory for Big Data System Computing Technology</institution>, <institution>Shenzhen University</institution>, <city>Shenzhen</city> <postal-code>518060</postal-code>, <country>China</country>, and <institution content-type="dept">Big Data Institute, College of Computer Science and Software Engineering</institution>, <institution>Shenzhen University</institution>, <city>Shenzhen </city><postal-code>518060</postal-code>, <country>China</country>.Computer clusters with the shared-nothing architecture are the major computing platforms for big data processing and analysis. In cluster computing, data partitioning and sampling are two fundamental strategies to speed up the computation of big data and increase scalability. In this paper, we present a comprehensive survey of the methods and techniques of data partitioning and sampling with respect to big data processing and analysis. We start with an overview of the mainstream big data frameworks on Hadoop clusters. The basic methods of data partitioning are then discussed including three classical horizontal partitioning schemes: range, hash, and random partitioning. Data partitioning on Hadoop clusters is also discussed with a summary of new strategies for big data partitioning, including the new Random Sample Partition (RSP) distributed model. The classical methods of data sampling are then investigated, including simple random sampling, stratified sampling, and reservoir sampling. Two common methods of big data sampling on computing clusters are also discussed: record-level sampling and block-level sampling. Record-level sampling is not as efficient as block-level sampling on big distributed data. On the other hand, block-level sampling on data blocks generated with the classical data partitioning methods does not necessarily produce good representative samples for approximate computing of big data. In this survey, we also summarize the prevailing strategies and related work on sampling-based approximation on Hadoop clusters. We believe that data partitioning and sampling should be considered together to build approximate cluster computing frameworks that are reliable in both the computational and statistical respects.https://www.sciopen.com/article/10.26599/BDMA.2019.9020015big data analysisdata partitioningdata samplingdistributed and parallel computingapproximate computing
spellingShingle	Mohammad Sultan Mahmud Joshua Zhexue Huang Salman Salloum Tamer Z. Emara Kuanishbay Sadatdiynov A Survey of Data Partitioning and Sampling Methods to Support Big Data Analysis Big Data Mining and Analytics big data analysis data partitioning data sampling distributed and parallel computing approximate computing
title	A Survey of Data Partitioning and Sampling Methods to Support Big Data Analysis
title_full	A Survey of Data Partitioning and Sampling Methods to Support Big Data Analysis
title_fullStr	A Survey of Data Partitioning and Sampling Methods to Support Big Data Analysis
title_full_unstemmed	A Survey of Data Partitioning and Sampling Methods to Support Big Data Analysis
title_short	A Survey of Data Partitioning and Sampling Methods to Support Big Data Analysis
title_sort	survey of data partitioning and sampling methods to support big data analysis
topic	big data analysis data partitioning data sampling distributed and parallel computing approximate computing
url	https://www.sciopen.com/article/10.26599/BDMA.2019.9020015
work_keys_str_mv	AT mohammadsultanmahmud asurveyofdatapartitioningandsamplingmethodstosupportbigdataanalysis AT joshuazhexuehuang asurveyofdatapartitioningandsamplingmethodstosupportbigdataanalysis AT salmansalloum asurveyofdatapartitioningandsamplingmethodstosupportbigdataanalysis AT tamerzemara asurveyofdatapartitioningandsamplingmethodstosupportbigdataanalysis AT kuanishbaysadatdiynov asurveyofdatapartitioningandsamplingmethodstosupportbigdataanalysis AT mohammadsultanmahmud surveyofdatapartitioningandsamplingmethodstosupportbigdataanalysis AT joshuazhexuehuang surveyofdatapartitioningandsamplingmethodstosupportbigdataanalysis AT salmansalloum surveyofdatapartitioningandsamplingmethodstosupportbigdataanalysis AT tamerzemara surveyofdatapartitioningandsamplingmethodstosupportbigdataanalysis AT kuanishbaysadatdiynov surveyofdatapartitioningandsamplingmethodstosupportbigdataanalysis

A Survey of Data Partitioning and Sampling Methods to Support Big Data Analysis

Similar Items