Dynamic Deduplication Decision in a Hadoop Distributed File System

Data are generated and updated tremendously fast by users through any devices in anytime and anywhere in big data. Coping with these multiform data in real time is a heavy challenge. Hadoop distributed file system (HDFS) is designed to deal with data for building a distributed data center. HDFS uses...

Full description

Saved in:
Bibliographic Details
Main Authors: Ruay-Shiung Chang, Chih-Shan Liao, Kuo-Zheng Fan, Chia-Ming Wu
Format: Article
Language:English
Published: Wiley 2014-04-01
Series:International Journal of Distributed Sensor Networks
Online Access:https://doi.org/10.1155/2014/630380
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Data are generated and updated tremendously fast by users through any devices in anytime and anywhere in big data. Coping with these multiform data in real time is a heavy challenge. Hadoop distributed file system (HDFS) is designed to deal with data for building a distributed data center. HDFS uses the data duplicates to increase data reliability. However, data duplicates need a lot of extra storage space and funding in infrastructure. Using the deduplication technique can improve utilization of the storage space effectively. In this paper, we propose a dynamic deduplication decision to improve the storage utilization of a data center which uses HDFS as its file system. Our proposed system can formulate a proper deduplication strategy to sufficiently utilize the storage space under the limited storage devices. Our deduplication strategy deletes useless duplicates to increase the storage space. The experimental results show that our method can efficiently improve the storage utilization of a data center using the HDFS system.
ISSN:1550-1477