Open Access Open Access  Restricted Access Subscription or Fee Access

LoEM: Improving Load Balancing for MapReduce-based Entity Matching

Khadidja Midoun, Malik Loudini, Walid-Khaled Hidouci, Abdelmounaam Rezgui

Abstract


MapReduce is a famous parallel programming system that facilitates the computation needed for complex big data applications. However, the data skew and load balancing problems limit its performance, especially, during a reduce phase. In this phase, the partitioning function assigns the keys to reducers based on a hash function that usually generates skewed partitions. This is the main cause of the unbalance workload across reducers. This article addresses the data skew and load balancing problems in MapReduce in the context of entity matching. The authors propose in this paper a new approach, LoEM, to assign the keys based on the reducers’ processing capability. The experimental results demonstrate that LoEM improves the load balancing of Hadoop up to 20% and 82% in homogeneous and heterogeneous environments, respectively, and 8% and 76% compared to BlockSplit approach in homogeneous and heterogeneous environments, respectively.

Keywords


Entity resolution, MapReduce, Hadoop, Load balancing, Data skew, Task scheduling, Data integration. Computing Classification System: [Information systems]: MapReduce-based systems,

Full Text:

PDF


Disclaimer/Regarding indexing issue:

We have provided the online access of all issues and papers to the indexing agencies (as given on journal web site). It’s depend on indexing agencies when, how and what manner they can index or not. Hence, we like to inform that on the basis of earlier indexing, we can’t predict the today or future indexing policy of third party (i.e. indexing agencies) as they have right to discontinue any journal at any time without prior information to the journal. So, please neither sends any question nor expects any answer from us on the behalf of third party i.e. indexing agencies.Hence, we will not issue any certificate or letter for indexing issue. Our role is just to provide the online access to them. So we do properly this and one can visit indexing agencies website to get the authentic information.