LoEM: Improving Load Balancing for MapReduce-based Entity Matching

Khadidja Midoun, Malik Loudini, Walid-Khaled Hidouci, Abdelmounaam Rezgui


MapReduce is a famous parallel programming system that facilitates the computation needed for complex big data applications. However, the data skew and load balancing problems limit its performance, especially, during a reduce phase. In this phase, the partitioning function assigns the keys to reducers based on a hash function that usually generates skewed partitions. This is the main cause of the unbalance workload across reducers. This article addresses the data skew and load balancing problems in MapReduce in the context of entity matching. The authors propose in this paper a new approach, LoEM, to assign the keys based on the reducers’ processing capability. The experimental results demonstrate that LoEM improves the load balancing of Hadoop up to 20% and 82% in homogeneous and heterogeneous environments, respectively, and 8% and 76% compared to BlockSplit approach in homogeneous and heterogeneous environments, respectively.


Entity resolution, MapReduce, Hadoop, Load balancing, Data skew, Task scheduling, Data integration. Computing Classification System: [Information systems]: MapReduce-based systems,

