Big Data Analytics in MapReduce: Literature Review

Big Data comprises both structured and unstructured data collected from various sources. For collecting, managing, storing and analyzing the large dataset, an efficient tool is required. Hadoop is an open source framework which processes large dataset and MapReduce in Hadoop is an effective programming model reduces the computation time of large scale database in a distributed architecture. A machine and deep learning algorithm based on MapReduce implemented in huge dataset will reduce processing time. This paper aims to study various MapReduce based model and algorithms to analyze huge data. Also, predicts the way of implementing algorithms in MapReduce to reduce the computing time.


Introduction
Data is being generated in all the major sectors include Healthcare, E-Commerce, Social media, Banking, Finance, etc., in the range of peta to Exabyte. Processing this huge dataset in a sequential program increases processing time, whereas, processing big data in a distributed architecture and the MapReduce programming model reduces the processing time with the increase of number of data nodes. An efficient processing can be discovered and automated in collaboration with machine and deep learning in MapReduce framework. Typically, MapReduce technique in Hadoop processes large scale dataset in parallel. Implementing a MapReduce-based machine/deep learning algorithm with increased number of nodes would improve efficiency and reduces processing time. Many authors proposed MapReduce based model, system and approach to analyze big data effectively. In this paper, those approaches were analyzed and compared. [1][2][3][4][5].

Hadoop
Hadoop is an open source Framework for big data by Apache to store and process big data in a distributed environment. Hadoop's Architecture has two main components: Distributed File System: Hadoop Distributed File System is designed to store and process large datasets which runs on commodity hardware. HDFS is similar to other existing distributed file system but the most significant feature is highly fault-tolerant and can be deployed on low cost hardware. In addition, HDFS follows Master/Slave Architecture where metadata is stored in NameNode which acts as a master server and application data is stored in DataNode which acts as a slave server. For reliability, the file content is duplicated on DataNode.

MapReduce:
MapReduce is a programming model to process huge dataset in parallel in a distributed environment. Mapreduce algorithm proposed by Google to enhance the speed by processing distributed big data in a cloud platform. Major phases involved in MapReduce phases are: i) Split: This phase splits the input into fixed number of pieces to get evaluated in map phase. ii) Map : In the map phase, the data from a data blocks get split and key-value pairs are generated for the data.
iii) Shuffle: In this phase, the key-value pairs generated from the map function is passed as an input and clubs together the similar information in it. iv) Reduce: This phase uses the output of shuffle phase and aggregates it, where data reduced into a single output value. Fig.1 depicts the architecture of MapReduce Framework. proposed an annotation of image automatically using MapReduce based SVM. MRSMO splits the large dataset into smaller subset and this split subset is allocated to a map task. Map function present in the task optimizes the subset in parallel. Output of the map reduce differ in terms of linearity. For linear SVM, partial weight vector from map task penetrates into reduce task to get global weight vector and for nonlinearity, the alpha array into reduce task finally gives global alpha array. Anan Banharnsakun[4], recommended a MapReduce incorporated artificial bee colony (MR-ABC) for clustering. This incorporation aims to minimize the sum of squared Euclidean Distance and centroid. The map function retrieves the cluster's centroid from the ABC and it is stored in HDFS. Centroid Value extracted from each bee to calculate the distance value between the centroid values and data record to obtain the minimum distance. Reduce function groups the same key value obtained from map function to determine the average distance and it returned as a fitness value. Daniel Valcarce , Javier Parapar and Alvaro Barreiro[6-8] proposed a MapReduce based recommender system implemented Posterior Probability Clustering algorithm on the basis of matrix factorization followed by Relevance Models. To reduce the complexity, the algorithm is implemented in MapReduce (distributed) framework to obtain the recommender for processing huge dataset. Furthermore, two join strategies, replication and broadcast were involved to make an efficient process. Weizhong Zhao, Huifang Ma and Qing He [9-11], recommended the MapReduce based PKMeans Cluster to analyse huge dataset effectively. Map function assigns the closest center to each sample and reduce function updates the new center and all the samples can be aggregated and determines the total number of samples. Shiva Asadianfam, Mahboubeh Shamsi and Abdolreza Rasouli Kenari[6] proposed a TVD-MRDL algorithm to automate the detection of the violation of drivers using MapReduce technique. Analysis include both structured and unstructured data. The proposed system able to analyze the traffic control center's data and the descriptions predefined by police. To process the image, Deep Learning algorithm named CNN is involved.
Mininath Bendre, Ramchandra Manthalkar, performed a case study to predict the pattern of student behavior of UCAM students by opting an Azure HDInsight big data solution by using its HDFS implementation. The association rules for the events done by the students obtained by implementing the apriori algorithm and further included MapReduce framework. Neha Verma, Dheeraj Malhotra & Jatinder Singh[8], presented a novel approach using association mining for the analysis of market basket to know customer's expectation from retail store. Customer's buying pattern analyzed using the MapReduce based Apriori algorithm implemented using IRM tool. Ms. Vandana Vijay, Dr. Ruchi Nanda2[9], proposed MRC-COVID system to store and process the Covid-19 dataset. In-Memory cache can be implemented on MapReduce to reduce the superfluous operations of disk I/O in runtime. Imparting cache in MapReduce improves performance and reduces the workload of data. To store and process Covid-19 data

Conclusion:
MapReduce is an effective framework to process large data set in parallel. Machine learning and deep learning algorithm implementation in MapReduce results in better performance. In this paper, the various algorithms, systems and models proposed by authors related to the big data analytics in MapReduce model were overviewed and the efficiency of the proposed algorithms were discussed. There is a dearth in focusing feature engineering in this implementation. In future, an effective algorithm to process feature engineered large dataset to be implemented in MapReduce will be proposed.