Comparison of ECB-LDA and DSO-RBNN in Diabetes Prediction using Big Data Analytics

Diabetes is one of the chronic diseases rovering all over the world. It affects people in all ages. Even child by birth also getting affected by this disease. Already various machine learning algorithms were used to predict diabetes. This work compares two algorithms Enhanced Catboost with Linear Discriminant Analysis (ECB-LDA) and Dolphin Swarm Optimization with Radial Basis Neural Network (DSO-RBNN) which were used for diabetes prediction. Also hospitals and other clinical centers are facing problem in handling large amount of data. To solve such problem and also do early prediction of diabetes, big data analytics is used. This work proves that the accuracy of DSO-RBNN is better than the ECB-LDA.


Introduction
Nowadays, society is handling an enormous amount of data which is further said to be Big Data. Various industries are facing a lot of issues while handling such kinds of data. Big Data Analytics is a concept which analyses big data and also helps in predicting future happenings. This work focuses on predicting the disease in the health industry that generates bulk data. The sources are clinical reports, diagnostic reports, laboratory reports, doctor's prescription, medical images, pharmacy information, Electronic Health Reports, Health insurance reports, etc. Information gathered from these sources is said to be Big Data. It is inevitable to analyse and deal with this big data. A lot of chronic diseases are still in the health industry, which extends for the long term. These diseases could be cured if predicted earlier. Big Data Analytics plays a vital role in analysing the data and predicting the disease earlier. [1][2][3][4][5]. This work results better in predicting chronic diseases. For experimentation, the PIMA diabetes dataset is used as the input data and has obtained good results.

Fig 1: 5 V's in Big Data
Big data provides 5 dimensionalities (Volume, Velocity, Variety, Veracity, and Value) which are necessary for managing flooded data.

V's in Big Data
Volume Velocity Variety Veracity Value Volume It indicates the large amount of data collected through different sources. Approximately they generate 2.5 quintillion bytes of data every day. Big Data plays an essential role in handling such voluminous data. Velocity It denotes the speed of the generated data. While considering social media, it uploads millions of data on Facebook, Twitter, youtube, and google daily. Big Data helps organizations to handle those data faster. Variety It denotes the structured and unstructured data obtained from different sources. Some of the structured data are text, pictures, video, etc., and unstructured videos are emails, voice mails, audio recordings, etc. Veracity It refers to the quality of the data. Before processing the data, it should analyse whether the data is clean and accurate to improve the quality. Value It analyses the value of the data and converts the bulk amount of data into business.

Machine Learning
Machine learning is a recent innovation technology which helps mankind in improving many industrial and professional process in daily life. Various types of machine learning algorithms are: •Supervised: In supervised learning, the algorithms are trained using labelled examples. •Unsupervised: In unsupervised learning, the algorithm uses data that has no historical labels. •Semi-supervised: It uses both labelled and unlabeled data for training. •Reinforcement Learning: Used in robotics, gaming, etc.

BDA In Healthcare
Health care faces a lot of challenges while handling an enormous amount of data. For handling such complex datasets and making more efficient decisions, BDA is used. BDA retrieves patient's information from Electronic Health Record data, Electronic Medical Record data, imaging data, or Sensor data. Then convert it into information necessary to the doctors or researchers, or analysts to make proper decisions. It helps the health care industry by providing personalized medicine through prescriptive analytics, predictive analytics, etc. This survey focuses on predicting chronic diseases. A single patient may have files such as electronic health records, doctor prescription, lab results, insurance, medical equipment, etc. It is impossible to analyze such kind of dataset. BDA helps to analyze such a dataset.

Chronic Disease
Chronic disease is a disease that persists for a long time. Such disease can be caused by the usage of tobacco, lack of physical activity, poor eating habit, etc. Some examples are heart disease, diabetes, kidney failure, cancer, stroke, arthritis, asthma, ulcer, obesity, etc. This paper used diabetic patients' dataset and did early prediction of the disease. Diabetes is one of the deadly diseases which must be predicted earlier to decrease the severity of the disease. Such prediction may also help the medical practitioners to make better decision before giving treatment.

Spark
Spark is an open source framework used in this work. It has the facility to add more features and efficiencies with the existing software. It is used in large companies for handling huge data. It integrates with different file system such as HDFS, MONGODB and amazon's s3 system since it doesn't have its own filesystem. Spark is a leading platform for large scale SQL, batch processing, stream processing and machine learning.

Features:
•It requires large memory for processing with huge amount of data.
•It supports most of the programming languages. •Scalability and fault tolerance. Mujumdar et al., compared various machine learning algorithms in predicting diabetes dataset and found that Logistic Regression secured accuracy of 96%. After applying pipeline in those algorithms, Adaboost classifier performed better than LR and obtained 98.8% accuracy. Sisodia et al., compared performance metrices like precision, accuracy, F-measure and recall of various algorithms like DT, SVM and NB. They found that NB performs and well and obtained 76.3% accuracy. From the above research studies, it was analysed that various ML and DL algorithms were already implemented to predict diabetes. Further studies should concentrate on various combination of algorithms to improve performance measures and also these should be implemented in unstructured data. [4][5][6][7][8][9].

Methodology
This work uses ECB-LDA and DSO-RBNN algorithms in PIMA diabetes dataset.

ECB-LDA Pseudocode
1. ECB-LDA takes the input data from PIMA diabetic dataset. 2. Pre-process the input data, so that the data is split as training and test data. 3. Apply min-max feature scaling to normalize the input variables. The variable ranges between 0 and 1. Min-Max feature scaling is calculated as, Where x denotes the original value and x1 denotes the normalized value. Min-Max Normalization is calculated as, a, bmin and max values. Mean Normalization is calculated as

Fig.1: Comparison of performance metrices with ECB-LDA and DSO-RBNN algorithms
Based on the comparison of precision, recall and accuracy of ECB-LDA with DSO-RBNN it was proven that DSO-RBNN response better for recall and accuracy with good percentage.

Conclusion
Diabetes is a deadly chronic disease which must be predicted earlier and treated as well. Various ML and DL algorithms were already predicted. But still those algorithms face performance issues. This work compared two algorithms and the performance metrices were discussed. It was proven that the accuracy of DSO-RBNN performs better than ECB-LDA. In future, this algorithm must be used with different disease datasets like heart disease, cancer, COVID19 etc.