Performance Analysis of Feature Selection Techniques for Text Classification

Internet is a suitable, highly available and low cost publishing medium. Therefore a significant data is hosted and published using websites. In this domain some amount of data is directly present for common people and some of data is not publically distributed. Such kinds of data are utilizable by service providers and administrators for business intelligence and other similar applications. In this presented work the web data analysis or mining is the key area of investigation and experimental study. The web data mining can be dividing in three major classes i.e. web content mining, web structure mining and web usages mining. In this work the web content mining and web usages mining is taken into consideration. First of all the web content mining is explored thus a system is developed for making comparative performance study of different content feature selection techniques. In this experiment the GINI index, Information Gain, DFS and Odd Ratio is compared using a real world collection of web pages. In order to classify the extracted features from the web contents the SVM (Support Vector Machine) is applied. The comparative study demonstrates the IG and GI is the suitable feature selection techniques that work well with the SVM classifier.


Introduction
Web is a backbone of new generation technology, research, education, medical, engineering and a number of areas getting benefitted by web. From web documents and services using data mining techniques information is extracted automatically, this is the process of Web Mining. To discover useful data from the web using patterns is the main purpose of web mining. In this presented work the different formats of web data is explored and web mining techniques are investigated for identifying effective, efficient and accurate techniques of web data mining. The web mining can be classified into three type's web content mining, web usages mining and structure mining [1]. A web mining process which extracts useful data from the web is called Content Mining. The contents are video, audio, text documents, structured records, and hyperlinks. In web content data delivered to the user in the form of a list, images, texts, tables, and videos. The number of webpages has increased to billions in the last few decades and it's still increasing. To search a query into billions of documents is a timeconsuming task. By performing different mining techniques and by narrowing down the search data content mining extracts queried data, so it's easy to find data required by the user [3].Similarly, web usages mining explores the domain of hidden knowledge in web access log files. And finally the structure mining helps to optimize the accessibility of web pages and structure of the work [2]. While the volume of data from heterogeneous sources develops impressively, foresight and its strategies infrequently advantage from such accessible data. This work focuses on textual data and considers its utilization in foresight to address new research questions and incorporate different partners. This textual data can be gotten to and methodically analyzed through content mining which structures and aggregates data in a generally robotized way. By exploiting new data sources (for example Twitter, web mining), more entertainers and perspectives are incorporated, and more accentuation is laid on the investigation of social changes. In this presented work the web content mining and web usages mining is the main area of investigation. Thus using a real world application the applicability of the web mining techniques is demonstrated in this work. The proposed work is a promising approach for motivating the researchers to employ different data mining techniques for solving real world issues.

Literature Survey
The exploration of potential of text mining for foresight by considering different data sources, text mining approaches, and foresight methods are used by authors [4]. In this paper authors extracts patterns and reduces data dimensions of BSS usage by exploring time series representation and clustering of BSS usage data [5]. This paper provides over three decades long (1983-2016) systematic literature review on clustering algorithm and its applicability and usability in the context of EDM [6]. The author's goal of review is to make available a comprehensive and semistructured overview of WCM methods, problems and solutions proffered. They have 57 publications including journals, conferences, and workshops in the period of 1999-2018 as a review on this subject [7]. This paper provide author's try to give a brief idea regarding web mining concerned with its techniques, tools and applications [8]. Two different feature selection methods are investigated in this paper on the spam reviews detection. Bagof-Words and words counts. Different machine learning algorithms were applied such as Support Vector Machine, Decision Tree, Naïve Bayes and Random Forest [9]. In this research an effort to address such uncertainty which is based on a data set derived from profiling data set available publicly. The conventional text feature extraction approach is applied to identify the most significant words in the data set [10]. This paper provide an improved global feature selection scheme (IGFSS) where the last step in a common feature selection scheme is modified in order to obtain a more representative feature set is proposed [11]. This paper shows an introduction of a fuzzy term weighing approach that makes the most of the HTML structure for document clustering [12].

Proposed Methodology
The proposed investigation of web mining is now focused to explore the domain of content mining and relevant feature selection techniques. This includes the design of data model which is used for accomplishing the desired objective. In this context a web mining model is demonstrated as given figure 1. The different component of the model is explained here.

1) Web Page Dataset:
we had downloaded a significant amount of web pages from different subjects and designed a syntactic dataset. The data set is organized in a way by which the subdirectory consist of the class labels and the directory contents or web pages are treated as data instances to be classify in target subjects or domains.

2) Data Preprocessing:
The entire web data preprocessing involve three main steps: a) Removal of HTML tags, b) Removal of special characters, and c)Removal of stop words.

3) Feature Selection:
That technique helps to reduce the data dimension and regulate the requirements of the computational resources such as time and memory. In this work we involve four popular feature selection techniques used for web content mining. a)GINI Index:let S is the set of samples and having k number of classes(c 1 , c 2 , … , c k ). According to the classes we define k sub categorize of data such that{1, 2, … , k}. Then GINI index of S can be defined using [16].
(1) Where p i is the probability which is calculated using i th sample of S and complete set of S. however the minimum value of GINI is 0, which shows maximum utility of data. Similarly if the distribution of class and data is uniform then the GINI demonstrate the maximum value to 1 which shows minimum utility of data. In order to use the technique for text classification it is used as a measuring function of data impurity with respect to class labels associated with data. So according to previous consideration the lower value of GINI indicates the higher applicability of the attribute for classification. In text analysis, IG is used to measure the relevance of attribute A in class C. The higher the value of IG between classes C and attribute A, demonstrate the higher the relevance between classes C and attribute A [17].
Where, H C = − p C log p(C) cEC , the entropy of the class, and H(C|A) =1 is the conditional entropy of class given attribute, Since Cornell movie review dataset has balanced class, the probability of class C for both positive and negative is equal to 0.5. As a result, the entropy of classes H(C) is equal to 1. Then the information gain can be formulated as: are not related at all. On the contrary, we tend to choose attribute A that mostly appears in one class C either positive or negative. On the other words, the best features are the set of attributes that only appear in one class. It means the maximum I(C, A) is reached when P (A) is equal to P(A|C 1 ) resulting in P(C 1 | A) and H(C 1 | A) being equal to 0.5. WhenP(A) = P(A|C 1 ), then the value of P(A) = P(A|C 2 )results in P C 2 A = 0 andP C 2 A = 0. The value of I(C, A) is varied c) DFS: The probabilistic feature ranking metric DFS. Its requirements emphasize that, terms present in a number of classes should be ranked higher than other terms; terms rarely occur in a single class and doesn't present in other classes are irrelevant and should be ranked lower; terms which frequently occur in a single class but doesn't occur in other classes are highly distinguishing, should be scored higher. DFS metric assigns score values between 0.5 and 1.0 Where, M is the number of classes, P(C j ) is probability of j th class and P(t |C j ) is probability of absence of term t when class C j is given while P(t|C j ) is feature likelihood when classes other than C j are given.

d) Odd Ratio:
It is a likelihood ratio. It's numerator is the multiplication of t p and t n and denominator is the multiplication of f p and f n . It presents the likelihood of feature occurrence to a class. It prioritizes those features having high occurrence rate to a particular class but ignores features which frequently occur in other classes. It also doesn't take into account irrelevant and redundant features. It's mathematical formulation is given as.
Odds ratio performs well on small number of features.

4) Data Splitting:
After feature selection of the approach the system returns a feature vector which is used further for experimentation or learning with the supervised learning algorithm.

5) Training Set:
The data splitting create two sub sets of entire web content data features first 70% of randomly selected data instances are used here for the classifier training.

6) Testing Set:
Additionally the 30% of randomly selected data is used for testing of the trained model.

7) SVM Training
: the SVM is a supervised learning model which is mostly used for classification of binary data, which is used the concept of hyper plain for differentiating between two classes. 8)Trained SVM: the SVM algorithm used for make training on the extracted features from the different feature selection techniques. After taking training from the input features the algorithm can identify the similar patterns.

9) Classified Data and Performance
: based on test data classification using the trained SVM the system measures the performance in terms of accuracy and error rate. At the same time the system also computes the efficiency of the system in terms of time consumed and memory usages.

Implementation
Using the developed user interface we have tried to deliver the functional aspects of the proposed framework. The design and explanation are given as: Figure 2 shows the selection of HTML data set which is available in local storage. Further in next figure 3 the feature selection technique is implemented with the concept of GINI Index.

Fig. 4. Information gain calculated
The figure 4 shows the implementation of Information Gain based feature selection. The figure 5shows the calculation of DFS based feature selection technique. Similarly the next figure 6 shows the odds ratio based feature selection approach.

Result Analysis
The aim of this experimental scenario is to obtain an efficient feature selection technique for implementing the web content mining based applications. In this context a comparative analysis is conducted between different feature extraction techniques. There are four parameters are used to compare the performnce. 1) Accuracy-That can be measured using the ratio of total correctly classified and the total patterns to be classified. That can also be represented using the following equation: = 100 (9) Chart.1.Accuracy (%) The accuracy of the implemented feature extraction techniques is given in chart 1and table 1.
2) Error rate-This is a ratio of misclassified test samples and the total samples for classification. That can be calculated using the following equation: = 100 (10)

Chart 2. Error rate (%)
3) Time Consumption-The amount of time consumed for classification is calculated using the following formula: = − (11)   The performance of the implemented feature selection algorithms in terms of time consumption is given using figure 3 and table 3.

4) Memory
Usage -The amount of total memory utilized for execution of an algorithm is measured here as the memory consumption or usages.

Conclusions and Future Work
In this presented work the web data mining is the main area of investigation. Therefore the web usages mining and web content mining is studied and the relevant methods are demonstrated. In web content mining the web pages are involved for experimentation because the most of web contents are published using HTML pages. These web pages includes different formatting tags and text contents therefore it complex to process and classify using any basic machine learning method. Therefore first some feature extraction techniques are explored namely GINI Index, Information gain, DFS and odds ratio. All these methods are basically measuring the ranks of the text features for selecting most appropriate according to their defined class labels. The experimental study offers different techniques and methods that useful for different kinds of data mining approaches used in web data mining. Future work of this experiment is further extended to find suitable and efficient classifier for web content classification. Therefore the selected two feature selection techniques namely GINI Index and Information Gain will utilize with the three popular supervised learning classifiers namely SVM, SVR and k-NN.