Named Entity Recognition using Ensemble Learning

Upgrading Industry 4.0 to 5.0 provides numerous research opportunities for the industrialists and researchers. This industrial revolution cross the peak of automation in the life science domain. In this digitalized world, big data plays a key role to provide the valuable insights by using various analytical methods. In life science, available of huge textual data contains wide spread of valuable information. To extract the hidden information from the big data, natural language processing plays a major and significant role. In NLP, named entity recognition is one of the key factor and biggest challenge for the research community. This paper presents the high level architecture of NER using ensemble learning method. The EL model contains a dictionary based entity identifier and a self-learning classifier. Proposed model outperformed well and produced high accuracy.

Life science is one of the prominent and growing domain in the business industry. The revolution of Artificial Intelligence (AI) in the pharmacy industry provided many research and job opportunity. Plenty of applications that are being used in life science industry are migrated to automation. Most of data are in textual format and creates the biggest challenge to the researchers and industrialist. To work with the textual data, NLP is one of the key technique to extract the valuable insights.
In various industries such as healthcare, education, finance, social media, etc., contains abundant information which are difficult to handle. NLP is significant to handle those sources. This paper distressthe role of NLP strategies in biomedical field.According to statista , in 2019 they have projected that in 2020 the healthcare industry has 2134 Exabyte of data.The data deluge in healthcare industry which is commonly generated by electronic healthcare record are stored inregional language. The stored data are in organized structure which is making more difficulties for the retrieving the hidden information from the huge amount of text data. The digitalized information of the clinical records are frequently store in the formal language. NLP is helpful for researchers and industrialists to communicate occasions and clinical ideas, astonishingly it makes the information hard for looking due to the lack of technologies and tools. To overcome these difficulties, the data must be properly processed by the NLP techniques. Named Entity Recognition (NER) is a key NLP errand to extricate the elements of intrigue (e.g., ailment names, medicine names and lab tests) from clinical stories, along these lines to help clinical and translational exploration. The paper is organized as follows: the background study of the NER in biomedical domain is present in the section 2.The proposed architectural flow the Lex-NER model is described in the section 3 with elegant workflow figures. Section 4 presents the results and discussions. Section 5 concludes with limitations of the proposed work

Background Study
Named Entity Recognition (NER) is a powerful technique in the NLP [1]. It is a sub-field of information retrieval.It is an errand of perceiving the articulations that ought to be ordered as articulations indicating substances. Model substance tags in clinical arena are ailments, drugs, treatment, qualities, malignant growth, protein and RNA [2, 3, and 4]. A great part of the examination in life science informatics has focused on NER. As indicated by [5] the majority of the techniques are rule-based, in spite of the fact that there are executed some half and half methodology that consolidate AI with these principles.
The creators in [6] makes reference to Conditional Random Fields (CRF), Support Vector Machines (SVM) and Hidden Markov Model (HMM) as regular AI strategies that are at present applied for NER undertakings in clinical space. The latest papers focus on profound wisdom methodologies put on repetitive neural systems (RNNs), for example, Long-Short Term Memory (LSTM) [7], Gated Recurrent Units (GRU) [8]. Basic pattern is joining the RNN with factual technique on head of the intermittent layers. It guarantees that the ideal succession of labels over the whole sentence is acquired [9]. CRF is the most regularly utilized measurable strategy in this cross breed approach. The creators of [9] consolidated RNN with CRF. Because of the difficulties recorded beneath the Clinical NER endeavors get lower execution estimates esteemed thebest F1 score acquired by [9] is sums 91.32% in correlation of comparable preliminaries with corpuses in nonspecialized fields, where as of late the creators of [10] got F1 score ninety three percentile on the CoNLL 2003 corpus.
Right off the bat, the information accessible for scientists in the biomedical field is restricted, for the most part because of the patient security and classification necessities. The accessible clarified databases are normally inadequate for named element acknowledgment undertaking to prepare the model [6]. Also, the clinical writings are written in a particular way, unique in relation to customary language. There are a great deal of inadequate sentences, casual syntax and covered with incorrect spellings and non-standard shorthand, shortened forms and abbreviations. Also, the medication is a quickly extending field with huge number of investigates led the add to continually developing number of clinical ideas. It makes incredibly hard to staying up with the latest. Besides, ideas in medication regularly convey importance, identified with the idea. It infers the NER models to keep the word setting data along the preparation procedure. Another ordinary element is that clinical language is described by long expressions containing exceptional characters and runs.
The greater parts of the investigations are performed on Corpuses in English. Second most well-known language is Chinese. There are obviously inadequate with regards to explores in different dialects. The term characteristic language is utilized to depict any language utilized by people, to recognize it from programming dialects and information portrayal dialects utilized by PCs and portrayed as counterfeit [11,12]. Normal language handling (NLP) term depicts computational methods that procedure communicated in and composed human language [13]. Characteristic language handling incorporates information preprocessing strategies like information cleaning, tokenization, standardization (stemming, lemmatization or different types of normalization). Setting up the content requires picking the ideal apparatuses; anyway it assists with improving precision of continuing NLP errands. Different assignments of NLP focus on removing the factual highlights like term frequency, inverse document frequency or linguistic highlights including Part of Speech (POS) labeling. NLP methods are devices to accomplish the unrivaled errand. Data Extraction (IE) including scanning for pertinent data in records exist among the most applied assignments.
NER is a phase of Information Extraction. It is one of key NLP undertakings that assists with changing over unstructured content into PC coherent organized information [13]. NER alludes to the undertaking of perceiving the articulations indicating substances (for example Named Entities, for example, illnesses, medications or individuals' names, in free content archives [14]. NER can be tackled with the utilization of numerous methods that can be separated into a few gatherings [15]: word reference based methodology, rule based methodology, measurable methodology, profound learning approach, crossover approach. The author performed NER task on GENIA corpus. Genia is normally utilized corpus by analysts both as word reference and as base corpus to perform NER task. The NER model is accessible in various variants and various configurations. Authors have utilized corpus which comprises of 1001 dynamic records from MEDLINE database and it is a scientific classification of 30 organically pertinent classes.

Materials and Methods
Lex-NER is the hybrid NER framework which istrained and tested on PubMedabstracts. The proposed work aims to build a hybrid model which combines both the string matcher and Machine Learning (ML) model to produce better accuracy. The ML which is incorporated a phase known as human-in-the-loop. The human-in-the-loop phase is used to increase the accuracy of the ML model. The domain experts evaluate the results of ML model and update the training dataset. This helps to increase the accuracy level. (1) Identifying the abstracts based on the keywords (2) NLP module which includes the text processing (3) Data preparation for training the machine learning model (4) Machine Learning module The first module is to identify the abstracts from the PubMed articles. For this experiment, it has been limited with certain keywords. Keywords such as drug names, diseases and symptoms. Based on the keywords the document is classified. Around 531 documents are collected for the proposed work.      Figure 3 depicts the graphical representation of the comparative study.

Conclusions
The proposed hybrid based NER model outperforms well when compared to the existing LinePipe and ABNER. The state of art which has examined and presented in the background study provide the solid knowledge for the NER. The model comprises of the string matcher and CRF model. The UMLS database is used to identify the entities with the phrase matcher. The matched words are annotated by the entity and passed into model. The CRF model which has been worked based on forward parsing method produced good accuracy. The model is trained using on the 531 abstracts which has been extracted from the PubMed database. In future, the work would be extended on increasing the abstracts and increasing number of features.