Comparative Analysis of Segmentation and Recognition Techniques for Offline Handwritten Words

A Pre-processing is the initial and vital phase in optical character recognition is the Pre-processing. Segmentation deals with the extraction of individual component from a document image. Number of techniques like projection profile, connected components, gaps between characters/components is reported in the literature for component extraction followed by feature extraction and recognition of the individual component. These techniques gives good results if components are isolated but fails if components are touched, shadowed or skewed. A novel technique is required to address such issues to enhance the recognition rate. The problem of segmentation for Roman script cursive handwriting is addressed by various authors but not enough addressed for Indian script especially Devanagari script. This paper is a review which is confined to offline handwritten script domain. It attempt to review various techniques for character segmentation considering touching characters for offline handwritten words in Devanagari script and scripts sharing similar characteristics (like Bangla, Gurumukhi), database used and their accuracy reported in the literature.


Introduction
OCR (Optical Character Recognition) is a conversion process which converts printed or handwritten data in the form of image, online or offline into machine encoded form. The purpose of converting data images into digital format is to edit and search data electronically, and store the digitized data in a compact way. ICR (Intelligent Character Recognition) more precise than OCR as different styles and fonts are made to learn by the computer system with major application as Automated Form processing. It has major advantages in term of speed, accuracy and cost. It reduces error as data entry (manually) is the likelihood of typographical errors. Devanagari script is widely used in northern and western part of India. There is more than 300 million user of the script and has various applications. Segmentation-based or holistic approached are used in literature for the recognition of Devanagari script. Recognition of number of languages is done using these approaches. Both approaches have shortcomings associated with them. However, Holistic approach does not give good results (Shaw, Parui, and Shridhar 2008) as per literature survey. Segmentation approach gives better results but Segmentation of Devanagari script is difficult because of presence of large character set which include vowels, consonants, compound characters and modifiers. Poor segmentation contributes to recognition error. (Shaw 2008a)HMM has been used in recognizing handwritten words but reported with some success and that too with presegmented letters. According to literature survey, various techniques are found in number of research papers in offline handwritten character recognition in Latin and other Asian languages but a few papers are available in Devanagari script (Hindi). One of the reasons can be the non-availability of standard databases of handwritten text/words/characters. Large character set of a language poses another difficulty. Hindi, Marathi, Nepali, Konkani, Sindhi, Kashmiri etc. are various languages that belong to Devanagari script. Punjabi, Bengali, Marathi are the languages of other script that shares characteristics with Devanagari script. This paper is divided into 8 sections. Section 2, 3 deals with the need of segmentation and various difficulties faces while segmenting word into individual components. Section 4 gives various techniques used in the literature for segmentation used in different scripting languages. Database used in the literature by different authors in their respective work is discussed in Section 5. Section 6 consists of validation and testing. Section 7 comprises of a brief discussion on the techniques used. Section 8 gives conclusion and future scope in the specified area of research. A comprehensive bibliography which includes most relevant papers related to the segmentation of offline Handwritten scripts is added to provide outline for development in the concerned field.

Need of Segmentation:
Holistic approach reduces the accuracy results as compared to segmentation approach(Shaw, Parui, and Shridhar 2008) [2] [4]. Segmentation reduces the complexity of recognition. If word is properly segmented, then no. of classes used in the recognition system will be equal to the no. of characters and not more. Line segmentation followed by word segmentation give way to character level segmentation in a text image. Different level of segmentation is discussed in (Mehul et al. 2014). Character level segmentation is the lowest level of segmentation which presents fundamental challenges due to variability in handwritten data.

Difficulties in Segmentation:
The horizontal line (Shirorekha) used in scripts like Devanagari (Hindi), Bangla, Gurumukhi (Punjabi), Marathi, Nepali makes segmentation problem more difficult. Spaces between the characters in handwritten data may vary which makes segmentation a difficult problem.  Large character set which includes consonants, vowels, modifiers, compound characters in script makes segmentation more complicated.  Different shapes/writing style/device used for writing further complicated the process of segmentation. Cursive Nature of handwriting make characters connected to each other.  Characters sharing similar contours.  Location of contact point at any elevation and non-linear boundary (Lu and Shridhar 1996).  Finding junction path to segment touched components.

English
The first survey that focuses on touched character is given by Tanzilsaba et. al. (Saba, Sulong, and Rehman 2010a). Various approached used for segmentation, segmentation rate, test data used for experiment till 2010 is provided in the survey. Paper by chen et.al. (Chen 1994) used HMM(Hidden Markov Model)stochastic network for unconstrained word recognition A segmentation approach is followed using morphology and heuristics based segmentation. The proposed algorithm used modified viterbi algorithm to search for best path. The resulted are obtained by applying the algorithm on 1583 images (1489 training images and 94 test images). The algorithm successfully segmented 95.6% of the trained images.
Authors used junction based approach and fuzzy features for the segmentation of touching string (Jayarathna and Bandara 2006). Character skeleton is used to find junction point i.e. pixel having more than three or more neighbouring points. Authors in (Saba, Sulong, and Rehman 2010b) proposed segmentation of touched characters in roman cursive characters based on Genetic algorithm. Experiments on cursive handwritten words are performed on unconstrained 300 word images. The results are tested on IAM benchmark database and up to 89.76% accuracy is obtained. Contour based approach used by Ventzislavet. al. (Alexandrov 2004) used geometrical and structural information for finding critical point on the contours. Another technique for segmentation of offline cursive handwritten words is given by F. In (Kaur, Singh, and Rani 2015), the broken and overlapped character problem and applied projection profile with neighbouring pixel for touching components(characters) in Gurumukhi script is discussed. Neighbouring pixel approach is applied in (Mangla, n.d.)For segmentation which included broken and touching word in Gurumukhi script. Database consisting of 50 words for isolated, touching and broken words is taken and accuracy of 97%, 95% and 95% respectively is reported. Authors in (Sharma and Lehal 2006) proposed an iterative technique to segment words. Presence of headline, aspect ratio of characters, vertical and horizontal projection profiles are used as a characteristic feature to segment the words.

Kannada
In (Mamatha and Srikantamurthy 2012), segmentation scheme for unconstrained handwritten Kannada scripts is proposed. Segmentation of words and character is accomplished using projection profiles and morphological operations. 82.35% accuracy for words segmentation and 73.08% accuracy for characters segmentation respectively are reported. Author proposed(Venkatesh, Majjagi, and Vijayasenan 2014) implicit segment for character segmentation along with recognition using HMM. Thinning, branch-points and mean points used in (Naveena and Manjunath Aradhya 2012) are used to find segmentation points. Author used expectation-maximization for learning mixture of Gaussians.

Oriya
Tripathy and U.pal in (Tripathy and Pal 2004) proposed segmentation technique for Oriya handwritten text. Unconstrained text is used for experimentation. Oriya handwritten text into individual characters. Projection profile is used for line segmentation and structural features are used for word segmentation. Segmentation of isolated and touched characters is proposed using water reservoir, structural and topological based features. 96.7% accuracy is obtained using the proposed algorithm for two-character touching strings. 1840 touching components is prepared consisting of two or more characters touching each other while writing( two-character, three-characters or more than three characters touching each other) The accuracy for segmentation of 96.7%, 95.1% and 93.3% respectively is reported.  Hindi words. The paper covers various issues using hierarchical segmentation approach like headline detection, separating upper and lower modifiers, identifying conjunct. 78% accuracy is reported by applying structural feature in hierarchical order. Morphological operations used by author in(Ladwani and Malik 2010) and applied on 100 words with 57% accuracy reported for segmentation of top modifiers, 55% for lower modifiers and 52% accuracy rate for middle zone characters. A script independent approach for character segmentation is given by Ram sarkar et.al (Sarkar 2010

Database
Non-availability of touching character database necessitates authors to create database for validation and testing their respective work.

Validation and Testing
Authors used various techniques to address the segmentation of line, text or words into characters. Due to non-availability of standard database, researchers need to create their respective databases and apply their techniques. Validating or comparison of their result obtained is not possible because no standard or benchmark database is available. As per the literature survey, results obtained using various techniques of segmentation are verified manually.

Discussion
Projection profile, connected components, structural properties, recognition and segmentation using Neural Network are the various techniques is applied by authors for segmentation of words into characters. Thinning algorithm is applied to the word image and candidate segmentation points are found. Contour based approach is applied in many papers to find valley and crest points. Approach like Water-reservoir is used in many Indian scripts with similar touching patterns. This approach was conceived by U.pal et.al. (U. Pal, A. Behaid, C. Choisy 2003) for segmenting touching numerals. Later the same approach was applied for Bangla, Oriya, Punjabi and Thai touching characters. A brief comparison is show in Table 1. According to literature survey, a few papers are available which consider segmentation of touching or fused characters in offline handwritten data as compare to online handwritten or printed data.

Conclusion and Future directions
Numbers of techniques are proposed for segmentation of text into their constituent components and authors used their respective selfcreated database for testing their proposed technique. This is the major challenge faced by the researchers in optical character recognition due to unavailability of the benchmark database. Performance evaluation in number of languages is done manually due to the lack of benchmark database. Contributing in the database is one way to aid the research community to fulfil the problem of non-availability of benchmark database. Isolated character and word database for Devanagari script (Hindi and Marathi) is available with CEDAR. Word database is available online and word database is made available on request. Techniques given in the literature for segmentation and recognition of online handwritten/printed or offline printed text are applied and tested under some text constraints. Enhanced techniques are required to address the problem segmentation in case of offline handwritten text or data. Further, a robust technique is required to segment and recognize script so that it can be applied to unconstrained handwritten text.