Exploring Resources in Word Sense Disambiguation for Marathi Language

Word Sense Disambiguation (WSD) is one of the most challenging problems in the research area of natural language processing. To find the correct sense of the word in a particular context is called Word Sense Disambiguation. As a human, we can get a correct sense of the word given in the sentence because of word knowledge of that particular natural language, but it is not an easy task for the machine to disambiguate the word. Developing any WSD system, it required sense repository and sense dictionary. It is very costly and time-consuming to build these resources. Many foreign languages have available these resources, that is why most of the foreign languages like English, German, Spanish etc lot of work is done in these Natural languages. When we look for Indian languages like Hindi, Marathi, Bengali etc. very less work is done. The reason behind this is resource-scarcity. In this paper, we majorly focus on Marathi Language Word Sense Disambiguation because of very less work is done in the Marathi Language as compared to Hindi and other Indian Languages. Our main objective is to provide information about various resources available for the Marathi language which will be helpful for researchers who wants to do work for Marathi WSD. This paper also gives a review on work done for Marathi Language WSD and its challenges and problems.


Introduction
Word sense disambiguation is a process which is automatically recognized which multiple meaning of ambiguous words is being used in a specific sentence. In other Word, WSD is identifying which meaning of a word (i.e. sense) is used in a sentence when the word has multiple senses text. In natural language processing, word sense disambiguation (WSD) is an open problem of computation linguistic. It is a word sense disambiguation is a massive challenge in natural language processing. The human mind is quite talented at word-sense disambiguation. A human language developed in a way that human easily understands the meaning which reflects in the sentence. In the computer, it has been a long-standing challenge to improve the ability of computers to do natural language processing. In the supervised machine learning approach, the classifier is trained for every different word on manually senses annotated. These methods assume that the context can provide sufficient proof on its own to disambiguate the sense.It uses annotated corpus and ambiguity is resolved by finding the nearest or closest word having similarities.

WSD Application and Techniques
WSD is required in various areas like Information Retrieval (IR), Sentiment Analysis, Knowledge Graph Construction, Text Mining and Information Extraction (IE), Lexicography and Machine Translation. Solutions to WSD are mostly categorized into knowledge-based, supervised and unsupervised approaches.
Every machine learning approach has many algorithms. In the supervised approach follows many algorithms are Decision List, Decision Tree, Naïve Bayes, Support Vector Machine (SVM), Logistic Regression, Logistic Regression, Random Forest, KNN (k-Nearest Neighbours) Ensemble Methods, Neural Networks, etc. In the Unsupervised approach follows many algorithms are Word Clustering, Context Clustering, Cooccurrence Graph, K-means, Apriori algorithm, etc. The Knowledge Base approach follows many algorithms are Genetic algorithm, Decision Support, Lesk algorithm, Semantic Similarity, Selection Preferences.
The authors [1] explored the work status of WSD in the Marathi Language. Many researchers used different algorithms for disambiguating the Marathi sentence like the graph-based algorithm to resolve ambiguity based on word sense and context domain. The researcher used Genetic Algorithm technique through which they resolve the ambiguity of the words based on their context domain and their senses. Other authors used approach consists of a modified Lesk algorithm with Support Vector Machine. etc. The accuracy of every algorithm is depends on text corpus and different techniques applied to the data set.

Marathi Language and its Word Categories
Marathi is the Indo-Aryan language. This language of Sanskrit origin. The Marathi language is the official language of Maharashtra state, a state in India. The language is most speakers in word wide. Approximately 90 million people in India speak this language. Maharashtra is a Southern state in India, the dialects of Marathi include Varhadii, Gawdi of Goa, Nagpuri Marathi, Dangii, Malwani, Kudali, Kasargod, Kosti, Ahirani of Khandeshi, etc. The Marathi language follows the Subject, Object, Verb, Nouns inflect for gender, number, etc. The Marathi language is eight main POS (Part of Speech). These are Noun, Verb, Adjective, Adverb, Pronoun, Postposition, Conjunction and Interjection [3].

One of the challenges in researching Marathi
Language WSD is a lack of resources. Still, some peoples started research work for Marathi Language and developed some Marathi Language WSD. In this section, we try to give information about resources available for Marathi Language which will useful for researchers, who want to work on this problem.

Marathi WorldNet
This is a machine readable dictionary based English WordNet. It is not just a traditional dictionary, but more than this. This dictionary gives different relations between synsets or synonym sets represented as unique concepts. It is developed by Dr. Pushpak Bhattacharya with his team at IIT, Bombay. Marathi WorldNet is organized as a semantic network of large electronic databases.
Paradigmatic relations such as synonymy, hyponymy, antonymy and entailment etc. are used to construct it. It is widely used lexical database today for research in NLP for Marathi language, the different senses called synonym sets or synsets for each open-class word like nouns, verbs, adjectives, and adverbs are listed by Marathi WordNet. It has the index_txt file to Provides information about all words present in Wordnet, the data_txt file for Providing the details of every word in the index file and the onto_txt file which Provides ontology details of the words in data file [2].

Indo WordNet
Based on EuroWordNet dictionary, IndoWordNet is developed. Eighteen scheduled languages of India, namely Marathi, Hindi, Malayalam, Telugu, Kashmiri, Bodo, Bangla, Gujarati, Kannada, Odia, Konkani, Manipuri, Assamese, Punjabi, Nepali, Tamil, Sanskrit and Urdu represent the lexical linked knowledge base of IndoWordNet. It is an online interface, which users can get outcomes according to needs in various organizations. The Look and feel of IndoWordNet are same as a customary word reference keeping the user versatility. IndoWordNet database structure is imported from English WordNet which is present on Princeton University site [4].

NLP Libraries
Most of the NLP applications required preprocessing of the text. The libraries that are more useful while using python for text processing are the Indic NLP Library and Natural Language Toolkit for Indic Languages (iNLTK). These two libraries support many Indian languages including Marathi.

Indic NLP Library
This Library is intended to build Python-based libraries for common text processing and Natural Language Processing in Indian languages.

Natural Language Toolkit for Indic Languages (iNLTK)
The iNLTK library is equivalent to the NLTK Python package. This library provides features that an NLP application developer required. It provides Tokenization, Generates similar sentences from given text input, Identifies the language of a text, Text completion, Word Embedding, and Text Generation in 13 Indic Languages including the Marathi Language. iNLTK is an open-source NLP library that support Marathi Language also [6-10].

IndicCorp
One of the largest publicly-available corpora for Indian languages is IndicCorp. This corpora is created for thirteen Indian Languages, Marathi is one of them.
IndicCorp corpora consist of thousands of web sources -primarily news, magazines, and books.

Conclusion
Though Word Sense Disambiguation (WSD) is one of the most challenging problems in the research area of natural language processing. This paper explores the methods, algorithms and technique of Word Sense Disambiguation. In this paper, we try to elaborate on resources that will helpful for working in Marathi Word Sense Disambiguation.
We provide information about the Indian Language Libraries and Tools.