Venue :
Institut d'administration des entreprises de Grenoble (IAE)
525 Avenue Centrale, 38400 Saint-Martin-d'Hères
Room 120
Jury :
This thesis concerns a study of Word Sense Disambiguation (WSD), which is a central task in natural language processing and that can improve applications such as machine translation or information extraction. Researches in word sense disambiguation predominantly concern the English language, because the majority of other languages lacks a standard lexical reference for the annotation of corpora, and also lacks sense annotated corpora for the evaluation, and more importantly for the construction of word sense disambiguation systems. In English, the lexical database wordnet is a long-standing de-facto standard used in most sense annotated corpora and in most WSD evaluation campaigns.
Our contribution to this thesis focuses on several areas:
first of all, we present a method for the automatic creation of sense annotated corpora for any language, by taking advantage of the large amount of wordnet sense annotated English corpora, and by using a machine translation system. This method is applied on Arabic and is evaluated, to our knowledge, on the only Arabic manually sense annotated corpus with wordnet: the Arabic OntoNotes 5.0, which we have semi-automatically enriched.
Its evaluation is performed thanks to an implementation of two supervised word sense disambiguation systems that are trained on the corpora produced using our method. We hence propose a solid baseline for the evaluation of future Arabic word sense disambiguation systems, in addition to sense annotated Arabic corpora that we provide as a freely available resource.
Secondly, we propose an in vivo evaluation of our Arabic word sense disambiguation system by measuring its contribution to the performance of the machine translation task.