Membres du Jury :
This thesis is part of the problematics of the extraction of meaning from texts and textual flows, produced in our case during collaborative processes. More specifically, we are interested in work-related emails and collaborative textual documents, with a first application to educational documents. The motivation for this interest is to help users gain access to useful information more quickly; we hence seek to locate them in the texts. Thus, we are interested in the tasks referred to in the emails, and to the fragments of educational documents which concern the themes of their interests. Two corpora, one of e-mails and one of educational documents, mainly in French, have been created. This was essential because there is virtually no previous work on this type of data in French.
We use a generic modeling of the structure of these data it to specify the formal processing of documents, a prerequisite for semantic processing. We demonstrate the difficulty of the problem of segmentation, standardization and structuring of documents in different source formats, and present the SegNorm tool, which segments and normalizes documents (in plain or tagged text), recursively and in units of configurable size. In the case of emails, it segments the messages containing quotations of messages into individual messages, thereby keeping the information about the chaining between the intertwined fragments. It also analyzes the metadata of the messages to reconstruct the threads of discussions, and retrieves in the quotations the messages of which one does not have the source file.We then discuss the semantic processing of these documents. We propose a modeling of the notion of task, then describe the annotation of a corpus of several hundred messages originating from the professional context of Viseo and GETALP. We then present a tool for locating tasks and extracting their attributes (temporal constraints, assignees, etc.). This tool, based on a combination of an expert approach and machine learning, is evaluated according to classic criteria of accuracy, recall and F-measure, as well as according to the quality of use.
Finally, we present our work on the MACAU-Chamilo platform, which aims to help learning by (1) structuring of educational documents according to two ontologies (form and content), (2) multilingual access to content initially monolingual. This is therefore again about structuring along the two axes, form and meaning.
(1) The ontology of forms makes it possible to annotate the fragments of documents by concepts such as theorem, proof, example, by levels of difficulty and abstraction, and by relations such as elaboration_of, illustration_of… The domain ontology models the formal objects of informatics, and more precisely the notions of computational complexity. This makes it possible to suggest to the users fragments useful for understanding notions of informatics perceived as abstract or difficult.
(2) The aspect related to multilingual access has been motivated by the observation that our universities welcome a large number of foreign students, who often have difficulty understanding our courses because of the language barrier. We proposed an approach to multilingualize educational content with the help of foreign students, by online post-editing of automatic pre-translations, and, if necessary, incremental improvement of these post-editions. (Our experiments have shown that multilingual versions of documents can be produced quickly and without cost.) This work resulted in a corpus of more than 500 standard pages (250 words/page) of post-edited educational content into Chinese.