Jury :
The topic of my PhD is the compilation of web data query languages. More particularly, the analysis and the distributed evaluation of a such language: sparql. My main contributions concern the evaluation of web data queries especially for recursive queries or for distributed settings.
In this thesis, I introduce μ-algebra: it is a kind of relational algebra equipped with a fixpoint operator. I present its syntax, semantics, and a translation from sparql with Property Paths (a new feature of sparql allowing some form of recursion) to this μ-algebra.
I then present a type system and show how μ-algebra terms can be rewritten to terms with equivalent semantics using either classical rewrite rules of the relational world or new rules that are specific to this μ-algebra. We demonstrate the correctness of these new rules that are introduced to handle the rewriting of fixpoints: they allow to push filters, joins and projections inside fixpoints or to combine several fixpoints (when some condition holds).
I demonstrate how these terms could be evaluated both from a general perspective and in the specific case of a distributed evaluation. I devise a cost model for μ-algebra terms inspired by this evaluation. With this cost model and this evaluator, several terms that are semantically equivalent can be seen as various Query Execution Plans (QEP) for a given query. I show that the μ-algebra and its rewrite rules allow the reach of QEP that are more efficient than all QEP considered in other existing approaches and confirm this by an experimental comparison of several query evaluators on sparql queries with recursion.
I investigate the use of an efficient distributed framework (Spark) to build a fast sparql dis- tributed query evaluator. It is based on a fragment of μ-algebra, limited to operators that have a translation into fast Spark code. The result of this has been implemented in SparqlGX, a state of the art distributed sparql query evaluator.
Finally, my last contribution concerns the estimation of the cardinality of solutions to a μ-algebra term. Such estimators are key in the optimization. Indeed, most cost models for QEP rely on such estimators and are therefore necessary to determine the most efficient QEP. I specifically consider the conjunctive query fragment of μ-algebra (which corresponds to the well-known Basic Graph Pattern fragment of sparql). I propose a new cardinality estimation based on statistics about the data and implemented the method into SparqlGX. Experiments show that this method improves the performance of SparqlGX.