Managing Linked Data at Scale: Querying, Integrating, and Sharing

This project aims to develop an efficient system and software tools to make it easy for users to share datasets with each other and integrate data available at geo-distributed engines. The RDF data model allows interlinking entities from different Web sources, where each dataset is independently maintained and accessed via a SPARQL endpoint. Subsequently various applications in life sciences, government open data, decentralized social networks, and Internet of Things, which need to query RDF graphs across geo-distributed and independent endpoints, have emerged. State-of-the-art federated RDF systems usually support a small number of data sources by utilizing schema information. They often cause unnecessary data retrieval and communications, leading to poor scalability and response time; these systems cannot support the need of these emerging applications to access tens to hundreds of geo-distributed RDF datasets.

This project mainly addresses these limitations and presents Lusail, a scalable and efficient federated RDF engine for querying large-scale graphs across geo-distributed endpoints. Lusail achieves scalability and low query response time through optimizations at compile and run times. At compile time, we use a novel locality-aware query decomposition [1] that maximizes the number of query triple patterns sent together to a source based on the actual location of the instances satisfying these triple patterns. At run time, we use selectivity- aware and parallel query execution techniques to reduce net- work latency and to increase parallelism by delaying the execution of subqueries expected to return large results. We evaluate Lusail using real data and synthetic benchmarks, with sizes up to billions of triples on an in-house cluster and a geo-distributed public cloud. We show that Lusail outperforms state-of-the-art systems by orders of magnitude in terms of scalability and response time. Lusail is demonstrated at SIGMOD 2017 [2]. A full technical report about Lusail is available at [3].

We also addressed the data sharing problem in a short collaboration between QCRI and MIT. Prof. Ashraf Aboulnaga, from QCRI, and Prof. Tim Berners-Lee, from MIT, led this collaboration. We managed to develop a decentralized platform for social Web called Solid (for Social Linked Data). Solid is based on RDF and Semantic Web technologies, and it achieves the goals of providing data independence and simple yet powerful data management mechanisms. I developed Meccano, an RDF-based system that implements the Solid protocols. More details about Meccano and the Solid protocols are available in [4]. The MIT team developed Solid.js, a javascript library implementing the Solid protocols. The Solid protocols guarantees efficient performance for social applications regardless these applications use Solid.js or not. However, Solid.js aims at accelerating the development life-cycle of Solid applications by writing less code. We demonstrated the Solid platform and Meccano at WWW 2016 [5].