Project Description:
The vision of semantic web as proposed by the World Wide Consortium (W3C) is to "create a universal medium for the exchange of data". For this vision to realize, large amounts of structured datasets are being published forming a web of interlinked structured data. The W3C community project Linking Open Data is playing a lead role in this by bringing and linking massive amounts of structured data in the web. Examples of structured datasets being published and interlinked by the project include Wikipedia, Wikibooks, Yago, DBLP bibliography, Wordnet, Geonames, MusicBainz, Freebase and many more. A diagram presenting all the datasets brought and linked by the projects is shown in Fig. 1. Governments are also following the trend of publishing structured data in the web. Both the US and the UK governments are not only publishing their data in the web but also encouraging people to reuse and benefit from it. Most of these datasets are published in RDF (Resource Description Framework – a W3C recommendation) by November 2009, the published sets have grown to over 13.1 billion RDF triples which are interlinked by around 142 million RDF links.
Powered and supported by many companies such as Facebook, Yahoo! and Google, another emerging direction towards populating the semantic web has started to appear. This emerging direction is the use of RDFa (Resource Description Framework – in – attributes – a W3C recommendation) which adds a set of attributes to XHTML for embedding RDF triples in web pages. RDFa plays an important role in bridging the gap between the web of files (web 2.0) and the web of data (web 3.0) in that, with the support of the largest companies, HTML authors are starting to embed RDF triples in their XHTML pages and thus contributing to the growth of the semantic web.
These trends of publishing structured data in the Web are shifting the focus of Web technologies towards new paradigms of structured-data retrieval. Traditional search engines cannot serve such data as the results of a keyword-based query will not be precise or clean, because the query itself is still ambiguous although the underlying data is structured. So, SPARQL (SPARQL Protocol and RDF Query Language) was proposed as a standardized query language that enables querying collections of RDF data which is analogous to SQL for querying databases. However, SPARQL is oriented for technical people and is of no use to people with limited IT skills. In fact, to exploit the massive amount of structured data in the Web to its full potential, people should be able to query this data easily and effectively. Formulating queries should be fast and should not require programming skills. Thus, to allow non technical people to query RDF data, MashQL was introduced as an intuitive language for querying the Data Web.
MashQL project was started at the University of Cyprus and is continued at Birzeit University. What was already done in the project is the query formulation language aspect which we will elaborate on in Chapter 3 (MashQL and the system model). A Firefox add-on editor which implements some aspects of MashQL was also developed. However, the editor only supported queries over small RDF files.
Our work and contribution to the MashQL project is twofold (i) bringing RDFa support to MashQL and (ii) developing the Graph-Signature Indexing query optimization solution and using it in MashQL to extend it to support queries over large RDF datasets. Graph-Signature indexing can also be viewed as a separate contribution from MashQL in that the indexing solution provides a significant enhancement to Oracle’s solution for querying large RDF datasets which is known as Oracle’s Semantic Technology.