The EGI DataHub and the Open Data Platform
Thursday, September 29, 2016 - 14:00
Until recently Open Access to research has been mostly considered in the scope of free access to scientific publications such as books, journal papers and conference proceedings. However as more research relies on access to high quality large data sets, including data collected from physical experiments as well as data obtained through pure simulations, it is becoming more apparent the importance of extending the open access also to data sets referenced in the scientific publications.
Currently several efforts are already addressing this challenge, by providing means for data sets indexing and cataloging, such as DataCite or OpenAIRE. These services rely on established standards such as OAI-PMH, which enable them to integrate with the existing platforms for publication metadata harvesting, and identify datasets through globally unique handles such as DOI or PID.
However, while these services enable discovery and identification of open data sets, they do not address directly the issue of accessing the underlying data by end users. In case of scientific publications, the typical scenario was to simply download the text document with the publication or view it in the browser, this method is not feasible in case of large data sets required in many fields such as high energy physics or chemistry.
Moreover, the publication of data sets often involves publishing a URL where the dataset is available along with the DOI or PID for resolution of the data set. However, this approach does not guarantee, that the downloaded dataset has not been modified or updated since its publication, which can result in difficulty in reproducing the results from the research based on the original data set.
Furthermore, users are accustomed to accessing and managing their personal data through Cloud based services such as Dropbox or Google Drive, while in order to access and process Open Data they still have to use some legacy protocols such FTP and share the data by exchanging URL’s or attachments to emails.
Within the scope of EGI-Engage project, we are developing an Open Data Platform prototype, which aims at providing a novel solution for open data management, giving the researchers similar experience and ease of use as with commercial data management and file synchronization solutions, while providing means for seamless publication and access to open data from any location, either from personal laptop or virtual machine running in the cloud. Within the EGI-Engage we have analysed several user communities requirements with focus on open data management through questionnaires, and compiled the results into a single report.
Among the most important requirements we have identified were the possibility of accessing large data sets without the need for pre-staging, advanced metadata support, easy data sharing and provision of unique handles to data sets.
The Open Data Platform will be based on Onedata distributed data management solution, which provides the backend for efficient data access and sharing on a global scale. The typical scenario of using ODP can be summarized as follows. A user prepares a data set inside of a Onedata space (a virtual volume). This data set can be of arbitrary size and internal structure in terms of subdirectories and files. The users shares this space with another user who is responsible for ensuring that the data set has appropriate metadata. The sharing functionality is internally provided by Onedata, and requires only exchanging a single token which the owner of the space generates through the user interface. Once the data curator ensures that all relevant metadata is added, the owner creates a snapshot of the entire dataset, which is calculated as a hash of the entire contents of the data set. This ensures, that even when the dataset is updated or extended with new files, the handle reference to the dataset points to the exact version which was used when publishing the dataset. Once the data set is published, Open Data Platform OAI-PMH Data Provider service will expose it to OAI-PMH Service Providers such as OpenAIRE on the next scheduled metadata harvest, and the data set will be available and discoverable online.
Now, when some user is interested in accessing the data set, ODP will provide several options. First of all, simply resolving the DOI in the browser will present the user with a preview webpage showing the summary of the contents of the data set, as is typical for most open access systems. However, although for small data sets it might be feasible to download the contents of the data set directly from the web browser, a much more interesting case is when the data sets are too large to download or when the user wants to access them directly on the computing nodes in the cloud or in some HPC data center. In such case, assuming the user is registered in the Open Data Platform, the user simply selects the option on the preview page to add the data set to her spaces, and the data set can be accessed instantly on any machine when Onedata client is available, either on a laptop or a computing node over remote filesystem functionality of Onedata. This can be even further extended to link data sets with virtual machine or Docker images for automatic verification and reproducibility of results from open access publications.
Whereas the Open Data Platform comprises technologies, the DataHub will be the end-user service exposing these technologies; the central point of access for the Open Data Platform. The DataHub makes existing large scale open data collections discoverable and available in an easy way for both EGI users and the general public in the case of open data collections where a login is not required.
Development of the Open Data Platform is progressing well, as are plans for the DataHub. A prototype of the Open Data Platform is due to be released in November 2016, however we plan to make the prototype available for testers and early adopters before that.
5min Introduction (Matthew Viljoen)
25min Introduction and Demonstration on the Open Data Platform prototype (Lukasz Dutka)
15min Open Data from a Water Reservoir Platform (Fernando Aguilar Gómez)
15min iMarine Usecase (Pasquale Pagano)
15min EO Data Usecase (Emmanuel Mondon)
Any potential user of EGI data services and those looking to bring computing and data services together.
Benefits for Audience:
This aim of this session is to introduce the DataHub and the Open Data Platform and provide a live demonstration of the prototype. It will also serve to invite expressions of interest from early adopters and other communities who believe that these solutions may help their usecases.
Topic 4: Working with data