Australian Data Lifecycle project: giving users a "data pump"
Thursday, September 29, 2016 - 11:30
The Australian eInfrastructure landscape is essentially a federated one; islands of capabilities (e.g., storage, compute, virtual labs, middlewares; some per-campus, some central/governmental) exist and are loosely interlinked, often on an ad-hoc or per-project basis. This arrangement has worked well enough up to now, but the tipping point is nearing where
a) sheer size of data and trouble moving data and
b) growing requirements on Open Data provenance, traceability and persistence of metadata
mean that we need to provide a stronger-linked data lifecycle system. A number of the larger eScience capabilities (AAF, ANDS, RDS, NeCTAR and AARNet) have therefore come together in Q2 2016 to determine what part of the future solution we can begin building with existing building blocks.
Indeed we are lucky to have Ian Duncan, director of the RDS capability, present at this conference; Ian will talk about the rationale and policy implementation of this very system, immediately after our presentation:
In this presentation, we will focus on the data movement core of this system; the "data pump", capable of ingesting as well as egressing data through regular file transfers (ftp, globus, aspera &c) as well as sync&share ("dropbox-like").
This sync&share capability is seen as absolutely vital; indeed, through earlier, smaller pilots using ownCloud as a frontend for "generic storage", we have found out just how big the difference in end user uptake is between cloud storage presented to researchers through "legacy tooling" and the same storage presented through a sync&share frontend -- orders of magnitude, without exaggeration.
The ability to ingest and egress data through M2M protocols is needed to we can take data direct from instruments, and also to push data through intermediary compute and workflow platforms. We intend to allow users to define data routing triggers (e.g., close of a file transfer, time of day, metadata values).
Ideally, a user will be able to rely on the planned system to automatically synch new data into the data pump, wait for a set condition to trigger, then move the new dataset to a workflow engine; wait for the engine to signal it's done, let the system retrieve the data, and synch it back to the user as well as issue a share invite to this user's collaborators; domestic or overseas. Also possible would be the automatic move of a finalised dataset, including attendant metadata, into the correct institutional repository, ready for citation as open data with correct provenance records and signalling to the funding body.
As far as implementation goes, the component we intend to use as the "central data pump" is already in operation at AARNet, as a service offering called "CloudStor"; a sync&share platform based on ownCloud and used by ~25.000 users. Fortuitously, a good proportion of the existing Australian capabilities in both storage as well as workflow compute are based on OpenStack, so for interlinking pilots we have so far focused on Swift as the M2M data movement protocol; learnings here can easily be generalised to "S3" as the M2M storage interface.
Given we are targeting substantial sizes both in data (terabytes and above) and geographies (continental), we must realise the solution needs to scale and retain performance over high latencies (~100ms). End user TCP stacks, alas, still prefer low latency, so our system must
a) put synch&share ("ownCloud") servers proximal to user concentrations, without being obliged to put significant amounts of storage hardware there; this is central to the already operational CloudStor system, and has indeed already been built; and
b) assume the science capabilities ("Swift nodes") are a sizable distance away from the users' nearest synch&share ("ownCloud") node. This permits science/discipline specific capabilities to remain single-datacentre (typically "a campus"), which aligns with their established mode of operation.
This talk will give a quick overview of the CloudStor system and what componentry we've used (and rejected!) to make it scale, the modifications envisioned to implement user triggers, metadata ingestion and retention in parallel to regular file synch; and how we plan to implement and automate interaction with other eScience/eInfra capabilities
As said in the introduction above, the prime mover and of the project, Ian Duncan, of RDS, is presenting right after this talk (albeit in another stream) and will be highlighting higher layer aspects of the proposed system pertaining to background drivers, science outcomes, government open data targets and policy harmonisation.
users, data managers, librarians, portal operators, repository operators, policy makers
Benefits for Audience:
eInfrastructure, if done well, can scale like a true cloud service
sync&share is a truly excellent paradigm to engage users with data platforms
good incentives can make capability silos disappear
open source cloud technology has really grown up a lot, provided you pick the right blocks. If you pick the wrong blocks, or even just mediocre ones, your solution will slowly grind to a halt, right when it becomes successful (~5k users) and boy is it painful to have to retract a service just as users are starting their word-of-mouth promotion!
Topic 2: Services enabling research
Topic 4: Working with data