DIGITAL INFRASTRUCTURES for RESEARCH 2018 | Serving the user base

Dynamic creation of data pipelines in clouds

Thursday, September 29, 2016 - 11:30


Processing large sets of scientific data is getting more and more important in every day science. The processing of data can be organized as a data pipeline which is a well-known way of processing data. However, the organization of such data pipelines usually requires special expertise particularly if the data pipeline is to be installed in cloud systems. Building such data pipelines requires the knowledge of the underlying cloud. Flowbster is a new cloud-oriented workflow system that is built on top of Occopus a generic cloud orchestrator and deployment tool. As a result the underlying cloud system can be hidden for the data pipeline developer who can concentrate on the data processing business logic. In order to further facilitate the fast development process a graphical graph editor is available in Flowbster. With this technology data engineers can quickly construct and try the required data pipeline in most of the major cloud systems generally used by scientists (OpenStack, OpenNebula, Amazon, EC2, CloudSigma). Moreover the data pipeline can support the exploitation of three types of parallelism in order to provide a fast data processing method that is a must for very large data sets. The three types of supported parallelism are pipeline parallelism, workflow parallel branch parallelism and node scalability parallelism..


Target Audience:

This presentation is important for those science communities that would like to build an efficient data pipeline in cloud without deep learning of cloud systems. The concrete output of this presentation is a contribution to ongoing research work for more easily and better exploiting available cloud systems for large data processing applications.


Benefits for Audience:

Participants learn about an easy-to-use and efficient data pipeline organization technology by which they can exploit various types of cloud systems (OpenStack, OpenNebula, Amazon, EC2, CloudSigma) without deep learning of these cloud systems. Flowbster hides the details of the underlying cloud systems so users can concentrate on their business logic. The created data pipeline will be very flexible, auto scalable and easily expandable, modifiable and adaptable to future needs. It can easily and dynamically deployed in the various clouds.


Topic 4: Working with data


Presenter Organisation URL
Peter Kacsuk SZTAKI
Download presentation: