Jupyter notebook as a Service: enabling easy access to parallel computing
Thursday, September 29, 2016 - 11:30
In an age where the need for data processing is increasing dramatically, academic users often struggle with the transition from small prototyped software, possibly running on a laptop, to the deployment on large clusters. Creating scalable software, which works well on increasing data load, is often still an afterthought. Academic projects that need to analyse large data sets are often struggling with getting a development environment up and running, after the realization that the current set up does not scale with present or future needs. Ideally, this environment should facilitate sharing of code, portability to other systems and effortlessly scale to really large data sets.
In recent years the use of notebooks has become increasingly popular. Notebooks allow users to write and execute code in an easy-to-use web interface that contains cells, which beside code, may contain marked up text and visualisations. For example, the Jupyter project has support for over 40 programming languages, contains widgets for visualisations and makes sharing code over GitHub or even email very easy.
Recently Apache Spark integration was added to the Jupyter stack, which provides the Spark API to Scala and iPython notebooks. With the Spark framework notebooks can be created that can run their calculations in parallel on various cluster architectures or multithreaded on a single host. Note that Jupyter also provides its own parallel computing extension ipyparallel.
In The Netherlands SURF and the Netherlands eScience Center collaborate on technologies that enable an "Integrated Federative e-Infrastructure". One of our objectives is to provide users with a dedicated notebook environment that dynamically scales with computational load and storage needs. As a first proof of concept we implemented a service that offers users access to a dynamically deployed and scalable Spark cluster. For this we have used Apache Mesos
Mesos is an orchestration platform for managing cluster resources. It uses container technology such as Linux Containers (LXC) and Docker, to deploy software frameworks and provides rescheduling in the case of failures. In addition, it provides real-time API’s for interacting with the cluster. Mesos’ functionality can be seen a cross product of IaaS and PaaS (Greenberg, Building Applications for Mesos).
Containerized Jupyter environments are available (or can easily be extended) via the docker-stacks project. Spark integrates very easily with both the docker-stacks Jupyter environment as wel as Mesos. Both Mesos and Spark provide functionality for dynamically scaling the running applications/frameworks. We are using the scheduling capabilities of Mesos in order to create per user clusters; this in combination with the dynamic resource allocation features of Spark will allow efficient use of physical hardware.
In short, we propose a system where from a collection of predefined docker containers we can create a private environment for a user in which he or she can access the Jupyter notebook and run Spark applications. Multiple concurrent Spark and Jupyter clusters should be able to run on the same physical hardware via Mesos.
With this PoC we aim to develop a user-friendly and powerful environment for parallel computing. We will present details of this Proof of Concept, together with our vision and plans how we will integrate it in an ecosystem of dynamic e-Infrastructure services such as (geographically distributed) storage and networking, accessible via federated identity management.
This presentation is aimed at researchers who need a user friendly and scalable programming environment and representatives of institutions offering e-Infrastructure services such as HPC services, data archives and NREN’s.
Benefits for Audience:
We will present a PoC that allows the rapid deployment of dedicated private Spark clusters for users. These environments will dynamically scale with load and come with a Jupyter notebook environment. Users can easily develop, run, execute and share code and results. The environment support various programming languages and code which may run on either a large cluster or on a single laptop without adjustments.
Topic 2: Services enabling research