User Auth & Secure YARN in Spark notebook

We’re happy to announce two exciting new Spark Notebook features – user login and ability to pass the logged-in user to the Secured Hadoop/YARN clusters. This makes the notebooks run with credentials of that user,  improving the security and traceability (e.g. by knowing which files belong to whom).

The supported authentication methods for logging-in include  Kerberos, User/password (out-of-the-box), while many others can be added with only few lines change, thanks to the pac4j security library (including LDAP, OAuth, SAML, IP Address, etc.).

The logged in user can be further passed to the Secure Hadoop cluster by employing the proxy-user impersonation (available in Spark on YARN clusters with Kerberos).

Very simplified view: user login and passing it to Spark/Hadoop

Continue reading “User Auth & Secure YARN in Spark notebook”

Spark Job Generation From a Notebook

We’re quite excited to open-sourceSpark Notebook feature which converts a notebook into a template of a Spark job in the form of an SBT project. Just add a configuration to turn it into a production-ready job.

This is intended to bridge the gap between the interactive nature of notebooks and the more formal enterprise production processes.

Spark Notebook to SBT project to Apache Spark job

Continue reading “Spark Job Generation From a Notebook”

Spark Notebook revision tracking with Git

We are proud to announce open-sourcing of the long-awaited notebook revision tracking plugin for the spark-notebook !

The plugin is based on Git, thus the notebook modifications can be either a) stored offline in a local git repo on the spark-notebook server, or b) every checkpoint can be immediately sent to a remote git server such as Github.

It supports manual and automatic checkpointing, and optionally a checkpoint message can be provided.

Continue reading “Spark Notebook revision tracking with Git”

Scalable Geospatial data analysis with Geotrellis, Spark, Sparkling-Water and, the Spark-Notebook

This blog shows how to perform scalable geospatial data analysis using Geotrellis, Apache Spark, Sparkling-Water and the Spark-Notebook.

As a benchmark for this blog, we use the 500 images (and 45GB) dataset distributed by Kaggle/DSTL.

After reading this blog post, you will know how to:

  • Load GeoJSON and GeoTIFF files with Geotrellis,
  • Manipulate/resize/convert geospatial rasters using Geotrellis,
  • Distribute geospatial pictures analysis on a spark cluster,
  • Display geospatial tiles in the Spark-Notebook,
  • Create multispectral histogram from a distributed image dataset,
  • Cluster image pixels based on multi-spectral intensity information,
  • Use H2O Sparkling-Water to train a machine learning algorithm on a distributed geospatial dataset,
  • Use a trained model to identify objects on large geospatial images,
  • How to vectorize object rasters into polygons and save them to distributed (parquet) file systems

Continue reading “Scalable Geospatial data analysis with Geotrellis, Spark, Sparkling-Water and, the Spark-Notebook”

O’Reilly Online training 19,21,23 Sept. 2016 Building Distributed Pipelines for Data Science

Our next training session will be the 19, 21, 23 September with O’Reilly.

You will be coached by Xavier Tordoir and Andy Petrella.

The online training is titled: “Building Distributed Pipelines for Data Science using Kafka, Spark, and Cassandra” , the goal is to learn how to introduce a distributed data science pipeline in your organization.

During this 3 days you will discover SparkNotebook.io and how to connect it with Kafka, Cassandra and Spark. With some easy example on Scala to learn Machine Learning.

It will be fast, heavy and awesome !

You can have more information on the official page http://www.oreilly.com/online-training/building-distributed-pipelines-for-data-science.html or contact us !

Run DC/OS on AWS with Terraform and Consul

Having a repeatable method of bringing an infrastructure up is a key requirement for rapid development of our agile toolkit. We use Mesosphere DC/OS a lot and we find ourselves spinning it up quite often — be it for a training or purely for development and testing purposes.

Mesosphere provides a number of DC/OS installation options, the most flexible one is the manual installation. However, launching and provisioning machines manually, every time a cluster a is needed, can be a tedious process. Automation is The Gold today, we have employed The Gold to solve that particular problem. We have created dcos-up, a Terraform + Consul automation tool for starting DC/OS on AWS. Today, we would like to share it with you.

To get your hands on it, head over to this repository on GitHub.

Continue reading “Run DC/OS on AWS with Terraform and Consul”