Spark Notebook revision tracking with Git

We are proud to announce open-sourcing of the long-awaited notebook revision tracking plugin for the spark-notebook !

The plugin is based on Git, thus the notebook modifications can be either a) stored offline in a local git repo on the spark-notebook server, or b) every checkpoint can be immediately sent to a remote git server such as Github.

It supports manual and automatic checkpointing, and optionally a checkpoint message can be provided.

Continue reading “Spark Notebook revision tracking with Git”

Scalable Geospatial data analysis with Geotrellis, Spark, Sparkling-Water and, the Spark-Notebook

This blog shows how to perform scalable geospatial data analysis using Geotrellis, Apache Spark, Sparkling-Water and the Spark-Notebook.

As a benchmark for this blog, we use the 500 images (and 45GB) dataset distributed by Kaggle/DSTL.

After reading this blog post, you will know how to:

  • Load GeoJSON and GeoTIFF files with Geotrellis,
  • Manipulate/resize/convert geospatial rasters using Geotrellis,
  • Distribute geospatial pictures analysis on a spark cluster,
  • Display geospatial tiles in the Spark-Notebook,
  • Create multispectral histogram from a distributed image dataset,
  • Cluster image pixels based on multi-spectral intensity information,
  • Use H2O Sparkling-Water to train a machine learning algorithm on a distributed geospatial dataset,
  • Use a trained model to identify objects on large geospatial images,
  • How to vectorize object rasters into polygons and save them to distributed (parquet) file systems

Continue reading “Scalable Geospatial data analysis with Geotrellis, Spark, Sparkling-Water and, the Spark-Notebook”

O’Reilly Online training 19,21,23 Sept. 2016 Building Distributed Pipelines for Data Science

Our next training session will be the 19, 21, 23 September with O’Reilly.

You will be coached by Xavier Tordoir and Andy Petrella.

The online training is titled: “Building Distributed Pipelines for Data Science using Kafka, Spark, and Cassandra” , the goal is to learn how to introduce a distributed data science pipeline in your organization.

During this 3 days you will discover and how to connect it with Kafka, Cassandra and Spark. With some easy example on Scala to learn Machine Learning.

It will be fast, heavy and awesome !

You can have more information on the official page or contact us !

Run DC/OS on AWS with Terraform and Consul

Having a repeatable method of bringing an infrastructure up is a key requirement for rapid development of our agile toolkit. We use Mesosphere DC/OS a lot and we find ourselves spinning it up quite often — be it for a training or purely for development and testing purposes.

Mesosphere provides a number of DC/OS installation options, the most flexible one is the manual installation. However, launching and provisioning machines manually, every time a cluster a is needed, can be a tedious process. Automation is The Gold today, we have employed The Gold to solve that particular problem. We have created dcos-up, a Terraform + Consul automation tool for starting DC/OS on AWS. Today, we would like to share it with you.

To get your hands on it, head over to this repository on GitHub.

Continue reading “Run DC/OS on AWS with Terraform and Consul”