We are proud to announce open-sourcing of the long-awaited notebook revision tracking plugin for the spark-notebook !
The plugin is based on Git, thus the notebook modifications can be either a) stored offline in a local git repo on the spark-notebook server, or b) every checkpoint can be immediately sent to a remote git server such as Github.
It supports manual and automatic checkpointing, and optionally a checkpoint message can be provided.
Continue reading “Spark Notebook revision tracking with Git”
This blog shows how to perform scalable geospatial data analysis using Geotrellis, Apache Spark, Sparkling-Water and the Spark-Notebook.
As a benchmark for this blog, we use the 500 images (and 45GB) dataset distributed by Kaggle/DSTL.
After reading this blog post, you will know how to:
- Load GeoJSON and GeoTIFF files with Geotrellis,
- Manipulate/resize/convert geospatial rasters using Geotrellis,
- Distribute geospatial pictures analysis on a spark cluster,
- Display geospatial tiles in the Spark-Notebook,
- Create multispectral histogram from a distributed image dataset,
- Cluster image pixels based on multi-spectral intensity information,
- Use H2O Sparkling-Water to train a machine learning algorithm on a distributed geospatial dataset,
- Use a trained model to identify objects on large geospatial images,
- How to vectorize object rasters into polygons and save them to distributed (parquet) file systems
Continue reading “Scalable Geospatial data analysis with Geotrellis, Spark, Sparkling-Water and, the Spark-Notebook”
Our next training session will be the 19, 21, 23 September with O’Reilly.
You will be coached by Xavier Tordoir and Andy Petrella.
The online training is titled: “Building Distributed Pipelines for Data Science using Kafka, Spark, and Cassandra” , the goal is to learn how to introduce a distributed data science pipeline in your organization.
During this 3 days you will discover SparkNotebook.io and how to connect it with Kafka, Cassandra and Spark. With some easy example on Scala to learn Machine Learning.
It will be fast, heavy and awesome !
You can have more information on the official page http://www.oreilly.com/online-training/building-distributed-pipelines-for-data-science.html or contact us !
Having a repeatable method of bringing an infrastructure up is a key requirement for rapid development of our agile toolkit. We use Mesosphere DC/OS a lot and we find ourselves spinning it up quite often — be it for a training or purely for development and testing purposes.
Mesosphere provides a number of DC/OS installation options, the most flexible one is the manual installation. However, launching and provisioning machines manually, every time a cluster a is needed, can be a tedious process. Automation is The Gold today, we have employed The Gold to solve that particular problem. We have created dcos-up, a Terraform + Consul automation tool for starting DC/OS on AWS. Today, we would like to share it with you.
To get your hands on it, head over to this repository on GitHub.
Continue reading “Run DC/OS on AWS with Terraform and Consul”
In this blog, we’ll cover:
- what is the next data science
- what is DC/OS
- how to install a distributed data science environment
- why Data Fellas is thrilled to be a partner of Mesosphere
Continue reading “Enterprise Agile Data Science Toolkit on DC/OS”
Andy Petrella will be speaking on Friday 22nd of April. He will talk about “New Data Science: Functional, Distributed, JVM… and Agile“.
Don’t hesitate to meet him !
Continue reading “Andy talks at Devoxx France”