We’re happy to announce two exciting new Spark Notebook features – user login and ability to pass the logged-in user to the Secured Hadoop/YARN clusters. This makes the notebooks run with credentials of that user, improving the security and traceability (e.g. by knowing which files belong to whom).
The supported authentication methods for logging-in include Kerberos, User/password (out-of-the-box), while many others can be added with only few lines change, thanks to the pac4j security library (including LDAP, OAuth, SAML, IP Address, etc.).
The logged in user can be further passed to the Secure Hadoop cluster by employing the proxy-user impersonation (available in Spark on YARN clusters with Kerberos).
We’re quite excited to open-source a Spark Notebook feature which converts a notebook into a template of a Spark job in the form of an SBT project. Just add a configuration to turn it into a production-ready job.
This is intended to bridge the gap between the interactive nature of notebooks and the more formal enterprise production processes.
We are proud to announce open-sourcing of the long-awaited notebook revision tracking plugin for the spark-notebook !
The plugin is based on Git, thus the notebook modifications can be either a) stored offline in a local git repo on the spark-notebook server, or b) every checkpoint can be immediately sent to a remote git server such as Github.
It supports manual and automatic checkpointing, and optionally a checkpoint message can be provided.
Our next training session will be the 19, 21, 23 September with O’Reilly.
You will be coached by Xavier Tordoir and Andy Petrella.
The online training is titled: “Building Distributed Pipelines for Data Science using Kafka, Spark, and Cassandra” , the goal is to learn how to introduce a distributed data science pipeline in your organization.
During this 3 days you will discover SparkNotebook.io and how to connect it with Kafka, Cassandra and Spark. With some easy example on Scala to learn Machine Learning.
It will be fast, heavy and awesome !
You can have more information on the official page http://www.oreilly.com/online-training/building-distributed-pipelines-for-data-science.html or contact us !
Having a repeatable method of bringing an infrastructure up is a key requirement for rapid development of our agile toolkit. We use Mesosphere DC/OS a lot and we find ourselves spinning it up quite often — be it for a training or purely for development and testing purposes.
Mesosphere provides a number of DC/OS installation options, the most flexible one is the manual installation. However, launching and provisioning machines manually, every time a cluster a is needed, can be a tedious process. Automation is The Gold today, we have employed The Gold to solve that particular problem. We have created dcos-up, a Terraform + Consul automation tool for starting DC/OS on AWS. Today, we would like to share it with you.