Run DC/OS on AWS with Terraform and Consul

Having a repeatable method of bringing an infrastructure up is a key requirement for rapid development of our agile toolkit. We use Mesosphere DC/OS a lot and we find ourselves spinning it up quite often — be it for a training or purely for development and testing purposes.

Mesosphere provides a number of DC/OS installation options, the most flexible one is the manual installation. However, launching and provisioning machines manually, every time a cluster a is needed, can be a tedious process. Automation is The Gold today, we have employed The Gold to solve that particular problem. We have created dcos-up, a Terraform + Consul automation tool for starting DC/OS on AWS. Today, we would like to share it with you.

To get your hands on it, head over to this repository on GitHub.

Terraform

Terraform provides a common configuration to launch infrastructure — from physical and virtual servers to email and DNS providers. Once launched, Terraform safely and efficiently changes infrastructure as the configuration is evolved.

In other words, using a plain text configuration file, it is possible to describe what machines should be running on a cloud provider. Once the configuration file is ready, a command line terraform tool is used to make sure the machines are there. Once they are there, the tool will ensure the machines have all the required software installed.

Consul

Consul is another great product from HashiCorp. It provides distributed service discovery and key/value storage for your infrastructure. Under the hood, Consul uses RAFT consensus protocol for distributing the service catalog and K/V storage across the infrastructure. Unlike mesos-dns, it not only provides a DNS based service discovery but also so called watches, a reactive service discovery tool which utilises Consul events to trigger operations on the machine in response to some other events. For example, it is possible to enable / disable certain services when some condition is met on another machine within an infrastructure.

Every machine contains a Consul agent, all agents together form a peer-to-peer gossip based overlay. The overlay serves as a presence mechanism but it is also used for event and K/V store dissemination. An agent can register a watch (an arbitrary program) which gets triggered in response to a user defined event.

That’s Consul in a nutshell. If you never used those two together, I highly recommend spending some time with them. They are an invaluable asset.

Terraform + Consul → dcos-up

dcos-up combines those two tools to provision the DC/OS cluster in AWS VPC accounts. It is possible to use a non-VPC accounts by making slight modifications to the configuration file.

One command in the terminal and you have a running DC/OS cluster 20 minutes later.

Great! Where do we start?

Install Terrafrom for your platform

Follow the link to the Terraform download page and download the release for your operating system. Put the contents of the archive somewhere on your local drive and make sure you add the directory to your PATH. On OSX this essentially boils down to:

cd $HOME
TF_VERSION=0.6.15
OS_DIST=darwin
wget https://releases.hashicorp.com/terraform/${TF_VERSION}/terraform_${TF_VERSION}_${OS_DIST}_amd64.zip
unzip terraform_${TF_VERSION}_${OS_DIST}_amd64.zip
ln -s $HOME/terraform_${TF_VERSION}_${OS_DIST}_amd64 /usr/local/terraform
export PATH=/usr/local/terraform:$PATH

Create terraform AWS credentials file

In order to create all the necessary instances and security groups, Terraform needs to know your AWS credentials. There are multiple ways to provide those, I prefer using environment variables. Create an ~/.aws/terraform file with the following contents:

#!/bin/bash
export AWS_ACCESS_KEY_ID=AKIA...
export AWS_SECRET_ACCESS_KEY=...
export AWS_DEFAULT_REGION=eu-west-1

And source it:

source ~/.aws/terraform

Obviously, use correct credentials, the dots will not work 😉 You may also want changing the region.

Clone the repository

cd ~
git clone https://github.com/data-fellas/dcos-up.git
cd dcos-up

Create a key pair

You have to create a key pair in the AWS account you use. By default, the expected key name is dcos-centos. You can change the key name using Terraform variables, this is explained further down. For now, name it dcos-centos and save the dcos-centos.pem file in [your-dcos-up-dir]/keys directory.

Bring DC/OS up

terraform apply

DC/OS is installing, tell me what is happening

Depending on the instance types used, installation may take good 20 minutes. Let’s walk through the process to learn what exactly is happening here.

The manual installation involves a number of different machines: the bootstrap server, one or more DC/OS masters and one or more DC/OS agents. There are two types of agents possible: slave and slave_public.

Before any master or agent can be brought up, the bootstrap server needs to be running. This is because the DC/OS installer needs to know the address of the bootstrap server in order to download all the necessary components during the installation. However, dcos-up also uses Consul during the installation, Consul agents on the DC/OS master and agent machines needs to know the address of the Consul server on the bootstrap server.

On every machine, the prepare-dcos-machine.sh program is executed. This installs Docker, enables all the system level settings, installs all required dependencies and reboots the machine. When the machine is back online, Terraform will SSH back into it and execute setup-consul.sh program. From here, a machine specific process kicks in.

The bootstrap server

If you look at the setup-consul.sh program, you will notice that right before the Consul service is enabled and started, a couple of watches are created. The one for the bootstrap machine (consul-watch-nodes.py) will watch for any node joining this infrastructure. When issue terraform apply is issued, Terraform learns the number of masters and agents directly from the configuration. dcos-up passes that number to the bootstrap machine during provisioning. The consul-watch-nodes.py watch is triggered every time a master / agent joins but the bootstrap server provisioning process does not start until all required masters / agents join. Finally, when all of them are detected, the watch will fully execute.

dcos-bootstrap-watch-info

First, the bootstrap-machine-init.sh runs, nginx and zookeeper Docker containers are  started, dcos_generate_conig.sh is downloaded and executed. Back in the Python watch, the config.yaml file is created. You can see it using Consul nodes information to populate master_list and agent_list settings. Once config.yaml is written to disk, the watch will execute the second part of bootstrap: bootstrap-machine-ready.sh. Here, the DC/OS bootstrap server is started and registered in Consul.

The masters and agents

Have a look at the setup-consul.sh program once again. There is another Consul watch in there — the watch-service-bootstrap-server watch. This one is used by any machine other than the bootstrap server. This watch will wait for the bootstrap service to appear in Consul and, once that happens, it will trigger setup-dcos-node.sh program. setup-dcos-node.sh execution ends up with master / agent being installed, depending on the node role. Once all those program finish ed successfully, eventually, a DC/OS cluster becomes operational.

Observing DC/OS launch process

The terraform apply command will finish well before the DC/OS cluster is operational. As such, it is important to know how the installation process can be observed. Everything, on every machine, is logged in /var/log/provisioning.log file. To see what is the status of the installation, SSH into the machine you wish to observe and execute:

tail -F /var/log/provisioning.log

When all provisioning logs indicate success, you can start looking for Exhibitor and DC/OS UI. Terraform printed all the necessary addresses at the very end of the run:

dcos-terraform-finished

Copy the Exhibitor UI address and open the page in the browser. Verify that the status is green and Server ID is 1. It may take a little bit until this happens.

dcos-exhibitor

You can now access the DC/OS UI:

dcos-ready

The installation process is finished. Simply set up your DC/OS CLI pointing to the new cluster and you can start launching applications.

Changing default settings

The default infrastructure name is test-infra. To change the infrastructure name, execute terraform apply in the following way:

terraform apply -var infra_name=myinfra

The default number of masters is 1, default number of agents is 2 for slave and 1 for slave_public. To change these numbers, execute Terraform like this:

terraform apply -var infra_name=myinfra -var "instance_counts.slave_public=3" -var "instance_counts.slave=1"

To change the expected key name:

terraform apply -var "provisioner.key_name=different-key-name"

Word of caution

dcos-up brings the master up with an insecure security group dcos_master_insecure. This security group in created for convenience only and opens ports: 80, 5050, 8080 and 8181 to everyone. If you do not wish to have these opened by default, remove aws_security_group.dcos_master_insecure.id from the list of security group assigned to dcos_master machines. Mind you, observing for Exhibitor and DC/OS CLI might be difficult without an SSH tunnel to the master.

dcos-up has not been designed to handle resizing of DC/OS. It should be used only for the initial installation process.

Let us know what you think!

If you have any ideas on how to improve dcos-up, would like to contribute or you’ve found a problem, do not hesitate and let us know!

Happy coding!

Andy Petrella
Andy is a mathematician turned into a distributed computing entrepreneur.
Besides being a Scala/Spark trainer. Andy also participated in many projects built using spark, cassandra, and other distributed technologies, in various fields including Geospatial, IoT, Automotive and Smart cities projects.
He is the creator of the Spark Notebook (https://github.com/andypetrella/spark-notebook), the only reactive and fully Scala notebook for Apache Spark.
In 2015, Xavier Tordoir and Andy founded Data Fellas (http://data-fellas.guru) enabling Data-Driven Business using the Agile Data Science Toolkit product.
Andy is also member of program committee of the O’Reilly Strata, Scala eXchange and Data Science eXchange and Devoxx events.

One Reply to “Run DC/OS on AWS with Terraform and Consul”

  1. Hi

    I get the following error after running “terraform apply” upon following downloading dcos-up:

    —-
    bash-3.2$ terraform apply
    Error loading config: Error loading /Users/traianowelcome/GitHub/Terraform/dcos-terraform/dcos-up/dcos.tf: Error reading config for aws_instance[dcos_bootstrap]: Invalid dot index found: ‘var.instance_types.bootstrap’. Values in maps and lists can be referenced using square bracket indexing, like: ‘var.mymap[“key”]’ or ‘var.mylist[1]’. in:
    ${var.instance_types.bootstrap}
    —-

    Could this be an issue with Terraform versions? I’m on Terraform v0.8.2.

    Thanks!

Comments are closed.