How to Set Up a Multi-Node Spark Cluster on AWS

Apache Spark is an open sourced cluster centric computing framework that was originally developed at the University of California, Berkeley’s AMPLab. The Apache Spark codebase was later donated to the Apache Software Foundation. Apache Spark provides users with an interface that can be used for programming whole clusters while maintaining implicit data parallelism and fault tolerance.

In this article, we explain why you should use Apache Spark and demonstrate how to setup a real multi-node cluster. This cluster will be hosted on AWS EC2 instances and we’re done, you can start playing around with Spark and process data.

First Things First — Why Apache Spark?

As far as distributed storage goes, Spark has the capability to interface with Hadoop Distributed File System (HDFS), MapR File System (MapR-FS), Cassandra, OpenStack Swift, Amazon S3, Kudu, etc. There are also third party solutions that are built on top of existing cloud infrastructure and used for improved storage capabilities such as Nimble predictive storage or PureStorage, or for stronger backup capabilities, such as N2WS. These solutions offer more control over your storage containers and how you use and protect them.

Spark also supports a pseudo-distributed local mode that can be used for development or testing purposes where distributed storage is not needed and the local file storage system can be used to get to the same end result.

Apache Spark is the latest entrant on the Big Data suite of services. Apache Spark allows users to execute Big Data processing workloads that do not fit the regular Mapreduce paradigm. It is able to accomplish by re-using major components of the Apache Hadoop Framework. It also provides support for a variety of patterns that are similar to the Java 8 Streams in functionality, while allowing users to run these jobs on a cluster.

Imagine a scenario where you have a data processing job working with Java 8 Streams. However, you need more horsepower and memory and a single machine can normally provide. In such an instance, Apache Spark is the best solution.

Set Up Your AWS Account

However, if your aim is to experiment and play around with Spark, a standard EC2 instance with Ubuntu AMI should do. In this tutorial, we’re going to use an open-source script known as FlintRock that’s the quickest possible way to set up flintrock with AWS.

FlintRock is a python based command-line tool and it’s in active development. The rest of this tutorial is going to focus on setting up clusters using FlintRock. You will need to have an AWS account and some basic knowledge of setting up EC2 instances in AWS to get started.

Install Flintrock

If you don’t have python, you can also try running their standalone installer.

Create a new user in AWS

To create a new user:

  1. Head to AWS Console and select IAM from the drop down menu.
  1. Go to Users tab and select Add new user.
  1. From the Add user window, fill in the username. Let’s name it spark.
  2. For the access type, select Programmatic access as well as AWS Console Management Access.
  3. Select Next: Permissions
  4. From the next screen, select Add user to a group.
  1. You will see an option to create a new group.
  2. Add a group name and select the appropriate permission for the user role.
  3. Press the Create Group button.
  4. Add the user to the newly created group. You will have a chance to review the settings. Click Next.

IAM will create a new user and return an Access Key Id, Secret Access Key and a Password. Make sure that you download them and keep them secure because these credentials won’t be available later. Set them as environment variable so that Flintrock can use them later:

$ export AWS_ACCESS_KEY_ID=xxxxxxxxxxxxxxxxxxxx $ export AWS_SECRET_ACCESS_KEY=yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy

Next up, you will need to generate a key pair to access the EC2 instance from your terminal. The steps are straightforward and AWS documentation has a good guide on how to do that.

Configure Flintrock

This will generate a config.yml file that will be opened in your default text-editor. This is what it looks like:

services: spark: version: 2.2.0 hdfs: version: 2.7.5provider: ec2providers: ec2: key-name: spark_cluster identity-file: /path/to/spark_cluster.pem instance-type: m4.large region: us-east-1 ami: ami-97785bed # Amazon Linux, us-east-1 user: ec2-user tenancy: default # default | dedicated ebs-optimized: no # yes | no instance-initiated-shutdown-behavior: terminate # terminate | stoplaunch: num-slaves: 1 install-hdfs: True install-spark: Truedebug: false

Replace the key-name and identity-file with the details of the key pair that you generated in the previous section. The num-slaves property determines the number of slaves that you would want to create.

Start a Multi-Node Spark Cluster

flintrock launch multi-node-cluster-demo

It might take a couple of minutes for flintrock to finish the setup. You can log in to your cluster by running the following command.

$ flintrock login my-spark-cluster

You should see something like this:

__| __|_ )_| ( / Amazon Linux AMI___|\___|___|https://aws.amazon.com/amazon-linux-ami/2017.09-release-notes/

[ec2-user@ip]$

Running Jobs Using Pyspark/Spark-shell

cd ~/server
./spark-2.3.0-bin-hadoop2.7/bin/pyspark — master spark://ip-170–24–10–53.us-east-l.compute.internal:7077

Following a brief pause, allowing for startup, you should see the pyspark prompt ‘>>>’.

Alternatively, you can use spark-shell to run your spark jobs.

Conclusion

Originally published at www.womenwhocode.com.

We are a 501(c)(3) non-profit organization dedicated to inspiring women to excel in technology careers. https://www.womenwhocode.com/