Pacuna's Blog

Apache Spark, Kubernetes and Azure

Apache Spark and Kubernetes are some of my favorite tools these days. And turns out you can run Spark jobs using Kubernetes as a resource manager. It sounds cool, but due to implementation differences between cloud providers and lack of documentation, it can get a bit tricky.

I’ve worked with Amazon AWS, Google GCE and Microsoft Azure in the past. And no matter what people say, they all work fine. Of course you will find differences, but they are all professional production-ready cloud environments.

I tested this integration on GCE and Azure. And I decided to share my experience with Azure, mainly because it wasn’t so straightforward and it required extra setup.

During the process I got into some issues. Mainly around networking when running Spark jobs in client mode. One important lesson is:

When using Apache Spark in client mode with Kubernetes, the pods should be able to communicate over the network with the host that is submitting the job.

Having a host submitting Spark jobs in the same virtual network the cluster is running in, and that allows inbound traffic from the pods should be enough. In this case with Azure, the solution was to create a custom Virtual network and a Subnet. Once you have these resources, you can launch the AKS cluster and the VM that will submit the jobs, in the same VNet and Subnet. Then, the Kubernetes pods that act as executors, can talk to the VM that runs the driver program and the jobs can be executed with no issues.

This post contains a bunch of commands. In case you want to dig deeper, links with more details will be provided. Let’s get started.

Local environment

All the commands where tested using an Ubuntu bionic container. If you have your own Ubuntu machine, you can use it instead. This container is only used to create the initial Azure resources, before launching the VM that will submit the Spark jobs. If you don’t have an Ubuntu 18 machine, you can use Docker and follow along.

Let’s start a bash session in an Ubuntu bionic container:

docker run -it ubuntu:bionic bash

We need to install the Azure CLI tool to interact with our Azure account along with some dependencies:

apt update && apt install curl ssh -y
curl -sL https://aka.ms/InstallAzureCLIDeb | bash

You can sign in with:

az login --use-device-code

Follow the instructions from the login command output. You will need to authenticate with your Azure credentials using your browser.

Create a Resource Group

A resource group will help us to encapsulate all the resources we need for this tutorial. Once we are done, we can delete everything by just deleting the resource group.

az group create --name sparkRG --location eastus

Create a Virtual Network and Subnet

Like I mentioned in the introduction, we need a custom Virtual Network and a Subnet. Now, when you create an AKS cluster, if you don’t specify a Vnet, a default one will be created. I had some issues with this default Vnet when I tried to create a VM in it after. And we need that VM to be located along the AKS cluster. That’s why we are creating a custom one here.

You can find more details about the following commands in this tutorial.

First, we create a Vnet in our resource group. We need to provide the network configuration for the IP ranges and the Subnet name and CIDR prefix. Take note of the Vnet and Subnet names. We are going to use them in later commands.

az network vnet create \
    --resource-group sparkRG \
    --name myAKSVnet \
    --address-prefixes 10.0.0.0/8 \
    --subnet-name myAKSSubnet \
    --subnet-prefix 10.240.0.0/16

You don’t need to use those exact values in case you wonder. Those are just the default used in the tutorial I linked previously.

Assign permissions

The following command will allow the AKS cluster to interact with other Azure resources. This method of managing permissions is very common when you need to integrate Azure services.

First we create a Service Principal:

az ad sp create-for-rbac --skip-assignment

Output:

{
  "appId": "81bba314-cc5a-4232-b619-285ddce5a199",
  "displayName": "azure-cli-2019-05-04-00-40-40",
  "name": "http://azure-cli-2019-05-04-00-40-40",
  "password": "cffc5a68-20ba-4a1d-b2db-41e75a10ffc3",
  "tenant": "47a8af16-7f6e-414a-8612-5930927bd83d"
}

Take note of the appId and password properties. We also need those for later use.

To assign roles, we need the resource ids of the Vnet and Subnet created previosly. We can use the az tool to run queries for these values:

VNET_ID=$(az network vnet show --resource-group sparkRG --name myAKSVnet --query id -o tsv)SUBNET_ID=$(az network vnet subnet show --resource-group sparkRG --vnet-name myAKSVnet --name myAKSSubnet --query id -o tsv)

Now let’s create the role assignment that we will use for our cluster and that assigns the right permissions to interact with the Vnet. Replace the assignee parameter with you appId from the your Service Principal:

az role assignment create --assignee 81bba314-cc5a-4232-b619–285ddce5a199 --scope $VNET_ID --role Contributor

Creating the AKS cluster

Most of the parameters to launch the cluster are network related and also copied from the link posted before. Remember the cluster needs to be created in the right Vnet and Subnet. Replace the service-principal with your appId and the client-secret with your password from the Service Principal:

az aks create \
    --resource-group sparkRG \
    --name sparkAKSCluster \
    --node-count 3 \
    --network-plugin kubenet \
    --service-cidr 10.0.0.0/16 \
    --dns-service-ip 10.0.0.10 \
    --pod-cidr 192.168.0.0/16 \
    --docker-bridge-address 172.17.0.1/16 \
    --vnet-subnet-id $SUBNET_ID \
    --service-principal 81bba314-cc5a-4232-b619-285ddce5a199 \
    --client-secret cffc5a68-20ba-4a1d-b2db-41e75a10ffc3 \
    --generate-ssh-keys

Creating a container registry

In order to submit Spark jobs to Kubernetes, we need to host Spark docker images somewhere. Let’s use the container registry service from Azure to create our own private registry:

az acr create --resource-group sparkRG --name sparkimages --sku Basic

Our AKS cluster should be able to pull images from this registry. We can also create role assignment for this (more details here):

CLIENT_ID=$(az aks show --resource-group sparkRG --name sparkAKSCluster --query "servicePrincipalProfile.clientId" --output tsv)ACR_ID=$(az acr show --name sparkimages --resource-group sparkRG --query "id" --output tsv)az role assignment create --assignee $CLIENT_ID --role acrpull --scope $ACR_ID

Creating the VM

Now we can create the VM that will submit the Spark jobs using the same Vnet and Subnet the cluster was created in:

az vm create \
--name sparkVM \
--resource-group sparkRG \
--image ubuntults \
--vnet-name myAKSVnet \
--subnet myAKSSubnet \
--generate-ssh-keys \
--admin-username pacuna \
--admin-password Xah8aruV5ciel5Weeroh

The generated SSH keys will be stored in the ~/.ssh directory. Copy them in case you want to use them later. Remember once you exit the container, they will be gone.

Setting up the VM

Log into the VM (from here on you can keep using the container, or you can use the generated SSH keys and use your regular local environment):

ssh [email protected]

Install Java and Spark:

sudo apt update
sudo apt upgrade
sudo apt install openjdk-8-jre -y
wget http://mirrors.sonic.net/apache/spark/spark-2.4.2/spark-2.4.2-bin-hadoop2.7.tgz
tar -xzvf spark-2.4.2-bin-hadoop2.7.tgz
sudo mv spark-2.4.2-bin-hadoop2.7 /opt/spark

Set the SPARK_HOME environmental variable, and add the Spark binaries to the path:

export SPARK_HOME=/opt/spark
export PATH=$SPARK_HOME/bin:$PATH

Let’s also install the Azure CLI tool to connect to the Kubernetes cluster from this VM:

curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash

Log in once again by following the instructions:

az login --use-device-code

Docker will be necessary to build and push the Spark images to the container registry:

curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh
sudo usermod -aG docker $USER

Log out and log in so docker can be used without sudo.

Submitting Spark applications to a Kubernetes cluster also requires to have kubectl installed and configured to communicate with the cluster:

curl -LO https://storage.googleapis.com/kubernetes-release/release/$(curl -s https://storage.googleapis.com/kubernetes-release/release/stable.txt)/bin/linux/amd64/kubectl
chmod +x ./kubectl
sudo mv ./kubectl /usr/local/bin/kubectl

We can configure kubectl by pulling the AKS credentials:

az aks get-credentials --resource-group sparkRG --name sparkAKSCluster

Build and push Spark docker images

Log in to the container registry so we can push our local Docker images:

az acr login --name sparkimages

Now we can build and push the Spark images using the docker-image-tool.sh that comes with Spark. The following commands need to be run from the SPARK_HOME directory. After running the commands, the remote container registry will have 3 images: spark:0.1, spark-py:0.1 and spark-r:0.1:

cd $SPARK_HOME
docker-image-tool.sh -r sparkimages.azurecr.io -t 0.1 build
docker-image-tool.sh -r sparkimages.azurecr.io -t 0.1 push

Add RBAC permissions for spark

kubectl create serviceaccount spark
kubectl create clusterrolebinding spark-role --clusterrole=edit --serviceaccount=default:spark --namespace=default

Running a Spark shell

There are two important parameters when using client mode: spark.driver.host and spark.driver.port. These parameters tell the pods where the driver is running on. The host will be our VM’s private IP address:

hostname --ip-address

Which in my case is 10.240.0.7. And for the port we will use 7778.

We also need our Kubernetes cluster address. You can get it with:

kubectl cluster-info

Which in my case is https://sparkakscl-sparkrg-e659ac-e06a364e.hcp.eastus.azmk8s.io:443.

Now we can start a spark-shell with the following command:

spark-shell \
--master k8s://https://sparkakscl-sparkrg-e659ac-e06a364e.hcp.eastus.azmk8s.io:443 \
--deploy-mode client \
--conf spark.driver.host=10.240.0.7 \
--conf spark.driver.port=7778 \
--conf spark.kubernetes.container.image=sparkimages.azurecr.io/spark:0.1 \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark

We are using our Kubernetes cluster endpoint, the driver parameters, the Spark image from our private container registry, and the service account created previously.

Run a simple test:

spark.range(10).toDF("number").show()

You should see something like the following:

19/05/04 02:06:48 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://10.240.0.7:4040
Spark context available as 'sc' (master = k8s://https://sparkakscl-sparkrg-e659ac-e06a364e.hcp.eastus.azmk8s.io:443, app id = spark-application-1556935623374).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.2
      /_/Using Scala version 2.12.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_191)
Type in expressions to have them evaluated.
Type :help for more information.scala> spark.range(10).toDF("number").show()
+------+
|number|
+------+
|     0|
|     1|
|     2|
|     3|
|     4|
|     5|
|     6|
|     7|
|     8|
|     9|
+------+

If you run the kubectl get pods command, you should see the Spark executors running as Kubernetes pods while you have this shell opened.

Delete everything

az group delete --name sparkRG --no-wait --yes

View original

#kubernetes #azure #spark

- 1 toasts