Share Protected Data with Jupyter Notebooks on Kubernetes with a Sidecar

breakfast at a french bakery in washington dc

Learn how to share protected data with user-shared Jupyter Notebooks with a BYOD approach

You have an application (example- Django/python or even a jupyter notebook containing ML code) and it is running inside a Kubernetes pod, ie, it has been containerized using docker and launched within a Kubernetes cluster. (I am assuming you have some idea about the application containerization process.)

For this tutorial, I will be using Google Kubernetes Engine (GKE) on the Google Cloud Platform. You can get a new GCP account with your Gmail account. You will receive a $300 credit to use within the year and can follow along in this tutorial. I am assuming that you know how to launch a Kubernetes cluster using GCP.

Using the GCP developers console, launch the cloud shell to create a cluster using the Google Kubernetes Engine. It looks something like this –

Before we create the Kubernetes cluster, we have to create a google bucket to store the protected data. The google bucket needs to have a unique name for the sidecar container (a docker container with a GCS fuse point) to interact with it.

You can directly use the GUI from the developers’ console to create a bucket, drag-and-drop some test data, and proceed to create the Kubernetes cluster using GKE.

gcloud container clusters create \
- scopes=https://www.googleapis.com/auth/devstorage.read_write \
- scopes=https://www.googleapis.com/auth/cloud-platform \
- machine-type n1-standard-2 \
- num-nodes 2 \
- preemptible \
- zone us-east1-b \ ## your zone
- cluster-version latest \
sidecar-gcsfuse-test ## name of your cluster

Once the cluster is created, add relevant permissions –

kubectl create clusterrolebinding cluster-admin-binding \
- clusterrole=cluster-admin \
- user=your_email@gmail.com

Once the cluster is ready, we have to create a persistent volume (PV) and a persistent volume claim (PVC) for the sidecar container to use.

The sidecar is designed with a bi-directional mount point — it connects upstream to the google bucket to get privileged data and downstream to the application to share the data. The application then accesses the data in a non-privileged mode thereby ensuring that it cannot alter the data in any form.

The application also gets its PV and PVC where it stores and modifies the data that it has accessed from the bucket without altering the original data in any way.

To deploy a PV and PVC for the sidecar, we can use an NFS mount or simply a standard GCE disk.

apiVersion: v1
kind: PersistentVolume
metadata:
name: sidecar-test
labels:
app: gcsfuse-test
spec:
accessModes:
- ReadOnlyMany
capacity:
storage: 12Gi
persistentVolumeReclaimPolicy: Retain
gcePersistentDisk:
pdName: sidecar-test
---
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: sidecar-test
labels:
app: gcsfuse-test
spec:
accessModes:
- ReadOnlyMany
resources:
requests:
storage: 9Gi

We use a depoyment.yaml to deploy the sidecar. We have to make sure we link to the correct file path, in this case, '/test/' of the GCS bucket with secure data.

- - -
apiVersion: apps/v1beta1
kind: Deployment
metadata:
name: gcsfuse-test
spec:
replicas: 1
template:
metadata:
labels:
app: gcsfuse-test
spec:
containers:
- name: gcsfuse-test
image: ananyak8srenci/gcsfuse-sidecar-data:latest
volumeMounts:
- mountPath: /data/sidecar-test
name: sidecar-test
mountPropagation: Bidirectional
securityContext:
privileged: true
capabilities:
add:
- SYS_ADMIN
lifecycle:
postStart:
exec:
command: ["gcsfuse", "-o", "allow_other", "-o", "nonempty", "sidecar-test", "/data/sidecar-test"]
preStop:
exec:
command: ["fusermount", "-u", "/data/sidecar-test"]
volumes:
- name: sidecar-test
persistentVolumeClaim:
claimName: sidecar-test

Create a persistent disk for the sidecar. Then deploy the PV, PVC, and the deployment.yaml.

Deploy the sidecar –

gcloud compute disks create --size=200Gi --zone=us-east1-b sidecar-testkubectl apply -f sidecar-pv-pvc-fuse.yamlkubectl apply -f sidecar-deployment.yaml

Once your sidecar container is deployed, test it by making sure you can see it in the list of deployments and pods.

kubectl get deployment
kubectl get pod

Then shell into the sidecar container and attempt writing files and deleting files in the GCS bucket it is associated with.

kubectl exec -it your_gcsfuse-test-deployment -- /bin/bash

The following video shows the GCS FUSE mounted sidecar container writing and deleting secure data in a google bucket.

Thanks for reading! Please feel free to leave a response if you have any comments or feedback.

Next, we will take a look at deploying a jupyter notebook and connecting it with the sidecar container to obtain bucket data in an unprivileged mode.

Part 2 — Deploy a jupyter notebook and connect it with the GCS-FUSE sidecar container.

Leave a Comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.