Multi-Writer File Storage on GKE

LH Fong
ESTL Lab Notes
Published in
13 min readMar 13, 2018

--

GCP Persistent Disk Icon

Coursemology needs a shared file system to allow file downloads. Instructors can click a button to download all of their students’ answers. A background job is started and a Sidekiq worker zips everything up, writing the file to the public/downloads folder. When the job is done, the user’s browser is redirected to the file’s URL so it can download the file, served by Nginx.

On our current VM based deployment, the app, worker and Nginx processes are all running on different VMs. They all have access to what they think is a local public/downloads folder through the use of SSHFS. This allows multiple writers and automatically keeps the folder synced across all the different VMs.

Solution One

Rewrite the file download feature to upload the created zip file to Amazon S3, which we already use anyway. However, S3 currently stores the files we actually want to keep, while these zip downloads are very temporary. We would also incur additional file storage costs, network transfer costs, and would have to write a cleanup script at some point to clear these zip files.

Solution Two

Figure out how to configure Kubernetes Volumes. In the documentation, there are a few solutions which support multiple writers. In the summary table of access modes in the Persistent Volumes documentation, let’s focus on the ReadWriteMany column. I am going to choose NFS as that is the least exotic one to me.

Trying the NFS Example

Let’s try to implement this by following the NFS example.

PersistentVolumes and PersistentVolumeClaims

The commands in the quickstart begin by creating a PersistentVolumeClaim. On GCP, this automatically provisions a GCE persistent disk which you can see in the web console. This magic occurs as there is no existing PersistentVolume that can satisfy the claim. The cluster from GKE has also been setup to allow dynamic provisioning. Let’s check the cluster’s resources to verify that this is really happening.

Before creating the PersistentVolumeClaim:

$ kubectl get persistentvolumes
No resources found.
$ kubectl get persistentvolumeclaims
No resources found.

Now create the claim with kubectl create -f nfs-server-gce-pv.yaml .

$ kubectl get persistentvolumes
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
pvc-SOME_ID 1Gi RWO Delete Bound default/nfs-pv-provisioning-demo standard 1m
$ kubectl get persistentvolumeclaims
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
nfs-pv-provisioning-demo Bound pvc-SOME_ID 1Gi RWO standard 1m

Go to the Disks listing in the GCP web console and you will also find a new disk with pvc-SOME_ID in its name.

This lays the groundwork for the NFS server.

NFS Server

The next command in the example creates an NFS server using a ReplicationController. I changed it to a Deployment for consistency with the rest of my setup. The main change was to rename role to app . For reference, the modified file is below:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: nfs-server
spec:
replicas: 1
selector:
matchLabels:
app: nfs-server
template:
metadata:
labels:
app: nfs-server
spec:
containers:
- name: nfs-server
image: k8s.gcr.io/volume-nfs:0.8
ports:
- name: nfs
containerPort: 2049
- name: mountd
containerPort: 20048
- name: rpcbind
containerPort: 111
securityContext:
privileged: true
volumeMounts:
- mountPath: /exports
name: mypvc
volumes:
- name: mypvc
persistentVolumeClaim:
claimName: nfs-pv-provisioning-demo

The example uses an image in Google Container Registry. I have not found a reference to the Dockerfile to see how it is setup. Edit: My colleague found it deep in the Kubernetes repository.

The next step creates a Service so we can reach the NFS server later. After the Service is created, use kubectl describe services nfs-server to get the IP. There is a bit of manual work coming up.

PersistentVolume and PersistentVolumeClaim for NFS Server

The next two files in the example create a PersistentVolume and a PersistentVolumeClaim for the NFS server. This is where the manual work comes in. The IP of the NFS server has to be manually configured in the PersistentVolume’s definition.

$ kubectl create -f nfs-pv.yaml
persistentvolume "nfs" created
$ kubectl get persistentvolume nfs
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
nfs 100Mi RWX Retain Available 12s
$ kubectl create -f nfs-pvc.yaml
persistentvolumeclaim "nfs" created
$ kubectl get persistentvolumeclaim nfs
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
nfs Bound nfs 100Mi RWX 21s
$ kubectl get persistentvolume nfs
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
nfs 100Mi RWX Retain Bound default/nfs 1m

Notice that the status of the nfs volume has gone from Available to Bound.

Running the Fake Backend

The example backend runs a shell script which writes the current datetime and the hostname to a file. Two replicas are configured. I think the idea is that the hostname will change when you view the file, depending on which replica last managed to overwrite the file. That did not work for me, I always got the same hostname. So I made some modifications to the file to verify that multiple writers really works as advertised:

# This mounts the nfs volume claim into /mnt and continuously
# overwrites /mnt/index.html with the time and hostname of the pod.
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: nfs-busybox
spec:
replicas: 2
selector:
matchLabels:
app: nfs-busybox
template:
metadata:
labels:
app: nfs-busybox
spec:
containers:
- image: busybox
command:
- sh
- -c
- 'while true; do date > /mnt/index-`hostname`.html; hostname >> /mnt/index-`hostname`.html; sleep $(($RANDOM % 5 + 5)); done'
imagePullPolicy: IfNotPresent
name: busybox
volumeMounts:
# name must match the volume name below
- name: nfs
mountPath: "/mnt"
volumes:
- name: nfs
persistentVolumeClaim:
claimName: nfs

The resource definition has been converted to a Deployment. The main change is the inclusion of the hostname in the output filename. If multiple writers works, I should see files from all the hosts regardless of which pod I am looking at.

Time to try it out:

$ kubectl create -f nfs-busybox-rc.yaml
deployment "nfs-busybox" created
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
nfs-busybox-f4cc4cf5-tkjg6 1/1 Running 0 2m
nfs-busybox-f4cc4cf5-z7vmd 1/1 Running 0 2m
nfs-server-695df8b7f-t8c9z 1/1 Running 0 3h
$ kubectl exec nfs-busybox-f4cc4cf5-tkjg6 -- ls /mnt
index-nfs-busybox-f4cc4cf5-tkjg6.html
index-nfs-busybox-f4cc4cf5-z7vmd.html
index.html
lost+found

We can go wild!

$ kubectl scale --replicas 8 deployment nfs-busybox
deployment "nfs-busybox" scaled
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
nfs-busybox-f4cc4cf5-d8hgl 1/1 Running 0 12s
nfs-busybox-f4cc4cf5-d9hkj 1/1 Running 0 12s
nfs-busybox-f4cc4cf5-lpsnv 1/1 Running 0 12s
nfs-busybox-f4cc4cf5-rdkjs 1/1 Running 0 12s
nfs-busybox-f4cc4cf5-t8zvf 1/1 Running 0 12s
nfs-busybox-f4cc4cf5-tkjg6 1/1 Running 0 4m
nfs-busybox-f4cc4cf5-xsl42 1/1 Running 0 12s
nfs-busybox-f4cc4cf5-z7vmd 1/1 Running 0 4m
nfs-server-695df8b7f-t8c9z 1/1 Running 0 4h
$ kubectl exec nfs-busybox-f4cc4cf5-d8hgl -- ls /mnt
index-nfs-busybox-f4cc4cf5-d8hgl.html
index-nfs-busybox-f4cc4cf5-d9hkj.html
index-nfs-busybox-f4cc4cf5-lpsnv.html
index-nfs-busybox-f4cc4cf5-rdkjs.html
index-nfs-busybox-f4cc4cf5-t8zvf.html
index-nfs-busybox-f4cc4cf5-tkjg6.html
index-nfs-busybox-f4cc4cf5-xsl42.html
index-nfs-busybox-f4cc4cf5-z7vmd.html
index.html
lost+found

With the NFS example working, it is time to apply similar concepts to Coursemology.

Using NFS in Coursemology

The main adaptations from the example include mounting the NFS share into Coursemology’s containers, modifying folder permissions and changing the data storage.

Mounting the NFS Share

This is easy since it is exactly the same as mounting the file share into the busybox containers.

In the Deployment definition for Coursemology’s containers, add the Volume to the volumes key, then mount the volume using the volumeMounts key into the pod template, modifying the mountPath accordingly.

With the Ingress and Coursemology’s app server running, I can view the files created by the busybox pods.

Folder Permissions

The Sidekiq worker process does not run as the root user in the container. It runs as a user named app. Since it will be creating the zip files, we have to ensure that it can write to the NFS share.

Let’s check what the current permissions are by doing a file listing:

$ kubectl exec worker-pod -- ls -l /mountpath
...
drwxr-xr-x 3 root root 4096 Mar 9 14:09 downloads
...

The mounted folder is only writeable by root now, so the worker process will not be able to write to it.

Perhaps this can be changed by modifying the permissions on the NFS server? The exports folder is the one holding the NFS file share.

$ kubectl exec -it nfs-server-pod /bin/shsh-4.2# ls -l
...
drwxr-xr-x 3 root root 4096 Mar 9 06:09 exports
...
sh-4.2# chmod -R 777 exports
sh-4.2# ls -l
...
drwxrwxrwx 3 root root 4096 Mar 9 06:09 exports
...

Now in the worker pod:

$ kubectl exec -it worker-pod /bin/sh
/mountpath $ ls -l
...
drwxrwxrwx 3 root root 4096 Mar 9 14:09 downloads
...

This works. The folder permissions have been changed and it is possible to add a file to the folder. It also does not feel right to loosen the permissions so much.

Another possibility is to change the folder owner. ServerFault suggests that NFS handles file permissions by user ID and group ID, so changing the owner to the user and group IDs of the app user on the Coursemology containers should work.

$ kubectl exec -it nfs-server-pod /bin/sh

sh-4.2# chown -R 10000:65533 exports
sh-4.2# ls -l
...
drwxr-xr-x 3 10000 65533 4096 Mar 9 07:10 exports
...

The owner and group appear as the raw numbers as there is no corresponding user and group ID on the NFS server container.

In the worker pod:

/mountpath $ ls -l
...
drwxr-xr-x 3 app nogroup 4096 Mar 9 15:10 downloads
...

This works too. The app user can add files into the folder. This will require the chown command to be run on NFS container startup. This might be doable with init containers. The same GitHub thread suggests that fsGroup could be another answer. The StackOverflow question here shows a similar solution.

In the Deployment definition for the NFS server, add the securityContext and fsGroup keys to the pod template. This was derived from the documentation for pod configuration. The example given is for a Pod definition. In a Deployment, .spec.template is a nested Pod template. Putting these two things together, we can figure out where to put the fsGroup . Here, the group ID is 65533 as that is the group ID of the app user in the worker container according to the /etc/passwd file. The full file for the NFS server now looks like this:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: nfs-server
spec:
replicas: 1
selector:
matchLabels:
app: nfs-server
template:
metadata:
labels:
app: nfs-server
spec:
securityContext:
fsGroup: 65533
containers:
- name: nfs-server
image: k8s.gcr.io/volume-nfs:0.8
ports:
- name: nfs
containerPort: 2049
- name: mountd
containerPort: 20048
- name: rpcbind
containerPort: 111
securityContext:
privileged: true
volumeMounts:
- mountPath: /exports
name: mypvc
volumes:
- name: mypvc
persistentVolumeClaim:
claimName: nfs-pv-provisioning-demo

Replace the Deployment, wait for the NFS server container to come up again, then check the worker pod:

$ kubectl exec -it worker-pod /bin/sh
/mountpath $ ls -l
...
drwxrwsr-x 3 root nogroup 4096 Mar 9 15:10 downloads
...

The group is the same as when chown was used manually to set the group. The app user can also write to the directory.

Remounting the GCE Persistent Disk

I brought down the GKE cluster for the weekend but left the persistent disk alone to see if I could reconnect it to a new cluster. It is possible but took some digging and experimentation.

For the fake backends, we first created an NFS PersistentVolume through the NFS server, then used PersistentVolumeClaims to access it.

Going down one level, for the NFS server, the PersistentVolumeClaim dynamically provisioned a PersistentDisk and created a PersistentVolume. So to reuse an existing disk, we have to create the PersistentVolume manually. Here’s a new resource definition file to do that:

apiVersion: v1
kind: PersistentVolume
metadata:
name: nfs-server-volume
spec:
capacity:
storage: 1Gi
accessModes:
- ReadWriteOnce
gcePersistentDisk:
pdName: gke-persistent-disk-name
fsType: ext4

You should see an entry for nfs-server-volume when you view the cluster’s volumes with kubectl get pv .

Now, try to create the PersistentVolumeClaim for the NFS server. It will ignore this PersistentVolume and create a new disk and volume :( The answer can be found in this issue here. The storageClassName attribute must match. In the NFS example, storage-class is an annotation like volume.beta.kubernetes.io/storage-class . The documentation says this has been changed to the storageClassName attribute and the annotation will eventually be deprecated. Modify the PersistentVolumeClaim so it specifies an empty string as the storage class:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: nfs-pv-provisioning-demo
labels:
demo: nfs-pv-provisioning
spec:
storageClassName: ""
accessModes: [ "ReadWriteOnce" ]
resources:
requests:
storage: 1Gi

The PersistentVolume is now bound to the claim:

$ kubectl get pv nfs-server-volume
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
nfs-server-volume 1Gi RWO Retain Bound default/nfs-pv-provisioning-demo 41m

The rest of the example files can be loaded into the cluster in order. Obtain a shell session into one of the pods and view the files which were created last week.

However, this does not work perfectly. Even though the claim is bound to the correct volume, GCE still created an additional PersistentDisk. As far as I can tell, it’s just sitting there costing money. Remember to clean it up with gcloud compute disks delete <disk-name>. Next, we will try a solution that does not require additional disks.

Changing Data Storage

The NFS example uses a dynamically provisioned persistent disk. The output zip files of the Sidekiq worker only need to be kept for as long as it takes the user to download them, so data persistence is not that important here and a separate disk for these files is unnecessary. The emptyDir volume type sounds perfect for this.

Configure the NFS server deployment to use the emptyDir volume by replacing the persistentVolumeClaim with emptyDir :

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: nfs-server
spec:
replicas: 1
selector:
matchLabels:
app: nfs-server
template:
metadata:
labels:
app: nfs-server
spec:
securityContext:
fsGroup: 65533
containers:
- name: nfs-server
image: k8s.gcr.io/volume-nfs:0.8
ports:
- name: nfs
containerPort: 2049
- name: mountd
containerPort: 20048
- name: rpcbind
containerPort: 111
securityContext:
privileged: true
volumeMounts:
- mountPath: /exports
name: mypvc
volumes:
- name: mypvc
emptyDir: {}

Create the NFS server Deployment, Service, PersistentVolume and PersistentVolumeClaim.

Wait for the app containers to go from pending (because they could not mount the volume) to running.

Get a shell into the containers and check that the mount is successful and that written files show up in the other containers.

Testing NFS Server Failure

Now try deleting the NFS server pod. When it comes up again, get a shell into the pod. If the NFS server pod was not scheduled onto the same node, the files you created previously are gone. Unfortunately, the app containers have also lost the mount path. It has gone missing, but since the app containers are healthy, there is no restart. Trying to delete the pods manually now also causes them to get stuck in the Terminating state. It is possible to force delete the pod, but the following warning that results is not very comforting:

warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.

So the app pods cannot terminate properly if the NFS server uses emptyDir and gets restarted. Worse, they silently lose the mount point but continue running. It could be possible to redefine the liveness command and probe, but that might not solve the stuck termination.

I tried hostPath next. On GKE, the / directory is not writeable, even with sudo . Specify a directory in the /home folder instead. That seemed to work at first, until I tried forcing the pod onto a different node with nodeSelector . When the new pod comes up on the other node, the same problem of a disappearing mount path occurred.

Back to PersistentDisks

All the problems above go away if a GCE persistent disk is used. Since it costs only $0.04 a month for 1 GB and there are other tasks to focus on, I shall just stick with using a persistent disk to back the NFS server’s PersistentVolume.

Coursemology Complications

After publishing the original post above, I got programming evaluations working and had a chance to test the file download. It failed! The Nginx container could not read the file.

Some investigation revealed that the Sidekiq worker writes files with 700 permission. The files are generated in a directory, so without execute permission for others on the directory, Nginx cannot read the file inside the directory. Here is what the permissions look like:

drwx--S---    2 10000    nogroup       4096 Mar 15 03:06 d20180315-1-1pzonya

The same thing happens on the existing VM based deployment, but SSHFS is used instead, and that allows specific users to be mapped to the shared filesystem. On the VM deployment, the worker sees the files as owned by its own user, while Nginx sees them as the www-data user.

NFS can be configured to map all uids and gids to the anonymous user with the all_squash option. Unfortunately, for the NFS server image being used here, the options are hardcoded in the run script. It is possible to pass in more folders to be added to /etc/exports by passing arguments in the pod spec, but there is no way to modify the options.

A whole afternoon was spent trying various things such as mounting a ConfigMap to /etc/exports , but that did not work. The container would show up as running , but the NFS process had not started correctly so the clients could not mount the volume. Inspecting the file after the NFS server starts also implied that the ConfigMap worked, but checking the logs showed some problems with duplicate exports. There is probably a race between loading the ConfigMap and the script appending to it.

Loading a ConfigMap as a file into a pod is also a little different from using ConfigMap data as a volume. There will be mounting errors if this is configured wrongly. The correct example is in the documentation.

Eventually, the solution was to roll my own NFS server image. It is nearly identical to the one in the Kubernetes repository, but the export options have been replaced with the following snippet:

echo "$i *(rw,fsid=0,insecure,all_squash,anonuid=10000,anongid=65533)" >> /etc/exports

While the file permissions look the same as before from both the Nginx container and the worker pod, the download feature works now as clients are squashed to the user IDs and group IDs with the correct permissions. This also makes the securityContext.fsGroup setting described above unnecessary.

--

--