Multi-Writer File Storage on GKE
Coursemology needs a shared file system to allow file downloads. Instructors can click a button to download all of their students’ answers. A background job is started and a Sidekiq worker zips everything up, writing the file to the public/downloads
folder. When the job is done, the user’s browser is redirected to the file’s URL so it can download the file, served by Nginx.
On our current VM based deployment, the app, worker and Nginx processes are all running on different VMs. They all have access to what they think is a local public/downloads
folder through the use of SSHFS. This allows multiple writers and automatically keeps the folder synced across all the different VMs.
Solution One
Rewrite the file download feature to upload the created zip file to Amazon S3, which we already use anyway. However, S3 currently stores the files we actually want to keep, while these zip downloads are very temporary. We would also incur additional file storage costs, network transfer costs, and would have to write a cleanup script at some point to clear these zip files.
Solution Two
Figure out how to configure Kubernetes Volumes. In the documentation, there are a few solutions which support multiple writers. In the summary table of access modes in the Persistent Volumes documentation, let’s focus on the ReadWriteMany
column. I am going to choose NFS as that is the least exotic one to me.
Trying the NFS Example
Let’s try to implement this by following the NFS example.
PersistentVolumes and PersistentVolumeClaims
The commands in the quickstart begin by creating a PersistentVolumeClaim. On GCP, this automatically provisions a GCE persistent disk which you can see in the web console. This magic occurs as there is no existing PersistentVolume that can satisfy the claim. The cluster from GKE has also been setup to allow dynamic provisioning. Let’s check the cluster’s resources to verify that this is really happening.
Before creating the PersistentVolumeClaim:
$ kubectl get persistentvolumes
No resources found.
$ kubectl get persistentvolumeclaims
No resources found.
Now create the claim with kubectl create -f nfs-server-gce-pv.yaml
.
$ kubectl get persistentvolumes
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
pvc-SOME_ID 1Gi RWO Delete Bound default/nfs-pv-provisioning-demo standard 1m$ kubectl get persistentvolumeclaims
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
nfs-pv-provisioning-demo Bound pvc-SOME_ID 1Gi RWO standard 1m
Go to the Disks listing in the GCP web console and you will also find a new disk with pvc-SOME_ID
in its name.
This lays the groundwork for the NFS server.
NFS Server
The next command in the example creates an NFS server using a ReplicationController. I changed it to a Deployment for consistency with the rest of my setup. The main change was to rename role
to app
. For reference, the modified file is below:
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: nfs-server
spec:
replicas: 1
selector:
matchLabels:
app: nfs-server
template:
metadata:
labels:
app: nfs-server
spec:
containers:
- name: nfs-server
image: k8s.gcr.io/volume-nfs:0.8
ports:
- name: nfs
containerPort: 2049
- name: mountd
containerPort: 20048
- name: rpcbind
containerPort: 111
securityContext:
privileged: true
volumeMounts:
- mountPath: /exports
name: mypvc
volumes:
- name: mypvc
persistentVolumeClaim:
claimName: nfs-pv-provisioning-demo
The example uses an image in Google Container Registry. I have not found a reference to the Dockerfile to see how it is setup. Edit: My colleague found it deep in the Kubernetes repository.
The next step creates a Service so we can reach the NFS server later. After the Service is created, use kubectl describe services nfs-server
to get the IP. There is a bit of manual work coming up.
PersistentVolume and PersistentVolumeClaim for NFS Server
The next two files in the example create a PersistentVolume and a PersistentVolumeClaim for the NFS server. This is where the manual work comes in. The IP of the NFS server has to be manually configured in the PersistentVolume’s definition.
$ kubectl create -f nfs-pv.yaml
persistentvolume "nfs" created$ kubectl get persistentvolume nfs
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
nfs 100Mi RWX Retain Available 12s$ kubectl create -f nfs-pvc.yaml
persistentvolumeclaim "nfs" created$ kubectl get persistentvolumeclaim nfs
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
nfs Bound nfs 100Mi RWX 21s$ kubectl get persistentvolume nfs
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
nfs 100Mi RWX Retain Bound default/nfs 1m
Notice that the status of the nfs
volume has gone from Available to Bound.
Running the Fake Backend
The example backend runs a shell script which writes the current datetime and the hostname to a file. Two replicas are configured. I think the idea is that the hostname will change when you view the file, depending on which replica last managed to overwrite the file. That did not work for me, I always got the same hostname. So I made some modifications to the file to verify that multiple writers really works as advertised:
# This mounts the nfs volume claim into /mnt and continuously
# overwrites /mnt/index.html with the time and hostname of the pod.apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: nfs-busybox
spec:
replicas: 2
selector:
matchLabels:
app: nfs-busybox
template:
metadata:
labels:
app: nfs-busybox
spec:
containers:
- image: busybox
command:
- sh
- -c
- 'while true; do date > /mnt/index-`hostname`.html; hostname >> /mnt/index-`hostname`.html; sleep $(($RANDOM % 5 + 5)); done'
imagePullPolicy: IfNotPresent
name: busybox
volumeMounts:
# name must match the volume name below
- name: nfs
mountPath: "/mnt"
volumes:
- name: nfs
persistentVolumeClaim:
claimName: nfs
The resource definition has been converted to a Deployment. The main change is the inclusion of the hostname in the output filename. If multiple writers works, I should see files from all the hosts regardless of which pod I am looking at.
Time to try it out:
$ kubectl create -f nfs-busybox-rc.yaml
deployment "nfs-busybox" created$ kubectl get pods
NAME READY STATUS RESTARTS AGE
nfs-busybox-f4cc4cf5-tkjg6 1/1 Running 0 2m
nfs-busybox-f4cc4cf5-z7vmd 1/1 Running 0 2m
nfs-server-695df8b7f-t8c9z 1/1 Running 0 3h$ kubectl exec nfs-busybox-f4cc4cf5-tkjg6 -- ls /mnt
index-nfs-busybox-f4cc4cf5-tkjg6.html
index-nfs-busybox-f4cc4cf5-z7vmd.html
index.html
lost+found
We can go wild!
$ kubectl scale --replicas 8 deployment nfs-busybox
deployment "nfs-busybox" scaled$ kubectl get pods
NAME READY STATUS RESTARTS AGE
nfs-busybox-f4cc4cf5-d8hgl 1/1 Running 0 12s
nfs-busybox-f4cc4cf5-d9hkj 1/1 Running 0 12s
nfs-busybox-f4cc4cf5-lpsnv 1/1 Running 0 12s
nfs-busybox-f4cc4cf5-rdkjs 1/1 Running 0 12s
nfs-busybox-f4cc4cf5-t8zvf 1/1 Running 0 12s
nfs-busybox-f4cc4cf5-tkjg6 1/1 Running 0 4m
nfs-busybox-f4cc4cf5-xsl42 1/1 Running 0 12s
nfs-busybox-f4cc4cf5-z7vmd 1/1 Running 0 4m
nfs-server-695df8b7f-t8c9z 1/1 Running 0 4h$ kubectl exec nfs-busybox-f4cc4cf5-d8hgl -- ls /mnt
index-nfs-busybox-f4cc4cf5-d8hgl.html
index-nfs-busybox-f4cc4cf5-d9hkj.html
index-nfs-busybox-f4cc4cf5-lpsnv.html
index-nfs-busybox-f4cc4cf5-rdkjs.html
index-nfs-busybox-f4cc4cf5-t8zvf.html
index-nfs-busybox-f4cc4cf5-tkjg6.html
index-nfs-busybox-f4cc4cf5-xsl42.html
index-nfs-busybox-f4cc4cf5-z7vmd.html
index.html
lost+found
With the NFS example working, it is time to apply similar concepts to Coursemology.
Using NFS in Coursemology
The main adaptations from the example include mounting the NFS share into Coursemology’s containers, modifying folder permissions and changing the data storage.
Mounting the NFS Share
This is easy since it is exactly the same as mounting the file share into the busybox containers.
In the Deployment definition for Coursemology’s containers, add the Volume to the volumes
key, then mount the volume using the volumeMounts
key into the pod template, modifying the mountPath
accordingly.
With the Ingress and Coursemology’s app server running, I can view the files created by the busybox pods.
Folder Permissions
The Sidekiq worker process does not run as the root user in the container. It runs as a user named app
. Since it will be creating the zip files, we have to ensure that it can write to the NFS share.
Let’s check what the current permissions are by doing a file listing:
$ kubectl exec worker-pod -- ls -l /mountpath
...
drwxr-xr-x 3 root root 4096 Mar 9 14:09 downloads
...
The mounted folder is only writeable by root now, so the worker process will not be able to write to it.
Perhaps this can be changed by modifying the permissions on the NFS server? The exports
folder is the one holding the NFS file share.
$ kubectl exec -it nfs-server-pod /bin/shsh-4.2# ls -l
...
drwxr-xr-x 3 root root 4096 Mar 9 06:09 exports
...sh-4.2# chmod -R 777 exports
sh-4.2# ls -l
...
drwxrwxrwx 3 root root 4096 Mar 9 06:09 exports
...
Now in the worker pod:
$ kubectl exec -it worker-pod /bin/sh
/mountpath $ ls -l
...
drwxrwxrwx 3 root root 4096 Mar 9 14:09 downloads
...
This works. The folder permissions have been changed and it is possible to add a file to the folder. It also does not feel right to loosen the permissions so much.
Another possibility is to change the folder owner. ServerFault suggests that NFS handles file permissions by user ID and group ID, so changing the owner to the user and group IDs of the app
user on the Coursemology containers should work.
$ kubectl exec -it nfs-server-pod /bin/sh
sh-4.2# chown -R 10000:65533 exports
sh-4.2# ls -l
...
drwxr-xr-x 3 10000 65533 4096 Mar 9 07:10 exports
...
The owner and group appear as the raw numbers as there is no corresponding user and group ID on the NFS server container.
In the worker pod:
/mountpath $ ls -l
...
drwxr-xr-x 3 app nogroup 4096 Mar 9 15:10 downloads
...
This works too. The app
user can add files into the folder. This will require the chown
command to be run on NFS container startup. This might be doable with init containers. The same GitHub thread suggests that fsGroup
could be another answer. The StackOverflow question here shows a similar solution.
In the Deployment definition for the NFS server, add the securityContext
and fsGroup
keys to the pod template. This was derived from the documentation for pod configuration. The example given is for a Pod definition. In a Deployment, .spec.template
is a nested Pod template. Putting these two things together, we can figure out where to put the fsGroup
. Here, the group ID is 65533 as that is the group ID of the app
user in the worker container according to the /etc/passwd
file. The full file for the NFS server now looks like this:
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: nfs-server
spec:
replicas: 1
selector:
matchLabels:
app: nfs-server
template:
metadata:
labels:
app: nfs-server
spec:
securityContext:
fsGroup: 65533
containers:
- name: nfs-server
image: k8s.gcr.io/volume-nfs:0.8
ports:
- name: nfs
containerPort: 2049
- name: mountd
containerPort: 20048
- name: rpcbind
containerPort: 111
securityContext:
privileged: true
volumeMounts:
- mountPath: /exports
name: mypvc
volumes:
- name: mypvc
persistentVolumeClaim:
claimName: nfs-pv-provisioning-demo
Replace the Deployment, wait for the NFS server container to come up again, then check the worker pod:
$ kubectl exec -it worker-pod /bin/sh
/mountpath $ ls -l
...
drwxrwsr-x 3 root nogroup 4096 Mar 9 15:10 downloads
...
The group is the same as when chown
was used manually to set the group. The app
user can also write to the directory.
Remounting the GCE Persistent Disk
I brought down the GKE cluster for the weekend but left the persistent disk alone to see if I could reconnect it to a new cluster. It is possible but took some digging and experimentation.
For the fake backends, we first created an NFS PersistentVolume through the NFS server, then used PersistentVolumeClaims to access it.
Going down one level, for the NFS server, the PersistentVolumeClaim dynamically provisioned a PersistentDisk and created a PersistentVolume. So to reuse an existing disk, we have to create the PersistentVolume manually. Here’s a new resource definition file to do that:
apiVersion: v1
kind: PersistentVolume
metadata:
name: nfs-server-volume
spec:
capacity:
storage: 1Gi
accessModes:
- ReadWriteOnce
gcePersistentDisk:
pdName: gke-persistent-disk-name
fsType: ext4
You should see an entry for nfs-server-volume
when you view the cluster’s volumes with kubectl get pv
.
Now, try to create the PersistentVolumeClaim for the NFS server. It will ignore this PersistentVolume and create a new disk and volume :( The answer can be found in this issue here. The storageClassName
attribute must match. In the NFS example, storage-class
is an annotation like volume.beta.kubernetes.io/storage-class
. The documentation says this has been changed to the storageClassName
attribute and the annotation will eventually be deprecated. Modify the PersistentVolumeClaim so it specifies an empty string as the storage class:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: nfs-pv-provisioning-demo
labels:
demo: nfs-pv-provisioning
spec:
storageClassName: ""
accessModes: [ "ReadWriteOnce" ]
resources:
requests:
storage: 1Gi
The PersistentVolume is now bound to the claim:
$ kubectl get pv nfs-server-volume
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
nfs-server-volume 1Gi RWO Retain Bound default/nfs-pv-provisioning-demo 41m
The rest of the example files can be loaded into the cluster in order. Obtain a shell session into one of the pods and view the files which were created last week.
However, this does not work perfectly. Even though the claim is bound to the correct volume, GCE still created an additional PersistentDisk. As far as I can tell, it’s just sitting there costing money. Remember to clean it up with gcloud compute disks delete <disk-name>
. Next, we will try a solution that does not require additional disks.
Changing Data Storage
The NFS example uses a dynamically provisioned persistent disk. The output zip files of the Sidekiq worker only need to be kept for as long as it takes the user to download them, so data persistence is not that important here and a separate disk for these files is unnecessary. The emptyDir volume type sounds perfect for this.
Configure the NFS server deployment to use the emptyDir volume by replacing the persistentVolumeClaim
with emptyDir
:
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: nfs-server
spec:
replicas: 1
selector:
matchLabels:
app: nfs-server
template:
metadata:
labels:
app: nfs-server
spec:
securityContext:
fsGroup: 65533
containers:
- name: nfs-server
image: k8s.gcr.io/volume-nfs:0.8
ports:
- name: nfs
containerPort: 2049
- name: mountd
containerPort: 20048
- name: rpcbind
containerPort: 111
securityContext:
privileged: true
volumeMounts:
- mountPath: /exports
name: mypvc
volumes:
- name: mypvc
emptyDir: {}
Create the NFS server Deployment, Service, PersistentVolume and PersistentVolumeClaim.
Wait for the app containers to go from pending
(because they could not mount the volume) to running
.
Get a shell into the containers and check that the mount is successful and that written files show up in the other containers.
Testing NFS Server Failure
Now try deleting the NFS server pod. When it comes up again, get a shell into the pod. If the NFS server pod was not scheduled onto the same node, the files you created previously are gone. Unfortunately, the app containers have also lost the mount path. It has gone missing, but since the app containers are healthy, there is no restart. Trying to delete the pods manually now also causes them to get stuck in the Terminating
state. It is possible to force delete the pod, but the following warning that results is not very comforting:
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
So the app pods cannot terminate properly if the NFS server uses emptyDir
and gets restarted. Worse, they silently lose the mount point but continue running. It could be possible to redefine the liveness command and probe, but that might not solve the stuck termination.
I tried hostPath
next. On GKE, the /
directory is not writeable, even with sudo
. Specify a directory in the /home
folder instead. That seemed to work at first, until I tried forcing the pod onto a different node with nodeSelector
. When the new pod comes up on the other node, the same problem of a disappearing mount path occurred.
Back to PersistentDisks
All the problems above go away if a GCE persistent disk is used. Since it costs only $0.04 a month for 1 GB and there are other tasks to focus on, I shall just stick with using a persistent disk to back the NFS server’s PersistentVolume.
Coursemology Complications
After publishing the original post above, I got programming evaluations working and had a chance to test the file download. It failed! The Nginx container could not read the file.
Some investigation revealed that the Sidekiq worker writes files with 700 permission. The files are generated in a directory, so without execute permission for others
on the directory, Nginx cannot read the file inside the directory. Here is what the permissions look like:
drwx--S--- 2 10000 nogroup 4096 Mar 15 03:06 d20180315-1-1pzonya
The same thing happens on the existing VM based deployment, but SSHFS is used instead, and that allows specific users to be mapped to the shared filesystem. On the VM deployment, the worker sees the files as owned by its own user, while Nginx sees them as the www-data
user.
NFS can be configured to map all uids and gids to the anonymous user with the all_squash
option. Unfortunately, for the NFS server image being used here, the options are hardcoded in the run script. It is possible to pass in more folders to be added to /etc/exports
by passing arguments in the pod spec, but there is no way to modify the options.
A whole afternoon was spent trying various things such as mounting a ConfigMap to /etc/exports
, but that did not work. The container would show up as running
, but the NFS process had not started correctly so the clients could not mount the volume. Inspecting the file after the NFS server starts also implied that the ConfigMap worked, but checking the logs showed some problems with duplicate exports. There is probably a race between loading the ConfigMap and the script appending to it.
Loading a ConfigMap as a file into a pod is also a little different from using ConfigMap data as a volume. There will be mounting errors if this is configured wrongly. The correct example is in the documentation.
Eventually, the solution was to roll my own NFS server image. It is nearly identical to the one in the Kubernetes repository, but the export options have been replaced with the following snippet:
echo "$i *(rw,fsid=0,insecure,all_squash,anonuid=10000,anongid=65533)" >> /etc/exports
While the file permissions look the same as before from both the Nginx container and the worker pod, the download feature works now as clients are squashed to the user IDs and group IDs with the correct permissions. This also makes the securityContext.fsGroup
setting described above unnecessary.