Projects:Kubernetes: verschil tussen versies

Uit Hackerspace Nijmegen Wiki
Naar navigatie springen Naar zoeken springen
k (Minnozz heeft pagina HSNWiki:Kubernetes hernoemd naar Projects:Kubernetes over een doorverwijzing)
 
(83 tussenliggende versies door 3 gebruikers niet weergegeven)
Regel 48: Regel 48:
* '''Container''': Like with Docker, this is one 'guest environment' in which you can run anything. Usually, Kubernetes containers are, in fact, Docker containers.
* '''Container''': Like with Docker, this is one 'guest environment' in which you can run anything. Usually, Kubernetes containers are, in fact, Docker containers.
* '''[https://kubernetes.io/docs/concepts/workloads/pods/pod-overview/ Pod]''': The basic unit you actually schedule in Kubernetes. Usually, a Pod contains one Container, but a Pod can consist of multiple Containers which can be a very useful feature. More on that later. A Pod isn't durable: when anything causes a Pod to stop (e.g. the machine it runs on has a power outage), it won't be restarted automatically.
* '''[https://kubernetes.io/docs/concepts/workloads/pods/pod-overview/ Pod]''': The basic unit you actually schedule in Kubernetes. Usually, a Pod contains one Container, but a Pod can consist of multiple Containers which can be a very useful feature. More on that later. A Pod isn't durable: when anything causes a Pod to stop (e.g. the machine it runs on has a power outage), it won't be restarted automatically.
* '''[https://kubernetes.io/docs/concepts/workloads/controllers/deployment/ Deployment]''': An indication of "desired state" of the cluster in terms of the pods you expect to have. The Kubernetes system will always try to match these Deployments to what it's actually running.
* '''[https://kubernetes.io/docs/concepts/workloads/controllers/deployment/ Deployment]''': An indication of "desired state" of the cluster in terms of the pods you always expect to have. The Kubernetes system will always try to match these Deployments to what it's actually running. Basically, a Deployment is the way to start a Pod and ensure it stays running.
* '''[https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/ Job]''': An indication that you want to run some command to completion. When the cluster has a Job, it will keep re-creating its Pods until a given number of them succeed successfully. Basically, a Job is the way to start a Pod and ensure it finishes once.
* '''[https://kubernetes.io/docs/concepts/storage/volumes/ Volume]''': As in Docker, changes to containers are temporary and will be gone when the container stops. If you want to keep those changes after a restart, like in Docker, you make a Volume. They are also useful to share data between containers. In Kubernetes, Volumes are kept over restarts of Containers, but not over restarts of Pods, unless they are Persistent. More on that later.
* '''[https://kubernetes.io/docs/concepts/storage/volumes/ Volume]''': As in Docker, changes to containers are temporary and will be gone when the container stops. If you want to keep those changes after a restart, like in Docker, you make a Volume. They are also useful to share data between containers. In Kubernetes, Volumes are kept over restarts of Containers, but not over restarts of Pods, unless they are Persistent. More on that later.
* '''[https://kubernetes.io/docs/concepts/services-networking/service/ Service]''': When your Pod contains some application, such as a webserver, you can make its TCP port available as a Service so that people (inside or outside the cluster) can connect to it. For an application you want to run redundantly, multiple Pods can be started; you'll configure them to share the same Service. This way, when you connect to the Service, you'll get one of the running Pods behind it. Instant redundancy!
* '''[https://kubernetes.io/docs/concepts/services-networking/service/ Service]''': When your Pod contains some application, such as a webserver, you can make its TCP port available as a Service so that people (inside or outside the cluster) can connect to it. For an application you want to run redundantly, multiple Pods can be started; you'll configure them to share the same Service. This way, when you connect to the Service, you'll get one of the running Pods behind it. Instant redundancy!
Regel 143: Regel 144:
</pre>
</pre>


The simplest thing we can do, now that we have a cluster, is run a random image with a random command, like before:
The simplest thing we can do, now that we have a cluster, is run a random image (which is a Docker image, in this case <code>ubuntu:bionic</code>, pulled from Docker Hub) with a random command, like before:


<pre>
<pre>
Regel 166: Regel 167:
</pre>
</pre>


Because of <code>--restart=Never</code>, if you exit your shell, the pod will exit and automatically be removed. If you don't give the parameter, kubectl will instead create a Deployment containing your Pod, instead of just the Pod; this means that when your Pod exits, the Deployment will have 0 out of 1 Pods running, so a new Pod will automatically be created:
In the command above, we gave <code>--restart=Never</code>. There are three parameters to this option:
 
* <code>Never</code>: When the pod exits, it will not be recreated. If this is given, <code>kubectl</code> will just create a Pod.
* <code>OnFailure</code>: When the pod exits with failure, it will be recreated, otherwise not. In other words, <code>kubectl</code> will start a Job to create the Pod. (If that doesn't make sense, quickly check up on Jobs in the Concepts section above!)
* <code>Always</code>: When the pod exits, it will be recreated. You guessed it: in this case, <code>kubectl</code> will create a Deployment. (If you didn't guess it, re-check the Concepts section!)
 
The default is <code>--restart=Always</code> so you'll see the container is recreated like this:


<pre>
<pre>
$ kubectl run -ti --image=ubuntu:bionic bash  
$ kubectl run -ti --image=ubuntu:bionic bash  
kubectl run --generator=deployment/apps.v1 is DEPRECATED and will be removed in a future version. Use kubectl run --generator=run-pod/v1 or kubectl create instead.
If you don't see a command prompt, try pressing enter.
If you don't see a command prompt, try pressing enter.
root@bash-58654c7f4b-9bhcq:/# touch foobarbaz
root@bash-58654c7f4b-9bhcq:/# touch foobarbaz
Regel 186: Regel 192:
As you can see, the container restarted, as <code>/foobarbaz</code> did not exist anymore when re-attaching after the <code>exit</code>. Any state in the filesystem of the container/pod will be gone upon restart.
As you can see, the container restarted, as <code>/foobarbaz</code> did not exist anymore when re-attaching after the <code>exit</code>. Any state in the filesystem of the container/pod will be gone upon restart.


So let's try adding a Volume to our pod, to see if we can make some changes persistent.
If you tried this, you can check and remove the deployment like this:
 
<pre>
$ kubectl get deployments     
NAME  READY  UP-TO-DATE  AVAILABLE  AGE
bash  1/1    1            1          3s
$ kubectl delete deployment bash
deployment.extensions "bash" deleted
</pre>
 
== Storage using Volumes ==
 
So let's try adding a Volume to our pod, to see if we can make some changes persistent. Kubernetes supports [https://kubernetes.io/docs/concepts/storage/volumes/#types-of-volumes many types of volumes]; in this case we use <code>[https://kubernetes.io/docs/concepts/storage/volumes/#emptydir emptyDir]</code> which is just a locally stored disk (initially empty).
 
There is no command-line parameter to <code>kubectl run</code> to add volumes. Internally, <code>kubectl run</code> translates your commandline to a JSON request to the Kubernetes API server; we'd have to add any additional requests directly into the JSON. This can be done with the <code>--overrides</code> flag, but at this point, it is probably easier to switch to sending those commands ourselves. We can use JSON for this too, but many users use YAML for this, so we will too.
 
The command above, <code>kubectl run --restart=Never -ti --image=ubuntu:bionic bash</code>, translates to the following YAML:
 
<pre>
apiVersion: v1
kind: Pod
metadata:
  name: bash
spec:
  containers:
  - name: bash
    image: ubuntu:bionic
    stdin: true
    stdinOnce: true
    tty: true
</pre>
 
As you can see, this is a Pod named "bash" and with one container, also called "bash", running <code>ubuntu:bionic</code>.
 
To create this Pod and attach to it, we write the code above to <code>bash.yaml</code> and run these commands:
 
<pre>
$ kubectl create -f bash.yaml
pod/bash created
$ kubectl attach -ti bash -c bash
If you don't see a command prompt, try pressing enter.
root@bash:/# exit 0
exit
$ kubectl delete pod bash
</pre>
 
Now, we will recreate the pod with an <code>emptyDir</code> volume mounted at <code>/foo</code>.
 
<pre>
$ cat bash.yaml
apiVersion: v1
kind: Pod
metadata:
  name: bash
spec:
  volumes:
  - name: testing-volume
    emptyDir: {}
  containers:
  - name: bash
    image: ubuntu:bionic
    stdin: true
    stdinOnce: true
    tty: true
    volumeMounts:
    - mountPath: /foo
      name: testing-volume
$ kubectl create -f bash.yaml
pod/bash created
$ kubectl attach -ti bash -c bash
If you don't see a command prompt, try pressing enter.
root@bash:/# mount | grep foo
/dev/mapper/ubuntu--vg-root on /foo type ext4 (rw,relatime,errors=remount-ro,data=ordered)
root@bash:/# exit 0
exit
$ kubectl delete pod bash
</pre>
 
Of course, this volume isn't really persistent just yet; restarts of the pod will cause it to be recreated (the Volume has the "lifetime of the Pod") so it actually doesn't serve our purpose.
 
There's two ways to get around this:
 
* Instead of using an emptyDir volume, use a volume type that stores its contents somewhere
* Or, alternatively, make a PersistentVolume that exists outside our Pod, and then mount it.
 
== A volume type with persistency ==
 
As seen before, Kubernetes supports [https://kubernetes.io/docs/concepts/storage/volumes/#types-of-volumes many volume types], some of which are naturally persistent because they store the data on an external service.
 
The quickest way to set up persistent storage is to set up a NFS server. (Remember that in production, you'll want to go for something redundant, such as Ceph or Gluster or clustered NFS.)
 
<pre>
root@kubetest4:~# apt-get install -y nfs-kernel-server
[...]
root@kubetest4:~# mkdir /persistent
root@kubetest4:~# chmod 0777 /persistent
root@kubetest4:~# cat <<EOF >>/etc/exports
/persistent *(rw,sync,no_subtree_check)
EOF
root@kubetest4:~# exportfs -ra
</pre>
 
Make sure the other nodes can access it, by checking that <code>mkdir /persistent && mount -t nfs <ipaddress>:/persistent /persistent</code> works. Also, create a file <code>/persistent/helloworld.txt</code> with some contents.
 
Now we can go ahead and create our pod with this volume (replacing <ipaddress> with the address of your NFS host again):
 
<pre>
$ cat bash.yaml
apiVersion: v1
kind: Pod
metadata:
  name: bash
spec:
  volumes:
  - name: testing-volume
    nfs:
      path: /persistent
      server: <ipaddress>
  containers:
  - name: bash
    image: ubuntu:bionic
    stdin: true
    stdinOnce: true
    tty: true
    volumeMounts:
    - mountPath: /foo
      name: testing-volume
$ kubectl create -f bash.yaml
pod/bash created
$ kubectl attach -ti bash -c bash
If you don't see a command prompt, try pressing enter.
root@bash:/# mount | grep foo
145.131.6.179:/persistent on /foo type nfs4 (rw,relatime,vers=4.0,rsize=524288,wsize=524288,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=145.131.6.179,local_lock=none,addr=145.131.6.179)
root@bash:/# cat /foo/helloworld.txt
Hello world!
</pre>
 
== Make a PersistentVolume with a claim ==
 
The Pod YAML file above has persistency, but with the downside that the definition describes where the files should be stored. Most of the time, definitions don't care about where the files are stored, but only what properties the storage has:
 
* How much storage can we use?
* Is it optimized for small files or for large files?
* Is it highly available?
* Is it backed up?
 
To more accurately implement this use-case, Kubernetes has two object types called [https://kubernetes.io/docs/concepts/storage/persistent-volumes/ PersistentVolume and PersistentVolumeClaim]. The idea is that a cluster administrator creates PersistentVolumes (abbreviated <code>pv</code>) that know what kind of storage they represent and where to find it; then, users create PersistentVolumeClaim (abbreviated <code>pvc</code>) asking for storage with constraints like "at least 10 GB". When a PVC is created, it is matched to its closest PV and a link is created. If no PV is available to fulfill a PVC, the PVC stays in "Pending" state. In a way, PersistentVolumes are like Nodes in the sense that they provide capacity, and PersistentVolumeClaims are like Pods in the sense that they use that capacity if it is available anywhere in the cluster.
 
In this section, we'll rewrite our Pod from before to use a PersistentVolumeClaim with its constraints, and then have that PersistentVolumeClaim automatically matched to a PersistentVolume that provides the same NFS share. The first thing we'll do is create the PersistentVolumeClaim:
 
<pre>
$ cat testing-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: testing-pvc
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 2Gi
$ kubectl apply -f testing-pvc.yaml
persistentvolumeclaim/testing-pvc created
$ kubectl get pvc
NAME            STATUS    VOLUME  CAPACITY  ACCESS MODES  STORAGECLASS  AGE
testing-pvc      Pending                                                    6s
</pre>
 
As you see, the <code>testing-pvc</code> PVC is created, but is Pending, because haven't supplied any PersistentVolumes to the cluster yet that can fulfill it. So, we create a PersistentVolume:
 
<pre>
$ cat nfs-storage.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
  name: nfs-storage
spec:
  capacity:
    storage: 10Gi
  accessModes:
    - ReadWriteOnce
  nfs:
    path: /persistent
    server: <IP address>
$ kubectl apply -f nfs-storage.yaml
$ kubectl get pv       
NAME          CAPACITY  ACCESS MODES  RECLAIM POLICY  STATUS  CLAIM                    STORAGECLASS  REASON  AGE
nfs-storage  10Gi      RWO            Retain          Bound    default/testing-pvc                              18s
</pre>
 
It might take a moment for the PVC to be bound, but sure enough, after a bit, it is:
 
<pre>
$ kubectl get pvc
NAME            STATUS  VOLUME        CAPACITY  ACCESS MODES  STORAGECLASS  AGE
testing-pvc      Bound    nfs-storage  10Gi      RWO                          3m15s
</pre>
 
Note that we asked for only 2 GB, but since the volume provides 10 GB, we got the lowest capacity volume that could be bound.
 
And let's create a Pod to use it:
 
<pre>
$ cat bash-pvc.yaml
apiVersion: v1
kind: Pod
metadata:
  name: bash
spec:
  volumes:
  - name: testing-volume
    persistentVolumeClaim:
      claimName: testing-pvc
  containers:
  - name: bash
    image: ubuntu:bionic
    stdin: true
    stdinOnce: true
    tty: true
    volumeMounts:
    - mountPath: /foo
      name: testing-volume
$ kubectl create -f bash.yaml
pod/bash created
$ kubectl attach -ti bash -c bash
If you don't see a command prompt, try pressing enter.
root@bash:/# mount | grep foo
145.131.6.179:/persistent on /foo type nfs4 (rw,relatime,vers=4.0,rsize=524288,wsize=524288,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=145.131.6.179,local_lock=none,addr=145.131.6.179)
root@bash:/# cat /foo/helloworld.txt
Hello world!
</pre>
 
This might seem like an insignificant victory, since we're getting the same end result, but it's a crucial step forward: our Pods, Jobs and Deployments don't need to care anymore where their storage comes from, only that it is persistent.
 
There's some more benefits of Persistent Volume Claims:
 
* If a PVC is deleted while there is still a Pod using it, it will switch state to <code>Terminating</code> but will not disappear until the Pods are gone, too.
* PVCs have a reclaim policy that can be set to <code>Delete</code> or <code>Recycle</code>, allowing the PV's contents to be deleted as well.
* When StorageClasses are used, PVC's can only be bound to a PV with the same Storage Class. When no free PV exists within the Storage Class, while a PVC is waiting to be bound, a PV can be automatically created. This is called [https://kubernetes.io/docs/concepts/storage/persistent-volumes/#provisioning Dynamic provisioning].
* It is possible to make a PV with a "Node Affinity", causing any Pods using that PV to run on a specific node. This combines very well with the HostPath volume type, as this allows a bind-mount of some directory on a Node to be accessible within a Pod.
** But, it should also be obvious that this is a security risk if you allow untrusted users to create PV's and use them in Pods. TO DO: Add a section on protecting this.
 
== Running a Deployment using this volume ==
 
Now that we know how to persistently store data and access it from Pods, it's time to create an actual Deployment with a real-world application in it.
 
<pre>
$ cat nginx.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
  name: nfs-html-storage
spec:
  capacity:
    storage: 10Gi
  accessModes:
    - ReadWriteOnce
  nfs:
    path: /persistent/html
    server: <IP of nfs server>
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: nginx-webfiles
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 2Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
spec:
  selector:
    matchLabels:
      app: nginx
  replicas: 2
  template:
    metadata:
      labels:
        app: nginx
    spec:
      volumes:
      - name: webfiles
        persistentVolumeClaim:
          claimName: nginx-webfiles
      containers:
      - name: nginx
        image: nginx:1.7.9
        ports:
        - containerPort: 80
        volumeMounts:
        - mountPath: /usr/share/nginx/html
          name: webfiles
</pre>
 
Now, before starting this deployment, make an example web page by creating <code>/persistent/html/index.html</code> with something like:
 
<pre>
<nowiki><h1>It's working!</h1></nowiki>
</pre>
 
Then, start the PV, PVC and deployment:
 
<pre>
$ kubectl apply -f nginx.yaml
</pre>
 
A note about <code>kubectl apply</code> ("declarative management") versus <code>kubectl create</code> ("imperative management"): in this case, <code>apply</code> and <code>create</code> would do the same thing as the Deployment described in <code>nginx.yaml</code> doesn't exist yet. However, would you change <code>nginx.yaml</code> and run <code>kubectl create</code> again, you'd get an error. "Imperative management" (create, delete, replace) means you're telling kubectl what action is necessary, while "declarative management" means you're telling kubectl what the state of the cluster should be, and it will perform the correct action for you. Both are fine in a production context; from now on, this page will be using <code>apply</code> where possible since that seems to be the community consensus in tutorials.
 
Let's check if the deployment has been created and the pods as well:
 
<pre>
$ kubectl get deployments
NAME              READY  UP-TO-DATE  AVAILABLE  AGE
nginx-deployment  1/2    2            1          11h
$ kubectl get pods                                     
NAME                                READY  STATUS              RESTARTS  AGE
nginx-deployment-58b6c946d5-fnqr6  1/1    Running            0          49s
nginx-deployment-58b6c946d5-p2nlm  0/1    ContainerCreating  0          49s
</pre>
 
One pod has been created, the other one is still in <code>ContainerCreating</code> state. Let's check why...
 
<pre>
$ kubectl describe pod nginx-deployment-58b6c946d5-p2nlm
[....]
  Warning  FailedMount  15s  kubelet, kubetest1  (combined from similar events): MountVolume.SetUp failed for volume "websource" : mount failed: exit status 32
Mounting command: systemd-run
Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/fac670f9-47d7-11e9-a977-001dd8b7660c/volumes/kubernetes.io~nfs/websource --scope -- mount -t nfs 145.131.6.179:/persistent/html /var/lib/kubelet/pods/fac670f9-47d7-11e9-a977-001dd8b7660c/volumes/kubernetes.io~nfs/websource
Output: Running scope as unit run-r368fd7089b0a46139882e708a89f8926.scope.
mount: wrong fs type, bad option, bad superblock on 145.131.6.179:/persistent/html,
      missing codepage or helper program, or other error
      (for several filesystems (e.g. nfs, cifs) you might
      need a /sbin/mount.<type> helper program)
 
      In some cases useful info is found in syslog - try
      dmesg | tail or so.
</pre>
 
In other words, it is failing to start because the volume <code>websource</code> (our NFS mount) cannot be started. You can see from the output that this is running on the kubelet for <code>kubetest1</code>, and the error comes from <code>mount -t nfs 145.131.6.179:/persistent/html/...</code>. The error is correct: we need an NFS mount helper tool that isn't installed on kubetest1. I run <code>apt-get install nfs-common</code> on it, and sure enough, the pod is soon running:
 
<pre>
$ kubectl get pods
NAME                                READY  STATUS    RESTARTS  AGE
nginx-deployment-58b6c946d5-fnqr6  1/1    Running  0          6m42s
nginx-deployment-58b6c946d5-p2nlm  1/1    Running  0          6m42s
</pre>
 
= Making services accessible over TCP =
 
== Accessing Pods and Deployment using port-forward ==
 
Now, we'd like to see nginx in action! The pod is listening on an internal port 80 (according to its configuration and the default nginx config). If we want to access this from outside the cluster, there's three ways:
 
* <code>kubectl port-forward</code>, which listens to a local port on the machine where you run <code>kubectl port-forward</code>, and then forwards all connections to a pod. (I've explained above how to configure kubectl so it can run on your own machine when the cluster runs elsewhere.)
* Creating a Service, in the next section
* Creating an Ingress, in the section after that
 
Let's try creating a port-forward to the pod first:
 
<pre>
$ kubectl port-forward pod/nginx-deployment-58b6c946d5-fnqr6 8080:80
Forwarding from 127.0.0.1:8080 -> 80
Forwarding from [::1]:8080 -> 80
</pre>
 
Now, visit <code>http://127.0.0.1:8080</code> in your browser and presto!
 
You can also do a port-forward to a deployment:
 
<pre>
$ kubectl port-forward deployment/nginx-deployment 8080:80
Forwarding from 127.0.0.1:8080 -> 80
Forwarding from [::1]:8080 -> 80
</pre>
 
When you start a port-forward to a deployment, the deployment is resolved to a random one of the <code>Running</code> pods in the deployment. So, in this case, you'll get a response from ''either'' of the two pods; if one of them is down, you'll get it from the one that's up. (If both are down, you're out of luck.) However, when that resolved pod goes down for whatever reason, the port-forward is '''not''' restarted to another pod in the deployment. You can try this out by running <code>kubectl delete pod nginx-deployment-...</code> on your pods: the deployment will cause them to be restarted, but the port-forward will cease to work once you've deleted the right one.
 
== Accessing a deployment using a Service ==
 
So, what should we do if we want the application to be reachable even if its Pods go down? The first method is to create a [https://kubernetes.io/docs/concepts/services-networking/service/ Service]. A Service describes how an application should be accessible. There's multiple types of Services, corresponding with multiple interpretations of "accessible":
 
* <code>ClusterIP</code> is a service type indicating that the application should be only internally accessible using a "virtual service IP" (as described above). This service IP will be allocated by Kubernetes and distributed to all nodes and pods, so that a connection to the virtual service IP on the correct port will automatically end up on one of its running Pods.
* <code>NodePort</code> is a service type indicating that the application should be externally accessible using a "service port" on all Nodes. The service port will be allocated by Kubernetes (you can choose it, but that's not recommended) and distributed to all nodes, so that a connection to any node on the service port will automatically end up on one of its running Pods. A NodePort service also automatically gets a ClusterIP, so you can use that, too.
* <code>LoadBalancer</code> is a service type indicating that the application should be externally accessible using a provided load balancer. By default, this works like the <code>NodePort</code> but on specific cloud providers you'll also get an allocated external IP address, on which a wanted port is listening and end up on one of the running Pods. I'm running this on my own cluster, not one hosted by a cloud provider, so I won't create a <code>LoadBalancer</code> service. If you'd like to, [https://kubernetes.io/docs/concepts/services-networking/service/#loadbalancer this page] explains how they work.
** There is a controller that allows you to use <code>LoadBalancer</code> services on your own bare-metal cluster that doesn't run on a cloud provider. It's called [https://metallb.universe.tf/ MetalLB] and it works by having a pool of external routable IP addresses to choose from; when allocating from that pool, it starts announcing that IP address via either ARP or BGP onto an arbitrary Node, so that traffic to that IP ends up there. If the Node goes down, MetalLB elects a new leader node and re-announces the IP there, so that the service is moved.
* <code>ExternalName</code> doesn't actually set up any forwarding, but allows you to register an internal name that forwards to a given name in DNS elsewhere. This allows migration to/from Kubernetes.
* Not a service type, but if your service uses HTTP, you can use Ingress instead of Service to make your service externally accessible. More on that later.
 
Since we want our service to be externally accessible, we'll make a NodePort service:
 
<pre>
$ cat service.yaml
apiVersion: v1
kind: Service
metadata:
  name: nginx-service
spec:
  selector:
    app: nginx
  ports:
    - name: nginx
      port: 80
      protocol: TCP
  type: NodePort
$ kubectl apply -f service.yaml
$ kubectl get services
NAME            TYPE        CLUSTER-IP      EXTERNAL-IP  PORT(S)        AGE
kubernetes      ClusterIP  10.16.0.1      <none>        443/TCP        8d
nginx-service  NodePort    10.16.247.178  <none>        80:31106/TCP  4s
</pre>
 
When we define the Service, we provide a Selector which defines the Pods supplying this Service. You can supply arbitrary labels to Pods; the <code>app=nginx</code> label was initially set in our Deployment (see the <code>metadata</code> section) and is inherited by all Pods created by it, so the Service will use them directly.
 
As you can see, our <code>nginx-service</code> service is now up and it got external port 31106. Indeed, when we go to <code>http://<IP>:31106/</code> (replacing <IP> with the IP of any of our nodes) we can see the page again! Also, when we <code>kubectl delete pod ...</code> arbitrary pods within the deployment, they are restarted automatically, and accesses to the external IP/port keep working as long as at least one pod is <code>Running</code>.
 
There is an alternative way to make a Service externally reachable that can be convenient: you can set an <b>externalIP</b> on a <code>ClusterIP</code> type Service and any Node with that IP will listen on the indicated Service Port -- but that comes with a big fat warning: it introduces a single point of failure into your Service, as it will be unreachable if that Node is down! Yet, it can be very convenient especially for bare-metal clusters, so I'll show you how to do it:
 
<pre>
$ cat service-externalip.yaml
apiVersion: v1
kind: Service
metadata:
  name: nginx-service-externalip
spec:
  selector:
    app: nginx
  ports:
    - name: nginx
      port: 80
      protocol: TCP
  type: ClusterIP
  externalIPs:
  - "145.131.6.177" # Replace this with one of your node's IPs, of course!
$ kubectl apply -f service-externalip.yaml
$ kubectl get services
NAME                      TYPE        CLUSTER-IP      EXTERNAL-IP    PORT(S)        AGE
kubernetes                ClusterIP  10.16.0.1      <none>          443/TCP        19d
nginx-service              NodePort    10.16.247.178  <none>          80:31106/TCP  11d
nginx-service-externalip  ClusterIP  10.16.165.143  145.131.6.177  80/TCP        3s
</pre>
 
Sure enough, if you visit your external IP on port 80, you should see the same page served by Nginx appear! As described before, you can have a similar approach without having a fixed Node to connect to; the controller [https://metallb.universe.tf/ MetalLB] chooses a node randomly then uses ARP or BGP to announce an IP address on it. But, this sort of setup only works in a controlled subnet so I can't try it on this cluster.
 
== A summary so far ==
 
We've talked about:
 
* Nodes, the machines (usually physical or VMs) that together form the Kubernetes cluster
** Master nodes are nothing special, except they (also) run Pods that together form the Kubernetes Control Plane
* Pods, the basic unit of scheduling; they run on Nodes and consist of at least one Container running an actual Docker Image
** Pods have an IP address within the Pod networking range
* Deployments, which are a way to tell Kubernetes to always have some type of Pod running
* Jobs, which are a way to tell Kubernetes to keep running some type of Pod until it finishes successfully
* Services, which are a way to make some application in Pods accessible over TCP (inside and/or outside the cluster)
** Services can have a 'virtual' IP address within the Service networking range, they can have a NodePort all Nodes listen on, and/or they can have an external IP statically or dynamically provided by a LoadBalancer.
* Volumes, which provide various kinds of storage to Pods
** Persistent Volumes are provided by the cluster administrator to allow storage
** Persistent Volume Claims claim such volumes for some user
** Pods can have a Persistent Volume Claim attached to them, making the contents of the volume actually usable
 
== Accessing a Deployment using an Ingress ==
 
[https://kubernetes.io/docs/concepts/services-networking/ingress/ Ingresses] are like Services, but for HTTP only. This specialisation allows adding a number of additional features, such as having multiple applications behind one URL or hostname (e.g. micro-services), SSL termination and splitting load between different versions of the same service (canarying).
 
Ingress is currently in beta (v1beta1), meaning that the feature is well-tested and will continue to exist, but details may change. Consider this before using it in production.
 
Like LoadBalancer Services, creating an Ingress does not immediately change anything in the cluster. You need to have an Ingress Controller for anything to change in the cluster after you create an Ingress. There's many [https://kubernetes.io/docs/concepts/services-networking/ingress-controllers/ Ingress Controller plugins] to choose from; I will try [https://github.com/containous/traefik Traefik] since it supports Let's Encrypt out of the box. (Some cloud providers may provide an Ingress Controller out of the box.)
 
First of all, we set up Traefik. For this, we'll need to create some service types we haven't seen before: service accounts, cluster role bindings and config maps. Bear with me for a bit while we set up Traefik:
 
<pre>
$ cat traefik-account.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: traefik-ingress-controller
  namespace: kube-system
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1beta1
metadata:
  name: traefik-ingress-controller
rules:
  - apiGroups:
      - ""
    resources:
      - services
      - endpoints
      - secrets
    verbs:
      - get
      - list
      - watch
  - apiGroups:
      - extensions
    resources:
      - ingresses
    verbs:
      - get
      - list
      - watch
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1beta1
metadata:
  name: traefik-ingress-controller
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: traefik-ingress-controller
subjects:
- kind: ServiceAccount
  name: traefik-ingress-controller
  namespace: kube-system
$ kubectl apply -f traefik-account.yaml
serviceaccount/traefik-ingress-controller created
clusterrole.rbac.authorization.k8s.io/traefik-ingress-controller created
clusterrolebinding.rbac.authorization.k8s.io/traefik-ingress-controller created
</pre>
 
Now we create a ConfigMap for Traefik's configuration:
 
<pre>
$ cat traefik-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: traefik-configmap
  namespace: kube-system
data:
  traefik.toml: |
    defaultEntryPoints = ["http", "https"]
    insecureSkipVerify = true
 
    [entryPoints]
      [entryPoints.http]
        address = ":80"
      [entryPoints.https]
        address = ":443"
        [entryPoints.https.tls]
      [entryPoints.admin]
        address = ":8080"
 
    [kubernetes]
      [kubernetes.ingressEndpoint]
        publishedService = "kube-system/traefik-ingress-service-external"
 
    [api]
    entryPoint = "admin"
$ kubectl apply -f traefik-configmap.yaml
configmap/traefik-configmap created
</pre>
 
That being done, we now start the Traefik deployment:
 
<pre>
$ cat traefik.yaml
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
    name: traefik-ingress
    namespace: kube-system
    labels:
        k8s-app: traefik-ingress-lb
spec:
    replicas: 1
    selector:
        matchLabels:
            k8s-app: traefik-ingress-lb
    template:
        metadata:
            labels:
                k8s-app: traefik-ingress-lb
                name: traefik-ingress-lb
        spec:
            volumes:
            - name: traefik-configmap
              configMap:
                name: traefik-configmap
            serviceAccountName: traefik-ingress-controller
            terminationGracePeriodSeconds: 60
            containers:
            - image: traefik
              name: traefik-ingress-lb
              ports:
              - name: web
                containerPort: 80
              - name: https
                containerPort: 443
              - name: admin
                containerPort: 8080
              volumeMounts:
              - mountPath: "/config"
                name: "traefik-configmap"
              args:
              - --loglevel=INFO
              - --configfile=/config/traefik.toml
$ kubectl apply -f traefik.yaml
deployment.extensions/traefik-ingress created
</pre>
 
What did this do?
* We created the service account and privileges Traefik needs to find Ingresses, Services and Endpoints.
* We created a ConfigMap, a hard-coded type of Volume that is commonly used to supply configuration inside Pods. This ConfigMap causes Traefik to listen on ports 80, 443 and 8080.
* Then, we created a Deployment that runs the Traefik image with the given configmap and service account.
* Note that you won't find these deployments and pods using the normal <code>kubectl get pods</code> (etc) commands unless you give <code>-n kube-system</code> to select the kube-system namespace.
 
You should see a <code>traefik-ingress-...</code> pod with status <code>Running</code> when you run <code>kubectl get pods -n kube-system</code>; if that's not the case, you should stop here and investigate what's wrong.


= Creating your own pods =
To use Traefik, we'll configure two things:
* External connections end up at it
* It reads the hostname and path of requests, and sends them onwards to the correct Service


To do: Set up Docker Registry. Push images to it. Start them in a pod.
The first thing we've already discussed before: it requires setting up a LoadBalancer Service if you're running on a cloud provider; if you're not, like me, you can set up a ClusterIP Service with an ExternalIP and the side-note of a single-point-of-failure applies here as well. (Note that we expose only ports 80 and 443, not 8080; this is the administrator port of Traefik.)
 
<pre>
$ cat traefik-service-external.yaml
apiVersion: v1
kind: Service
metadata:
  name: traefik-ingress-service-external
  namespace: kube-system
spec:
  selector:
    k8s-app: traefik-ingress-lb
  ports:
    - protocol: TCP
      port: 80
      name: web
    - protocol: TCP
      port: 443
      name: https
  externalIPs:
  - "145.131.8.75"
</pre>
 
The <code>externalIPs</code> mentioned here should be the external IP of one of your Nodes. At this point you can also create a record in DNS to point to this IP address if you want; I created <code>kubetest.sjorsgielen.nl IN A 145.131.8.75</code>.
 
Having this set up should cause <code>http://kubetest.sjorsgielen.nl/</code> to end up within Traefik. It will give a "404 page not found" result, as Traefik doesn't know about any Ingresses yet to forward your request to.
 
You can check the Traefik dashboard to see that it's up. Currently, we'll need a port-forward for that:
 
<pre>
$ kubectl port-forward -n kube-system deployment/traefik-ingress 8080:8080
Forwarding from [::1]:8080 -> 8080
Forwarding from 127.0.0.1:8080 -> 8080
</pre>
 
Now, visit <code>http://localhost:8080/</code> and you should see the Traefik dashboard. It will show no frontends and no backends, as we haven't created any Ingresses yet for it to route. So let's create one for our Nginx service:
 
<pre>
$ cat ingress.yaml
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
    name: nginx-ingress
    annotations:
        traefik.frontend.rule.type: PathPrefixStrip
spec:
    rules:
    - host: kubetest.sjorsgielen.nl
      http:
        paths:
        - path: /nginx
          backend:
            serviceName: nginx-service
            servicePort: 80
$ kubectl apply -f nginx.yaml
ingress.extensions/nginx-ingress created
</pre>
 
So what does this mean?
 
* It's an Ingress type, meaning it's a message to the cluster/Traefik that we want to have a Service externally accessible over HTTP.
* The service will be reachable on the Host <code>kubetest.sjorsgielen.nl</code> -- this acts like a sort of virtual server in Apache, where different hosts can serve different content.
* The request Path must begin with <code>/nginx</code>; the <code>traefik.frontend.rule.type: PathPrefixStrip</code> annotation will cause the <code>/nginx</code> prefix to be stripped off before the request is forwarded.
* The requests will be forwarded to the <code>nginx-service</code> service on port 80.
 
In other words, http://kubetest.sjorsgielen.nl/nginx/index.html will be forwarded to http://nginx-service/index.html. And indeed, it shows the same Nginx page again! Also, if you go to the Traefik dashboard again, you'll see the frontend and backend have appeared and also you'll be able to see the average response time on the Health tab.
 
Now, you could replace your port-forward to the Traefik dashboard with a Service and an Ingress so you can make it externally accessible on your hostname (or a different one) as well. I'll leave that as an exercise to you!
 
== Let's encrypt this ==
 
There's one very nice feature of Traefik I didn't want you to miss out on. It of course supports TLS, and it can automatically get your certificates through any ACME provider such as Let's Encrypt.
 
For this, we change our ConfigMap to include a <code>[acme]</code> section and also to auto-forward all HTTP requests to HTTPS:
 
<pre>
$ cat traefik-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: traefik-configmap
  namespace: kube-system
data:
  traefik.toml: |
    defaultEntryPoints = ["http", "https"]
    insecureSkipVerify = true
 
    [entryPoints]
      [entryPoints.http]
        address = ":80"
        [entryPoints.http.redirect]
          entryPoint = "https"
      [entryPoints.https]
        address = ":443"
        [entryPoints.https.tls]
      [entryPoints.admin]
        address = ":8080"
 
    [acme]
    email = 'your e-mail address'
    storage = "acme.json"
    caServer = "https://acme-v01.api.letsencrypt.org/directory"
    entryPoint = "https"
    onDemand = true
      [acme.httpChallenge]
      entryPoint = "http"
 
    [kubernetes]
      [kubernetes.ingressEndpoint]
        publishedService = "kube-system/traefik-ingress-service-external"
 
    [api]
    entryPoint = "admin"
$ kubectl apply -f traefik-configmap.yaml
configmap/traefik-configmap configured
</pre>
 
Now, unfortunately, changing ConfigMaps doesn't automatically update the Pods that use it. So, we can destroy our Pod and the Deployment will recreate it with the correct configuration:
 
<pre>
$ kubectl get pods -n kube-system | grep traefik
traefik-ingress-6dcd896c78-7w2k6      1/1    Running  0          8d
$ kubectl delete pod traefik-ingress-6dcd896c78-7w2k6 -n kube-system
$ kubectl get pods -n kube-system | grep traefik
traefik-ingress-6dcd896c78-8gl9t      1/1    Running  0          15s
</pre>
 
Traefik will start requesting a TLS certificate when the first TLS request comes in. That may take a minute for the LetsEncrypt challenge to resolve, but after this, you should be able to access your hostname via HTTPS and it should present a valid certificate. In my case, https://kubetest.sjorsgielen.nl/nginx gives the same working page! Also, we've configured the http forward, so http://kubetest.sjorsgielen.nl/nginx just forwards there. Hassle-free TLS, done!
 
= Creating your own images =
 
So far, we've usually set up the standard container <code>ubuntu:bionic</code>. It's pulled from the Docker Hub at https://hub.docker.com/_/ubuntu. Docker Hub is a central registry for images. In the same way you can pull many images from there, such as the minimal Linux image <code>alpine</code> or the image running in our Traefik pod, <code>traefik</code>.
 
But, if we want to run our own Docker images inside Kubernetes, it will need to be able to pull them as well. This can be done by uploading our images to Docker Hub, but for our own experimentation, let's set up our own registry and plug Kubernetes into it.
 
To begin with, the registry will need storage for its images. True to our earlier experiments, we start by creating a persistent volume claim. (I'll assume there's a persistent volume to fulfill it; if not, check above how to create one yourself.)
 
<pre>
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: registry-files
spec:
  storageClassName: default
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 20Gi
</pre>
 
The registry deployment:
 
<pre>
apiVersion: apps/v1
kind: Deployment
metadata:
  name: registry
spec:
  selector:
    matchLabels:
      app: registry
  replicas: 1
  template:
    metadata:
      labels:
        app: registry
    spec:
      volumes:
      - name: registrystorage
        persistentVolumeClaim:
          claimName: registry-files
      containers:
      - name: registry
        image: registry:2
        ports:
        - containerPort: 5000
        volumeMounts:
        - mountPath: /var/lib/registry
          name: registrystorage
</pre>
 
And a Service + Ingress to make it accessible on a new hostname. I found that Docker doesn't support accessing a registry with a path prefix, so we have to give it its own hostname. Luckily, with Traefik, it's easy to route; you'll only have to add a record in DNS.
 
<pre>
apiVersion: v1
kind: Service
metadata:
  name: registry-service
spec:
  selector:
    app: registry
  ports:
    - name: registry
      port: 5000
      protocol: TCP
---
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: registry-ingress
spec:
  rules:
  - host: kuberegistry.sjorsgielen.nl
    http:
      paths:
      - path: /
        backend:
          serviceName: registry-service
          servicePort: 5000
</pre>
 
After a minute, as before, https://kuberegistry.sjorsgielen.nl/v2/ (replace with your own hostname) should return 200 OK with a page content of "{}".
 
To test whether it's working, let's take the Ubuntu Docker image and push it onto our registry, as per [https://docs.docker.com/registry/ more or less these instructions]. Here, it's important that the registry is well-reachable over HTTPS, as Docker will only allow non-SSL HTTP communication over localhost! (Although you could get around this with a <code>kubectl port-forward</code>.)
 
<pre>
$ docker pull ubuntu
$ docker image tag ubuntu kuberegistry.sjorsgielen.nl/myubuntu
$ docker push kuberegistry.sjorsgielen.nl/myubuntu
[...] Retrying in 10 seconds
</pre>
 
That seems to fail. As before, we can figure out the root cause by getting the logs of the Registry pod:
 
<pre>
$ kubectl logs registry-6bf4dbcfb-9csf5
[...]
time="2019-03-28T21:44:04.465658668Z" level=error msg="response completed with error" err.code=unknown err.detail="filesystem: mkdir /var/lib/registry/docker: permission denied" err.message="unknown error" go.version=go1.11.2 http.request.host=kuberegistry.sjorsgielen.nl http.request.id=c00f2785-30b0-469d-bcff-70a12c0f604b http.request.method=POST http.request.remoteaddr=10.107.160.0 http.request.uri="/v2/myubuntu/blobs/uploads/" http.request.useragent="docker/18.06.1-ce go/go1.10.4 git-commit/e68fc7a kernel/4.4.0-112-generic os/linux arch/amd64 UpstreamClient(Docker-Client/18.06.1-ce \(linux\))" http.response.contenttype="application/json; charset=utf-8" http.response.duration=125.482304ms http.response.status=500 http.response.written=164 vars.name=myubuntu
</pre>
 
A "permission denied" error in "mkdir /var/lib/registry/docker". Now, we may not know the PersistentVolume behind whatever is mounted in the registry, but we can quickly find out by checking <code>kubectl describe deployment registry</code>, <code>kubectl get pvc</code> and <code>kubectl describe pv registry-storage</code>. In my case, it's because root squashing is enabled on my NFS mount and the directory is being accessed by root, therefore by an anonymous uid/gid, which doesn't have rights in the directory. It's easily fixed and now the push works:
 
<pre>
$ docker push kuberegistry.sjorsgielen.nl/myubuntu
The push refers to repository [kuberegistry.sjorsgielen.nl/myubuntu]
b57c79f4a9f3: Pushed
d60e01b37e74: Pushed
e45cfbc98a50: Pushed
762d8e1a6054: Pushed
latest: digest: sha256:f2557f94cac1cc4509d0483cb6e302da841ecd6f82eb2e91dc7ba6cfd0c580ab size: 1150
</pre>
 
Now, let's make our own Docker image, push it, and start it in a Pod!
 
Here's an example Dockerfile that runs a tiny Perl-based webserver that always responds with its own hostname:
 
<pre>
$ cat Dockerfile
FROM ubuntu:bionic
 
RUN apt-get update \
&& apt-get install -y libmojolicious-perl \
&& rm -rf /var/lib/apt/lists/*
 
# Normally, you'd use COPY here, but I wanted to keep this in one file
RUN echo "#!/usr/bin/env perl"                      >>/app.pl \
&& echo "use Mojolicious::Lite;"                    >>/app.pl \
&& echo "get '/' => sub {"                          >>/app.pl \
&& echo "  shift->render(text => 'Hello World!'); " >>/app.pl \
&& echo "};"                                        >>/app.pl \
&& echo "app->start;"                              >>/app.pl \
&& chmod +x /app.pl
 
EXPOSE 3000
CMD ["/app.pl", "daemon", "-l"]
$ docker build -t kuberegistry.sjorsgielen.nl/helloworld:latest .
$ docker push kuberegistry.sjorsgielen.nl/helloworld:latest
</pre>
 
At this point, you should be able to write a Deployment, Service and Ingress for this application, using the examples above. <code>kubectl apply</code> should then start the Pod, Traefik should route the service and whatever host/path you configured should quickly be reachable and respond with "Hello World". We've created our own image and ran it on your cluster!


= To do =
= To do =


* Play with deployments.
* Play with deployments with redundancy.
* Play with volumes, including persistent volumes.
* Play with services.
* TCP ports ingressing.
* Play with kubectl port-forward.
* Kubectl set image for rolling release?
* Kubectl set image for rolling release?
* Kubernetes Dashboard
* Kubernetes Dashboard
* Attempt Kubernetes upgrade from 1.13 to 1.14
* Attempt Kubernetes upgrade from 1.13 to 1.14
** https://kubernetes.io/docs/tasks/administer-cluster/kubeadm/kubeadm-upgrade-1-14/
** First, do a normal apt upgrade (the kubernetes packages are held and will not be modified)
** Then, unhold the kubeadm package on the master, upgrade it to the right version, then re-hold it
*** This only worked for me after unholding the kubelet and upgrading it as well.
** On the master, "kubeadm upgrade plan", then "kubeadm upgrade apply v1.14.x"
** Upgrade CNI controller by re-running the same <code>kubectl apply</code> as earlier
** Unhold the kubelet and kubectl packages on the master, upgrade them and re-hold them, then restart the kubelet
** For each worker, unhold the kubeadm package, upgrade it, rehold it; cordon (drain) the node; upgrade the node config; install the new kubelet version and restart it; uncordon the node.
*** Here too, this only worked for me after unholding the kubelet and upgrading it as well.
* Try getting information on a pod from inside it using the Kubernetes API
* Try getting information on a pod from inside it using the Kubernetes API
** https://kubernetes.io/docs/tasks/administer-cluster/access-cluster-api/#accessing-the-api-from-a-pod
** <code>wget --ca-certificate=/run/secrets/kubernetes.io/serviceaccount/ca.crt -qO- https://kubernetes.default.svc.cluster.local/api/</code>
** Doesn't need using the Kubernetes API, can be done using env vars: https://kubernetes.io/docs/tasks/inject-data-application/environment-variable-expose-pod-information/
* Play with native cronjobs
* Play with Statefulset / Daemonset
* Play with Statefulset / Daemonset
* Security Contexts
** Refuse pods with host networking
** Refuse PVs with hostpath mounts
** Allow K8s API communication from a pod, but only to receive information about itself
** Basically: Make it impossible to root a node even with "broad" privileges on the Kubernetes API server
** https://kubernetes.io/docs/concepts/policy/pod-security-policy/
* Limiting pods in memory, CPU, I/O
* Limiting pods in network communication
[[Categorie:Projects]]

Huidige versie van 5 feb 2020 om 16:15

According to Wikipedia, Kubernetes (commonly stylized as k8s) is an open-source container orchestration system for automating application deployment, scaling, and management. I've heard a lot about it, and it seems to solve some problems I'm encountering sometimes, so I'd like to get to know it better. While doing so, I wanted to make a write-up of what I found out.

This page is not intended as a tutorial, but it will link to various tutorials I found useful. In this way, I hope that it can be used as a reference for someone setting up Kubernetes to be able to learn about it more quickly.

Disclaimer: I'm a beginner in the Kubernetes area, so this tutorial will contain errors. Please correct errors where you find them, and if possible, write some explanation! The "I" in this article is Sjors, but parts are possibly written by other people.

Problem

We start with a Linux desktop, running some version of Arch Linux. Someone wants to run an application on it for which only binaries are available, but those binaries were compiled on Ubuntu. Now, system libraries are different between Arch Linux and Ubuntu, and while it's possible to create binaries that run independent of system libraries (called "static binaries"), in this particular situation let's assume the binaries aren't static, but you still want to run them on your Arch installation.

You can get a second machine and run Ubuntu on it, or similarly you could use a Virtual Machine. But, there's a simpler and more efficient solution: Docker allows you to install, within your Linux distribution (the "host"), another Linux distribution (let's call it "guest" for now). The host and the guest may be completely different – technically, only the kernel of the host is used, and of course the host and guest must have compatible processor architectures. You run the guest environment within Docker (a "Docker Container") and run that binary in it.

This is how easy that is:

sjors@somebox:~$ lsb_release -a
LSB Version: 1.4
Distributor ID: Arch
Description: Arch Linux
Release: rolling
Codename: n/a
sjors@somebox:~$ docker run -ti ubuntu:bionic bash
root@a5b210d251c2:/# apt-get update && apt-get -y install lsb-release
[....]
root@a5b210d251c2:/# lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 18.04.1 LTS
Release:	18.04
Codename:	bionic

The -it flag makes this an interactive container that can run bash; just as easily, you can run detached containers running webservers, and use the integrated port-forwarding features to make them accessible to the outside.

Now, Docker has some problems of its own:

  • You start Docker containers by accessing the Docker daemon; the daemon runs containers as root and allows you to start a container with a bind-mount of "/". Basically, having access to the Docker daemon means you have root on the system, but you need access to do anything. It's all-or-nothing.
  • When your Docker machine goes down, all containers are gone. You'll have to either restart all containers manually, or have boot-scripts that set them up, but there's no automatic restart mechanism.
  • When you want to run more Docker containers than fit on one machine, there's no horizontal scaling mechanism built-in.
  • When you want to run a service multiple times, e.g. for redundancy, you need to schedule them manually multiple times and roll your own method of load-balancing them.

Kubernetes provides a solution for these problems, while otherwise looking very much like Docker. In fact, when you're familiar with Docker some of the commands below will also be very familiar to you.

Concepts

In this section, I'll explain some of Kubernetes' concepts quickly, and add links if you want to know more.

  • Container: Like with Docker, this is one 'guest environment' in which you can run anything. Usually, Kubernetes containers are, in fact, Docker containers.
  • Pod: The basic unit you actually schedule in Kubernetes. Usually, a Pod contains one Container, but a Pod can consist of multiple Containers which can be a very useful feature. More on that later. A Pod isn't durable: when anything causes a Pod to stop (e.g. the machine it runs on has a power outage), it won't be restarted automatically.
  • Deployment: An indication of "desired state" of the cluster in terms of the pods you always expect to have. The Kubernetes system will always try to match these Deployments to what it's actually running. Basically, a Deployment is the way to start a Pod and ensure it stays running.
  • Job: An indication that you want to run some command to completion. When the cluster has a Job, it will keep re-creating its Pods until a given number of them succeed successfully. Basically, a Job is the way to start a Pod and ensure it finishes once.
  • Volume: As in Docker, changes to containers are temporary and will be gone when the container stops. If you want to keep those changes after a restart, like in Docker, you make a Volume. They are also useful to share data between containers. In Kubernetes, Volumes are kept over restarts of Containers, but not over restarts of Pods, unless they are Persistent. More on that later.
  • Service: When your Pod contains some application, such as a webserver, you can make its TCP port available as a Service so that people (inside or outside the cluster) can connect to it. For an application you want to run redundantly, multiple Pods can be started; you'll configure them to share the same Service. This way, when you connect to the Service, you'll get one of the running Pods behind it. Instant redundancy!
  • Namespace: Kubernetes supports running multiple "virtual clusters" on the infrastructure of one "physical cluster". Those virtual clusters are called "namespaces", and you can restrict access to certain namespaces. Normally, you're only working with the "default" namespace.

Those are some concepts that allow you to use a Kubernetes cluster. In this guide, we'll also be setting up the infrastructure behind that:

  • Node: A machine that actually runs the Containers. Can be bare-metal or a virtual machine, or even an embedded IoT device. A Node runs a process called "Kubelet" which interacts with the Docker daemon (usually) to set everything up, but normally you never communicate directly with it.
  • Control Plane: A set of some applications (API server, etcd, controller manager, scheduler...) that make sure the cluster is "healthy". For example, it starts Pods when you request it to, but also when a Node goes down that was running Pods for some Deployment, it restarts those Pods elsewhere. These master applications themselves also run in Pods, in a separate namespace called "kube-system".
  • Master node: Otherwise a normal Node, but it runs the Control Plane applications. By default, a Master node will only run Pods for these applications, but you can configure it to allow normal Pods too. There can be multiple Master nodes, for redundancy of the cluster.

Understanding networking

Networking within a Kubernetes cluster isn't difficult, but requires some specific explanation. Applications within Kubernetes containers need to be able to access each other, regardless of whether they run on the same Node or not. To make it worse, Nodes may be running behind firewalls or may be running in different subnets. Luckily, Kubernetes has some mechanisms that make it very robust against this.

When setting up a Kubernetes cluster, there are two important internal IP ranges throughout the cluster:

  • The Pod network range. This internal range is automatically split over Nodes, and Pods get individual addresses from it.
    • For example, you can set this to 10.123.0.0/16; the master node will likely get 10.123.0.0/24 and the second Node you add after that gets 10.123.1.0/24 and so on. A Pod running on this second node may have 10.123.1.55 as an IP address. (If the Pod has multiple containers, all of them will have the same IP address.)
  • The service network range. When you register a Service, such as "my-fun-webserver", it automatically gets an IP address within this range. An application called the 'kube-proxy', running automatically on every Node, then takes care that any communication with this IP address is forwarded to one of the actual Pods providing that service (by configuring iptables). Fun fact: the Kubernetes API server registers itself as a service and is always available at the first host address of the range you chose.
    • For example, your service network range may be 10.96.0.0/16; the Kubernetes API service makes itself available at 10.96.0.1. When you communicate with this IP address, the communication is automatically translated (by iptables) to be sent to the Pod IP address of the Kubernetes API service, e.g. 10.123.1.55.

It's important that these ranges don't overlap, and they also both shouldn't overlap with any relevant IP ranges within your existing network! The Kubernetes folks suggest you use something within 10.0.0.0/8 if your local network range is within 192.168.0.0/16 and vice-versa.

Since there is no one-size-fits-all solution to networking between Nodes, Kubernetes allows that to be done by specific networking plugins, called CNI. There are a number of such plugins, but a friend of mine experienced with running Kubernetes clusters explained why you should probably go for Weave:

  • It uses the vxlan standard. vxlan allows tunneling traffic between nodes automatically. When a packet should travel from one Node to another, a small UDP/IP header is prepended to it, so it is sent to the right node, where the header is taken off and routing continues.
  • This method allows it to cross almost all difficult network setups as long as you can have one UDP port forwarded between the machines.
  • Weave is smart enough to figure out the most efficient way to use vxlan given your Linux kernel version.
  • It's also pretty simple: just a single Go binary.

Kubernetes takes care that the pod network range and service network range is not only usable within pods, but also on the nodes. So, using the example values above, `https://10.96.0.1/` will reach the Kubernetes API server within pods and on nodes, also highly-available if you have multiple masters, which is pretty convenient.

Some more important features of Kubernetes networking:

  • A Kubernetes cluster automatically runs a "CoreDNS" pod, which provides DNS to all other pods. It forwards requests outside the cluster to an upstream DNS server, but most importantly, provides an internal `cluster.local` DNS zone that you can use to look up other pods or services. For example, `kubernetes.default.svc.cluster.local` resolves to 10.96.0.1, as above. (In that hostname, 'kubernetes' is the service name, 'default' is the namespace.)
  • When a pod is listening on some TCP port, you don't need to use Services to reach them externally: kubectl port-forward pod/foobarbaz 8080:80 forwards local port 8080 to port 80 of a pod called 'foobarbaz', and for this to work your kubectl can run on any machine with credentials to access the API server, it doesn't need to be part of the cluster.

Setting it all up

With that all behind us, let's start setting up our first cluster!

For a useful cluster, you'll need at least one machine to be the master, but of course we'll use at least two so we can call it an actual cluster. The master node is strongly recommended to have at least two CPUs and 2 GB of RAM. The other nodes can be a bit smaller.

For my tests, I've used four machines:

  • Kubetest1 - 1 CPU, 1 GB RAM - 145.131.6.177 - Ubuntu 16.04
  • Kubetest2 - 1 CPU, 1 GB RAM - 145.131.8.75 - Ubuntu 16.04
  • Kubetest3 - 1 CPU, 1 GB RAM - 145.131.5.151 - Ubuntu 16.04
  • Kubetest4 - 2 CPU, 2 GB RAM - 145.131.6.179 - Ubuntu 16.04

(I initially wanted a three-machine cluster, and forced the master on a 1 CPU/1 GB RAM machine, but processes were being killed because of going out-of-memory. So, kubetest4 became a bit bigger and will be my master.)

Now, the machines in your cluster should be able to access each other on all ports used by Kubernetes (see this list here). Important note: the machines should also be able to access themselves on their external IP address! I was initially having problems because of this, where pods on a worker node could reach the master normally, but pods on the master node couldn't. It was because the external IP address of the master were used in some communications, which was ironically impossible just on the master itself.

So, once you have your machines up, and you know they can reach each other, we'll start installation.

  • First, we choose our Pod network range and Service range. See previous section. I used 10.107.0.0/16 for the network range and 10.16.0.0/16 for the service range.
  • Optionally, we disable swap on the machine, as Kubernetes prefers to run without it. To do this, check /proc/swaps for your swap locations and run swapoff on them, then remove the entries from /etc/fstab. Or, leave them enabled and later disable the check in Kubernetes.
  • We install the Docker daemon. On Ubuntu, sudo apt-get install docker.io does it.
  • We install the basic Kubernetes tools: kubeadm, kubelet and kubecfg. Here's a nice guide for it.
  • We use kubeadm to create our cluster. Just follow this guide, until the "Installing a pod network add-on" step.
    • I used: kubeadm init --pod-network-cidr "10.107.0.0/16" --service-cidr "10.16.0.0/16"
    • After this, kubectl get nodes should already respond with your master node, and kubectl get pods --all-namespaces should mention the kube-system pods that make up the Control Plane as discussed above!
  • We install a pod network add-on. As discussed before, we use Weave, but YMMV. See this guide on Weave setup, with an important note: if you don't use Kubernetes' default Pod network CIDR, pay attention to the exact kubectl apply step.
    • I used: kubectl apply -f "https://cloud.weave.works/k8s/net?k8s-version=$(kubectl version | base64 | tr -d '\n')&env.IPALLOC_RANGE=10.107.0.0/16"
    • After this, kubectl get pods --all-namespaces should show some additional pods. Most importantly, it should now show that CoreDNS is running.
  • Now, we can join our other nodes. kubeadm init will show a kubeadm join command at the end of its output, which you can run on the other nodes. If you don't have that output anymore, see this page on how to recreate a valid join command.
    • I also copied the /etc/kubernetes/admin.conf from the master to the other nodes (root-readable only), and set KUBECONFIG=/etc/kubernetes/admin.conf, so I could run kubectl on them. But that's up to you!

Hopefully, after understanding all the concepts discussed earlier, this process was a matter of mere minutes!

Your node list should be complete now:

root@kubetest4:~# kubectl get nodes
NAME        STATUS   ROLES    AGE    VERSION
kubetest1   Ready    <none>   2d2h   v1.13.4
kubetest2   Ready    <none>   2d2h   v1.13.4
kubetest3   Ready    <none>   2d5h   v1.13.4
kubetest4   Ready    master   2d5h   v1.13.4

By the way, you can run kubectl commands from your own machine as well by copying /etc/kubernetes/admin.conf from the master to your $HOME/.kube/config, and possibly after some firewall config to allow communications.

TO DO: It should be possible to generate user credentials instead of taking admin credentials. But how?

Creating some basic pods

You'll see your cluster has no pods in the default namespace:

$ kubectl get pods
No resources found.

The simplest thing we can do, now that we have a cluster, is run a random image (which is a Docker image, in this case ubuntu:bionic, pulled from Docker Hub) with a random command, like before:

$ kubectl run --restart=Never -ti --image=ubuntu:bionic bash
If you don't see a command prompt, try pressing enter.
root@bash:/# apt-get update && apt-get -y install lsb-release
[....]
root@bash:/# lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 18.04.1 LTS
Release:	18.04
Codename:	bionic

Indeed, your pod will be shown now:

$ kubectl get pods
NAME   READY   STATUS    RESTARTS   AGE
bash   1/1     Running   0          56s

In the command above, we gave --restart=Never. There are three parameters to this option:

  • Never: When the pod exits, it will not be recreated. If this is given, kubectl will just create a Pod.
  • OnFailure: When the pod exits with failure, it will be recreated, otherwise not. In other words, kubectl will start a Job to create the Pod. (If that doesn't make sense, quickly check up on Jobs in the Concepts section above!)
  • Always: When the pod exits, it will be recreated. You guessed it: in this case, kubectl will create a Deployment. (If you didn't guess it, re-check the Concepts section!)

The default is --restart=Always so you'll see the container is recreated like this:

$ kubectl run -ti --image=ubuntu:bionic bash 
If you don't see a command prompt, try pressing enter.
root@bash-58654c7f4b-9bhcq:/# touch foobarbaz
root@bash-58654c7f4b-9bhcq:/# exit
Session ended, resume using 'kubectl attach bash-58654c7f4b-9bhcq -c bash -i -t' command when the pod is running

[...wait for a bit until the pod comes back up...]

$ kubectl attach bash-58654c7f4b-9bhcq -c bash -i -t
If you don't see a command prompt, try pressing enter.
root@bash-58654c7f4b-9bhcq:/# ls foobarbaz
ls: cannot access 'foobarbaz': No such file or directory

As you can see, the container restarted, as /foobarbaz did not exist anymore when re-attaching after the exit. Any state in the filesystem of the container/pod will be gone upon restart.

If you tried this, you can check and remove the deployment like this:

$ kubectl get deployments      
NAME   READY   UP-TO-DATE   AVAILABLE   AGE
bash   1/1     1            1           3s
$ kubectl delete deployment bash
deployment.extensions "bash" deleted

Storage using Volumes

So let's try adding a Volume to our pod, to see if we can make some changes persistent. Kubernetes supports many types of volumes; in this case we use emptyDir which is just a locally stored disk (initially empty).

There is no command-line parameter to kubectl run to add volumes. Internally, kubectl run translates your commandline to a JSON request to the Kubernetes API server; we'd have to add any additional requests directly into the JSON. This can be done with the --overrides flag, but at this point, it is probably easier to switch to sending those commands ourselves. We can use JSON for this too, but many users use YAML for this, so we will too.

The command above, kubectl run --restart=Never -ti --image=ubuntu:bionic bash, translates to the following YAML:

apiVersion: v1
kind: Pod
metadata:
  name: bash
spec:
  containers:
  - name: bash
    image: ubuntu:bionic
    stdin: true
    stdinOnce: true
    tty: true

As you can see, this is a Pod named "bash" and with one container, also called "bash", running ubuntu:bionic.

To create this Pod and attach to it, we write the code above to bash.yaml and run these commands:

$ kubectl create -f bash.yaml
pod/bash created
$ kubectl attach -ti bash -c bash
If you don't see a command prompt, try pressing enter.
root@bash:/# exit 0
exit
$ kubectl delete pod bash

Now, we will recreate the pod with an emptyDir volume mounted at /foo.

$ cat bash.yaml
apiVersion: v1
kind: Pod
metadata:
  name: bash
spec:
  volumes:
  - name: testing-volume
    emptyDir: {}
  containers:
  - name: bash
    image: ubuntu:bionic
    stdin: true
    stdinOnce: true
    tty: true
    volumeMounts:
    - mountPath: /foo
      name: testing-volume
$ kubectl create -f bash.yaml
pod/bash created
$ kubectl attach -ti bash -c bash
If you don't see a command prompt, try pressing enter.
root@bash:/# mount | grep foo
/dev/mapper/ubuntu--vg-root on /foo type ext4 (rw,relatime,errors=remount-ro,data=ordered)
root@bash:/# exit 0
exit
$ kubectl delete pod bash

Of course, this volume isn't really persistent just yet; restarts of the pod will cause it to be recreated (the Volume has the "lifetime of the Pod") so it actually doesn't serve our purpose.

There's two ways to get around this:

  • Instead of using an emptyDir volume, use a volume type that stores its contents somewhere
  • Or, alternatively, make a PersistentVolume that exists outside our Pod, and then mount it.

A volume type with persistency

As seen before, Kubernetes supports many volume types, some of which are naturally persistent because they store the data on an external service.

The quickest way to set up persistent storage is to set up a NFS server. (Remember that in production, you'll want to go for something redundant, such as Ceph or Gluster or clustered NFS.)

root@kubetest4:~# apt-get install -y nfs-kernel-server
[...]
root@kubetest4:~# mkdir /persistent
root@kubetest4:~# chmod 0777 /persistent
root@kubetest4:~# cat <<EOF >>/etc/exports
/persistent *(rw,sync,no_subtree_check)
EOF
root@kubetest4:~# exportfs -ra

Make sure the other nodes can access it, by checking that mkdir /persistent && mount -t nfs <ipaddress>:/persistent /persistent works. Also, create a file /persistent/helloworld.txt with some contents.

Now we can go ahead and create our pod with this volume (replacing <ipaddress> with the address of your NFS host again):

$ cat bash.yaml
apiVersion: v1
kind: Pod
metadata:
  name: bash
spec:
  volumes:
  - name: testing-volume
    nfs:
      path: /persistent
      server: <ipaddress>
  containers:
  - name: bash
    image: ubuntu:bionic
    stdin: true
    stdinOnce: true
    tty: true
    volumeMounts:
    - mountPath: /foo
      name: testing-volume
$ kubectl create -f bash.yaml
pod/bash created
$ kubectl attach -ti bash -c bash
If you don't see a command prompt, try pressing enter.
root@bash:/# mount | grep foo
145.131.6.179:/persistent on /foo type nfs4 (rw,relatime,vers=4.0,rsize=524288,wsize=524288,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=145.131.6.179,local_lock=none,addr=145.131.6.179)
root@bash:/# cat /foo/helloworld.txt
Hello world!

Make a PersistentVolume with a claim

The Pod YAML file above has persistency, but with the downside that the definition describes where the files should be stored. Most of the time, definitions don't care about where the files are stored, but only what properties the storage has:

  • How much storage can we use?
  • Is it optimized for small files or for large files?
  • Is it highly available?
  • Is it backed up?

To more accurately implement this use-case, Kubernetes has two object types called PersistentVolume and PersistentVolumeClaim. The idea is that a cluster administrator creates PersistentVolumes (abbreviated pv) that know what kind of storage they represent and where to find it; then, users create PersistentVolumeClaim (abbreviated pvc) asking for storage with constraints like "at least 10 GB". When a PVC is created, it is matched to its closest PV and a link is created. If no PV is available to fulfill a PVC, the PVC stays in "Pending" state. In a way, PersistentVolumes are like Nodes in the sense that they provide capacity, and PersistentVolumeClaims are like Pods in the sense that they use that capacity if it is available anywhere in the cluster.

In this section, we'll rewrite our Pod from before to use a PersistentVolumeClaim with its constraints, and then have that PersistentVolumeClaim automatically matched to a PersistentVolume that provides the same NFS share. The first thing we'll do is create the PersistentVolumeClaim:

$ cat testing-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: testing-pvc
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 2Gi
$ kubectl apply -f testing-pvc.yaml 
persistentvolumeclaim/testing-pvc created
$ kubectl get pvc
NAME             STATUS    VOLUME   CAPACITY   ACCESS MODES   STORAGECLASS   AGE
testing-pvc      Pending                                                     6s

As you see, the testing-pvc PVC is created, but is Pending, because haven't supplied any PersistentVolumes to the cluster yet that can fulfill it. So, we create a PersistentVolume:

$ cat nfs-storage.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
  name: nfs-storage
spec:
  capacity:
    storage: 10Gi
  accessModes:
    - ReadWriteOnce
  nfs:
    path: /persistent
    server: <IP address>
$ kubectl apply -f nfs-storage.yaml
$ kubectl get pv        
NAME          CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                    STORAGECLASS   REASON   AGE
nfs-storage   10Gi       RWO            Retain           Bound    default/testing-pvc                              18s

It might take a moment for the PVC to be bound, but sure enough, after a bit, it is:

$ kubectl get pvc
NAME             STATUS   VOLUME        CAPACITY   ACCESS MODES   STORAGECLASS   AGE
testing-pvc      Bound    nfs-storage   10Gi       RWO                           3m15s

Note that we asked for only 2 GB, but since the volume provides 10 GB, we got the lowest capacity volume that could be bound.

And let's create a Pod to use it:

$ cat bash-pvc.yaml
apiVersion: v1
kind: Pod
metadata:
  name: bash
spec:
  volumes:
  - name: testing-volume
    persistentVolumeClaim:
      claimName: testing-pvc
  containers:
  - name: bash
    image: ubuntu:bionic
    stdin: true
    stdinOnce: true
    tty: true
    volumeMounts:
    - mountPath: /foo
      name: testing-volume
$ kubectl create -f bash.yaml
pod/bash created
$ kubectl attach -ti bash -c bash
If you don't see a command prompt, try pressing enter.
root@bash:/# mount | grep foo
145.131.6.179:/persistent on /foo type nfs4 (rw,relatime,vers=4.0,rsize=524288,wsize=524288,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=145.131.6.179,local_lock=none,addr=145.131.6.179)
root@bash:/# cat /foo/helloworld.txt
Hello world!

This might seem like an insignificant victory, since we're getting the same end result, but it's a crucial step forward: our Pods, Jobs and Deployments don't need to care anymore where their storage comes from, only that it is persistent.

There's some more benefits of Persistent Volume Claims:

  • If a PVC is deleted while there is still a Pod using it, it will switch state to Terminating but will not disappear until the Pods are gone, too.
  • PVCs have a reclaim policy that can be set to Delete or Recycle, allowing the PV's contents to be deleted as well.
  • When StorageClasses are used, PVC's can only be bound to a PV with the same Storage Class. When no free PV exists within the Storage Class, while a PVC is waiting to be bound, a PV can be automatically created. This is called Dynamic provisioning.
  • It is possible to make a PV with a "Node Affinity", causing any Pods using that PV to run on a specific node. This combines very well with the HostPath volume type, as this allows a bind-mount of some directory on a Node to be accessible within a Pod.
    • But, it should also be obvious that this is a security risk if you allow untrusted users to create PV's and use them in Pods. TO DO: Add a section on protecting this.

Running a Deployment using this volume

Now that we know how to persistently store data and access it from Pods, it's time to create an actual Deployment with a real-world application in it.

$ cat nginx.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
  name: nfs-html-storage
spec:
  capacity:
    storage: 10Gi
  accessModes:
    - ReadWriteOnce
  nfs:
    path: /persistent/html
    server: <IP of nfs server>
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: nginx-webfiles
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 2Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
spec:
  selector:
    matchLabels:
      app: nginx
  replicas: 2
  template:
    metadata:
      labels:
        app: nginx
    spec:
      volumes:
      - name: webfiles
        persistentVolumeClaim:
          claimName: nginx-webfiles
      containers:
      - name: nginx
        image: nginx:1.7.9
        ports:
        - containerPort: 80
        volumeMounts:
        - mountPath: /usr/share/nginx/html
          name: webfiles

Now, before starting this deployment, make an example web page by creating /persistent/html/index.html with something like:

<h1>It's working!</h1>

Then, start the PV, PVC and deployment:

$ kubectl apply -f nginx.yaml

A note about kubectl apply ("declarative management") versus kubectl create ("imperative management"): in this case, apply and create would do the same thing as the Deployment described in nginx.yaml doesn't exist yet. However, would you change nginx.yaml and run kubectl create again, you'd get an error. "Imperative management" (create, delete, replace) means you're telling kubectl what action is necessary, while "declarative management" means you're telling kubectl what the state of the cluster should be, and it will perform the correct action for you. Both are fine in a production context; from now on, this page will be using apply where possible since that seems to be the community consensus in tutorials.

Let's check if the deployment has been created and the pods as well:

$ kubectl get deployments
NAME               READY   UP-TO-DATE   AVAILABLE   AGE
nginx-deployment   1/2     2            1           11h
$ kubectl get pods                                      
NAME                                READY   STATUS              RESTARTS   AGE
nginx-deployment-58b6c946d5-fnqr6   1/1     Running             0          49s
nginx-deployment-58b6c946d5-p2nlm   0/1     ContainerCreating   0          49s

One pod has been created, the other one is still in ContainerCreating state. Let's check why...

$ kubectl describe pod nginx-deployment-58b6c946d5-p2nlm
[....]
  Warning  FailedMount  15s  kubelet, kubetest1  (combined from similar events): MountVolume.SetUp failed for volume "websource" : mount failed: exit status 32
Mounting command: systemd-run
Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/fac670f9-47d7-11e9-a977-001dd8b7660c/volumes/kubernetes.io~nfs/websource --scope -- mount -t nfs 145.131.6.179:/persistent/html /var/lib/kubelet/pods/fac670f9-47d7-11e9-a977-001dd8b7660c/volumes/kubernetes.io~nfs/websource
Output: Running scope as unit run-r368fd7089b0a46139882e708a89f8926.scope.
mount: wrong fs type, bad option, bad superblock on 145.131.6.179:/persistent/html,
       missing codepage or helper program, or other error
       (for several filesystems (e.g. nfs, cifs) you might
       need a /sbin/mount.<type> helper program)

       In some cases useful info is found in syslog - try
       dmesg | tail or so.

In other words, it is failing to start because the volume websource (our NFS mount) cannot be started. You can see from the output that this is running on the kubelet for kubetest1, and the error comes from mount -t nfs 145.131.6.179:/persistent/html/.... The error is correct: we need an NFS mount helper tool that isn't installed on kubetest1. I run apt-get install nfs-common on it, and sure enough, the pod is soon running:

$ kubectl get pods
NAME                                READY   STATUS    RESTARTS   AGE
nginx-deployment-58b6c946d5-fnqr6   1/1     Running   0          6m42s
nginx-deployment-58b6c946d5-p2nlm   1/1     Running   0          6m42s

Making services accessible over TCP

Accessing Pods and Deployment using port-forward

Now, we'd like to see nginx in action! The pod is listening on an internal port 80 (according to its configuration and the default nginx config). If we want to access this from outside the cluster, there's three ways:

  • kubectl port-forward, which listens to a local port on the machine where you run kubectl port-forward, and then forwards all connections to a pod. (I've explained above how to configure kubectl so it can run on your own machine when the cluster runs elsewhere.)
  • Creating a Service, in the next section
  • Creating an Ingress, in the section after that

Let's try creating a port-forward to the pod first:

$ kubectl port-forward pod/nginx-deployment-58b6c946d5-fnqr6 8080:80
Forwarding from 127.0.0.1:8080 -> 80
Forwarding from [::1]:8080 -> 80

Now, visit http://127.0.0.1:8080 in your browser and presto!

You can also do a port-forward to a deployment:

$ kubectl port-forward deployment/nginx-deployment 8080:80
Forwarding from 127.0.0.1:8080 -> 80
Forwarding from [::1]:8080 -> 80

When you start a port-forward to a deployment, the deployment is resolved to a random one of the Running pods in the deployment. So, in this case, you'll get a response from either of the two pods; if one of them is down, you'll get it from the one that's up. (If both are down, you're out of luck.) However, when that resolved pod goes down for whatever reason, the port-forward is not restarted to another pod in the deployment. You can try this out by running kubectl delete pod nginx-deployment-... on your pods: the deployment will cause them to be restarted, but the port-forward will cease to work once you've deleted the right one.

Accessing a deployment using a Service

So, what should we do if we want the application to be reachable even if its Pods go down? The first method is to create a Service. A Service describes how an application should be accessible. There's multiple types of Services, corresponding with multiple interpretations of "accessible":

  • ClusterIP is a service type indicating that the application should be only internally accessible using a "virtual service IP" (as described above). This service IP will be allocated by Kubernetes and distributed to all nodes and pods, so that a connection to the virtual service IP on the correct port will automatically end up on one of its running Pods.
  • NodePort is a service type indicating that the application should be externally accessible using a "service port" on all Nodes. The service port will be allocated by Kubernetes (you can choose it, but that's not recommended) and distributed to all nodes, so that a connection to any node on the service port will automatically end up on one of its running Pods. A NodePort service also automatically gets a ClusterIP, so you can use that, too.
  • LoadBalancer is a service type indicating that the application should be externally accessible using a provided load balancer. By default, this works like the NodePort but on specific cloud providers you'll also get an allocated external IP address, on which a wanted port is listening and end up on one of the running Pods. I'm running this on my own cluster, not one hosted by a cloud provider, so I won't create a LoadBalancer service. If you'd like to, this page explains how they work.
    • There is a controller that allows you to use LoadBalancer services on your own bare-metal cluster that doesn't run on a cloud provider. It's called MetalLB and it works by having a pool of external routable IP addresses to choose from; when allocating from that pool, it starts announcing that IP address via either ARP or BGP onto an arbitrary Node, so that traffic to that IP ends up there. If the Node goes down, MetalLB elects a new leader node and re-announces the IP there, so that the service is moved.
  • ExternalName doesn't actually set up any forwarding, but allows you to register an internal name that forwards to a given name in DNS elsewhere. This allows migration to/from Kubernetes.
  • Not a service type, but if your service uses HTTP, you can use Ingress instead of Service to make your service externally accessible. More on that later.

Since we want our service to be externally accessible, we'll make a NodePort service:

$ cat service.yaml
apiVersion: v1
kind: Service
metadata:
  name: nginx-service
spec:
  selector:
    app: nginx
  ports:
    - name: nginx
      port: 80
      protocol: TCP
  type: NodePort
$ kubectl apply -f service.yaml
$ kubectl get services
NAME            TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)        AGE
kubernetes      ClusterIP   10.16.0.1       <none>        443/TCP        8d
nginx-service   NodePort    10.16.247.178   <none>        80:31106/TCP   4s

When we define the Service, we provide a Selector which defines the Pods supplying this Service. You can supply arbitrary labels to Pods; the app=nginx label was initially set in our Deployment (see the metadata section) and is inherited by all Pods created by it, so the Service will use them directly.

As you can see, our nginx-service service is now up and it got external port 31106. Indeed, when we go to http://<IP>:31106/ (replacing <IP> with the IP of any of our nodes) we can see the page again! Also, when we kubectl delete pod ... arbitrary pods within the deployment, they are restarted automatically, and accesses to the external IP/port keep working as long as at least one pod is Running.

There is an alternative way to make a Service externally reachable that can be convenient: you can set an externalIP on a ClusterIP type Service and any Node with that IP will listen on the indicated Service Port -- but that comes with a big fat warning: it introduces a single point of failure into your Service, as it will be unreachable if that Node is down! Yet, it can be very convenient especially for bare-metal clusters, so I'll show you how to do it:

$ cat service-externalip.yaml
apiVersion: v1
kind: Service
metadata:
  name: nginx-service-externalip
spec:
  selector:
    app: nginx
  ports:
    - name: nginx
      port: 80
      protocol: TCP
  type: ClusterIP
  externalIPs:
  - "145.131.6.177" # Replace this with one of your node's IPs, of course!
$ kubectl apply -f service-externalip.yaml
$ kubectl get services
NAME                       TYPE        CLUSTER-IP      EXTERNAL-IP     PORT(S)        AGE
kubernetes                 ClusterIP   10.16.0.1       <none>          443/TCP        19d
nginx-service              NodePort    10.16.247.178   <none>          80:31106/TCP   11d
nginx-service-externalip   ClusterIP   10.16.165.143   145.131.6.177   80/TCP         3s

Sure enough, if you visit your external IP on port 80, you should see the same page served by Nginx appear! As described before, you can have a similar approach without having a fixed Node to connect to; the controller MetalLB chooses a node randomly then uses ARP or BGP to announce an IP address on it. But, this sort of setup only works in a controlled subnet so I can't try it on this cluster.

A summary so far

We've talked about:

  • Nodes, the machines (usually physical or VMs) that together form the Kubernetes cluster
    • Master nodes are nothing special, except they (also) run Pods that together form the Kubernetes Control Plane
  • Pods, the basic unit of scheduling; they run on Nodes and consist of at least one Container running an actual Docker Image
    • Pods have an IP address within the Pod networking range
  • Deployments, which are a way to tell Kubernetes to always have some type of Pod running
  • Jobs, which are a way to tell Kubernetes to keep running some type of Pod until it finishes successfully
  • Services, which are a way to make some application in Pods accessible over TCP (inside and/or outside the cluster)
    • Services can have a 'virtual' IP address within the Service networking range, they can have a NodePort all Nodes listen on, and/or they can have an external IP statically or dynamically provided by a LoadBalancer.
  • Volumes, which provide various kinds of storage to Pods
    • Persistent Volumes are provided by the cluster administrator to allow storage
    • Persistent Volume Claims claim such volumes for some user
    • Pods can have a Persistent Volume Claim attached to them, making the contents of the volume actually usable

Accessing a Deployment using an Ingress

Ingresses are like Services, but for HTTP only. This specialisation allows adding a number of additional features, such as having multiple applications behind one URL or hostname (e.g. micro-services), SSL termination and splitting load between different versions of the same service (canarying).

Ingress is currently in beta (v1beta1), meaning that the feature is well-tested and will continue to exist, but details may change. Consider this before using it in production.

Like LoadBalancer Services, creating an Ingress does not immediately change anything in the cluster. You need to have an Ingress Controller for anything to change in the cluster after you create an Ingress. There's many Ingress Controller plugins to choose from; I will try Traefik since it supports Let's Encrypt out of the box. (Some cloud providers may provide an Ingress Controller out of the box.)

First of all, we set up Traefik. For this, we'll need to create some service types we haven't seen before: service accounts, cluster role bindings and config maps. Bear with me for a bit while we set up Traefik:

$ cat traefik-account.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: traefik-ingress-controller
  namespace: kube-system
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1beta1
metadata:
  name: traefik-ingress-controller
rules:
  - apiGroups:
      - ""
    resources:
      - services
      - endpoints
      - secrets
    verbs:
      - get
      - list
      - watch
  - apiGroups:
      - extensions
    resources:
      - ingresses
    verbs:
      - get
      - list
      - watch
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1beta1
metadata:
  name: traefik-ingress-controller
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: traefik-ingress-controller
subjects:
- kind: ServiceAccount
  name: traefik-ingress-controller
  namespace: kube-system
$ kubectl apply -f traefik-account.yaml
serviceaccount/traefik-ingress-controller created
clusterrole.rbac.authorization.k8s.io/traefik-ingress-controller created
clusterrolebinding.rbac.authorization.k8s.io/traefik-ingress-controller created

Now we create a ConfigMap for Traefik's configuration:

$ cat traefik-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: traefik-configmap
  namespace: kube-system
data:
  traefik.toml: |
    defaultEntryPoints = ["http", "https"]
    insecureSkipVerify = true

    [entryPoints]
      [entryPoints.http]
        address = ":80"
      [entryPoints.https]
        address = ":443"
        [entryPoints.https.tls]
      [entryPoints.admin]
        address = ":8080"

    [kubernetes]
      [kubernetes.ingressEndpoint]
        publishedService = "kube-system/traefik-ingress-service-external"

    [api]
    entryPoint = "admin"
$ kubectl apply -f traefik-configmap.yaml
configmap/traefik-configmap created

That being done, we now start the Traefik deployment:

$ cat traefik.yaml
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
    name: traefik-ingress
    namespace: kube-system
    labels:
        k8s-app: traefik-ingress-lb
spec:
    replicas: 1
    selector:
        matchLabels:
            k8s-app: traefik-ingress-lb
    template:
        metadata:
            labels:
                k8s-app: traefik-ingress-lb
                name: traefik-ingress-lb
        spec:
            volumes:
            - name: traefik-configmap
              configMap:
                name: traefik-configmap
            serviceAccountName: traefik-ingress-controller
            terminationGracePeriodSeconds: 60
            containers:
            - image: traefik
              name: traefik-ingress-lb
              ports:
              - name: web
                containerPort: 80
              - name: https
                containerPort: 443
              - name: admin
                containerPort: 8080
              volumeMounts:
              - mountPath: "/config"
                name: "traefik-configmap"
              args:
              - --loglevel=INFO
              - --configfile=/config/traefik.toml
$ kubectl apply -f traefik.yaml
deployment.extensions/traefik-ingress created

What did this do?

  • We created the service account and privileges Traefik needs to find Ingresses, Services and Endpoints.
  • We created a ConfigMap, a hard-coded type of Volume that is commonly used to supply configuration inside Pods. This ConfigMap causes Traefik to listen on ports 80, 443 and 8080.
  • Then, we created a Deployment that runs the Traefik image with the given configmap and service account.
  • Note that you won't find these deployments and pods using the normal kubectl get pods (etc) commands unless you give -n kube-system to select the kube-system namespace.

You should see a traefik-ingress-... pod with status Running when you run kubectl get pods -n kube-system; if that's not the case, you should stop here and investigate what's wrong.

To use Traefik, we'll configure two things:

  • External connections end up at it
  • It reads the hostname and path of requests, and sends them onwards to the correct Service

The first thing we've already discussed before: it requires setting up a LoadBalancer Service if you're running on a cloud provider; if you're not, like me, you can set up a ClusterIP Service with an ExternalIP and the side-note of a single-point-of-failure applies here as well. (Note that we expose only ports 80 and 443, not 8080; this is the administrator port of Traefik.)

$ cat traefik-service-external.yaml
apiVersion: v1
kind: Service
metadata:
  name: traefik-ingress-service-external
  namespace: kube-system
spec:
  selector:
    k8s-app: traefik-ingress-lb
  ports:
    - protocol: TCP
      port: 80
      name: web
    - protocol: TCP
      port: 443
      name: https
  externalIPs:
  - "145.131.8.75"

The externalIPs mentioned here should be the external IP of one of your Nodes. At this point you can also create a record in DNS to point to this IP address if you want; I created kubetest.sjorsgielen.nl IN A 145.131.8.75.

Having this set up should cause http://kubetest.sjorsgielen.nl/ to end up within Traefik. It will give a "404 page not found" result, as Traefik doesn't know about any Ingresses yet to forward your request to.

You can check the Traefik dashboard to see that it's up. Currently, we'll need a port-forward for that:

$ kubectl port-forward -n kube-system deployment/traefik-ingress 8080:8080
Forwarding from [::1]:8080 -> 8080
Forwarding from 127.0.0.1:8080 -> 8080

Now, visit http://localhost:8080/ and you should see the Traefik dashboard. It will show no frontends and no backends, as we haven't created any Ingresses yet for it to route. So let's create one for our Nginx service:

$ cat ingress.yaml
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
    name: nginx-ingress
    annotations:
        traefik.frontend.rule.type: PathPrefixStrip
spec:
    rules:
    - host: kubetest.sjorsgielen.nl
      http:
        paths:
        - path: /nginx
          backend:
            serviceName: nginx-service
            servicePort: 80
$ kubectl apply -f nginx.yaml
ingress.extensions/nginx-ingress created

So what does this mean?

  • It's an Ingress type, meaning it's a message to the cluster/Traefik that we want to have a Service externally accessible over HTTP.
  • The service will be reachable on the Host kubetest.sjorsgielen.nl -- this acts like a sort of virtual server in Apache, where different hosts can serve different content.
  • The request Path must begin with /nginx; the traefik.frontend.rule.type: PathPrefixStrip annotation will cause the /nginx prefix to be stripped off before the request is forwarded.
  • The requests will be forwarded to the nginx-service service on port 80.

In other words, http://kubetest.sjorsgielen.nl/nginx/index.html will be forwarded to http://nginx-service/index.html. And indeed, it shows the same Nginx page again! Also, if you go to the Traefik dashboard again, you'll see the frontend and backend have appeared and also you'll be able to see the average response time on the Health tab.

Now, you could replace your port-forward to the Traefik dashboard with a Service and an Ingress so you can make it externally accessible on your hostname (or a different one) as well. I'll leave that as an exercise to you!

Let's encrypt this

There's one very nice feature of Traefik I didn't want you to miss out on. It of course supports TLS, and it can automatically get your certificates through any ACME provider such as Let's Encrypt.

For this, we change our ConfigMap to include a [acme] section and also to auto-forward all HTTP requests to HTTPS:

$ cat traefik-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: traefik-configmap
  namespace: kube-system
data:
  traefik.toml: |
    defaultEntryPoints = ["http", "https"]
    insecureSkipVerify = true

    [entryPoints]
      [entryPoints.http]
        address = ":80"
        [entryPoints.http.redirect]
          entryPoint = "https"
      [entryPoints.https]
        address = ":443"
        [entryPoints.https.tls]
      [entryPoints.admin]
        address = ":8080"

    [acme]
    email = 'your e-mail address'
    storage = "acme.json"
    caServer = "https://acme-v01.api.letsencrypt.org/directory"
    entryPoint = "https"
    onDemand = true
      [acme.httpChallenge]
      entryPoint = "http"

    [kubernetes]
      [kubernetes.ingressEndpoint]
        publishedService = "kube-system/traefik-ingress-service-external"

    [api]
    entryPoint = "admin"
$ kubectl apply -f traefik-configmap.yaml
configmap/traefik-configmap configured

Now, unfortunately, changing ConfigMaps doesn't automatically update the Pods that use it. So, we can destroy our Pod and the Deployment will recreate it with the correct configuration:

$ kubectl get pods -n kube-system | grep traefik
traefik-ingress-6dcd896c78-7w2k6       1/1     Running   0          8d
$ kubectl delete pod traefik-ingress-6dcd896c78-7w2k6 -n kube-system
$ kubectl get pods -n kube-system | grep traefik
traefik-ingress-6dcd896c78-8gl9t       1/1     Running   0          15s

Traefik will start requesting a TLS certificate when the first TLS request comes in. That may take a minute for the LetsEncrypt challenge to resolve, but after this, you should be able to access your hostname via HTTPS and it should present a valid certificate. In my case, https://kubetest.sjorsgielen.nl/nginx gives the same working page! Also, we've configured the http forward, so http://kubetest.sjorsgielen.nl/nginx just forwards there. Hassle-free TLS, done!

Creating your own images

So far, we've usually set up the standard container ubuntu:bionic. It's pulled from the Docker Hub at https://hub.docker.com/_/ubuntu. Docker Hub is a central registry for images. In the same way you can pull many images from there, such as the minimal Linux image alpine or the image running in our Traefik pod, traefik.

But, if we want to run our own Docker images inside Kubernetes, it will need to be able to pull them as well. This can be done by uploading our images to Docker Hub, but for our own experimentation, let's set up our own registry and plug Kubernetes into it.

To begin with, the registry will need storage for its images. True to our earlier experiments, we start by creating a persistent volume claim. (I'll assume there's a persistent volume to fulfill it; if not, check above how to create one yourself.)

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: registry-files
spec:
  storageClassName: default
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 20Gi

The registry deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: registry
spec:
  selector:
    matchLabels:
      app: registry
  replicas: 1
  template:
    metadata:
      labels:
        app: registry
    spec:
      volumes:
      - name: registrystorage
        persistentVolumeClaim:
          claimName: registry-files
      containers:
      - name: registry
        image: registry:2
        ports:
        - containerPort: 5000
        volumeMounts:
        - mountPath: /var/lib/registry
          name: registrystorage

And a Service + Ingress to make it accessible on a new hostname. I found that Docker doesn't support accessing a registry with a path prefix, so we have to give it its own hostname. Luckily, with Traefik, it's easy to route; you'll only have to add a record in DNS.

apiVersion: v1
kind: Service
metadata:
  name: registry-service
spec:
  selector:
    app: registry
  ports:
    - name: registry
      port: 5000
      protocol: TCP
---
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: registry-ingress
spec:
  rules:
  - host: kuberegistry.sjorsgielen.nl
    http:
      paths:
      - path: /
        backend:
          serviceName: registry-service
          servicePort: 5000

After a minute, as before, https://kuberegistry.sjorsgielen.nl/v2/ (replace with your own hostname) should return 200 OK with a page content of "{}".

To test whether it's working, let's take the Ubuntu Docker image and push it onto our registry, as per more or less these instructions. Here, it's important that the registry is well-reachable over HTTPS, as Docker will only allow non-SSL HTTP communication over localhost! (Although you could get around this with a kubectl port-forward.)

$ docker pull ubuntu
$ docker image tag ubuntu kuberegistry.sjorsgielen.nl/myubuntu
$ docker push kuberegistry.sjorsgielen.nl/myubuntu
[...] Retrying in 10 seconds

That seems to fail. As before, we can figure out the root cause by getting the logs of the Registry pod:

$ kubectl logs registry-6bf4dbcfb-9csf5
[...]
time="2019-03-28T21:44:04.465658668Z" level=error msg="response completed with error" err.code=unknown err.detail="filesystem: mkdir /var/lib/registry/docker: permission denied" err.message="unknown error" go.version=go1.11.2 http.request.host=kuberegistry.sjorsgielen.nl http.request.id=c00f2785-30b0-469d-bcff-70a12c0f604b http.request.method=POST http.request.remoteaddr=10.107.160.0 http.request.uri="/v2/myubuntu/blobs/uploads/" http.request.useragent="docker/18.06.1-ce go/go1.10.4 git-commit/e68fc7a kernel/4.4.0-112-generic os/linux arch/amd64 UpstreamClient(Docker-Client/18.06.1-ce \(linux\))" http.response.contenttype="application/json; charset=utf-8" http.response.duration=125.482304ms http.response.status=500 http.response.written=164 vars.name=myubuntu 

A "permission denied" error in "mkdir /var/lib/registry/docker". Now, we may not know the PersistentVolume behind whatever is mounted in the registry, but we can quickly find out by checking kubectl describe deployment registry, kubectl get pvc and kubectl describe pv registry-storage. In my case, it's because root squashing is enabled on my NFS mount and the directory is being accessed by root, therefore by an anonymous uid/gid, which doesn't have rights in the directory. It's easily fixed and now the push works:

$ docker push kuberegistry.sjorsgielen.nl/myubuntu
The push refers to repository [kuberegistry.sjorsgielen.nl/myubuntu]
b57c79f4a9f3: Pushed 
d60e01b37e74: Pushed 
e45cfbc98a50: Pushed 
762d8e1a6054: Pushed 
latest: digest: sha256:f2557f94cac1cc4509d0483cb6e302da841ecd6f82eb2e91dc7ba6cfd0c580ab size: 1150

Now, let's make our own Docker image, push it, and start it in a Pod!

Here's an example Dockerfile that runs a tiny Perl-based webserver that always responds with its own hostname:

$ cat Dockerfile
FROM ubuntu:bionic

RUN apt-get update \
 && apt-get install -y libmojolicious-perl \
 && rm -rf /var/lib/apt/lists/*

# Normally, you'd use COPY here, but I wanted to keep this in one file
RUN echo "#!/usr/bin/env perl"                       >>/app.pl \
 && echo "use Mojolicious::Lite;"                    >>/app.pl \
 && echo "get '/' => sub {"                          >>/app.pl \
 && echo "  shift->render(text => 'Hello World!'); " >>/app.pl \
 && echo "};"                                        >>/app.pl \
 && echo "app->start;"                               >>/app.pl \
 && chmod +x /app.pl

EXPOSE 3000
CMD ["/app.pl", "daemon", "-l"]
$ docker build -t kuberegistry.sjorsgielen.nl/helloworld:latest .
$ docker push kuberegistry.sjorsgielen.nl/helloworld:latest

At this point, you should be able to write a Deployment, Service and Ingress for this application, using the examples above. kubectl apply should then start the Pod, Traefik should route the service and whatever host/path you configured should quickly be reachable and respond with "Hello World". We've created our own image and ran it on your cluster!

To do