State/Persistence - Volumes/PV/PVC/SC/STS

workshop

2021-08-18

Content

1. Volumes

https://kubernetes.io/docs/concepts/storage/volumes/
V podstatě jde o přímé namountování volumu na pod.
Asi nejpotřebnější části jsou:

#config map mount example
---
apiVersion: v1
data:
  conf: abrakadabra
kind: ConfigMap
metadata:
  creationTimestamp: null
  name: tcm
---
apiVersion: v1
kind: Pod
metadata:
  name: cm-mount
  namespace: default
spec:
  containers:
  - name: cm-mount
    image: busybox
    command: ["sh", "-c", "sleep 120"]
    imagePullPolicy: IfNotPresent
    volumeMounts:
    - name: confmapvol
      mountPath: /data
  volumes:
  - name: confmapvol
    configMap:
      name: tcm
      items:
        - key: conf
          path: conf

Tady malá odbočka:

2. Persistent Volumes

https://kubernetes.io/docs/concepts/storage/persistent-volumes/
V rychlosti udělám souhrn.
Zatímco volumes jsou definovány interně v rámci deploy a kubernetes se o jejich lifecycle nijak nestará. PersistenceVolume je vlastně samostatná vrstva API která izoluje definici a tvorbu storage od její konzumace deploymentem.
budeme o tom mluvit

vypíchneme zde malou definici PersistentVolume a PersistentVolumeClaim:

A PersistentVolume (PV) is a piece of storage in the cluster that has been provisioned by an administrator or dynamically provisioned using Storage Classes. It is a resource in the cluster just like a node is a cluster resource. PVs are volume plugins like Volumes, but have a lifecycle independent of any individual Pod that uses the PV. This API object captures the details of the implementation of the storage, be that NFS, iSCSI, or a cloud-provider-specific storage system.

A PersistentVolumeClaim (PVC) is a request for storage by a user. It is similar to a Pod. Pods consume node resources and PVCs consume PV resources. Pods can request specific levels of resources (CPU and Memory). Claims can request specific size and access modes (e.g., they can be mounted ReadWriteOnce, ReadOnlyMany or ReadWriteMany, see AccessModes).

➤ kb api-resources|awk 'NR==1 || /PersistentVolume/'
NAME                              SHORTNAMES   APIVERSION                             NAMESPACED   KIND
persistentvolumeclaims            pvc          v1                                     true         PersistentVolumeClaim
persistentvolumes                 pv           v1                                     false        PersistentVolume

PersistenceVolume můžeme rozdělit na dvě části STATICKÉ a DYNAMICKÉ, začneme statickou.

2.1 Static Persistence Volume

Vytvoříme statické PV typu hostpath, v ideálním světě by tohle dělal asi admin a řekl by BFU tady je a děj se vůle boží

apiVersion: v1
kind: PersistentVolume
metadata:
  name: static-pv01
spec:
  accessModes:
  - ReadWriteOnce 
  storageClassName: "" #nepouzijeme dynamiku
  capacity:
    storage: 10Mi
  hostPath:
    path: /tmp/static-pv01
    type: DirectoryOrCreate #pokud adresar neni tak vytvor
  nodeAffinity: #chceme aby se vytvoril na nodu k3d-deadless-agent-0
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: kubernetes.io/hostname
          operator: In
          values:
          - k3d-deadless-agent-0
  persistentVolumeReclaimPolicy: Delete #pokud nebude zadny PVC ktery by referencoval tak smaz
  claimRef: #povolime aby ho claimoval pouze PVC static-pvc z namespacu default
     namespace: default
     name: static-pvc01

ReadWriteOnce – the volume can be mounted as READ-WRITE by SINGLE NODE ReadOnlyMany – the volume can be mounted READ-ONLY by MANY NODES ReadWriteMany – the volume can be mounted as READ-WRITE by MANY NODES

A PVC to PV binding is a one-to-one mapping, using a ClaimRef which is a bi-directional binding between the PersistentVolume and the PersistentVolumeClaim.

➤ kb get pv static-pv01
NAME          CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS      CLAIM                  STORAGECLASS   REASON   AGE
static-pv01   10Mi       RWO            Delete           Available   default/static-pvc01                           2m27s

Ok uděláme PVC které pak dále budeme referencovat v DEPOLYMENTU

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: static-pvc01
  namespace: default
spec:
  storageClassName: ""
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 10Mi

➤ kb get pvc static-pvc01 && kb get pv static-pv01
NAME           STATUS   VOLUME        CAPACITY   ACCESS MODES   STORAGECLASS   AGE
static-pvc01   Bound    static-pv01   10Mi       RWO                           72s
NAME          CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                  STORAGECLASS   REASON   AGE
static-pv01   10Mi       RWO            Delete           Bound    default/static-pvc01                           10m

Ok a deployment ktery bude referencovat vytvořené PVC

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: static-busy
  name: static-busy
spec:
  replicas: 1
  selector:
    matchLabels:
      app: static-busy
  template:
    metadata:
      labels:
        app: static-busy
    spec:
      containers:
      - image: busybox
        command: ["sh", "-c", "sleep 120"]
        name: busybox-static01
        volumeMounts:
          - name: static
            mountPath: /data
      volumes:
      - name: static
        persistentVolumeClaim:
          claimName: static-pvc01

Všimněte si že pod běží na stejném nódu na kterém je vytvořeno PV.

➤ kb get pods static-busy-579b7645d7-4fhqn -o=json|jq .spec.nodeName
"k3d-deadless-agent-0"

#stav na nodu
➤ /usr/bin/docker ps |grep agent-0|awk '{print $1}'|xargs -I'{}'  /usr/bin/docker  exec -t {} ls -la /tmp

Ještě testneme škálování a affinitu vyškálovaných podů a pak deployment smažeme, následně smažeme PVC a uvidíme co se stane s PV.

HostPath jde udělat přímo jako volume, pak samozřejmě žádné PV/PVC není potřeba a ani nevznikne

2.2 Dynamic Persistence Volume

https://kubernetes.io/docs/concepts/storage/persistent-volumes/#dynamic
Zatímco u statického persistence provisioningu vytváříme PV sami, u dynamického přichází na hry STORAGECLASS což je vlastně takový automat na vytváření PV. Tzn pokud chceme použít dynamiku musí byt alespoň jedna Storageclass definována.

K3d v defaultní konfiguraci nabízí jednu StorageClass

➤ kb get sc
NAME                   PROVISIONER                            RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
local-path (default)   rancher.io/local-path                  Delete          WaitForFirstConsumer   false                  8h

SC ma nastaveno VolumeBindingMode: WaitForFirstCustomer
Takže uděláme si podobný workload jako v minulém příkladě jen s použitím SC.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: dynamic-pvc01
  namespace: default
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: local-path
  resources:
    requests:
      storage: 10Mi

➤ kb get pvc
NAME            STATUS    VOLUME   CAPACITY   ACCESS MODES   STORAGECLASS   AGE
dynamic-pvc01   Pending                                      local-path     64s

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: dynamic-ngx
  name: dynamic-ngx
spec:
  replicas: 1
  selector:
    matchLabels:
      app: dynamic-ngx
  template:
    metadata:
      labels:
        app: dynamic-ngx
    spec:
      containers:
      - name: dynamic-ngx
        image: nginx:stable-alpine
        imagePullPolicy: IfNotPresent
        volumeMounts:
        - name: dyn
          mountPath: /data
        ports:
        - containerPort: 80
      - image: bash
        command: ["bash", "-c", "x=0;while true;do ((x=x+1));sleep 1;echo $x > /data/ticktock;done"]
        name: dynamic-bash
        volumeMounts:
          - name: dyn
            mountPath: /data
      volumes:
      - name: dyn
        persistentVolumeClaim:
          claimName: dynamic-pvc01

➤  kb get pvc dynamic-pvc01 && kb get pv pvc-b9a9caff-ab3e-471d-94ca-41cb4644fecb
NAME            STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
dynamic-pvc01   Bound    pvc-b9a9caff-ab3e-471d-94ca-41cb4644fecb   10Mi       RWO            local-path     4m10s
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                   STORAGECLASS   REASON   AGE
pvc-b9a9caff-ab3e-471d-94ca-41cb4644fecb   10Mi       RWO            Delete           Bound    default/dynamic-pvc01   local-path              2m9s

A vzniklé PV

➤ kb neat get pv pvc-b9a9caff-ab3e-471d-94ca-41cb4644fecb -o yaml
apiVersion: v1
kind: PersistentVolume
metadata:
  annotations:
    pv.kubernetes.io/provisioned-by: rancher.io/local-path
  name: pvc-b9a9caff-ab3e-471d-94ca-41cb4644fecb
spec:
  accessModes:
  - ReadWriteOnce
  capacity:
    storage: 10Mi
  claimRef:
    apiVersion: v1
    kind: PersistentVolumeClaim
    name: dynamic-pvc01
    namespace: default
    resourceVersion: "35571"
    uid: b9a9caff-ab3e-471d-94ca-41cb4644fecb
  hostPath:
    path: /var/lib/rancher/k3s/storage/pvc-b9a9caff-ab3e-471d-94ca-41cb4644fecb_default_dynamic-pvc01
    type: DirectoryOrCreate
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: kubernetes.io/hostname
          operator: In
          values:
          - k3d-deadless-server-0
  persistentVolumeReclaimPolicy: Delete
  storageClassName: local-path

Tolik zatím k dynamice a statice, určitě si to vyzkoušejte. Má to navíc pár aspektů které si probereme příště.Skuste se nad tím zamyslet co to přináší

3. Definice nové SC a ReadWriteMany

Jelikož se mi na k3d nepodařilo rozchodit blockdevice provisioner LongHorn.

VOLUNTEER WANTED: pokud byste se toho chtel nekdo chytit a skusit to rozbehat tak cenim, celkem rád bych viděl ten LONGHORN funkční a někoho kdo se do něj více ponoří. Problém proč to nefunguje je na 99% K3D (iSCSI)

Udělal jsem řešení jednodušši a to použití NFS provisioneru. V podstatě se jedná o NFS server běžící na master nodu.
NFS používá default SC pro vytvoření oddílu který obhospodařuje. Na simulaci ReadWriteMany a StatefulSetu je to dostatečné.

origin git repo
helm git repo

K Helmu a jiným templatovacím nástrojům se dostaneme později.

➤ git clone https://github.com/kubernetes-sigs/nfs-ganesha-server-and-external-provisioner

#template resource k nahlednuti pak v adresari render (neni nutne delat ale je fajn se do toho podivat)
➤ helm template nfs-server-provisioner .  \
  --namespace nfs-server-provisioner --create-namespace \
  --set persistence.storageClass="local-path" \
  --set persistence.size="5Gi" \
  --set persistence.enabled=true \
  --set "mountOptions={tcp,nfsvers=4.1}" \
  --output-dir render

#install
➤ helm install nfs-server-provisioner .  \
  --namespace nfs-server-provisioner --create-namespace \
  --set persistence.storageClass="local-path" \
  --set persistence.size="5Gi" \
  --set persistence.enabled=true \
  --set "mountOptions={tcp,nfsvers=4.1}"

➤ kb get sc
NAME                   PROVISIONER                            RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
local-path (default)   rancher.io/local-path                  Delete          WaitForFirstConsumer   false                  10h
nfs                    cluster.local/nfs-server-provisioner   Delete          Immediate              true                   12s

a jak vidíte vzal si 5GB disk

➤ oc get pv|awk 'NR==1 ||/nfs/'
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                                                  STORAGECLASS   REASON   AGE
pvc-013000fa-bb3c-49f7-9293-04d36ba715f8   5Gi        RWO            Delete           Bound    nfs-server-provisioner/data-nfs-server-provisioner-0   local-path              84s

Takže ok máme SC a teď si skusíme dynamicky udělat nějaký PV jako ReadWriteMany.

---
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: nfs-pvc01
spec:
  storageClassName: "nfs"
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 100Mi
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: nfs-bash
  name: nfs-bash
spec:
  replicas: 1
  selector:
    matchLabels:
      app: nfs-bash
  template:
    metadata:
      labels:
        app: nfs-bash
    spec:
      containers:
      - image: bash
        command: ["sh", "-c", "sleep 120"]
        name: busybox-static01
        volumeMounts:
          - name: nfs
            mountPath: /data
      volumes:
      - name: nfs
        persistentVolumeClaim:
          claimName: nfs-pvc01

PV vytvořený SC

➤ kb neat get pv pvc-e7c48e08-c610-454c-ae22-464e1cb6628e -o yaml
apiVersion: v1
kind: PersistentVolume
metadata:
  annotations:
    EXPORT_block: "\nEXPORT\n{\n\tExport_Id = 1;\n\tPath = /export/pvc-e7c48e08-c610-454c-ae22-464e1cb6628e;\n\tPseudo
      = /export/pvc-e7c48e08-c610-454c-ae22-464e1cb6628e;\n\tAccess_Type = RW;\n\tSquash
      = no_root_squash;\n\tSecType = sys;\n\tFilesystem_id = 1.1;\n\tFSAL {\n\t\tName
      = VFS;\n\t}\n}\n"
    Export_Id: "1"
    Project_Id: "0"
    Project_block: ""
    Provisioner_Id: 0eee7408-12ab-45a9-a28e-c76eab64144d
    kubernetes.io/createdby: nfs-dynamic-provisioner
    pv.kubernetes.io/provisioned-by: cluster.local/nfs-server-provisioner
  name: pvc-e7c48e08-c610-454c-ae22-464e1cb6628e
spec:
  accessModes:
  - ReadWriteMany
  capacity:
    storage: 100Mi
  claimRef:
    apiVersion: v1
    kind: PersistentVolumeClaim
    name: nfs-pvc01
    namespace: default
    resourceVersion: "37860"
    uid: e7c48e08-c610-454c-ae22-464e1cb6628e
  mountOptions:
  - tcp
  - nfsvers=4.1
  nfs:
    path: /export/pvc-e7c48e08-c610-454c-ae22-464e1cb6628e
    server: 10.43.138.238
  persistentVolumeReclaimPolicy: Delete
  storageClassName: nfs

Vyškálujeme a jelikož nemáme žádnou affinitu a ReadWriteMany měli by se pody rozprostřít mezi vícero nódů

➤ kb scale deployment --replicas 3 nfs-bash

# a bingo
➤ kb get pods -l app=nfs-bash -o=jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.nodeName}{"\n"}'
nfs-bash-86f8d6494-mxlqv        k3d-deadless-agent-1
nfs-bash-86f8d6494-br2vm        k3d-deadless-agent-0
nfs-bash-86f8d6494-zj8wh        k3d-deadless-agent-0

4.StatefulSet

Do teď jsme řešili jen deployment a ten jak se zdá není pro state plně vhodný, existuje spousta případů kdy je jeho použití v pořádku ale také existuje spousta případů kdy není.
Pro druhou možnost Kubernetes nabízí specialní objekt StatefullSet (STS).

Rozdíl mezi StatefullSetem a Deploymentem(Podem) ze Stackoverflow. Lépe bych to asi nedokázal formulovat tak tady to máte.

Yes, a regular pod can use a persistent volume. However, sometimes you have multiple pods that logically form a “group”. Examples of this would be database replicas, ZooKeeper hosts, Kafka nodes, etc. In all of these cases there’s a bunch of servers and they work together and talk to each other. What’s special about them is that each individual in the group has an identity. For example, for a database cluster one is the master and two are followers and each of the followers communicates with the master letting it know what it has and has not synced. So the followers know that “db-x-0” is the master and the master knows that “db-x-2” is a follower and has all the data up to a certain point but still needs data beyond that.

In such situations you need a few things you can’t easily get from a regular pod:

A predictable name: you want to start your pods telling them where to find each other so they can form a cluster, elect a leader, etc. but you need to know their names in advance to do that. Normal pod names are random so you can’t know them in advance.
A stable address/DNS name: you want whatever names were available in step (1) to stay the same. If a normal pod restarts (you redeploy, the host where it was running dies, etc.) on another host it’ll get a new name and a new IP address.
A persistent link between an individual in the group and their persistent volume: if the host where one of your database master was running dies it’ll get moved to a new host but should connect to the same persistent volume as there’s one and only 1 volume that contains the right data for that “individual”. So, for example, if you redeploy your group of 3 database hosts you want the same individual (by DNS name and IP address) to get the same persistent volume so the master is still the master and still has the same data, replica1 gets it’s data, etc.

StatefulSets solve these issues because they provide all formers

- Stable, unique network identifiers.
- Stable, persistent storage.
- Ordered, graceful deployment and scaling.
- Ordered, graceful deletion and termination.

I didn’t really talk about (3) and (4) but that can also help with clusters as you can tell the first one to deploy to become the master and the next one find the first and treat it as master, etc.

apiVersion: apps/v1
kind: StatefulSet
metadata:
  namespace: default
  name: exsts
  labels:
    app: exsts
spec:
  selector:
    matchLabels:
      app: exsts # has to match .spec.template.metadata.labels
  serviceName: exsts-svc #A Headless Service, to controll network domain --> service se nevytvori dynamicky
  updateStrategy:
    type: RollingUpdate
  replicas: 1 # by default is 1
  template:
    metadata:
      labels:
        app: exsts # has to match .spec.selector.matchLabels
    spec:
      # terminationGracePeriodSeconds: 10
      containers:
      - name: exsts
        image: docker.io/library/busybox:latest
        resources:
          requests:
            cpu: 50m
            memory: 100Mi
          limits:
            cpu: 50m
            memory: 100Mi
        command: ["/bin/sh", "-c", "--"]
        args: ["while true; do sleep 30;done;"]
        volumeMounts:
        - name: sts-novct
          mountPath: /dadada
      volumes:
        - name: sts-novct
          persistentVolumeClaim:
            claimName: sts-pvc01

Ok ale pokud naškálujeme budeme muset mít persistent volume RW many, STS se s tímto vypořádává přez volumeclaimTemplate

  volumeClaimTemplates:
  - metadata:
      name: exsts-vct
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: "local-path"
      resources:
        requests:
          storage: 1Gi

Headless Service aka spec.serviceName

apiVersion: v1
kind: Service
metadata:
  name: exsts-svc
  labels:
    app: exsts
spec:
  clusterIP: None #dulezite
  selector:
    app: exsts

#svc
Server:         10.43.0.10
Address:        10.43.0.10:53

Name:   exsts-svc.default.svc.cluster.local
Address: 10.42.2.14
Name:   exsts-svc.default.svc.cluster.local
Address: 10.42.0.42
Name:   exsts-svc.default.svc.cluster.local
Address: 10.42.2.16

#pod
nslookup exsts-0.exsts-svc.default.cluster.local
Server:         10.43.0.10
Address:        10.43.0.10:53

As some have noted, you can indeed can some of the same benefits by using regular pods and services, but its much more work. For example, if you wanted 3 database instances you could manually create 3 deployments and 3 services. Note that you must manually create 3 deployments as you can’t have a service point to a single pod in a deployment. Then, to scale up you’d manually create another deployment and another service. This does work and was somewhat common practice before PetSet/PersistentSet came along. Note that it is missing some of the benefits listed above (persistent volume mapping & fixed start order for example).

Each connection to the service is forwarded to one randomly selected backing pod. But what if the client needs to connect to all of those pods? What if the backing pods themselves need to each connect to all the other backing pods. Connecting through the service clearly isn’t the way to do this. What is?

For a client to connect to all pods, it needs to figure out the the IP of each individual pod. One option is to have the client call the Kubernetes API server and get the list of pods and their IP addresses through an API call, but because you should always strive to keep your apps Kubernetes-agnostic, using the API server isn’t ideal.

Luckily, Kubernetes allows clients to discover pod IPs through DNS lookups. Usually, when you perform a DNS lookup for a service, the DNS server returns a single IP — the service’s cluster IP. But if you tell Kubernetes you don’t need a cluster IP for your service (you do this by setting the clusterIP field to None in the service specification ), the DNS server will return the pod IPs instead of the single service IP. Instead of returning a single DNS A record, the DNS server will return multiple A records for the service, each pointing to the IP of an individual pod backing the service at that moment. Clients can therefore do a simple DNS A record lookup and get the IPs of all the pods that are part of the service. The client can then use that information to connect to one, many, or all of them.

Setting the clusterIP field in a service spec to None makes the service headless, as Kubernetes won’t assign it a cluster IP through which clients could connect to the pods backing it.

Addon

raft protokol ETCD

DOMCV

vytvořte cluster s 3 worker nody
vytvořte default storage class pro libovolný provider (hostpath,nfs …)
vytvořte redis cluster jako STS s 6 pody. Pody budou “rovnoměrně” rozdistribuovanými mezi “worker nody” (tzn každý nód bude mít 2 pody) REDIS CLUSTER
Každý pod bude mít svůj vlastní PV a použije nadefinovanou SC.
prověřte readiness a liveness proby pro jednotlivé pody, pokud nejsou nadefinujte je
zpřístupněte databázi z venku (tzn přihlásíte se na ní ze svého lokálního hostu přez CLI)
zjistěte jaký pod je leader
smažte jeden worker nod clusteru a prověřte co se stalo s jednotlivými replikami redisu