Pod Detailed Explanation #2

https://medium.com/aws-in-plain-english/pod-detailed-explanation-dfca4eac18e5

Creation and Termination

Pod Creation Process

Users submit the information of the Pod that needs to be created to apiServer through kubectl or other API clients.
apiServer begins to generate information about the Pod object, stores the information in etcd, and then returns a confirmation message to the client.
apiServer begins to reflect the changes in the Pod object in etcd, and other components use the watch mechanism to track and check changes on apiServer.
The scheduler discovers that a new Pod object needs to be created and begins to allocate a host for the Pod and update the result information to apiServer.
The kubelet on the node detects the scheduled Pod, attempts to call docker to start the container, and sends the result back to apiServer.
apiServer stores the received Pod status information in etcd.

Pod Termination Process

Users send a command to apiServer to delete the pod object.
The pod object information in apiServer will be updated over time, and during the grace period (default 30s), the pod is considered dead.
Mark the pod as the terminating state.
kubelet starts the pod shutdown process when it detects the pod object is in the terminating state.
Endpoint controllers monitor the pod’s shutdown behavior and remove it from the endpoint list of all service resources that match this endpoint.
If the current pod object defines a preStop hook handler, it will start executing synchronously after being marked as terminating.
The container process in the pod object receives the stop signal.
After the grace period, if there are still running processes in the pod, the pod object will receive an immediate termination signal.
kubelet requests apiServer to set the grace period of this pod resource to 0 to complete the deletion operation. At this time, the pod is not visible to users.

Init Containers

Init Containers are containers that run before the main container in a pod, and they are used to perform certain setup or initialization tasks, such as initializing a database or waiting for a service to start. They have two main characteristics:

Init Containers must complete successfully before the main container starts, and if an init container fails, Kubernetes will restart it until it succeeds.
Init Containers are executed in a specific order, and each init container will only run after the previous one has completed successfully.

Init Containers have many use cases, but the most common ones are:

Providing tools or custom code that is not included in the main container’s image.
Delaying the startup of the main container until certain dependencies are met.

To demonstrate the use of Init Containers, we can create a pod that requires MySQL and Redis services to be available before the main container (nginx) starts. For testing purposes, we can assume that the MySQL server is at 192.168.109.201 and the Redis server is at 192.168.109.202.

We can create a YAML file named pod-initcontainer.yaml with the following content:

apiVersion: v1
kind: Pod
metadata:
  name: pod-initcontainer
  namespace: dev
spec:
  containers:
  - name: main-container
    image: nginx:1.17.1
    ports: 
    - name: nginx-port
      containerPort: 80
  initContainers:
  - name: test-mysql
    image: busybox:1.30
    command: ['sh', '-c', 'until ping 192.168.109.201 -c 1 ; do echo waiting for mysql...; sleep 2; done;']
  - name: test-redis
    image: busybox:1.30
    command: ['sh', '-c', 'until ping 192.168.109.202 -c 1 ; do echo waiting for reids...; sleep 2; done;']

We can then create the pod with the following command:

#Create pod
[root@master ~]# kubectl create -f pod-initcontainer.yaml
pod/pod-initcontainer created

#Check pod status
#It was found that the pod was stuck in the process of starting the first init container, and the subsequent containers would not run
root@master ~]# kubectl describe pod pod-initcontainer -n dev
........
Events:
Type Reason Age From Message

Normal Scheduled 49s default-scheduler Successfully assigned dev/pod-initcontainer to node1
Normal Pulled 48s kubelet, node1 Container image "busybox:1.30" already present on machine
Normal Created 48s kubelet, node1 Created container test-mysql
Normal Started 48s kubelet, node1 Started container test-mysql

#View the pod dynamically
[root@master ~]# kubectl get pods pod-initcontainer -n dev -w
NAME READY STATUS RESTARTS AGE
pod-initcontainer 0/1 Init:0/2 0 15s
pod-initcontainer 0/1 Init:1/2 0 52s
pod-initcontainer 0/1 Init:1/2 0 53s
pod-initcontainer 0/1 PodInitializing 0 89s
pod-initcontainer 1/1 Running 0 90s

#Then open a new shell to add two IPs to the current server and observe the changes in the pod
[root@master ~]# ifconfig ens33:1 192.168.109.201 netmask 255.255.255.0 up
[root@master ~]# ifconfig ens33:2 192.168.109.202 netmask 255.255.255.0 up

Hook Functions

Hook functions can sense events in their own lifecycle and run user-specified program code when the corresponding event occurs.

Kubernetes provides two hook functions before and after the main container starts and stops:

post start: executed after the container is created, if it fails, the container will be restarted
pre stop: executed before the container is terminated. After completion, the container will be successfully terminated. Before it is completed, the operation to delete the container will be blocked.

Hook handlers support defining actions in the following three ways:

Exec command: execute a command once in the container

……
  lifecycle:
    postStart: 
      exec:
        command:
        - cat
        - /tmp/healthy
……

TCPSocket: attempt to access the specified socket in the current container

……      
  lifecycle:
    postStart:
      tcpSocket:
        port: 8080
……

HTTPGet: send an HTTP request to a specific URL in the current container

……
  lifecycle:
    postStart:
      httpGet:
        path: / #URI address
        port: 80 #port number
        host: 192.168.109.100 #host address
        scheme: HTTP #supported protocol, http or https
……

Next, let’s take the example of using the exec method to demonstrate the use of hooks. Create a pod-hook-exec.yaml file with the following contents:

apiVersion: v1
kind: Pod
metadata:
  name: pod-hook-exec
  namespace: dev
spec:
  containers:
  - name: main-container
    image: nginx:1.17.1
    ports:
    - name: nginx-port
      containerPort: 80
    lifecycle:
      postStart: 
        exec: # Execute a command to modify the default homepage content of nginx when the container starts
          command: ["/bin/sh", "-c", "echo postStart... > /usr/share/nginx/html/index.html"]
      preStop:
        exec: # Stop the nginx service before the container stops
          command: ["/usr/sbin/nginx","-s","quit"]
# Create the pod
[root@master ~]# kubectl create -f pod-hook-exec.yaml
pod/pod-hook-exec created

# Check the pod
[root@master ~]# kubectl get pods  pod-hook-exec -n dev -o wide
NAME           READY   STATUS     RESTARTS   AGE    IP            NODE    
pod-hook-exec  1/1     Running    0          29s    10.244.2.48   node2   

# Access the pod
[root@master ~]# curl 10.244.2.48
postStart...

Container Probes

Container probes are used to detect whether the application instance in a container is working properly. It is a traditional mechanism for ensuring business availability. If the instance’s status does not meet expectations after probing, Kubernetes “removes” the problematic instance and stops routing traffic to it. Kubernetes provides two types of probes for container probing:

Liveness probes: used to check whether the application instance is currently in normal running state. If not, Kubernetes will restart the container.
Readiness probes: used to check whether the application instance is ready to accept requests. If not, Kubernetes will not forward traffic.

The livenessProbe determines whether to restart the container, while the readinessProbe determines whether to forward requests to the container.

Both probes currently support three detection methods:

Exec command: execute a command once in the container. If the exit code of the command is 0, the program is considered normal; otherwise, it is not.
TCPSocket: attempt to access a user container port. If the connection can be established, the program is considered normal; otherwise, it is not.
HTTPGet: call the URL of the web application in the container. If the returned status code is between 200 and 399, the program is considered normal; otherwise, it is not.

Here we will take liveness probes as an example to do some demonstrations:

Method 1: Exec

Create a file named pod-liveness-exec.yaml with the following content:

apiVersion: v1
kind: Pod
metadata:
  name: pod-liveness-exec
  namespace: dev
spec:
  containers:
  - name: nginx
    image: nginx:1.17.1
    ports: 
    - name: nginx-port
      containerPort: 80
    livenessProbe:
      exec:
        command: ["/bin/cat","/tmp/hello.txt"] # Execute a command to view a file

Create a Pod and observe the effect:

# Create the Pod
[root@master ~]# kubectl create -f pod-liveness-exec.yaml
pod/pod-liveness-exec created

# Check the Pod details
[root@master ~]# kubectl describe pods pod-liveness-exec -n dev
......
  Normal   Created    20s (x2 over 50s)  kubelet, node1     Created container nginx
  Normal   Started    20s (x2 over 50s)  kubelet, node1     Started container nginx
  Normal   Killing    20s                kubelet, node1     Container nginx failed liveness probe, will be restarted
  Warning  Unhealthy  0s (x5 over 40s)   kubelet, node1     Liveness probe failed: cat: can't open '/tmp/hello11.txt': No such file or directory
  
# Observe that the nginx container performs health checks after starting
# After the check fails, the container is killed and then restarted (this is the effect of the restart strategy, which will be explained later)
# After waiting for a while, check the pod information again, and you can see that RESTARTS is no longer 0 but keeps increasing
[root@master ~]# kubectl get pods pod-liveness-exec -n dev
NAME                READY   STATUS             RESTARTS   AGE
pod-liveness-exec   0/1     CrashLoopBackOff   2          3m19s

# Of course, you can change it to an existing file, such as /tmp/hello.txt, and try again, and the result will be normal...

Method 2: TCPSocket

Create a file named pod-liveness-tcpsocket.yaml with the following content:

apiVersion: v1
kind: Pod
metadata:
  name: pod-liveness-tcpsocket
  namespace: dev
spec:
  containers:
  - name: nginx
    image: nginx:1.17.1
    ports: 
    - name: nginx-port
      containerPort: 80
    livenessProbe:
      tcpSocket:
        port: 8080 # Attempt to access port 8080

Create the pod and observe the result:

# Create the pod
[root@master ~]# kubectl create -f pod-liveness-tcpsocket.yaml
pod/pod-liveness-tcpsocket created

# View the pod details
[root@master ~]# kubectl describe pods pod-liveness-tcpsocket -n dev
......
  Normal   Scheduled  31s                            default-scheduler  Successfully assigned dev/pod-liveness-tcpsocket to node2
  Normal   Pulled     <invalid>                      kubelet, node2     Container image "nginx:1.17.1" already present on machine
  Normal   Created    <invalid>                      kubelet, node2     Created container nginx
  Normal   Started    <invalid>                      kubelet, node2     Started container nginx
  Warning  Unhealthy  <invalid> (x2 over <invalid>)  kubelet, node2     Liveness probe failed: dial tcp 10.244.2.44:8080: connect: connection refused
  
# Observe that the probe attempted to access port 8080 but failed
# After waiting for a while, view the pod information again to see that RESTARTS has increased from 0 to a higher number
[root@master ~]# kubectl get pods pod-liveness-tcpsocket  -n dev
NAME                     READY   STATUS             RESTARTS   AGE
pod-liveness-tcpsocket   0/1     CrashLoopBackOff   2          3m19s

# Of course, you can change the port to a valid one, such as 80, and try again to see that the result is normal...

Method Three: HTTPGet

Create the pod-liveness-httpget.yaml file and apply it:

apiVersion: v1
kind: Pod
metadata:
  name: pod-liveness-httpget
  namespace: dev
spec:
  containers:
  - name: nginx
    image: nginx:1.17.1
    ports:
    - name: nginx-port
      containerPort: 80
    livenessProbe:
      httpGet:
        scheme: HTTP
        port: 80 
        path: /hello

Creating the pod and observing the results

# Creating the pod
[root@master ~]# kubectl create -f pod-liveness-httpget.yaml
pod/pod-liveness-httpget created

# Checking the pod details
[root@master ~]# kubectl describe pod pod-liveness-httpget -n dev
.......
  Normal   Pulled     6s (x3 over 64s)  kubelet, node1     Container image "nginx:1.17.1" already present on machine
  Normal   Created    6s (x3 over 64s)  kubelet, node1     Created container nginx
  Normal   Started    6s (x3 over 63s)  kubelet, node1     Started container nginx
  Warning  Unhealthy  6s (x6 over 56s)  kubelet, node1     Liveness probe failed: HTTP probe failed with statuscode: 404
  Normal   Killing    6s (x2 over 36s)  kubelet, node1     Container nginx failed liveness probe, will be restarted
  
# The probe tries to access the specified path but fails due to a 404 error.
# After a while, the RESTARTS value keeps increasing instead of being 0
[root@master ~]# kubectl get pod pod-liveness-httpget -n dev
NAME                   READY   STATUS    RESTARTS   AGE
pod-liveness-httpget   1/1     Running   5          3m17s

# The pod can be fixed by changing the path to a valid one such as '/'

By exploring the livenessProbe sub-properties, we can see that there are additional configurations available beyond the three types of probes demonstrated above. These properties are described below:

FIELDS:
   exec <Object>  
   tcpSocket    <Object>
   httpGet      <Object>
   initialDelaySeconds  <integer>  # The number of seconds after the container has started before the first probe is executed.
   timeoutSeconds       <integer>  # The probe timeout in seconds. Defaults to 1 second with a minimum of 1 second.
   periodSeconds        <integer>  # How often in seconds to perform the probe. Defaults to 10 seconds with a minimum of 1 second.
   failureThreshold     <integer>  # The number of times in a row that the probe must fail for the container to be considered failed. Defaults to 3 with a minimum of 1.
   successThreshold     <integer>  # The number of times in a row that the probe must succeed for the container to be considered successful. Defaults to 1.

Next, we will configure two more properties to demonstrate their effects:

[root@master ~]# more pod-liveness-httpget.yaml
apiVersion: v1
kind: Pod
metadata:
  name: pod-liveness-httpget
  namespace: dev
spec:
  containers:
  - name: nginx
    image: nginx:1.17.1
    ports:
    - name: nginx-port
      containerPort: 80
    livenessProbe:
      httpGet:
        scheme: HTTP
        port: 80 
        path: /
      initialDelaySeconds: 30 # Delay time in seconds after the container starts before the first probe is executed.
      timeoutSeconds: 5 # Time in seconds after which the probe times out.

Restart Policy

In the previous section, once the container probe detected a problem, Kubernetes would restart the Pod where the container resides, and this is determined by the Pod’s restart policy. There are three restart policies for Pods, as follows:

Always: Automatically restarts the container when it fails. This is the default value.
OnFailure: Restarts the container when it terminates with a non-zero exit code.
Never: Never restarts the container regardless of its state.

The restart policy applies to all containers in the Pod object. The first container that needs to be restarted will be restarted immediately when needed. The subsequent restarts will be delayed by the kubelet for a certain period of time, with increasing delays of 10s, 20s, 40s, 80s, 160s, and 300s, with 300s being the maximum delay.

Create a pod-restartpolicy.yaml:

apiVersion: v1
kind: Pod
metadata:
  name: pod-restartpolicy
  namespace: dev
spec:
  containers:
  - name: nginx
    image: nginx:1.17.1
    ports:
    - name: nginx-port
      containerPort: 80
    livenessProbe:
      httpGet:
        scheme: HTTP
        port: 80
        path: /hello
  restartPolicy: Never # set the restart policy to Never

Run the Pod for testing:

# Create the Pod
[root@master ~]# kubectl create -f pod-restartpolicy.yaml
pod/pod-restartpolicy created

# Check the Pod details and find that the nginx container has failed
[root@master ~]# kubectl describe pods pod-restartpolicy  -n dev
......
  Warning  Unhealthy  15s (x3 over 35s)  kubelet, node1     Liveness probe failed: HTTP probe failed with statuscode: 404
  Normal   Killing    15s                kubelet, node1     Container nginx failed liveness probe
  
# Wait for a while and observe the number of Pod restarts, which is always 0 and the container is never restarted.
[root@master ~]# kubectl  get pods pod-restartpolicy -n dev
NAME                   READY   STATUS    RESTARTS   AGE
pod-restartpolicy      0/1     Running   0          5min42s

Pod scheduling

By default, Kubernetes determines on which node a pod should run using a set of algorithms executed by the scheduler component. This process is not under human control. However, in practice, this is not always sufficient because there are many cases where we want to control which pod should run on which node. This requires an understanding of Kubernetes scheduling rules for pods. Kubernetes provides four types of scheduling:

Automatic scheduling: The node on which the pod runs is entirely calculated by the scheduler using a series of algorithms.
Directed scheduling: NodeName, NodeSelector.
Affinity scheduling: NodeAffinity, PodAffinity, PodAntiAffinity.
Taint (tolerance) scheduling: Taints, Tolerations.

Directed scheduling

Directed scheduling refers to the use of nodeName or nodeSelector declared on the pod to schedule the pod to the desired node. Note that this scheduling is forced, which means that even if the target node for scheduling does not exist, it will still be scheduled, and the pod will fail to run.

NodeName

NodeName is used to forcibly constrain the pod to be scheduled to a specified node name. In fact, this directly skips the scheduler’s scheduling logic and directly schedules the pod to the specified node name.

Next, let’s do an experiment: Create a file named pod-nodename.yaml

apiVersion: v1
kind: Pod
metadata:
  name: pod-nodename
  namespace: dev
spec:
  containers:
  - name: nginx
    image: nginx:1.17.1
  nodeName: node1 # Schedule to node1
# Create the pod
[root@master ~]# kubectl create -f pod-nodename.yaml
pod/pod-nodename created

# Check which node the pod is scheduled to run on
[root@master ~]# kubectl get pods pod-nodename -n dev -o wide
NAME           READY   STATUS    RESTARTS   AGE   IP            NODE      ......
pod-nodename   1/1     Running   0          56s   10.244.1.87   node1     ......   

# Next, delete the pod and modify nodeName value to node3 (which doesn't exist)
[root@master ~]# kubectl delete -f pod-nodename.yaml
pod "pod-nodename" deleted
[root@master ~]# vim pod-nodename.yaml
[root@master ~]# kubectl create -f pod-nodename.yaml
pod/pod-nodename created

# Check again and see that it is scheduled towards node3, but as there is no node3, the pod is unable to run
[root@master ~]# kubectl get pods pod-nodename -n dev -o wide
NAME           READY   STATUS    RESTARTS   AGE   IP       NODE    ......
pod-nodename   0/1     Pending   0          6s    <none>   node3   ......

NodeSelector

NodeSelector is used to schedule pods to nodes that have specific labels. It is implemented through Kubernetes’ label-selector mechanism, which means that before a pod is created, the scheduler uses the MatchNodeSelector scheduling policy to match labels and find the target node. The pod is then scheduled to the target node, and this matching rule is enforced.

Next, let’s do an experiment:

First, add labels to the node nodes respectively.

[root@master ~]# kubectl label nodes node1 nodeenv=pro
node/node2 labeled
[root@master ~]# kubectl label nodes node2 nodeenv=test
node/node2 labeled

2.Create a YAML file pod-nodeselector.yaml and use it to create a pod:

apiVersion: v1
kind: Pod
metadata:
  name: pod-nodeselector
  namespace: dev
spec:
  containers:
  - name: nginx
    image: nginx:1.17.1
  nodeSelector: 
    nodeenv: pro # Schedule the pod on a node with the label 'nodeenv=pro'
# Create the pod
[root@master ~]# kubectl create -f pod-nodeselector.yaml
pod/pod-nodeselector created

# Check the node that the pod is scheduled on
[root@master ~]# kubectl get pods pod-nodeselector -n dev -o wide
NAME               READY   STATUS    RESTARTS   AGE     IP          NODE    ......
pod-nodeselector   1/1     Running   0          47s   10.244.1.87   node1   ......

3. Next, delete the pod and modify the nodeSelector value to a non-existent label, such as nodeenv: xxxx.

# Delete the pod and modify the YAML file
[root@master ~]# kubectl delete -f pod-nodeselector.yaml
pod "pod-nodeselector" deleted
[root@master ~]# vim pod-nodeselector.yaml

# Create the pod again
[root@master ~]# kubectl create -f pod-nodeselector.yaml
pod/pod-nodeselector created

# Check the pod status, and find that it cannot be scheduled on any node
[root@master ~]# kubectl get pods -n dev -o wide
NAME               READY   STATUS    RESTARTS   AGE     IP       NODE    
pod-nodeselector   0/1     Pending   0          2m20s   <none>   <none>

# Check the detail and find that the node selector failed
[root@master ~]# kubectl describe pods pod-nodeselector -n dev
.......
Events:
  Type     Reason            Age        From               Message
  ----     ------            ----       ----               -------
  Warning  FailedScheduling  <unknown>  default-scheduler  0/3 nodes are available: 3 node(s) didn't match node selector.
  Warning  FailedScheduling  <unknown>  default-scheduler  0/3 nodes are available: 3 node(s) didn't match node selector.

Affinity Scheduling

In the previous section, we introduced two types of directed scheduling methods, which are convenient to use but have a certain limitation. If there is no node that meets the conditions, the Pod will not be scheduled even if there is still an available node list in the cluster. This limits its use cases.

To address this problem, Kubernetes provides another type of scheduling method called Affinity. Based on the NodeSelector, it can prioritize scheduling on nodes that meet the conditions and fall back to nodes that do not meet the conditions, making scheduling more flexible.

Affinity can be divided into three categories:

NodeAffinity: targets nodes and determines which nodes a pod can be scheduled to
PodAffinity: targets pods and determines which existing pods a pod can be scheduled within the same topology domain
PodAntiAffinity: targets pods and determines which existing pods a pod cannot be scheduled within the same topology domain

Regarding the usage scenarios of Affinity/Anti-Affinity:

Affinity: If two applications frequently interact with each other, it is necessary to use Affinity to make the two applications as close as possible to reduce performance loss due to network communication.

Anti-Affinity: When an application is deployed with multiple replicas, it is necessary to use Anti-Affinity to distribute each application instance to different nodes, thereby improving the high availability of the service.

NodeAffinity

First, let’s take a look at the configurable options for NodeAffinity:

pod.spec.affinity.nodeAffinity
  requiredDuringSchedulingIgnoredDuringExecution  The node must meet all specified rules to be scheduled, which is equivalent to a hard limit.
    nodeSelectorTerms  Node selection list
      matchFields   A list of node selector requirements based on node fields
      matchExpressions   A list of node selector requirements based on node labels (recommended)
        key    Key
        values Values
        operator Operator, supports Exists, DoesNotExist, In, NotIn, Gt, Lt
  preferredDuringSchedulingIgnoredDuringExecution Schedule to nodes that meet the specified rules, which is equivalent to a soft limit (preference)
    preference   A node selector item associated with the corresponding weight
      matchFields   A list of node selector requirements based on node fields
      matchExpressions   A list of node selector requirements based on node labels (recommended)
        key    Key
        values Values
        operator Operator, supports In, NotIn, Exists, DoesNotExist, Gt, Lt
 weight  Preference weight, in the range of 1-100.
Explanation of operators:

- matchExpressions:
  - key: nodeenv              # Match nodes with a label key of 'nodeenv'
    operator: Exists
  - key: nodeenv              # Match nodes with a label key of 'nodeenv' and a value of 'xxx' or 'yyy'
    operator: In
    values: ["xxx","yyy"]
  - key: nodeenv              # Match nodes with a label key of 'nodeenv' and a value greater than 'xxx'
    operator: Gt
    values: "xxx"

Next, let’s demonstrate requiredDuringSchedulingIgnoredDuringExecution.

Create pod-nodeaffinity-required.yaml:

apiVersion: v1
kind: Pod
metadata:
  name: pod-nodeaffinity-required
  namespace: dev
spec:
  containers:
  - name: nginx
    image: nginx:1.17.1
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution: # hard constraint
        nodeSelectorTerms:
        - matchExpressions: # match the value of "env" in ["xxx","yyy"]
          - key: nodeenv
            operator: In
            values: ["xxx","yyy"]
# create the pod
[root@master ~]# kubectl create -f pod-nodeaffinity-required.yaml
pod/pod-nodeaffinity-required created

# check the pod status (should be pending)
[root@master ~]# kubectl get pods pod-nodeaffinity-required -n dev -o wide
NAME                        READY   STATUS    RESTARTS   AGE   IP       NODE    ...... 
pod-nodeaffinity-required   0/1     Pending   0          16s   <none>   <none>  ......

# check the details of the pod
# it failed to schedule because node selection failed
[root@master ~]# kubectl describe pod pod-nodeaffinity-required -n dev
......
  Warning  FailedScheduling  <unknown>  default-scheduler  0/3 nodes are available: 3 node(s) didn't match node selector.
  Warning  FailedScheduling  <unknown>  default-scheduler  0/3 nodes are available: 3 node(s) didn't match node selector.

# stop the pod
[root@master ~]# kubectl delete -f pod-nodeaffinity-required.yaml
pod "pod-nodeaffinity-required" deleted

# modify the file, changing values: ["xxx","yyy"] to ["pro","yyy"]
[root@master ~]# vim pod-nodeaffinity-required.yaml

# start the pod again
[root@master ~]# kubectl create -f pod-nodeaffinity-required.yaml
pod/pod-nodeaffinity-required created

# check again and see that the pod has been scheduled to node1
[root@master ~]# kubectl get pods pod-nodeaffinity-required -n dev -o wide
NAME                        READY   STATUS    RESTARTS   AGE   IP            NODE  ...... 
pod-nodeaffinity-required   1/1     Running   0          11s   10.244.1.89   node1 ......

Next, let’s demonstrate preferredDuringSchedulingIgnoredDuringExecution for NodeAffinity.

Create a file called pod-nodeaffinity-preferred.yaml:

apiVersion: v1
kind: Pod
metadata:
name: pod-nodeaffinity-preferred
namespace: dev
spec:
containers:
- name: nginx
image: nginx:1.17.1
affinity: # Affinity settings
nodeAffinity: # Node affinity settings
preferredDuringSchedulingIgnoredDuringExecution: # Soft requirement
- weight: 1
preference:
matchExpressions: # Match nodes with "nodeenv" value in ["xxx","yyy"] (which do not exist in the current environment)
- key: nodeenv
operator: In
values: ["xxx","yyy"]
# Create the Pod
[root@master ~]# kubectl create -f pod-nodeaffinity-preferred.yaml
pod/pod-nodeaffinity-preferred created

# Check the status of the Pod (running successfully)
[root@master ~]# kubectl get pod pod-nodeaffinity-preferred -n dev
NAME READY STATUS RESTARTS AGE
pod-nodeaffinity-preferred 1/1 Running 0 40s
Some notes on setting NodeAffinity rules:
1. If both nodeSelector and nodeAffinity are defined, both conditions must be met for the Pod to run on the specified Node.
2. If nodeAffinity specifies multiple nodeSelectorTerms, only one of them needs to be matched successfully.
3. If a nodeSelectorTerms has multiple matchExpressions, a node must satisfy all of them to be successfully matched.
4. If a Node on which a Pod is running no longer meets the node affinity requirements of the Pod during the Pod’s runtime due to a change in its labels, the system will not act upon this change.

PodAffinity

PodAffinity is mainly used to ensure that new pods are scheduled in the same area as reference pods based on the running pods.

Here are the configurable items for PodAffinity:

pod.spec.affinity.podAffinity
  requiredDuringSchedulingIgnoredDuringExecution: Hard constraint
    namespaces: Specifies the namespace of the reference pod
    topologyKey: Specifies the scope of scheduling
    labelSelector: Label selector
      matchExpressions: A list of node selector requirements based on labels (recommended)
        key: Key
        values: Values
        operator: Relationship operator, supports In, NotIn, Exists, and DoesNotExist
      matchLabels: Multiple contents mapped by multiple matchExpressions
  preferredDuringSchedulingIgnoredDuringExecution: Soft constraint
    podAffinityTerm: Options
      namespaces: 
      topologyKey: 
      labelSelector:
        matchExpressions:
          key: Key
          values: Values
          operator: Relationship operator, supports In, NotIn, Exists, and DoesNotExist
        matchLabels:
    weight: A preference weight, ranging from 1-100
The topologyKey is used to specify the scope of scheduling, for example:
    If it is specified as kubernetes.io/hostname, the scope is based on Node nodes
    If it is specified as beta.kubernetes.io/os, the scope is based on the operating system type of Node nodes

Next, let’s demonstrate requiredDuringSchedulingIgnoredDuringExecution for PodAffinity:

1. First, create a target Pod, pod-podaffinity-target.yaml:

apiVersion: v1
kind: Pod
metadata:
  name: pod-podaffinity-target
  namespace: dev
  labels:
    podenv: pro # set label
spec:
  containers:
  - name: nginx
    image: nginx:1.17.1
  nodeName: node1 # specify the target pod to run on node1
# Start the target pod
[root@master ~]# kubectl create -f pod-podaffinity-target.yaml
pod/pod-podaffinity-target created

# Check the pod status
[root@master ~]# kubectl get pods  pod-podaffinity-target -n dev
NAME                     READY   STATUS    RESTARTS   AGE
pod-podaffinity-target   1/1     Running   0          4s

2. Create pod-podaffinity-required.yaml, with the following contents:

apiVersion: v1
kind: Pod
metadata:
  name: pod-podaffinity-required
  namespace: dev
spec:
  containers:
  - name: nginx
    image: nginx:1.17.1
  affinity:  # set pod affinity
    podAffinity: # set pod affinity
      requiredDuringSchedulingIgnoredDuringExecution: # hard restriction
      - labelSelector:
          matchExpressions: # match pods with podenv=xxx or podenv=yyy labels
          - key: podenv
            operator: In
            values: ["xxx","yyy"]
        topologyKey: kubernetes.io/hostname

The above configuration means that the new Pod must be on the same Node as a pod that has the label podenv=xxx or podenv=yyy. Obviously, there is no such pod right now. Let's run the test next.

# Start pod
[root@master ~]# kubectl create -f pod-podaffinity-required.yaml
pod/pod-podaffinity-required created

# Check pod status and found that it is not running
[root@master ~]# kubectl get pods pod-podaffinity-required -n dev
NAME                       READY   STATUS    RESTARTS   AGE
pod-podaffinity-required   0/1     Pending   0          9s

# Check the detailed information
[root@master ~]# kubectl describe pods pod-podaffinity-required  -n dev
......
Events:
  Type     Reason            Age        From               Message
  ----     ------            ----       ----               -------
  Warning  FailedScheduling  <unknown>  default-scheduler  0/3 nodes are available: 2 node(s) didn't match pod affinity rules, 1 node(s) had taints that the pod didn't tolerate.

# Then modify  values: ["xxx","yyy"]----->values:["pro","yyy"]
# Meaning: The new Pod must be on the same node as the pod with the label nodeenv=xxx or nodeenv=yyy.
[root@master ~]# vim pod-podaffinity-required.yaml

# Recreate the pod and check the results
[root@master ~]# kubectl delete -f  pod-podaffinity-required.yaml
pod "pod-podaffinity-required" deleted
[root@master ~]# kubectl create -f pod-podaffinity-required.yaml
pod/pod-podaffinity-required created

# The pod is running normally now
[root@master ~]# kubectl get pods pod-podaffinity-required -n dev
NAME                       READY   STATUS    RESTARTS   AGE   LABELS
pod-podaffinity-required   1/1     Running   0          6s    <none>

We won’t demonstrate preferredDuringSchedulingIgnoredDuringExecution for PodAffinity.

PodAntiAffinity

PodAntiAffinity is mainly used to ensure that the new Pod is not scheduled in the same zone as a reference Pod, where the reference Pod is already running.

The configuration options for PodAntiAffinity are the same as PodAffinity. Here is an example to demonstrate how it works:

We will continue to use the target Pod from the previous example.

[root@master ~]# kubectl get pods -n dev -o wide --show-labels
NAME                     READY   STATUS    RESTARTS   AGE     IP            NODE    LABELS
pod-podaffinity-required 1/1     Running   0          3m29s   10.244.1.38   node1   <none>     
pod-podaffinity-target   1/1     Running   0          9m25s   10.244.1.37   node1   podenv=pro

2. Create a YAML file named pod-podantiaffinity-required.yaml with the following content:

apiVersion: v1
kind: Pod
metadata:
  name: pod-podantiaffinity-required
  namespace: dev
spec:
  containers:
  - name: nginx
    image: nginx:1.17.1
  affinity:  # Set the affinity
    podAntiAffinity: # Set the PodAntiAffinity
      requiredDuringSchedulingIgnoredDuringExecution: # Hard constraint
      - labelSelector:
          matchExpressions: # Match the label podenv with value "pro"
          - key: podenv
            operator: In
            values: ["pro"]
        topologyKey: kubernetes.io/hostname

The above configuration means that the new Pod must not be scheduled on the same node as the Pod with the label nodeenv=pro. Let’s run a test:

# Create the Pod
[root@master ~]# kubectl create -f pod-podantiaffinity-required.yaml
pod/pod-podantiaffinity-required created

# Check the Pod status
# We can see that the Pod is scheduled to node2
[root@master ~]# kubectl get pods pod-podantiaffinity-required -n dev -o wide
NAME                           READY   STATUS    RESTARTS   AGE   IP            NODE   .. 
pod-podantiaffinity-required   1/1     Running   0          30s   10.244.1.96   node2  ..

Taints and Tolerations

Taints

The previous scheduling methods are from the perspective of Pods. By adding properties to the Pods, we can determine whether the Pods should be scheduled to a specific node. However, we can also approach the scheduling from the perspective of the nodes, by adding taints to the nodes to determine whether Pods are allowed to be scheduled to them.

Once a node is tainted, there is a repelling relationship between the node and the Pod, which prevents the Pod from being scheduled on the node, or even evicts already existing Pods.

The format of a taint is key=value:effect, where key and value are the labels of the taint, and effect describes the effect of the taint, which supports the following options:

PreferNoSchedule: Kubernetes will try to avoid scheduling Pods to nodes with this taint, unless there is no other node available for scheduling.
NoSchedule: Kubernetes will not schedule Pods to nodes with this taint, but it will not affect already existing Pods on the node.
NoExecute: Kubernetes will not schedule Pods to nodes with this taint, and will also evict already existing Pods on the node.

Commands to set and remove taints using kubectl are shown below:

# Set taint
kubectl taint nodes node1 key=value:effect

# Remove taint
kubectl taint nodes node1 key:effect-

# Remove all taints
kubectl taint nodes node1 key-

Next, let’s demonstrate the effect of taints:

Prepare node1 (to make the effect more obvious, temporarily stop the node2 node)
Set a taint for node1: tag=heima:PreferNoSchedule; then create pod1 (pod1 can be scheduled)
Modify the taint for node1: tag=heima:NoSchedule; then create pod2 (pod1 is normal, pod2 fails to be scheduled)
Modify the taint for node1: tag=heima:NoExecute; then create pod3 (all three pods fail to be scheduled)

# Set taint for node1 (PreferNoSchedule)
[root@master ~]# kubectl taint nodes node1 tag=heima:PreferNoSchedule

# Create pod1
[root@master ~]# kubectl run taint1 --image=nginx:1.17.1 -n dev
[root@master ~]# kubectl get pods -n dev -o wide
NAME                      READY   STATUS    RESTARTS   AGE     IP           NODE   
taint1-7665f7fd85-574h4   1/1     Running   0          2m24s   10.244.1.59   node1    

# Modify taint for node1 (cancel PreferNoSchedule, set NoSchedule)
[root@master ~]# kubectl taint nodes node1 tag:PreferNoSchedule-
[root@master ~]# kubectl taint nodes node1 tag=heima:NoSchedule

# Create pod2
[root@master ~]# kubectl run taint2 --image=nginx:1.17.1 -n dev
[root@master ~]# kubectl get pods taint2 -n dev -o wide
NAME                      READY   STATUS    RESTARTS   AGE     IP            NODE
taint1-7665f7fd85-574h4   1/1     Running   0          2m24s   10.244.1.59   node1 
taint2-544694789-6zmlf    0/1     Pending   0          21s     <none>        <none>   

# Modify taint for node1 (cancel NoSchedule, set NoExecute)
[root@master ~]# kubectl taint nodes node1 tag:NoSchedule-
[root@master ~]# kubectl taint nodes node1 tag=heima:NoExecute

# Create pod3
[root@master ~]# kubectl run taint3 --image=nginx:1.17.1 -n dev
[root@master ~]# kubectl get pods -n dev -o wide
NAME                      READY   STATUS    RESTARTS   AGE   IP       NODE     NOMINATED 
taint1-7665f7fd85-htkmp   0/1     Pending   0          35s   <none>   <none>   <none>    
taint2-544694789-bn7wb    0/1     Pending   0          35s   <none>   <none>   <none>     
taint3-6d78dbd749-tktkq   0/1     Pending   0          6s    <none>   <none>   <none>     
Quick tip:
When using kubeadm to set up a cluster, the master node will be automatically marked with a taint, so pods will not be scheduled on the master node.

Toleration

We have learned about how Taints can be used to reject Pods from scheduling onto Nodes. However, sometimes we may want to schedule Pods onto Nodes with specific Taints. To achieve this, we can use Tolerations.

“Taints mean rejection and tolerations mean ignoring. Nodes reject pod scheduling with taints, while pods ignore rejection with tolerations.”

The following example illustrates the effects:

In the previous section, a NoExecute taint was applied to the node1 node, so the pod cannot be scheduled on it.
In this section, the pod can be scheduled by adding a toleration to it.

Create a file named pod-toleration.yaml with the following contents:

apiVersion: v1
kind: Pod
metadata:
  name: pod-toleration
  namespace: dev
spec:
  containers:
  - name: nginx
    image: nginx:1.17.1
  tolerations:      # Add toleration
  - key: "tag"        # Key of the taint to tolerate
    operator: "Equal" # Operator
    value: "heima"    # Value of the taint to tolerate
    effect: "NoExecute"   # Tolerance rule, which must be the same as the taint rule
# Before adding toleration
[root@master ~]# kubectl get pods -n dev -o wide
NAME             READY   STATUS    RESTARTS   AGE   IP       NODE     NOMINATED 
pod-toleration   0/1     Pending   0          3s    <none>   <none>   <none>           

# After adding toleration
[root@master ~]# kubectl get pods -n dev -o wide
NAME             READY   STATUS    RESTARTS   AGE   IP            NODE    NOMINATED
pod-toleration   1/1     Running   0           3s    10.244.1.62   node1   <none>

Here is a detailed configuration of toleration:

[root@master ~]# kubectl explain pod.spec.tolerations
......
FIELDS:
   key       # The key of the taint to tolerate; an empty key means to match all keys
   value     # The value of the taint to tolerate
   operator  # The operator between key-value pairs; supports Equal and Exists (default)
   effect    # The effect of the taint to tolerate; an empty value means to match all effects
   tolerationSeconds   # Tolerance time; takes effect when the effect is NoExecute, indicating the