Kubernetes Developer Workshop

โš™๏ธ Nodes & Scheduling

In this section we'll take a look at the the nodes that run our workloads. This is not strictly necessary to know in order to deploy and run applications, but it is useful to understand the fundamentals of how Kubernetes works under the hood, and it will give you a better understanding of the cluster and how to troubleshoot it when things go wrong.

This section is a little more Azure & AKS specific, as we'll be taking about nodepools and some specifics of how AKS manages nodes. However the concepts of nodes, labels, selectors, taints, and tolerations are all fundamental Kubernetes concepts that apply to any cluster, regardless of where it's running.

In Kubernetes, the term "node" refers to a machine in the cluster, you might also see them referred to as "worker nodes" or "agent nodes".

Note. We won't be going into cluster level networking, i.e. how nodes and pods communicate with each other, VNets or how services route traffic to pods, otherwise this would be a 2 week deep dive! If you are really interested in that, check out Kubernetes networking concepts. For AKS specific networking, check out the AKS CNI networking overview

๐Ÿ—๏ธ Cluster Architecture Overview

Every Kubernetes cluster consists of two main parts:

When you created your AKS cluster back in section 1, you specified --node-count 2, which created a single node pool with two worker nodes. The control plane was provisioned for you transparently by Azure.

The following diagram shows a high-level view of a Kubernetes cluster architecture, with the control plane, system node pool, and user node pools. Don't worry about understanding every component in the diagram, just get a sense of how the control plane manages the nodes and how the system node pool runs critical cluster infrastructure while user node pools run your application workloads.

Kubernetes cluster architecture diagram showing the control plane, system node pool, and user node pool

๐Ÿ“š Kubernetes Docs: Cluster Architecture

๐Ÿ” Exploring Nodes

Let's start by listing the nodes in the cluster:

kubectl get nodes -o wide

This will show you the nodes with additional detail including the OS image, kernel version, container runtime, and internal IP addresses. You should see two nodes, both with a status of Ready.

To get much more detailed information about a specific node, use describe:

kubectl describe node <node-name>

This command outputs a wealth of information. Some key sections to look at:

๐Ÿงช Experiment: Run kubectl describe node on one of your nodes and look at the "Allocated resources" section. How much of the node's CPU and memory is being used by requests? If you scaled up a deployment to many replicas, what would happen when the node runs out of allocatable resources?

๐Ÿ“ฆ Node Components

Each worker node runs a few essential components that keep it functioning as part of the cluster. The most important is the kubelet โ€” the primary agent on each node that communicates with the control plane and ensures containers are running. It runs as a system service directly on the node (not as a pod), so you won't see it in kubectl output. The container runtime (containerd on AKS) is the software that actually runs the containers.

Beyond those invisible node-level services, AKS deploys a number of system pods into the kube-system namespace. Let's take a look:

kubectl get pods -n kube-system -o wide

You'll see quite a few pods here, some of the notable ones include:

Don't worry about understanding all of these โ€” the key point is that a lot of infrastructure runs in the background to keep your cluster operational, and much of it is visible in the kube-system namespace.

๐ŸŠ Adding A Second Node Pool

To really explore node scheduling and placement, it helps to have more than one node pool. Let's add a second small pool called extra with a single node. This will give us a concrete target for node selectors, taints, and other scheduling features we'll explore in this section.

az aks nodepool add \
  --resource-group $RES_GROUP \
  --cluster-name $AKS_NAME \
  --name extra \
  --node-count 1 \
  --node-vm-size Standard_B2ms \
  --labels workload=extra

This will take a couple of minutes. Once it completes, verify the new node has joined the cluster:

kubectl get nodes -o wide

You should now see three nodes โ€” two from your original nodepool1 and one from the new extra pool. Note down the name of the node in the extra pool, you'll need it later. You can easily identify it with:

kubectl get nodes -l agentpool=extra

Adding a node pool will increase your Azure costs. Remember to remove it when you're done with this section using: az aks nodepool delete --resource-group $RES_GROUP --cluster-name $AKS_NAME --name extra

๐Ÿท๏ธ Labels & Selectors

Until now we've been deploying our workloads without any control over which nodes they run on โ€” the Kubernetes scheduler has been placing them wherever it sees fit based on resource availability. This is fine for many workloads, but sometimes you want more control over where your pods run. This is where labels and selectors come in.

Nodes use labels extensively, and understanding them is key to controlling where your workloads run. Let's see what labels are on your nodes:

kubectl get nodes --show-labels

The output will be quite verbose! Some important labels that AKS sets automatically include:

Notice the agentpool label โ€” your original nodes will show agentpool=nodepool1 while the new node shows agentpool=extra. You should also see the custom label workload=extra on the new node, which we set when creating the pool.

These labels become very powerful when combined with node selectors or node affinity rules in your pod specs, which let you control which nodes a pod can be scheduled on.

๐ŸŽฏ Node Selectors & Affinity

The simplest way to influence pod scheduling is with a nodeSelector. This is added to your pod template spec and tells the scheduler to only place the pod on nodes matching specific labels.

Let's try this with the extra node pool we just created. We can target it using the agentpool label that AKS automatically sets, or the custom workload label we added. Let's use our custom label:

Edit the deployment manifest for your API and add a nodeSelector:

spec:
  # Extra stuff omitted for brevity
  spec:
    # Place this just above the containers: section
    nodeSelector:
      workload: extra

Now apply the updated manifest with kubectl apply -f and watch what happens to the pods:

kubectl get pods -l app=nanomon-api -o wide

You should see all the API pods running on the extra node. Remove the nodeSelector and reapply to restore normal scheduling.

For more sophisticated scheduling, Kubernetes offers node affinity, which provides richer matching expressions including "preferred" (soft) and "required" (hard) rules.

๐Ÿ“š Kubernetes Docs: Assigning Pods to Nodes

Here's an example of a preferred node affinity that tries to schedule pods on the extra pool, but doesn't fail if the node is unavailable. You don't need to update your manifest to test this โ€” just read through the example to understand how it works:

spec:
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 1
          preference:
            matchExpressions:
              - key: agentpool
                operator: In
                values:
                  - extra
  containers:
    - name: nanomon-api
      image: __ACR_NAME__.azurecr.io/nanomon-api:latest

The difference between "preferred" and "required" is important โ€” a required rule (requiredDuringSchedulingIgnoredDuringExecution) will cause pods to remain in a Pending state if no matching node is available, while a preferred rule will fall back to any available node.

๐Ÿชฃ Taints & Tolerations

Taints and tolerations work alongside node selectors, but in the opposite direction. While node selectors attract pods to certain nodes, taints are used to repel pods from nodes unless they explicitly tolerate the taint.

A taint is applied to a node and has three parts: a key, a value, and an effect. The effect can be:

Let's use our extra node pool to see this in action. First, taint the extra node so that normal pods are repelled from it:

kubectl taint nodes -l agentpool=extra dedicated=special:NoSchedule

Here we use -l agentpool=extra to target the node by label rather than by name, which is often more convenient.

Now scale up a deployment and see what happens:

kubectl scale deployment nanomon-api --replicas 6
kubectl get pods -l app=nanomon-api -o wide

Wait why aren't any pods starting they are all pending! Well we still have the node selector in the manifest that is forcing all the pods to be scheduled on the extra node, but now we have a taint on that node that is preventing any pods from being scheduled there. So we have a scheduling conflict โ€” the node selector says "schedule here" but the taint says "don't schedule here". The result is that the pods remain in a Pending state indefinitely.

You could remove the node selector to allow the pods to be scheduled on the other nodes, but let's instead add a toleration to allow the pods to be scheduled on the tainted node:

spec:
  # Extra stuff omitted for brevity
  spec:
    tolerations:
      - key: "dedicated"
        operator: "Equal"
        value: "special"
        effect: "NoSchedule"
    # Lines below remain unchanged
    nodeSelector:
      workload: extra

Note that we combine the toleration with a nodeSelector โ€” the toleration allows the pod to run on the tainted node, but doesn't force it there. The nodeSelector handles the placement.

๐Ÿ“š Kubernetes Docs: Taints and Tolerations

When you're done experimenting, remove the taint and scale back down:

kubectl taint nodes -l agentpool=extra dedicated=special:NoSchedule-
kubectl scale deployment nanomon-api --replicas 2

The trailing - on the taint command removes it, which is easy to be mistaken for a typo, so be careful!

๐Ÿ“Š Resource Monitoring

Back in section 7 we set resource requests and limits on our pods. But how do we see actual resource usage on the nodes? The kubectl top command gives us a quick view:

# Show resource usage per node
kubectl top nodes

# Show resource usage per pod
kubectl top pods

This shows the real-time CPU and memory consumption. Comparing these values with the node's allocatable resources (from kubectl describe node) gives you a good sense of how much headroom you have.

If kubectl top returns an error, it means the metrics server isn't installed. In AKS, this is typically enabled by default, but you can install it with kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

๐Ÿ› ๏ธ Node Maintenance & Cordoning

Sometimes you need to take a node out of service for maintenance, upgrades, or troubleshooting. This sort of cluster level operation is not common for application developers, but an awareness of how it works is useful for understanding cluster operations and troubleshooting.

There are three main activities for managing node availability:

๐Ÿ” DaemonSets

You may have noticed the --ignore-daemonsets flag in the drain command above, and wondered what a DaemonSet is. If you looked at the system pods in kube-system earlier, you might also have noticed that some pods (like kube-proxy and the CSI drivers) have one instance running on every node. That's because they are managed by a DaemonSet.

A DaemonSet is a workload resource (like a Deployment) but instead of running a set number of replicas, it ensures that a copy of a pod runs on every node in the cluster. When a new node is added, the DaemonSet automatically schedules a pod onto it. When a node is removed, the pod is cleaned up. You don't specify a replicas count โ€” the number of pods is determined by the number of nodes.

This makes DaemonSets ideal for node-level infrastructure concerns such as:

You can see the DaemonSets running in your cluster with:

kubectl get daemonsets -A

Notice how the DESIRED and CURRENT columns match for each DaemonSet โ€” that's telling you every node has its required pod running. DaemonSets can also use node selectors to target only a subset of nodes if needed.

๐Ÿ“š Kubernetes Docs: DaemonSets

๐ŸŠ Node Pools In Practice

We've already been using node pools throughout this section โ€” the extra pool we created earlier is a great example. In AKS, nodes are organized into node pools, groups of nodes with the same VM size and configuration.

In production, it's common to have several node pools with different characteristics:

You can manage your node pools with the Azure CLI, for example listing the pools in your cluster:

az aks nodepool list --resource-group $RES_GROUP --cluster-name $AKS_NAME -o table

You should see both nodepool1 and extra listed. The combination of node pools with the labels, selectors, taints, and tolerations we explored above gives you fine-grained control over workload placement.

๐Ÿ“š AKS Docs: Node Pools

๐Ÿงน Cleanup

If you added the extra node pool during this section, now is a good time to remove it to avoid unnecessary Azure costs:

az aks nodepool delete --resource-group $RES_GROUP --cluster-name $AKS_NAME --name extra --no-wait

The --no-wait flag returns immediately while the deletion happens in the background. Your pods will be rescheduled onto the remaining nodes automatically.

If you still have your API pods pinned to the extra node with a nodeSelector, you'll notice your API pods are now in a Pending state after the node pool is deleted. Remove the nodeSelector from your manifest and reapply to restore normal scheduling.

๐Ÿง  Key Takeaways

Understanding nodes and cluster architecture might seem like "infrastructure plumbing", but it's knowledge that pays off when things go wrong. Here's a quick summary of what we covered: