Kubernetes Cluster Architecture

This document should provide an overview about the current state of the Kubernetes Cluster, where we deploy our core- and player services. At some points it should also outline, what can be improved (marked as improve).

The cluster

hcloud

The cluster itself is hosted at Hetzner Cloud. This gives us the opportunity to quickly scale the Nodes of the cluster by just adding new hcloud VMs as worker or control plane nodes. Another big advantage of running the cluster in the cloud instead of Bare Metal is that we just can use a Hetzner LoadBalancer to forward traffic to our Ingress and only have to worry about networking on a really high level point of view.

Kubernetes distribution (K3s)

We use K3s as Kubernetes distribution. K3s is a highly available, certified Kubernetes distribution whose main advantage is, that it is packaged as a single binary. All of the Kubernetes control plane and worker node components are set up by executing the binary and are configurable by passing arguments. To build the cluster we use the hcloud-kube-hetzner Terraform module (https://github.com/kube-hetzner/terraform-hcloud-kube-hetzner). This abstracts a really big part of the cluster configuration. So with only a few arguments it is setting up a complete Kubernetes cluster, including an Ingress controller (nginx in our case) and storage provider (longhorn in our case).

Our configuration of the module can be found here.

kube-cluster-config

The Repository kube-cluster-config is one of our “single” sources of truth, which is describing our whole infrastructure as code. Here is (as you have seen in the previous section) our cluster configuration located as well as all components of our cluster that have to be deployed before the services are deployed on the cluster.

improve: Set up a shared Terraform state. Currently the states are located on the machines of different members of the DevOps team. The states should be shared so that everybody can see what is currently applied and can make changes to the infrastructure

Namespaces

After applying the kube-hetzner module, all namespaces need to be created. To set up all namespaces at once we’ve got a really simple Terraform module here.

improve: Let the module provide an output of all namespaces, which could be used by other modules as dependency instead of maintaining its own namespace list

Authentication and Authorization

We want to provide every player team exclusive access to their own Kubernetes Namespace. Also the guided project team should have access to their namespaces. Lets call our player teams and guided project members “Developers”. Our Developers should have read only access to their namespace and the possibility to delete pods (to restart their application).

AuthN

To secure our Kubernetes API server, we have to make sure that only members of the Microservice Dungeon have access to it. Therefore we use GitLab OpenID Connect identity provider. OpenID Connect is a layer on top of the OAuth2 protocol, which allows a client to verify its identity to a “third party” by authenticating at the identity provider (GitLab in our case). We chose GitLab, because every member of the Microservice Dungeon needs a GitLab account anyways.

The configuration of the API server to use GitLab OIDC can be found here. The GitLab Application to create a client-id and a client-secret is created by our technical user. So our developers can use one kubectl config and need the to install the kubelogin kubectl plugin as described in the Deployment Guide and can authenticate to the cluster.

AuthZ

After the authentication of a developer, we need to authorize him or her to access resources in a certain namespace. This happens by adding Kubernetes RBAC Role objects to the namespace and permitting a user to use this Role via a RoleBinding. To easily add more Developers and modify their Roles, we’ve also got a Terraform module for RBAC here.

Open Policy Agent (OPA) - Gatekeeper

The OPA Gatekeeper enables us to enforce policies to Kubernetes objects after they’ve passed the API server. It is also possible to mutate Kubernetes objects and add labels or annotations to them. We deploy the Gatekeeper and some policies/mutations with this Terraform module.

Use cases in the Microservice Dungeon Cluster?

  1. LoadBalancer Services: Developers used LoadBalancer Services on Minikube to expose their service to their host. The common case that their don’t disable them via their Helm Chart when they deploy on the Hetzner cluster, leads to an automatic Hetzner Cloud LoadBalancer creation via the K8s cloud controller manager. This leads to additional ~7€/month per LoadBalancer. So we’re enforcing a policy which denies API request creating a LoadBalancer Service.
  2. Ingress Annotations: We want to attach the custom.nginx.org/allowed-ips annotation to every Ingress object, to allow only IPs from the TH VPN to access via the Ingress exposed services.

Flux

To deploy applications / Helm Charts to our cluster which are not part of our core (Cluster)Infrastructure, we use Flux. Flux is deployed via this shell script to our cluster (improve: deploy Flux with Terraform). After the whole Flux infrastructure is deploy (Helm-Controller, Kustomization-Controller, etc.) it is possible to deploy Flux Resources like HelmReleases or Kustomizations to our FluxResources repository. If the resource is merged to the default branch, Flux automatically deploys or updates this resource.

improve: introduce Codeownership + Namespace/Path checking Rego policy, so that people can’t deploy to every namespace {unfortunately Codeownership is a GitLab paid feature :(}

Flux Deployment Client

To automatically deploy your a certain image tag (e.g. your commit short sha) to the FluxResources repo, you can use the Flux Deployment Client. For a working example, look at the Game service, how this is implemented: https://gitlab.com/the-microservice-dungeon/core-services/game

improve: Because this is a “(really) nice to have feature” the priority to refactor the code & write a complete documentation was too low. So this tool is a bit hacky at the moment and could be refactored, tested & documented.

Common includes

To push your Helm Chart to the GitLab Registry & use the Flux Deployment Client, we provide Common Gitlab CI-CD includes here: https://gitlab.com/the-microservice-dungeon/devops-team/common-ci-cd

Logging

To access all our application and cluster component logs, we build up a simple ELK Stack. This is definitly not production ready at the moment. There is so much space for improvement, but it does its job at the moment. All self-written Helm Charts can be found here: https://gitlab.com/the-microservice-dungeon/devops-team/deployments/logging. Our ELK Stack is deployed via Flux at the moment, but should be already there before our “business” applications are deployed. So we could also apply this via Terraform. But as I said, here can be invested really much time to improve the use of the ELK Stack.

Last modified February 4, 2025: fix go & npm dependencies (8ff1fa0)