Kubernetes Volumes and Multiple Zones

While working on a recent patch for Kubernetes, I put down a few notes around how Kubernetes handles multiple zones from the storage perspective.

This is not an extensive description of multi zone support in Kubernetes, but rather a specific explanation of how PersistentVolumes (PVs) are created in the correct zones.

The labels

It all starts with the kubelet adding labels to node objects with information about the zones and regions the node is placed.

This can be quickly verified with the command:

$ kubectl get node/ip-172-18-12-140.ec2.internal -o yaml
apiVersion: v1
kind: Node
  labels:
	failure-domain.beta.kubernetes.io/region: us-east-1
	failure-domain.beta.kubernetes.io/zone: us-east-1d
   (...)

In addition to that, the PersistentVolumeLabel Admission Controller automatically adds zone labels to PVs as soon as they are created.

The scheduler (via the VolumeZonePredicate predicate) will then ensure that pods that claim a given PV are only placed into the same zone as that volume, as volumes cannot be attached across zones.

This approach sounds interesting, but it comes with some problems. For instance, this only prevents scheduling pods in certain zones. It can’t tell the storage provisioner to provion the PV in a certain zone.

There’s a better solution to address this shortcoming: topology-aware volume provisiong.

Topology-aware volume provisioning

With topology-aware volume provisioning, the PV is only provisioned when a pod requests it. When that happens, the volume is provisioned in the same zone as the pod.

The PV NodeAffinity is always set in the storage plugin (or in the external provisioner, in the CSI case). Then, there’s another scheduler predicate that schedules pods on certain nodes: VolumeBindingChecker. This predicate looks at the pv.spec.nodeaffinity field rather than at the PV labels.

This is how the field looks in the PV object:

In-tree storage plugin:

  nodeAffinity:
	required:
	  nodeSelectorTerms:
	  - matchExpressions:
		- key: failure-domain.beta.kubernetes.io/zone
		  operator: In
		  values:
		  - us-east-1d
		- key: failure-domain.beta.kubernetes.io/region
		  operator: In
		  values:
		  - us-east-1

CSI driver:

  nodeAffinity:
	required:
	  nodeSelectorTerms:
	  - matchExpressions:
		- key: topology.ebs.csi.aws.com/zone
		  operator: In
		  values:
		  - us-east-1d

Building a Latex resume and not worrying about compiling it

A long time ago I came accross Adrian Friggeri’s resume built with Latex. I couldn’t find the original template, but did found this page with a customization.

I found that template amazing and I decided I wanted my resume to be like that. I was looking for a new job, so why not? That being said, I spent hours and hours on it; I installed uncountable dependencies, made some Helvetica fonts work on my Linux box, created the perfect descriptions for my previous positions, adjusted the alignments with so much care. Phew, I finally managed to have my fancy curriculum!

Eventually I got a new job and forgot about it, until I decided to update it. At that point I had a clean OS installation, so I’d have to install all those dependencies again. As a result, I decided to ditch that resume and build a simpler one, based on the well-known moderncv template. This one was simpler to compile as it required fewer dependencies so I thought my problem was soved.

However, recently I decided to update it, and guess what? For some reason it didn’t compile. Then I realized that the building process of my resume would be a great candidate to be containerized. To make things better, I found out that there is a moderncv package available in Fedora 27. So I quickly put together this simple Dockerfile in my resume directory:

FROM fedora-minimal:27

VOLUME /data

WORKDIR /data

RUN microdnf update && \
    microdnf install texlive texlive-xetex texlive-moderncv && \
    microdnf clean all

Voilà! Build the image, push it to Docker Hub and you will never have install texlive and its dependencies only to compile your resume again.

If you are interested you can check my GitHub repository for a complete example.

Fork-like behaviour in Go

Recently, I have been looking into how to implement Linux Namespaces in Go.

In C, the way we isolate a process in certain Namespaces is by specifying them in the tags parameter of clone(2). For instance, user_namespaces(7) provides the following example to illustrate this.

(...)

while ((opt = getopt(argc, argv, "+imnpuUM:G:zv")) != -1) {
        switch (opt) {
        case 'i': flags |= CLONE_NEWIPC;        break;
        case 'm': flags |= CLONE_NEWNS;         break;
        case 'n': flags |= CLONE_NEWNET;        break;
        case 'p': flags |= CLONE_NEWPID;        break;
        case 'u': flags |= CLONE_NEWUTS;        break;
        case 'v': verbose = 1;                  break;
        case 'z': map_zero = 1;                 break;
        case 'M': uid_map = optarg;             break;
        case 'G': gid_map = optarg;             break;
        case 'U': flags |= CLONE_NEWUSER;       break;
        default:  usage(argv[0]);
        }
    }

(...)

/* Create the child in new namespace(s) */

child_pid = clone(childFunc, child_stack + STACK_SIZE,
                 flags | SIGCHLD, &args);
if (child_pid == -1)
   errExit("clone");

(...)

The problem

Everything looks well and good, so I wanted to do the same in Go. However, I soon found out that fork(2) (i.e., clone(2)) isn’t the right way to do so. The problem lies on the fact that fork(2) creates the new child by copying only the main thread of execution. Even if a Go program is single-threaded, under the hood there might be many other threads executing in the Go runtime. As a result, fork(2) isn’t really a nice way of creating a child process in Go.

Possible workarounds

The first solution that comes to mind is to use the os/exec package. Plus, with this package it’s possible to define the Namespaces of the child process by setting the attributes of the command:

c := exec.Command("ls", "-l")
c.SysProcAttr = syscall.SysProcAttr{
    Cloneflats: syscall.CLONE_NEWIPC,
}
if err := c.Run(); err != nil {
    (...)

Everything looks great, but if a process call itself, wouldn’t the resulting process call itself again? Wouldn’t I get into an infinite loop?

When I came accross this problem I promply split my program into two: a parent and a child. The former would simply call the latter with the appropriate flags set. However, for practical reasons, I wanted to have all my code in a single program, so I started looking for alternatives.

Another possible solution would have been to have different arguments in the command to alternate between parent and child. For example, when executed, program would start in parent mode and then it would call itself by doing program child. That sounded a bit sloppy for me, though. For exampmle, what if the user call program child directly?

My choice

I wondered how Moby addresses this issue, so looking at its codebase I found the reexec package that does exactly what I wanted.

The solution used by this package is very interesting. First, to call itself a process could simply execute /proc/self/exe, which, according to proc(5), is a symbolic link to the actual executed command. Then, we could ovewrite the command’s argument os.Args[0] in order to signal the resulting process that it’s a child.

With that in mind, it’s possible to re-execute a Go program by doing:

c := exec.Cmd{
    Path: "/proc/self/exe",
    Args: []string{"child"},
    (...)
}
if err := c.Run(); err != nil {
    (...)
}

And in main() we can check if the program running was re-executed by doing someting like this:

func main() {
	switch os.Args[0] {
    case "parent":
        (...)
    case "child":
        (...)
     }

Yes, that’s right! We overwrote os.Args[0]!

This still smells like a hack, but I found it to be a nice way to re-execute a process in Go.

Reflections on 2017

On reading

I managed to complete my 2017 Reading Challenge and read 12 books this year. This was an improvement from last year, when I read only 5.

To be quite frankly, I strategically picked a few small books so I could finish the challenge on time. I believe this is OK, but I would rather measure my reading habit by days I actually read something. Since it would be more difficult to track this approach, I am OK with the number of books for now.

Also, 2017 was the year that I got interested in fiction books. Non-fiction books are great to get new insights, but there is nothing more enjoyable that reading a good story in the early morning over a cup of coffee.

On writing software

On the software engineeting side, I broke some new grounds in 2017.

I learned a little bit of Rust in the beginning of the year.

I read the first edition of The Rust Programming Language book and after a few days I manged to overcome the borrow checker and compile some small programs. Following that, I implemented a simple Pomodor timer that I called prod. The highlight of the journey, though, was my implementation of the uniq command for the Redox OS.

I also got a bit more experience in Go. I developed a prototype for an NTP server called ohrad. Also, I implemented a simple HTTTP daemon in my job. The highlight was buildah PR. This one was very rewarding for me, specially because buildah is part of the Atomic Project.

The ugly duckling was C. In the hope to get more experience in this language I started a simple in-memory cache that I called hashd. I never really finished it, but I did learned a few new things on the way.

On changing jobs

2017 was the year that I left Scrapinghub and joined Red Hat.

I left Scrapinghub after 3 years I can’t even describe how much I evolved there. I was very happy to be joining Red Hat, but at the same time I was very sad to leave Scrapinghub.

The process of getting my Czech Visa approved took around 6 months, but fortunatelly everything worked out well.

On writing

In 2017 I wrote many journal entries, but I failed to make it a habit. I found it a tedius task, but I confedd it’s pretty rewarding to read something I wrote a long time ago. It’s impressive to see how much I evolved and, at the same time, how I struggle with some recurring issues.

When it comes to blogging, I completely failed in 2017. I put down some random notes from different topics that later would become blog posts, but they never really reached to their final destination.

On playing the guitar

I started to play the guitar but I had to leave it in Brazil. I made some progress and managed to play a few simple musics and fingerstyles. Nothing that someone would enjoy listening to, though.

On lifting

I was deadlifting 140 kilos and squatting 136 kilos (not counting the bar) right before moving to Czech Republic. I had to stop lifting for a while until I settled down in Czech Republic, and my weights went down considerably. Hopefully I will get back to those numbers in 2018!

On traveling

The only long travel of 2017 happened when I moved from Brazil to Czech Republic. However, living in the middle of Europe brings great advantages! In 2017 I managed to visit 4 countries for the first time: Czech Republic, Germany, Austria and Hungary.

Hello

… world!