How to avoid configuration drift

Configuration drift is real and happens to almost all systems. We try to avoid it with Infrastructure as Code, but often that makes things worse. Tools don’t help us avoid configuration drift until we understand the root causes at a deeper level.

Life means that things constantly change. What could be better to prove this than Information Technology? We apply patches, we optimize for performance, availability or security and we have to deal with a never ending flow of new or changed requriements. If everything was static why would someone need DevOps Engineers, SREs or Developers?

Configuration drift can occur in different forms like diverging nodes in a cluster, outdated infrastructure code, snowflake systems, neglected stage environments. There are different strategies on how to deal with it. But in order to apply those strategies we first have to understand why configuration drift happens at all.

Avoiding configuration drift is the easiest thing out there. Let’s say we do some configuration changes to server1. If server2 is meant to be identical we know that we have to do the exact same changes there. This is not a task that is somehow difficult. We could have a commitment within our team that we keep track of all changes in our environment in some document. All we have to do is take some minutes and update that documentation. That is not difficult either. We totally know what we have to do to prevent configuration drift. But as we all know reality is different. Even though those steps are some of the easiest things on earth they are not done properly often times.

The reason why we procrastinate on those simple tasks is that we do not have enough discipline to do them. It seems like we just have to learn how to be disciplined. I have got to know many colleagues, who accused others of not having enough discipline but when it came to their own tasks it looked like they did not have that discipline either. Finding excuses that others are responsible for that configuration drift but neglecting our own responsibility will not take us anywhere.

Discipline is an extremely limited resource. We can think of it as a specific amount of money in our wallets. Every time we spend some of it our reserve decreases until eventually our wallet is completely empty and we need to refill it. Have you ever tried to do some unpleasant tasks in the evening that needed some discipline for completion. Getting up at 6:00 in the morning and heading to a workplace where we expect some rebukes from our boss takes a lot of discipline. This thing alone can eat up our entire supply of discipline for the complete day.

When it comes to things like doing some quick fix to our production environment our discipline reserves may already be eaten up. We procrastinate on doing subsequent task like applying the same change to staging or upgrading our Ansible playbook.

Thinking about our priorities and using our discpiline for what is important is one thing we can do to avoid that. But we should take that one step further. What if we turn the whole thing around and make sure that we need discipline every time we do things manually.

We can not avoid that we have to do ad hoc changes to some systems occasionally. If a new release caused a malfunction in our production environment we have to deal with it right away. That means we have to take shortcuts from time to time. But that should be the exception not our modus operandi.

Having some Ansible playbooks that run automatically every night should be the default. If we do some ad hoc changes we have to make sure to update our playbook afterwards. Otherwise our quick fix will get reverted during the night. If we want to procrastinate on that task we will have to disable the cronjob that runs at night. That needs discipline.

Why would we need root access to a server if we use IaC? Disabling root access and having a playbook that enables it for 1 hour means we need discipline to do ad hoc changes but automation remains our default.

Use pipelines rather than applying changes manually. Using a pipeline means that we do not need discipline to apply our changes. We are forced to do things properly.

Preventing configuration drift is not hard at all. What can make it hard though is thinking that we have enough discipline to do it manually. This approach will most likely go wrong. However, once we understand that we can use lack of discipline to our advantage, it makes things very easy.

Contact