How to deal with eventual consistency in AWS IAM

Have you ever experienced that you set some IAM permissions and your provisioning fails even though you set permissions correctly? If that happens it may have to do with IAMs eventual consistency.

IAM is a global service. Everything we set in IAM applies to all regions. Nevertheless we have to send our API calls to a single endpoint. And we receive a confirmation when everything is OK.

That confirmation can be misleading if we do not think about what it actually means. It only confirms IAM has received our request and it is valid. It does not confirm that the changes are applied everywhere in all regions. The change still needs to be replicated to all regions and that incurrs a propagation delay. If in the meantime we try to use that permissions to create a resource it may fail. We may be faster than that propagation delay.

%3 Terraform Terraform IAM IAM Terraform->IAM API Resource EC2 Instance Terraform->Resource API IAM->Terraform SUCCESS Region1 Region1 IAM->Region1 Region2 Region2 IAM->Region2 Region3 Region3 IAM->Region3 IAM Role Region3->Resource Access Denied Resource->Terraform FAILED Resource->Region3

One might ask why wouldn’t AWS wait to send that confirmation until everything is replicated. They could make IAM provide strong consistency easily.

If IAM was strongly consistent what would happen if one of the regions was unreachable? That would block everything worldwide. No one could change their IAM roles anymore due to a network outage somewhere in the world.

So there are good reasons that IAM is only eventually consistent. There are several approaches how we can deal with that.

  1. We can accept that our provisioning may fail and we could simply restart it. Accepting failure might be hard psychologically but the good thing is that we don’t have to add complexity to our code in order to deal with it.

  2. We can add a time delay that is longer than the propagation delay we expect. There is a recommendation from AWS that says that 45 seconds should be fine. In Ansible we can use the pause module

     - name: Wait for IAM propagation
       pause:
         seconds: 45 
    

    With Terraform it is a little bit more complicated. We can add a delay with the time_sleep resource but we will have to place it in our dependency tree with depends_on (see http://www.it-automation.com/2021/05/22/terraform-depends_on.html)

     resource "time_sleep" "iam_propagation" {
      depends_on = [aws_iam_role.foo]
      create_duration = "45s"
     }
    
     resource "aws_instance" "baz" {
      depends_on [time_sleep.iam_propagation]
      (...)
     }
    
  3. We could do retries.

    Ansible allows us to retry a task several times if it fails

     - name: Set up EC2 instance
     (...)
       retries: 3
       delay: 20 
       register: result
       until: result.rc == 0
    

    With Terraform retries are part of the Provider. As of today there is no way to specify retries at the resource level.

Actually I believe there is no silver bullet. Every workaround has its price and we have to decide which one is best for us in our environment.

Contact