Automatically Remediate AWS Control Tower Drift with the Async Multi-Account Factory Module

Background: The Problem of Drift in AWS Control Tower

When managing AWS accounts via Control Tower and Service Catalog, you may encounter an issue where OpenTofu/Terraform detects drift in your infrastructure state. This is particularly common when:

A new version of the Account Factory Provisioning Artifact is published
You move an account between Organizational Units (OUs)
Manual changes are made in the AWS Console or via API

In all of these cases, the provisioned_product_id changes behind the scenes, but OpenTofu/Terraform isn’t aware of it. When you next apply your infrastructure code, it attempts to reconcile this drift by updating every affected provisioned product, even if nothing else has changed.

This becomes a major problem at scale:

The update process is slow, especially for large organizations
AWS imposes a hard limit of 5 concurrent updates, so you're throttled quickly
OpenTofu/Terraform updates can take hours to complete
You risk timeouts, failed updates, and broken infrastructure state

The Fix: Introducing the Async Multi-Account Factory Module

To solve this, we’ve introduced a new module: control-tower-multi-account-factory-async

Instead of managing provisioned_product_id drift directly via OpenTofu/Terraform, this module uses an asynchronous workflow built from AWS native services:

Component	Role
EventBridge Rule	Listens for Service Catalog API calls like UpdateProvisioningArtifact and UpdateProvisionedProduct
Ingest Lambda	Finds outdated provisioned products and queues them for update
SQS FIFO Queue	Stores update jobs with strict ordering and deduplication
Worker Lambda	Applies the update and launches Step Functions
AWS Step Functions state machine	Monitors the update process and confirms success or failure

This async approach operates as follows:

Why is this better?

Drift is resolved outside OpenTofu/Terraform
Updates happen automatically, with no user action
Concurrency is controlled to avoid throttling
Your OpenTofu/Terraform applies stay fast and clean

Step-by-Step: Switching to the Async Module

Update your terragrunt.hcl to use the new module

Replace this:

terraform {
  source = "git@github.com:gruntwork-io/terraform-aws-control-tower.git//modules/landingzone/control-tower-multi-account-factory?ref=VERSION"
}

With this:

terraform {
  source = "git@github.com:gruntwork-io/terraform-aws-control-tower.git//modules/landingzone/control-tower-multi-account-factory-async?ref=VERSION"
}

Note: No state migration is needed — this is a drop-in replacement.

Update IAM Permissions

The new infrastructure created by the async module will require additional permissions be added to the roles root-pipelines-apply-role and root-pipelines-plan-role. The necessary IAM role changes are included below and can also be found in v3.1.2 (or later) of terraform-aws-architecture-catalog.

For _envcommon/landingzone/root-pipelines-apply-role.hcl, ensure that you have at least the following permissions:

    "EventBridgeAccess" = {
      resources = ["*"]
      actions   = ["events:*"]
      effect    = "Allow"
    }
    "LambdaDeployAccess" = {
      resources = ["*"]
      actions   = ["lambda:*"]
      effect    = "Allow"
    }
    "SQSDeployAccess" = {
      resources = ["*"]
      actions   = ["sqs:*"]
      effect    = "Allow"
    }
   "StatesDeployAccess" = {
     resources = ["*"]
     actions   = ["states:*"]
     effect    = "Allow"
   }

For _envcommon/landingzone/root-pipelines-plan-role.hcl, ensure that you have at least the following permissions:

    "CloudWatchEventsReadOnlyAccess" = {
      effect    = "Allow"
      actions   = ["events:Describe*", "events:List*"]
      resources = ["*"]
    }
    "CloudWatchLogsReadOnlyAccess" = {
      effect = "Allow"
      actions = [
        "logs:Get*",
        "logs:Describe*",
        "logs:List*",
        "logs:Filter*",
        "logs:ListTagsLogGroup"
      ]
      resources = ["*"]
    }
    "LambdaReadOnlyAccess" = {
      effect = "Allow"
      actions = [
        "lambda:Get*",
        "lambda:List*",
        "lambda:InvokeFunction"
      ]
      resources = ["*"]
    }
    "SQSReadOnlyAccess" = {
      effect = "Allow"
      actions = [
        "sqs:Get*",
        "sqs:List*",
      ]
      resources = ["*"]
    }
   "StatesReadOnlyAccess" = {
     resources = ["*"]
     actions   = [
       "states:List*",
       "states:Describe*",
       "states:GetExecutionHistory",
       "states:ValidateStateMachineDefinition"
     ]
     effect    = "Allow"
   }

Apply your changes

Next, run terragrunt apply either directly or through GitHub Actions. This will deploy:

The new Lambda functions
SQS FIFO queue + DLQ
EventBridge rules for Service Catalog API monitoring
AWS Step Functions state machine

Once applied, drifted provisioned_product_id values will be remediated when UpdateProvisioningArtifact or UpdateProvisionedProduct API calls occur.

Note: If your environment is already in a drifted state, you may need to manually trigger one of these API calls. The simplest way to do this is to deactivate and reactivate the current provisioning artifact version.

Optional: Control Concurrency with lambda_worker_max_concurrent_operations

AWS Service Catalog currently enforces a hard limit of 5 account-related operations concurrently that includes provisioning, updating, and enrolling. Exceeding this limit may result in throttling errors or failed updates.

To avoid hitting that limit (and prevent failed updates), you can configure the number of concurrent updates with the lambda_worker_max_concurrent_operations variable. Example:

inputs = {
  lambda_worker_max_concurrent_operations = X # default is 4
}

This variable tells the worker Lambda to never initiate more than X updates at a time, which can be used to leave headroom for other processes (like provisioning new accounts) to succeed.

Value	Behavior
`5`	Max concurrency allowed by AWS (use with caution)
`4`	The default set by the async module
`<5`	Safe concurrency with headroom for other ops
`1`	Serialized updates, safest but slowest

Background: The Problem of Drift in AWS Control Tower​

The Fix: Introducing the Async Multi-Account Factory Module​

Step-by-Step: Switching to the Async Module​

Optional: Control Concurrency with lambda_worker_max_concurrent_operations​

Background: The Problem of Drift in AWS Control Tower

The Fix: Introducing the Async Multi-Account Factory Module

Step-by-Step: Switching to the Async Module

Optional: Control Concurrency with lambda_worker_max_concurrent_operations