No-one wants unfettered, widespread access to production all the time, but the pager does have an awful habit of going off and - if your tooling fails you - you might have to pop into a production account to have a look around. I’ve recently been playing around with the use of AWS Step Functions to orchestrate this access. Our journey starts in Slack, with a notification about an alert or deployment that may require us to elevate our access in an emergency:
When we press the “Break Glass Prod” button, Slack’s interactivity feature sends a webhook to API Gateway, which then starts up our Step Function and reports back with an ephemeral message:
The Step Function has a few stages. First, we grant temporary access with AWS SSO. Setting up a permission set ahead of time lets us easily assign emergency access when required, while keeping the ability to develop and test the policy outside the scope of this application. We then wait an hour (or configurable amount of time) and revoke the access. After the access is gone, we wait for CloudTrail to have reported back everything that the user did and then e-mail a report to a user or delivery list. Here’s what our Step Function execution will look like:
Here’s what the access report e-mail looks like:
Each of the activities in the Step Function is performed with Lambda. Step Functions can make native AWS API calls themselves, but we’re sprinkling in some other logic and formatting that makes me lean towards some proper code. The combination of API Gateway, Step Functions, DynamoDB and Lambda gives us a neat, serverless solution, perfect for a low traffic use case like this.
View the app code and Terraform infrastructure on GitHub.