April 19, 2024

We’re heavy customers of Amazon Compute Compute Cloud (EC2) at Slack — we run roughly 60,000 EC2 cases throughout 17 AWS areas whereas working a whole bunch of AWS accounts. A large number of groups personal and handle our varied cases.

The Occasion Metadata Service (IMDS) is an on-instance element that can be utilized to achieve an perception to the occasion’s present state. Because it first launched over 10 years in the past, AWS prospects used this service to assemble helpful details about their cases. At Slack, IMDS is used closely as an example provisioning, and likewise utilized by instruments that want to know their working environments.

Info uncovered by IMDS consists of IAM credentials, metrics in regards to the occasion, safety group IDs, and an entire lot extra. This info might be extremely delicate – if an occasion is compromised, an attacker could possibly use occasion metadata to achieve entry to different Slack providers on the community.

In 2019, AWS launched a brand new model of IMDS (IMDSv2) the place each request is protected by session authentication. As a part of our dedication to excessive safety requirements, Slack moved your entire fleet and instruments to IMDSv2. On this article, we’re going to focus on the pitfalls of utilizing IMDSv1 and our journey in the direction of absolutely migrating to IMDSv2.

The v2 distinction

IMDSv1 makes use of a easy request-and-response sample that may amplify the impression of Server Side Request Forgery (SSRF) vulnerabilities — if an utility deployed on an occasion is susceptible to SSRF, an attacker can exploit the applying to make requests on their behalf. Since IMDSv1 helps easy GET requests, they will extract credentials utilizing its API.

IMDSv2 eliminates this assault vector by utilizing session-oriented requests. IMDSv2 works by requiring these two steps:

  1. Make a PUT request with the header X-aws-ec2-metadata-token-ttl-secondsheader, and obtain a token that’s legitimate for the TTL offered within the request
  2. Use that token in a HTTP GET request with the header named X-aws-ec2-metadata-token to make any follow-up IMDS calls

With IMDSv2, somewhat than merely making HTTP GET requests, an attacker wants to take advantage of vulnerabilities to make PUT requests with headers. Then they should use the obtained credentials to make follow-up GET requests with headers to entry IMDS information. This makes it way more difficult for attackers to entry IMDS through vulnerabilities akin to SSRF.

Our journey in the direction of IMDSv2

At Slack there are a number of occasion provisioning mechanisms at play, akin to Terraform, CloudFormation and varied in-house instruments that decision the AWS EC2 API. As a company, we rely closely on IMDS to get insights into our cases throughout provisioning and the lifecycle of those cases.

We create AWS accounts per atmosphere (Sandbox, Dev and Prod) and per service staff and generally even per utility – so we have now a whole bunch of AWS accounts.

We’ve got a single root AWS group account. All our youngster accounts are members of this group. After we create an AWS account, the account creation course of writes details about the account (such because the account ID, proprietor particulars, and account tags) to a DynamoDB desk. Info on this desk is accessible through an inner API known as Archipelago for account discovery.

Determining the size of the issue

Earlier than migrating, first we would have liked to know what number of cases in our fleet used IMDSv1. For this we used the EC2 CloudWatch metric known as MetadataNoToken that counts how usually the IMDSv1 API was used for a given occasion.

We created an utility known as imds-cw-metric-collector to map these metrics and occasion IDs we collected to alert varied service groups and functions. The appliance used our inner Archipelago API to get an inventory of our AWS accounts, the aforementioned MetadataNoToken metric, and talked to our occasion provisioning providers to gather data like proprietor IDs and Chef Roles (for cases which are utilizing Chef to configure them). Our customized app despatched all these metrics to our Prometheus monitoring system.

A dashboard aggregated these metrics to trace all cases that made IMDSv1 calls. This info was then used to attach with service groups, and work with them to replace their providers to make use of IMDSv2.

IMDSv1 usage dashboard

Nevertheless, the listing of EC2 occasion IDs and their homeowners was solely part of the equation. We additionally wanted to know which processes on these cases have been making these calls to the IMDSv1 API.

At Slack, for probably the most half, we use Ubuntu and Amazon Linux on our EC2 cases. For IMDSv1 name detection, AWS offers a device known as AWS ImdsPacketAnalyzer. We determined to construct the device and bundle it up as a Debian Linux distribution bundle (*.deb) in our APT repository. This allowed the service groups to put in this device on demand and examine IMDSv1 calls.

This labored completely for our Ubuntu 22.04 (Jammy Jellyfish) and Amazon Linux cases. Nevertheless, the ImdsPacketAnalyzer doesn’t work on our legacy Ubuntu 18.04 (Bionic Beaver) cases so we needed to resort to utilizing instruments akin to lsof and netlogs in some circumstances.

As a final resort on a few of our dev cases we simply turned off IMDSv1 and listed issues that have been damaged.

Cease calling IMDSv1

As soon as we had an inventory of cases and processes on these cases that have been making the IMDSv1 calls, it was time for us to get to work and replace every one to make use of IMDSv2 as an alternative.

Updating our bash scripts was the straightforward half, as AWS offers very clear steps on switching from IMDSv1 and IMDSv2 for these. We additionally upgraded our AWS CLI to the most recent model to get IMDSv2 help. Nevertheless doing this for providers which are written utilizing different languages was a bit extra difficult. Fortunately AWS has a comprehensive list of libraries that we needs to be utilizing to implement IMDSv2 for varied languages. We labored with service groups to improve their functions to IMDSv2 supported variations of libraries and roll these out throughout our fleet.

As soon as we had rolled out these adjustments, the variety of cases utilizing IMDSv1 dropped precipitously.

Turning off IMDSv1 for brand spanking new cases

Stopping our providers from utilizing the IMDSv1 API solely solved a part of the issue. We additionally wanted to show off IMDSv1 on all future cases. To unravel this drawback, we turned to our provisioning instruments.

First we checked out our mostly used provisioning device, Terraform. Our staff offers a set of normal Terraform modules for service groups to make use of to create issues akin to AutoScaling teams, S3 buckets, and RDS cases. These widespread modules allow us to make a change in a single place and roll it out to many groups. Service groups that simply wish to construct an AutoScaling group don’t must know the nitty-gritty configurations of Terraform to make use of considered one of these modules.

Nevertheless we didn’t wish to roll out this transformation to all our AWS youngster accounts on the identical time, as there have been service groups that have been actively engaged on switching to IMDSv1 right now. Subsequently we would have liked a strategy to exclude these groups and their youngster accounts. We got here up with a customized Terraform module known as accounts_using_imdsv1 as the answer.Then we have been ready to make use of this module in our shared Terraform modules to maintain or terminate IMDSv1 as per the instance beneath:

module "accounts_using_imdsv1" 
  supply = "../slack/accounts_using_imdsv"


useful resource "aws_instance" "instance" 
  ami           = information.aws_ami.amzn-linux-2023-ami.id
  instance_type = "c6a.2xlarge"
  subnet_id     = aws_subnet.instance.id

  metadata_options 
    http_endpoint  = "enabled"
    http_tokens    = module.accounts_using_imdsv1.is_my_account_using_imdsv1 ? "non-obligatory" : "required"
  

We began with a big listing of accounts within the accounts_using_imdsv1 module as utilizing IMDSv1, however we have been slowly capable of take away them as service groups migrated to IMDSv2.

Blocking cases with IMDSv1 from launching

The subsequent step for us was to dam launching cases with IMDSv1 enabled. For this we turned to AWS Service control policies (SCPs). We up to date our SCPs to dam launching IMDSv1 supported cases throughout all our youngster accounts. Nevertheless, much like the AutoScaling group adjustments we mentioned earlier, we wished to exclude some accounts originally whereas the service homeowners have been working to modify to IMDSv2. Our accounts_using_imdsv1 Terraform module got here to the rescue right here too. We have been ready to make use of this module in our SCPs as beneath. We blocked the power to launch cases with IMDSv1 help and likewise blocked the power to activate IMDSv1 on current cases.

 # Block launching cases with IMDSv1 enabled
  assertion 
    impact = "Deny"

    actions = [
      "ec2:RunInstances",
    ]

    sources = [
      "arn:aws:ec2:*:*:instance/*",
    ]

    situation 
      take a look at     = "StringNotEquals"
      variable = "ec2:MetadataHttpTokens"
      values     = ["required"]
    

    situation 
      take a look at          = "StringNotEquals"
      variable = "aws:PrincipalAccount"
      values     = module.accounts_using_imdsv1.accounts_list_using_imdsv1
    
  

  # Block turning on IMDSv1 if it is already turned off
  assertion 
    impact = "Deny"

    actions = [
      "ec2:ModifyInstanceMetadataOptions",
    ]

    sources = [
      "arn:aws:ec2:*:*:instance/*",
    ]

    situation 
      take a look at          = "StringNotEquals"
      variable = "ec2:Attribute/HttpTokens"
      values     = ["required"]
    

    situation 
      take a look at          = "StringNotEquals"
      variable = "aws:PrincipalAccount"
      values     = module.accounts_using_imdsv1.accounts_list_using_imdsv1
    
  
}

How efficient are these SCPs?

SCPs are efficient in the case of blocking most IMDSv1 utilization. Nevertheless there are some locations the place they don’t work.

SCPs don’t apply to the AWS root group’s account, and solely apply to youngster accounts which are members of the group. Subsequently, SCPs don’t stop launching cases with IMDSv1 enabled, nor turning on IMDSv1 on an current occasion within the root AWS account.

SCPs additionally don’t apply to service-linked roles. For instance, if an autoscaling group launches an occasion in response to a scaling occasion, below the hood the AutoScaling service is utilizing a service-linked IAM position managed by AWS and people occasion launches are usually not impacted by the above SCPs.

We checked out stopping groups from creating AWS Launch Templates that don’t implement IMDSv2, however AWS Launch Template coverage situation keys at the moment do not provide support for ec2:Attribute/HttpTokens.

What different security mechanisms are in place?

As there is no such thing as a 100%-foolproof strategy to cease somebody from launching an IMDSv1-enabled EC2 occasion, we put in a notification system using AWS EventBridge and Lambda.

We created two EventBridge guidelines in every of our youngster accounts utilizing CloudTrail occasions for EC2 occasions. One rule captures requests to the EC2 API and the second captures responses from the EC2 API, telling us when somebody is making a EC2:RunInstances name with IMDSv1 enabled.

Rule 1: Capturing the requests


  "element": 
    "eventName": ["RunInstances"],
    "eventSource": ["ec2.amazonaws.com"],
    "requestParameters": 
      "metadataOptions": 
        "httpTokens": ["optional"]
      
    
  ,
  "detail-type": ["AWS API Call via CloudTrail"],
  "supply": ["aws.ec2"]

Rule 2: Capturing the responses

{
  "element": {
    "eventName": ["RunInstances"],
    "eventSource": ["ec2.amazonaws.com"],
    "responseElements": 
      "instancesSet": 
        "gadgets": 
          "metadataOptions": 
            "httpTokens": ["optional"]
          
        
      
    
  },
  "detail-type": ["AWS API Call via CloudTrail"],
  "supply": ["aws.ec2"]
}

These occasion guidelines have a goal setup to level them at a central occasion bus dwelling in an account managed by our staff.

AWS Eventbridge Targets

Occasions matching these guidelines are despatched to the central occasion bus. The Central Occasion bus captures these occasions through an analogous algorithm. Subsequent it sends them by way of an Input Transformer to format the occasion much like the next:

Enter path:


  "account": "$.account",
  "instanceid": "$.element.responseElements.instancesSet.gadgets[0].instanceId",
  "area": "$.area",
  "time": "$.time"

Enter template:

 
  "supply" : "slack",
  "detail-type": "slack.api.postMessage",
  "model": 1,
  "account_id": "<account>",
  "channel_tag": "event_alerts_channel_imdsv1",
  "element": 
    "textual content": ":importantred: :provisioning: occasion `<instanceid> (<area>)` within the AWS account `<account>` was launched with `IMDSv1` help"
  

Lastly the reworked occasions get despatched a Lambda perform in our account.

AWS Eventbridge Targets

This Lambda perform makes use of the account ID from the occasion and our inner Archipelago API to find out the Slack Channel, then sends this occasion to Slack.

IMDSv1 Slack Alerts

This circulation seems like the next:

IMDSv1 Slack Alert Flow

We even have an analogous alert in place for when IMDSv1 is turned on for an current occasion.

IMDSv1 Enabled Slack Alert

What in regards to the cases with IMDSv1 enabled?

Launching new cases with IMDSv2 is cool and all, however what about our hundreds of current cases? We wanted a strategy to implement IMDSv2 on them as properly. As we noticed above, SCPs don’t block launching cases with IMDSv1 totally.

This is the reason we created a service known as IMDSv1 Terminator. It’s deployed on EKS and makes use of an IAM OIDC provider to acquire IAM credentials. These credentials have entry to imagine a extremely restricted position in all our youngster accounts created for this very goal.

The coverage connected to the position assumed by IMDSv1 Terminator in youngster accounts is as beneath:


    "Assertion": [
        
            "Action": "ec2:ModifyInstanceMetadataOptions",
            "Condition": 
                "StringEquals": 
                    "ec2:Attribute/HttpTokens": "required"
                
            ,
            "Effect": "Allow",
            "Resource": "arn:aws:ec2:*:*:instance/*",
            "Sid": ""
        ,
        
            "Action": [
                "ec2:DescribeRegions",
                "ec2:DescribeInstances"
            ],
            "Impact": "Enable",
            "Useful resource": "*",
            "Sid": ""
        
    ],
    "Model": "2012-10-17"


Much like our earlier metric collector utility, this additionally makes use of the inner Archipelago API to get an inventory of our AWS accounts, lists our EC2 cases in batches and analyzes every one and checks if IMDSv1 is enabled. Whether it is, the service will implement IMDSv2 on the occasion.

When the service remediates an occasion, we get notified in Slack.

IMDSv1 Terminator Slack Alert

Initially we noticed a whole bunch of those messages for current cases, however as they have been remediated and solely new cases have been launched with IMDSv2, we stopped seeing these messages. Now if an occasion will get launched with IMDSv1 help enabled we have now the consolation of understanding that it’ll get remediated and we’ll get notified.

This service additionally sends metrics to our Prometheus monitoring system in regards to the IMDS standing of our cases. We are able to simply visualize what AWS accounts and areas which are nonetheless working IMDSv1 enabled cases, if there are any.

IMDSv1 Usage Dashboard

Some final phrases

Having the ability to implement IMDSv2 throughout Slack’s huge community was a difficult however rewarding expertise for the Cloud Foundations staff. We labored with our massive variety of service groups to perform this purpose, specifically our SecOps staff who went above and past to assist us full the migration.

[hiring text=”Want to help us build out our cloud infrastructure? We’re hiring!” url=”https://slack.com/careers” /]