Turning AzureAD Oauth tokens into SAML tokens for AWS

A little background

At our work (currently), we needed to be able to call to AWS on the CLI. We try to use AzureAD as our main IDP (rather than our legacy ADFS deployment).

This works great for accessing the AWS console using the Enterprise App already available on the store, but it doesn't work at all for getting API creds for use with tools like the AWS CLI.

Our AWS account team pointed us to a neat tool, the AWS CLI Credentials Provider. Which is a sample for how one could integrate other providers and SAML.

The main reason why you'd want to use something like AzureAD is to ensure things like MFA is applied consistently and conditional access policies still apply, without us needing to add that to AWS.

The secret sauce that makes this work is in the V1 on-behalf-of flow available in AzureAD. It's documented well, but it's a long read. The most interesting section is the Service-to-service access token request because that's what allows us to hand in an OAuth token for one service, and request a SAML token for another.

Update

Since deploying this internally, we ended out deploying AWS Directory Services, so AWS SSO now makes sense. I've not decided yet if that should kill this solution.

Also, knowing you can "switch" between token types in AAD is neat.

The architecture required

We're going to need a few components here to make this work:

An Azure AD Tenant
The Amazon Web Services (AWS) Enterprise Application deployed to that tenant
An Application Registration for the CLI component - to identify our user
An Application Registration for the Middleware component - to transform the OAuth token into a SAML token, using the on-behalf-of flow
Code that implements our credentials provider app

A "rough" sequence diagram of the events that we need to happen is below:

sequenceDiagram participant User participant CLI as CLI App participant AAD as AzureAD participant MID as Middleware App participant AWS as AWS User->>CLI: Accesses AWS resource activate CLI CLI->>AAD: Request new device code for a session activate AAD AAD->>CLI: returns an authorization code, waits (polls) for sign in deactivate AAD CLI-->>User: User opens device code in browser loop Wait for login CLI->>AAD: Poll for sign in action activate AAD AAD->>CLI: User is signed in, returns oauth bearer token (A) deactivate AAD end CLI->>MID: Exchange token (A) for SAML Token for application (B) activate MID MID->>AAD: OBO User, swap (A) for (B) activate AAD AAD->>MID: New SAML token deactivate AAD MID->>CLI: (B) SAML token for AWS deactivate MID CLI->>AWS: Claim SAML token for AWS Credentials activate AWS AWS->>CLI: AWS Credential Set (STS) deactivate AWS CLI->>User: Access AWS resource deactivate CLI

Setting up the App Registrations

There's a lot to do there, and the instructions are better documented here: AzureAD Setup Instructions

The gist of it though, is you need to set your two new applications up as Public client applications. The CLI app will need to be able to claim ID Tokens and the Middleware app will need both ID Tokens and Access Tokens. Turning on public clients allows you to use the device code flow amongst other things.

The CLI Application doesn't need any special settings, really. Users are only going to use it for the first authentication hop, to get a token.

The Middleware Application on the other hand will need to:

Create a scope allowing user_impersonation
Be delegated permissions by the user (or admins) to call upstream to the AWS app with user_impersonation
Allow the CLI app to call it, with the afformentioned impersonation scope.

The CLI App

I've written these components in python, although ADAL (Microsoft's auth library adal isn't that hot IMHO). This is probably easier in MSAL, the newer version. That being said at the time I looked at this, MSAL couldn't do device code flows.

The code here are rough snippets. I originally wrote them for python3.6 and haven't tested them since. I've got more production ready code linked in at the bottom of this article.

The following will connect to Azure AD and ask for a new device code. With this, we can then log in using a browser on any device, which will authenticate this session.

import adal

# Create an ADAL Auth Context against my tenant, ask for a device code
context = adal.AuthenticationContext(authority_url)
device_code = context.acquire_user_code(RESOURCE, client_id)
print("")
print("Logging in user using device token...")
print(device_code['message'])

From there, we can then claim the device code for a token once the user signs in on the browser. Simple right?

token = context.aquire_token_with_device_code(RESOURCE, device_code, client_id)

Yeah, no. It's not that easy. What happens if the token is never claimed? How do we deal with being able to cancel this blocking request?

We probably want it to be more sensible and secure. As such we also now need to poll in the background while we wait. This almost certainly isn't the best way to do this, but I'll use ThreadPoolExecutor with a single thread. If you have a more sane way to do this, please do let me know. The TPE is going to take our function context.aquire_token_with_device_code and all the arguments to be passed into that function to get a real OAuth token back from AzureAD.

If we don't get a result (i.e. timeout or whatever), we'll specifically tell Azure AD to bin that device_code session, so it can't be stolen.

This seems easy on the face of it, but the code ends out being way more complex.

# Block the thread, until we get confirmation that the token has
# been claimed or we time out
tpe = ThreadPoolExecutor(max_workers=1)
futures = []

# Add our poll job promise to the queue
futures.append(tpe.submit(
    context.acquire_token_with_device_code, RESOURCE, device_code, client_id
))

# Store the result if we get one inside 10 seconds, otherwise bail
# Blocking starts here...
result = concurrent.futures.wait(futures, timeout=10)
if len(result.done) == 1:
    token = result.done[0].result()
else:
    context.cancel_request_to_get_token_with_device_code(device_code)
    raise Exception("Device token flow has timed out")

OK, now we are authenticated. Yey. 🌈

Using the OBO flow to swap to SAML

At this point, our CLI app has a token that contains a perfectly valid OAuth token, that's scoped to access our middleware app.

We'll draw up a request to fire of at the Azure AD like the sample below:

# We'll do this in requests because ADAL won't
oauth_url = f"{authority_url}/oauth2/token"
obo_payload = {
    "grant_type": "urn:ietf:params:oauth:grant-type:jwt-bearer",
    "assertion": token["accessToken"],
    "client_id": middleware_client_id,
    "client_secret": middleware_client_secret,
    "resource": appid_of_aws_application,
    "requested_token_use": "on_behalf_of",
    "requested_token_type": "urn:ietf:params:oauth:token-type:saml2"
}
response = requests.post(oauth_url, data=obo_payload)
r = resp.json()

# Pull the saml token out of the response.
saml_token = base64.urlsafe_b64decode(r['access_token'])
if type(saml_token) == bytes:
    saml_token = saml_token.decode('utf-8')

This is kind of neat, and there's a few things happening here.

We take our existing OAuth token issued to our CLI app, in the name of our user
We use the middleware's client_id and client_secret to allow us to use the on_behalf_of grant
We tell Azure AD the resource we want to access is the AWS application
We ask for the token to be exchanged for a saml assertion

Err, wait. Isn't that needlessly complex? I'm afraid not. What we are doing here is service to service. We can't call the Enterprise App directly from the CLI app because we don't own the Enterprise app and we can't make it support that kind of flow.

Getting deeper...

If all you wanted to know was how to get from one token to the other, you should stop reading now. I'm continuing down the rabbit hole for this with some AWS side, and nice user bits for the script, because I can.

Taking the token to AWS

OK - so now we have a SAML Assertion. That's important to note because an assertion alone isn't actually useful for use. We need to turn that into a request AWS understands.

To do that we'll need to parse out a bunch of information.

# Use ElementTree to parse our token, because it's XML
root = ET.from_string(saml_token)

# Read the assertion, and parse out all the claims for AWS roles
aws_roles = []
SAML_NS = '{urn:oasis:names:tc:SAML:2.0:assertion}'
for attr in root.iter(f"{SAML_NS}Attribute"):
    if (attr.get('Name') == 'https://aws.amazon.com/SAML/Attributes/Role'):
        for val in attr.iter(f"{SAML_NS}AttributeValue"):
            available_roles.append(val.text)

# We also need to get the issuer for later
saml2_issuer = root.find(f"{SAML_NS}Issuer")

# Format a valid SAML response (as if it came from something like ADFS)
saml_response_tpl = """
<samlp:Response ID="_{response_id}"
    Version="2.0" IssueInstant="{authn_instant}"
    Destination="https://signin.aws.amazon.com/saml"
    xmlns:samlp="urn:oasis:names:tc:SAML:2.0:protocol">
    <Issuer xmlns="urn:oasis:names:tc:SAML:2.0:assertion">{issuer}</Issuer>
    <samlp:Status>
        <samlp:StatusCode Value="urn:oasis:names:tc:SAML:2.0:status:Success"/>
    </Status>
    {saml_assertion}
</samlp:Response>
"""
saml_response = saml_response_tpl.format(
    response_id=uuid.uuid4(),
    authn_instant=datetime.datetime.now().strftime("%Y-%m-%dT%H:%M:%S.%f.Z"),
    issuer=saml2_issuer.text,
    saml_assertion=saml_token
)

At this point we could fire that at the AWS SAML endpoint and Bob's your aunty. If this was the console, it'd pop up with the role selection dialog. We're in a CLI, so we need to do that ourselves...

# This is a helper because many, many blogs get the SAML ARN
# claim in the wrong order. For the record, it's IDP_ARN,ROLE_ARN
for role in aws_roles:
    chunks = role.split(',')
    if 'saml-provider' in chunks[0]:
        new_aws_role = f"{chunks[1]},{chunks[0]}
        index = aws_roles.index(role)
        aws_roles.insert(index, new_aws_role)
        aws_role.remove(role)

# Allow the user to pick the role they want, or if there's
# only one use that one
if len(aws_roles) > 1:
    i = 0
    print("Please choose the role you would like to use:")
    for role in aws_roles:
        role_arn = role.split(',')[0]
        print("[{i}]: {role_arn}")
        i+=1

    print("Selection: ")
    selected_role_index = input()

    if int(selected_role_index) > (len(aws_roles) - 1):
        print("You selected an invalid role number, please try again...")
        sys.exit(1)

    role_arn, principal_arn = aws_roles[int(selected_role_index)].split(',')
else:
    role_arn, principal_arn = aws_roles[0].split(',')

# Finally, call STS and get some creds
conn = boto3.sts.connect_to_region('us-east-1')
token = conn.assume_role_with_saml(
    role_arn, principal_arn, base64.b64encode(saml_response.encode())
)
print("STS Token:")
print("Expiration:          {}".format(token.credentials.expiration))
print("Access Key:          {}".format(token.credentials.access_key))
print("Secret Access Key:   {}".format(token.credentials.secret_key))
print("Session Token:       {}".format(token.credentials.session_token))

What a saga.

Again Thanks to the AWS Team in Perth and Sydney - they pointed me in the right direction on this on the AWS side.

The github repo I've created for this (with slightly more sensible code for production) can be found at github.com/elliotsegler/aws-aad-creds.