CICD Pipelines with long running tasks are less than ideal

CICD pipelines automate the building, testing, packaging, and deployment of your digital projects as they change and take shape. Typically, your CICD pipeline will run every time changes are pushed to a repository. This setup helps accelerate the development to delivery cycle.

Ideally, your CICD pipelines should run as quickly as possible. However, there may be situations where they need to perform a long-running task. Your CICD build server may not be suitable or even capable of running the long-running task. As a result, you may need to run this task in an external system.

Problem: long running external task makes inefficient use of your CICD build server (GitLab Runner)

The sequence diagram below shows an example scenario where a GitLab CI/CD pipeline is used to perform a long-running external task. GitLab CI/CD uses what they call a “GitLab Runner” which is essentially a build server to perform the tasks in the CICD pipelines.

The GitLab CI/CD Pipeline can launch a GitLab Runner instance on-demand to start a task, in this case start the external long running task. The problem with this scenario is that the long running external task takes a long time (hours) to complete. Furthermore, the GitLab runner needs to regularly poll the external task to check if it has completed. This results in the GitLab runner instance remaining up for the life of the external task and costing you GitLab compute minutes or dollars from running your own self-managed GitLab runner instance.

A more efficient solution could be to make the external task an asynchronous task that is able to “call back” the GitLab CI/CD pipeline when it is complete, rather than having the Runner poll it. This would require the external task system to have knowledge of the GitLab CI/CD pipeline execution and use the GitLab REST API to notify it of its completion. Putting this together on the external task side may not be so easy.

Solution: AWS Step Functions and Lambda for serverless monitoring of external task

If you are in a restrictive situation where your only option is to have a monitoring component that regularly polls the status of an external task, then you can use AWS Step Functions with AWS Lambda to do this for you.

Using AWS Step Functions with AWS Lambda you can build a serverless workflow which can start the external task and perform periodic status checks against it. It can then notify your GitLab CI/CD pipeline when the external task it is complete. The GitLab Runner only needs to stand up for a short amount of time to make an asynchronous start Step Function execution call. Below is a sequence diagram for this solution.

This solution is more efficient because it only uses serverless compute resources when an action is required, such as starting the external task or checking its status. Most of the time spent is waiting for the next “check external task status” action in which no compute resources are used.

Below is a image of the Step Function state machine to achieve the solution:

The steps of a successful execution are described below:

GitLab CI/CD pipeline (runner) make a Step Function start execution call on the state machine (see here).
The state machine first uses a Lambda function to start the external task (Mock start async external task) (see here).
The state machine then enters a loop to poll the external task for it’s status (Wait X Seconds → Poll external task status → External task COMPLETE? → repeat ).
1. The time to wait between each poll iteration is configurable and provided by the execution input to the Wait state (Wait X Seconds)
2. A Lambda function is used to check the external task status (Poll external task status), the output of which is then passed to the Choice state (External task COMPLETE?) (see here).
3. If the Choice state (External task COMPLETE?) receives a input of “COMPLETE”, the loop is then exited. Otherwise the default route is re-enter the Wait state (Wait X Seconds) and repeat the loop.
A successful completion of the external task should result in using a Lambda function to call back the GitLab CI/CD pipeline with a success message (Callback GitLab Success) (see here).

If the Choice state (External task COMPLETE?) receives a “FAILED” input due to the external task returning a failure status, then a Lambda function is used to call back the GitLab CI/CD pipeline with a fail message (Callback GitLab Fail).

Any errors occurring in the Mock start async external task and Poll external task status Lambda functions will also be directed to call back the GitLab CI/CD pipeline with a fail message (Callback GitLab Fail).

You can find the Step Function state machine .asl.json file here.

GitLab CI/CD pipeline async callback with AWS Step Functions

To gain the cost savings benefit of reducing GitLab runner uptime, the GitLab CI/CD pipeline definition (.gitlab-ci.yml file) needs to be configured to:

Start the Step Function state machine execution.
Wait for a call back from the Step Function state machine with the external task result.

This can be achieved with two jobs in the GitLab CI/CD pipeline.

An example .gitlab-ci.yml file is available here. The example uses OpenID Connect to provide access to AWS (see: https://docs.gitlab.com/ci/cloud_services/aws/).

The diagram below shows the example GitLab CI/CD pipeline visualization against the AWS Step Functions state machine execution diagram.

The example GitLab CI/CD pipeline contains two jobs which are explained as followed

CICD job: sfn-start-execution

This job’s main purpose is to run the AWS CLI command aws stepfunctions start-execution to start the Step Function state machine execution. This is run from a bash script sfn_start_execution.bash.

After this job is complete the runner instance is stopped. The next job “finish-async-external-task” does not run immediately because it is set up as a manual job (further explained below).

The bash script sfn_start_execution.bash is used to help put together the arguments for the running the AWS CLI command aws stepfunctions start-execution including the execution inputs loaded from the file external_task_params.json.

The Step Function state machine execution input is provided in the JSON file external_task_params.json. An example execution input is below:

{
    "external_task_params": {
        "status_check_url": "http://mock-external-task-status-0wp4znps.s3-website-ap-southeast-2.amazonaws.com/status.json",
        "status_check_poll_wait_seconds": 30,
        "example_param_a": "example_value_a",
        "example_param_b": "example_value_b",
        "example_param_c": "example_value_c"
    },
    "gitlab_cicd": {
        "note": "This object contains parameters for interacting with GitLab CICD. The values will be updated by GitLab CICD pipeline itself.",
        "callback_job_name": "REPLACE_WITH_GITLAB_CALLBACK_JOB_NAME",
        "project_id": "REPLACE_WITH_CI_PROJECT_ID",
        "pipeline_id": "REPLACE_WITH_CI_PIPELINE_ID"
    }
}

The execution input fields are described as follows:

Field path	Description
`external_task_params.status_check_url`	A website URL that can be used to check the status of an external task. The above example uses a S3 bucket static website as a mock external task status page.
`external_task_params.status_check_poll_wait_seconds`	The time to wait between check of the external task status.
`external_task_params.example_param_X`	This is an example parameter available for the state machine to provide to the external task when it is started.
`gitlab_cicd.callback_job_name`	The GitLab CI/CD job name for the state machine to call back once the external task is completed. In this example it will resolve to the job sfn-finish-callback. This field will be set by the GitLab CI/CD pipeline itself.
`gitlab_cicd.project_id`	The GitLab CI/CD project ID for the state machine to call back once the external task is completed. This field will be set by the GitLab CI/CD pipeline itself.
`gitlab_cicd.pipeline_id`	The GitLab CI/CD pipeline ID for the state machine to call back once the external task is completed. This field will be set by the GitLab CI/CD pipeline itself.

CICD job: sfn-finish-callback

This job is used to wait for the call back from the Step Function execution after the external task has ended. The external task result is provided to CICD pipeline to set the completion status (Passed/Failed).

This job is set as a manual run job. A manual job does not start automatically after prior jobs have been completed. Instead it requires a manual call to start. We can use this behavior to 'pause' the CI/CD pipeline, ensuring that no GitLab runner remains active.

The manual job is to be started by the Step Function execution using a GitLab REST API call (further explained below).

After the external task has ended with a success or failure, the Step Function execution can make the manual call to start this job and provide the external task result in the job variable, EXTERNAL_TASK_RESULT. A value of “SUCCESS” will make the job run successfully and the CICD pipeline complete with a “Passed” status. Any other task result will make the job run fail and the CICD pipeline end with a “Failed” status.

Starting the external task using AWS Lambda

The Step Function state machine (state name “Mock start async external task”) uses a Lambda function to start the external task. The method of starting an external task will depend on your own particula external task setup but generally this is likely an HTTP request to an API endpoint.

For this example, a Python Lambda function is created to start a mock external task (which only logs 'task starting' messages to CloudWatch Logs). You can find the example Python code here.

External task parameters that are defined in the external_task_params.json file are passed to the Lambda function and accessible in the invoke event:

external_task_params = event["execution_input"]["external_task_params"]
logger.info("external_task_params: " + json.dumps(external_task_params))

Other notes:

AWS Lambda Powertools logger is used to help structure the log messages. AWS Lambda Powertools is available as a Lambda Layer (see this).

Checking the external task status using AWS Lambda

The Step Function state machine (state name “Poll external task status”) uses a Lambda function to check the status of an external task. Similar to starting an external task, the method to check the status will depend on your own particular external task setup. The task status is likely available through an HTTP request.

For this post example, an static website hosted on Amazon S3 is used as a mock external task status page. The status page URL is set in the external_task_params.json file, for example:

{
    "external_task_params": {
        "status_check_url": "http://mock-external-task-status-0wp4znps.s3-website-ap-southeast-2.amazonaws.com/status.json"
    }
}

A Python runtime Lambda function makes an HTTP request to the status page URL which returns a JSON response such as below:

{
    "taskName": "mock-external-task",
    "status": "COMPLETE"
}

Note: Using the Amazon S3 static website for mocking purposes, we can easily change the status page response by uploading the desired response content file to the S3 object status.json.

Below is a snippet of the Lambda function Python code. You can find the full code example here.

def lambda_handler(event: dict, _context):
    """Main Lambda handler"""
    status_check_url = event["execution_input"]["external_task_params"]["status_check_url"]
    logger.info(f"Check external task status at: {status_check_url}")

    response = requests.get(status_check_url, timeout=30)
    if response.status_code != 200:
        response.raise_for_status()
    logger.info(f"Check external task status response: {response}")

    response = response.json()
    if "status" not in response:
        raise ValueError("Response is missing status")

    return response

The Lambda function outputs the status page response which is used as input by the next state (Choice state “External task COMPLETE?”) in the Step Function state machine. This in turn controls the loop back for polling the external task status which can result in invoking the Lambda function multiple times.

Other notes:

AWS Lambda Powertools logger is used to help structure the log messages. AWS Lambda Powertools is available as a Lambda Layer (see this).

Callback GitLab CI/CD pipeline with the external task result using AWS Lambda

The Step Function state machine (state names “Callback GitLab Success” and “Callback GitLab Fail”) uses a Lambda function to call back the GitLab CI/CD pipeline with the result of the external task (SUCESS/FAIL).

The call back to GitLab CI/CD pipeline action is a manual run of the job sfn-finish-callback. This example uses a Python runtime Lambda function that calls GitLab REST API to achieve this. You can find the example code here.

The Lambda function first lists the pipeline jobs using GitLab REST API resource

Refer to the GitLab API documentation for details on listing pipeline jobs.

GET /projects/:id/pipelines/:pipeline_id/jobs

This call allows us to determine the job ID (callback_job_id) for a given callback_job_name, project_id and pipeline_id in the Lambda invoke event.

def lambda_handler(event: dict, _context):
    """Main Lambda handler"""
    callback_message = event.get("callback_message", "SUCCESS")
    callback_job_name = event["execution_input"]["gitlab_cicd"]["callback_job_name"]
    project_id = event["execution_input"]["gitlab_cicd"]["project_id"]
    pipeline_id = event["execution_input"]["gitlab_cicd"]["pipeline_id"]

    # Get the list of pipeline job details
    url = f"{GITLAB_API_BASE_URL}/projects/{project_id}/pipelines/{pipeline_id}/jobs"
    headers = {"PRIVATE-TOKEN": GITLAB_TOKEN}
    logger.info(f"Listing pipeline jobs: GET {url}")
    response = requests.get(url, headers=headers, timeout=GITLAB_API_TIMEOUT)
    if response.status_code != 200:
        response.raise_for_status()
    pipeline_jobs = response.json()

    # Get the callback job ID
    callback_job_id = None
    for job in pipeline_jobs:
        if job["name"] == callback_job_name:
            callback_job_id = job["id"]
            logger.info(f"Found callback job '{callback_job_name}' ID: {callback_job_id}")
            break
    if callback_job_id is None:
        raise ValueError(f"Callback job '{callback_job_name}' not found")

We can then use the job ID in the next GitLab REST API call to start/run the manual job:

See: https://docs.gitlab.com/api/jobs/#run-a-job

POST /projects/:id/jobs/:job_id/play

Below is the snippet of code the Lambda function uses to perform the run job API call:

def lambda_handler(event: dict, _context):
    # ...code omitted for brevity...

    # Start the callback job
    url = f"{GITLAB_API_BASE_URL}/projects/{project_id}/jobs/{callback_job_id}/play"
    headers = {"PRIVATE-TOKEN": GITLAB_TOKEN, "Content-Type": "application/json"}
    data = {
        "job_variables_attributes": [{"key": "EXTERNAL_TASK_RESULT", "value": callback_message}]
    }
    response = requests.post(url, headers=headers, json=data, timeout=GITLAB_API_TIMEOUT)
    if response.status_code != 200:
        response.raise_for_status()
    logger.info(
        f"Successfully started callback job '{callback_job_name}' ID '{callback_job_id}' "
        f"with EXTERNAL_TASK_RESULT '{callback_message}'"
    )
    return

The job variable EXTERNAL_TASK_RESULT is provided in the run job API call. The job sfn-finish-callback will use this variable to set the “Passed” or “Failed” outcome of the CICD pipeline.

Other notes:

A GitLab API token in the header PRIVATE-TOKEN is used to perform authenticated API calls (see: https://docs.gitlab.com/api/rest/authentication/#personalprojectgroup-access-tokens)
- The GitLab API token is stored and retrieved from a SSM Parameter secure string with the help of AWS Lambda Powertools Parameters utility.
AWS Lambda Powertools logger is used to help structure the log messages. AWS Lambda Powertools is available as a Lambda Layer (see this).

A simpler solution without polling

So far we have covered a solution using AWS Step Functions with a polling workflow. This solution is for scenarios where you have limited capability to modify the external task to call back the CICD pipeline and report it’s task completion.

If you are capable of modifying your external task, then we can look at creating a simpler AWS Step Functions workflow which could look something like below:

This Step Function workflow is essentially the same as before (previously described here) with the polling component removed and one other key difference.

The other key difference is in the “Mock start async external task” state where it uses an AWS Step Functions feature, to wait for a callback with task token. This allows the state to invoke the Lambda function to start the external task then pause and remain on this state. The external task then needs to callback using either SendTaskSuccess or SendTaskFailure AWS API calls for the state machine execution to continue.

An example of this AWS StepFunction workflow .asl.json file is here.

At a high level, what we have here is a solution that asynchronously started an external task with a callback to AWS Step Functions, which then has a callback to GitLab CI/CD pipeline. Some advantages of with this over having your external task directly call back GitLab CI/CD pipelines are:

GitLab access credentials for the callback (i.e. access token for the GitLab REST API) do not need to be provided to the external task. Instead they are securely provided to the AWS Lambda functions using SSM Parameter store secure strings (alternatively, you could also use AWS Secrets Manager).
This Step Function state machine workflow can be expanded to orchestrate other tasks and activities with outputs that can be fed in to your original external task.
The Step Function along with it’s Lambda functions and other AWS resources can be put in to Infrastructure as Code (such as Terraform covered below) for a repeatable deployment pattern.

Try it out: Terraform sample deployment

I have provided Terraform deployable sample of the AWS Step Functions and Lambda solution, as well as a sample .gitlab-ci.yml file to use with it. This section provides instructions on how to deploy it and try the solution yourself.

You can find the source code here: https://gitlab.com/freddyclhblog/gitlab-cicd-aws-step-functions-long-running-tasks

Prerequisites

Privileged/administrator access to an AWS account.
Create an OpenID Connect identity provider for GitLab in AWS IAM (see: https://docs.gitlab.com/ci/cloud_services/aws/#add-the-identity-provider). Note: the Terraform will reference this IAM OIDC provider to create a IAM role for the GitLab CI/CD pipeline.
A GitLab project to be set up with a CICD pipeline for the long-running external task (see: https://docs.gitlab.com/user/project/).
A GitLab personal access token (or project access token) with “api” scope” (see: https://docs.gitlab.com/user/profile/personal_access_tokens/#create-a-personal-access-token).
A public status page for your external task. This can be a mock status page which could be hosted using Amazon S3 static website hosting. For an example status response see here.
Terraform installed (see: https://developer.hashicorp.com/terraform/install).
Familiarity using Terraform to deploy to AWS.
AWS CLI installed (see: https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html)

1 - Deploy with Terraform

Clone the sample project:

git clone https://gitlab.com/freddyclh/gitlab-cicd-aws-step-functions-long-running-tasks.git

Prepare a terraform variables file for deployment:

# Navigate to the project terraform directory
cd gitlab-cicd-aws-step-functions-long-running-tasks/terraform

# Copy the example terraform variables file
cp example.terraform.tfvars terraform.tfvars

In your created terraform.tfvars file, change the following variable values:

name - Provide a
name for the resources that Terraform will deploy.
region - Provide the AWS region you wish to deploy AWS Step Functions and other resources to.
gitlab_iam_openid_connect_provider_arn - Provide the ARN of your GitLab IAM OIDC identity provider.
gitlab_project_paths - Add to the list your Gitlab project and default branch (i.e. main) that will trigger a run of the CICD pipeline.
ssm_param_name_gitlab_cicd_access_token - Provide a SSM Parameter name which will be used to store the GitLab access token as a secure string.

Use Terraform to deploy the resources to your AWS account:

# A good idea to check your AWS credentials first
aws sts get-caller-identity

# Initialise terraform and deploy
terraform init
terraform apply

Terraform should return the following example outputs:

gitlab_cicd_role_arn = "arn:aws:iam::123456789012:role/gitlab-cicd-async-external-tasks-cicd-role"
sfn_state_machine_arns = {
  "callback_task_token" = "arn:aws:states:ap-southeast-2:123456789012:stateMachine:gitlab-cicd-async-external-tasks-callback-task-token"
  "polled" = "arn:aws:states:ap-southeast-2:123456789012:stateMachine:gitlab-cicd-async-external-tasks-polled"
}

2 - Store the GitLab access token in SSM parameter store

The terraform deployment should have created a new secure string parameter in SSM Parameter store with the name that you have provided in the variable ssm_param_name_gitlab_cicd_access_token.

Update the value of the parameter to be your GitLab access token.

You can update the parameter through the AWS Systems Manager console here: https://console.aws.amazon.com/systems-manager/parameters

For more details on how to do this, refer to AWS documentation here: https://docs.aws.amazon.com/systems-manager/latest/userguide/sysman-paramstore-versions.html#sysman-paramstore-version-console

3 - Create the GitLab CI/CD pipeline

Copy the sample .gitlab-ci.yml and external_task_params.json to your GitLab project that will have the CICD pipeline.

# Assume working in common directory containing you GitLab project directory 
# and gitlab-cicd-aws-step-functions-long-running-tasks.

# File copy example (replace "your-gitlab-project")
cp -r gitlab-cicd-aws-step-functions-long-running-tasks/gitlab_ci/ your-gitlab-project

In your GitLab project directory, update the variables in .gitlab-ci.yml

AWS_ROLE_ARN - Set this to the value from Terraform output gitlab_cicd_role_arn
AWS_REGION - Set this to the AWS region you are using.
STATE_MACHINE_ARN:
- For the polling status page approach described here, set this to the value from Terraform output sfn_state_machine_arns.polled.
- For the callback task token approach described here, set this to the value from Terraform output sfn_state_machine_arns.callback_task_token

In your GitLab project directory, update the fields in external_task_params.json :

external_task_params.status_check_url - Set this to your own external task status page.
external_task_params.status_check_poll_wait_seconds - Set this to an interval time (in seconds) between checks to your external task status page.
You can add any additional fields you require under external_task_params.

Add, commit and push the file changes in your GitLab project.

4 - Running the CICD pipeline

Any pushes to the default branch should trigger the CICD pipeline to run and in turn AWS Step Functions to start the external task. You can also start the CICD pipeline in GitLab manually.

If you are trying the polling example (sfn_state_machine_arns.polled)…

To simulate an external task that is taking a while to complete, before you start the pipeline you should set your status page (in external_task_params.status_check_url) to an “in progress” response. For example:

{
    "taskName": "mock-external-task",
    "status": "IN_PROGRESS"
}

While the pipeline is running, to simulate the external task has completed, set your status page to a “complete” response. For example:

{
    "taskName": "mock-external-task",
    "status": "COMPLETE"
}

While the pipeline is running, to simulate the external task has failed, set your status page to a “failed” response. For example:

{
    "taskName": "mock-external-task",
    "status": "FAILED"
}

If you are trying the callback task token example (sfn_state_machine_arns.callback_task_token)…

The CICD pipeline will not end until the SendTaskSuccess or SendTaskFailure call is made. The task token can be found in the Mock start async external task state Task input, under Payload.task_token.

You can use the AWS CLI with the task token as argument to perform the call. For example:

TASK_TOKEN="<PUT TASK TOKEN HERE>"

# Send success
aws stepfunctions send-task-success \
    --task-token "$TASK_TOKEN" \
    --task-output '{"message": "Mock external task completed successfully"}'

# Or send fail
aws stepfunctions send-task-failure \
    --task-token "$TASK_TOKEN" \
    --cause 'External task failed'

Offboard GitLab CI/CD long running tasks with AWS Step Functions

Run external tasks asynchronously in your GitLab CI/CD pipeline and save on GitLab runner costs

Table of contents

CICD Pipelines with long running tasks are less than ideal

Problem: long running external task makes inefficient use of your CICD build server (GitLab Runner)

Solution: AWS Step Functions and Lambda for serverless monitoring of external task

GitLab CI/CD pipeline async callback with AWS Step Functions

CICD job: sfn-start-execution

CICD job: sfn-finish-callback

Starting the external task using AWS Lambda

Checking the external task status using AWS Lambda

Callback GitLab CI/CD pipeline with the external task result using AWS Lambda

A simpler solution without polling

Try it out: Terraform sample deployment

Prerequisites

1 - Deploy with Terraform

2 - Store the GitLab access token in SSM parameter store

3 - Create the GitLab CI/CD pipeline

4 - Running the CICD pipeline