AWS Batch Example Project¶

Keywords: AWS Batch Example Project

Summary¶

本文我在了解了 AWS Batch 的基本概念和功能后, 做的第一个实验性质的项目, 同时也为我今后做 AWS Batch Project 提供了参考. 在这个项目中我们刻意让业务逻辑极简化但是又有一定的业务代表性.

在这个项目中我们创建了一个 Container App, 只要给定一个 Source S3 folder 和一个 Target S3 folder 作为参数, 就能将在 Source 下的所有文件拷贝到 Target 上.

首先我们来计划一下我们需要做什么:

创建一个 ECR Repo, 然后把 App 的代码打包成容器.
创建一个 Computer Environment
创建一个 Job Queue
创建一个 Job Definition, 其中指定使用我们的容器
用创建的 Job Definition 提交一个 Job 到 Job Queue 中, 然后 Queue 会自动寻找可用的 Compute Environment 来运行这个 Job.

Reference:

Components of AWS Batch: https://docs.aws.amazon.com/batch/latest/userguide/what-is-batch.html#batch_components

Prepare Container Image¶

首先我们要准备好我们的业务代码和容器镜像.

App Code

我们这个 App 非常简单. 它是用 Python 实现的, requirements.txt 定义了用到的依赖:

fire==0.4.0
pathlib_mate>=1.2.1,<3.0.0
s3pathlib>=2.0.1,<3.0.0
boto_session_manager>=1.5.3,<2.0.0

App 的源代码 main.py 文件:

# -*- coding: utf-8 -*-

from boto_session_manager import BotoSesManager
from s3pathlib import S3Path, context


def copy_s3_folder(
    bsm: BotoSesManager,
    s3dir_source: S3Path,
    s3dir_target: S3Path,
):
    """
    Core logic.
    """
    context.attach_boto_session(bsm.boto_ses)
    print(f"copy files from {s3dir_source.uri} to {s3dir_target.uri}")
    for s3path_source in s3dir_source.iter_objects():
        relpath = s3path_source.relative_to(s3dir_source)
        s3path_target = s3dir_target.joinpath(relpath)
        print(f"copy: {relpath.key}")
        s3path_source.copy_to(s3path_target, overwrite=True)


def main(
    region: str,
    s3uri_source: str,
    s3uri_target: str,
):
    """
    wrapper around the core logic, expose the parameter to CLI.
    """
    print(f"received: region = {region!r}, s3uri_source = {s3uri_source!r}, s3uri_target = {s3uri_target!r}")
    copy_s3_folder(
        bsm=BotoSesManager(region_name=region),
        s3dir_source=S3Path(s3uri_source).to_dir(),
        s3dir_target=S3Path(s3uri_target).to_dir(),
    )


# convert the app to a CLI app.
if __name__ == "__main__":
    import fire

    fire.Fire(main)

Dockerfile 的内容, 我们用的 base image 是 Python:

# this is public and open source
FROM public.ecr.aws/docker/library/python:3.9-alpine
# set working directory
WORKDIR /usr/src/app
# package application
COPY requirements.txt ./
RUN pip install --no-cache-dir -r requirements.txt
COPY main.py ./
ENTRYPOINT ["python", "./main.py"]

如果你想要在构建容器镜像之前本地运行一下 App, 你可以:

# CD to where the main.py is
# create a virtualenv at .venv folder
virtualenv -p python3.9 .venv

# activate virtualenv
source .venv/bin/activate

# install dependencies
pip install -r requirements.txt

# try to run the CLI
python main.py --region ${aws_region} --s3uri_source s3://${aws_account_id}-${aws_region}-data/projects/lambda_project1/sbx/source/ --s3uri_target s3://${aws_account_id}-${aws_region}-data/projects/lambda_project1/sbx/final/

Create ECR Repository

进入 AWS ECR Console
点击 Create Repository
Visibility settings 选: Private
Repository name 填: aws_batch_example
Tag immutability 选: disabled, 这样我们可以不断的覆盖特定的 Tag
其他选默认

Build and Publish Container Image

CD 到 Dockerfile 所在的目录.
参考 ./ecr_login 脚本的内容, 运行它用 Docker 对 AWS ECR 进行登录. 记得先用 chmod +x ecr_login 命令将其变为可执行文件. 脚本内容如下:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

"""
This shell script automates docker login to AWS ECR.

Requirements:

- Python3.7+
- `fire>=0.1.3,<1.0.0 <https://pypi.org/project/fire/>`_
- make sure you have run ``chmod +x ecr_login`` to make this script executable

Usage:

.. code-block:: bash

    # show help info
    $ ./ecr_login -h

    # on local laptop use AWS cli profile
    $ ./ecr_login --aws-profile ${your_aws_profile}

    # on EC2, Cloud9, CloudShell
    $ ./ecr_login --aws-region ${your_aws_region}

    # if your boto session doesn't have sts:GetCallerIdentity permission
    # you have to explicitly provide AWS account ID
    $ ./ecr_login --aws-region ${your_aws_region} --aws-account-id ${your_aws_account_id}
"""

import typing as T
import boto3
import base64
import subprocess

import fire


def get_ecr_auth_token_v1(
    ecr_client,
    aws_account_id,
) -> str:
    """
    Get ECR auth token using boto3 SDK.
    """
    res = ecr_client.get_authorization_token(
        registryIds=[
            aws_account_id,
        ],
    )
    b64_token = res["authorizationData"][0]["authorizationToken"]
    user_pass = base64.b64decode(b64_token.encode("utf-8")).decode("utf-8")
    auth_token = user_pass.split(":")[1]
    return auth_token


def get_ecr_auth_token_v2(
    aws_region: str,
    aws_profile: T.Optional[str] = None,
):
    """
    Get ECR auth token using AWS CLI.
    """
    args = ["aws", "ecr", "get-login", "--region", aws_region, "--no-include-email"]
    if aws_profile is not None:
        args.extend(["--profile", aws_profile])
    response = subprocess.run(args, check=True, capture_output=True)
    text = response.stdout.decode("utf-8")
    auth_token = text.split(" ")[5]
    return auth_token


def docker_login(
    auth_token: str,
    registry_url: str,
) -> bool:
    """
    Login docker cli to AWS ECR.

    :return: a boolean flag to indicate if the login is successful.
    """
    pipe = subprocess.Popen(["echo", auth_token], stdout=subprocess.PIPE)
    response = subprocess.run(
        ["docker", "login", "-u", "AWS", registry_url, "--password-stdin"],
        stdin=pipe.stdout,
        capture_output=True,
    )
    text = response.stdout.decode("utf-8")
    return "Login Succeeded" in text


def main(
    aws_profile: T.Optional[str] = None,
    aws_account_id: T.Optional[str] = None,
    aws_region: T.Optional[str] = None,
):
    """
    Login docker cli to AWS ECR using boto3 SDK and AWS CLI.

    :param aws_profile: specify the AWS profile you want to use to login.
        usually this parameter is used on local laptop that having awscli
        installed and configured.
    :param aws_account_id: explicitly specify the AWS account id. if it is not
        given, it will use the sts.get_caller_identity() to get the account id.
        you can use this to get the auth token for cross account access.
    :param aws_region: explicitly specify the AWS region for boto3 session
        and ecr repo. usually you need to set this on EC2, ECS, Cloud9,
        CloudShell, Lambda, etc ...
    """
    boto_ses = boto3.session.Session(
        region_name=aws_region,
        profile_name=aws_profile,
    )
    ecr_client = boto_ses.client("ecr")
    if aws_account_id is None:
        sts_client = boto_ses.client("sts")
        res = sts_client.get_caller_identity()
        aws_account_id = res["Account"]

    print("get ecr auth token ...")
    auth_token = get_ecr_auth_token_v1(
        ecr_client=ecr_client,
        aws_account_id=aws_account_id,
    )
    if aws_region is None:
        aws_region = boto_ses.region_name
    print("docker login ...")
    flag = docker_login(
        auth_token=auth_token,
        registry_url=f"https://{aws_account_id}.dkr.ecr.{aws_region}.amazonaws.com",
    )
    if flag:
        print("login succeeded!")
    else:
        print("login failed!")


def run():
    fire.Fire(main)


if __name__ == "__main__":
    run()

依次运行 ./cli build-image, ./cli test-image, ./cli push-image, 分别用来构建, 测试, 发布. 记得先用 chmod +x cli 命令将其变为可执行文件. 脚本内容如下:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

"""
This shell script can:

- build container image for AWS Batch
- push container image to AWS ECR
- test image locally

Requirements:

- update the "Your project configuration here" part at beginning of this script
- Python3.7+
- `fire>=0.1.3,<1.0.0 <https://pypi.org/project/fire/>`_
- `s3pathlib>=2.0.1,<3.0.0 <https://pypi.org/project/s3pathlib/>`_
- `boto_session_manager>=1.5.3,<2.0.0 <https://pypi.org/project/boto-session-manager/>`_
- make sure you have run ``chmod +x ecr_login`` to make this script executable

Usage:

.. code-block:: bash

    # show help info
    $ ./cli -h

    # build image
    $ ./cli build-image

    # push image
    $ ./cli push-image

    # test image
    $ ./cli test-image
"""

import typing as T
import os
import subprocess
import contextlib
import dataclasses
from pathlib import Path

from s3pathlib import S3Path, context
from boto_session_manager import BotoSesManager

# ------------------------------------------------------------------------------
# Your project configuration here
aws_profile = "bmt_app_dev_us_east_1"
aws_region = "us-east-1"
repo_name = "aws-batch-example"
repo_tag = "latest"


# ------------------------------------------------------------------------------


@contextlib.contextmanager
def temp_cwd(path: T.Union[str, Path]):
    """
    Temporarily set the current working directory (CWD) and automatically
    switch back when it's done.

    Example:

    .. code-block:: python

        with temp_cwd(Path("/path/to/target/working/directory")):
            # do something
    """
    path = Path(path).absolute()
    if not path.is_dir():
        raise NotADirectoryError(f"{path} is not a dir!")
    cwd = os.getcwd()
    os.chdir(str(path))
    try:
        yield path
    finally:
        os.chdir(cwd)


@dataclasses.dataclass
class EcrContext:
    aws_account_id: str
    aws_region: str
    repo_name: str
    repo_tag: str
    path_dockerfile: Path

    @property
    def dir_dockerfile(self) -> Path:
        return self.path_dockerfile.parent

    @property
    def image_uri(self) -> str:
        return f"{self.aws_account_id}.dkr.ecr.{self.aws_region}.amazonaws.com/{self.repo_name}:{self.repo_tag}"

    def build_image(self):
        with temp_cwd(self.dir_dockerfile):
            args = ["docker", "build", "-t", self.image_uri, "."]
            subprocess.run(args, check=True)

    def push_image(self):
        with temp_cwd(self.dir_dockerfile):
            args = [
                "docker",
                "push",
                self.image_uri,
            ]
            subprocess.run(args, check=True)

    def test_image(self):
        with temp_cwd(dir_here):
            s3bucket = f"{bsm.aws_account_id}-{bsm.aws_region}-data"
            s3dir_source = S3Path(f"s3://{s3bucket}/projects/aws_batch_example/source/")
            s3dir_target = S3Path(f"s3://{s3bucket}/projects/aws_batch_example/target/")
            s3dir_source.delete()
            s3dir_target.delete()
            s3dir_source.joinpath("test.txt").write_text("hello-world")
            print(f"preview source: {s3dir_source.console_url}")
            print(f"preview target: {s3dir_target.console_url}")

            args = [
                "docker",
                "run",
                "--rm",
                self.image_uri,
                "--region",
                "us-east-1",
                "--s3uri_source",
                s3dir_source.uri,
                "--s3uri_target",
                s3dir_target.uri,
            ]
            subprocess.run(args, check=True)


dir_here = Path(__file__).absolute().parent
path_dockerfile = dir_here.joinpath("Dockerfile")

IS_LOCAL = False
IS_CI = False
IS_C9 = False
if "CI" in os.environ or "CODEBUILD_CI" in os.environ:
    IS_CI = True
elif "C9_USER" in os.environ:
    IS_C9 = True
else:
    IS_LOCAL = True

if IS_LOCAL:
    bsm = BotoSesManager(profile_name=aws_profile)
elif IS_CI:
    bsm = BotoSesManager(region_name=aws_region)
elif IS_C9:
    bsm = BotoSesManager(region_name=aws_region)
else:
    raise RuntimeError

context.attach_boto_session(bsm.boto_ses)

ecr_context = EcrContext(
    aws_account_id=bsm.aws_account_id,
    aws_region=aws_region,
    repo_name=repo_name,
    repo_tag=repo_tag,
    path_dockerfile=path_dockerfile,
)


class Main:
    def build_image(self):
        """
        Build the docker image.
        """
        ecr_context.build_image()

    def push_image(self):
        """
        Push the docker image to ECR.
        """
        ecr_context.push_image()

    def test_image(self):
        """
        Test the docker image.
        """
        ecr_context.test_image()


if __name__ == "__main__":
    import fire

    fire.Fire(Main)

现在我们的 Container 已经就绪了, 可以开始配置我们的 Batch Job 了.

Configuration¶

这一节里我们来配置 Compute Environment, Job Queue 和 Job Definition.

Computer Environment¶

我们首先来配置计算环境.

Step 1: Compute environment configuration
- Compute environment configuration:
  
  Platform: Fargate
  
  Name: aws_batch_example
  
  Service role: use the default AWSServiceRoleForBatch
Step 2: Instance configuration
- Use Fargate Spot capacity: turn it on (to save cost)
- Maximum vCPUs: 4 (make it small to save cost)
Step 3: Network configuration
- VPC and Subnet and Security Group: use your default VPC, public subnet, default security group

Job Queue¶

然后来配置 Job Queue

Orchestration type: Fargate
Name: aws_batch_example
Scheduling policy Amazon Resource Name (optional): leave it empty
Connected compute environments: use the aws_batch_example you just created

Job Definition¶

最后来配置 Job Definition

Step 1: Job definition configuration
- Orchestration type: Fargate
- General configuration:
  
  Name: aws_batch_example
  
  Execution timeout: 60 (seconds)
  
  Scheduling priority: leave it empty, this is for advanced scheduling
- Fargate platform configuration
  
  Fargate platform version: LATEST (default)
  
  (IMPORTANT) info=Infolabel=Assign public IP: turn it on.
  
  If it is on, then it allows your task to have outbound network access to the internet, so you can talk to ECR service endpoint to pull your image. If it is off, you have to ensure that you have a NatGateway on your VPC to route traffic to the internet (but it is expansive). If it is off and you don’t have a NatGateway, then you cannot pull container image from ECR. You can also use ECR VPC Endpoint to create internal connection between your VPC and ECR service endpoint. See this discussion: https://repost.aws/knowledge-center/ecs-pull-container-api-error-ecr
  
  Ephemeral storage:
  
  Execution role:
  
  Job attempts: 1
  
  Retry strategy conditions: leave it empty
Step 2: Container configuration
- Image: ${aws_account_id}.dkr.ecr.${aws_region}.amazonaws.com/aws-batch-example:latest
- Command syntax:
  
  JSON: ["--region","us-east-1","--s3uri_source","Ref::s3uri_source","--s3uri_target","Ref::s3uri_target"].
- Parameters: add two parameter name s3uri_source and s3uri_target.
- Environment configuration:
  
  Job role configuration:
  
  vCPUs: 0.25
  
  Memory: 0.5
Step 3 (optional): Linux and logging settings
- leave everything empty

Test by Submitting a Job¶

最后我们就可以来运行一个 Job 了.

Step 1: Job configuration
- Name: aws_batch_example
- Job definition: aws_batch_example:1
- Job queue: aws_batch_example
Step 2 (optional): Overrides
- use default for Everything except:
- Additional configuration -> Parameters: because we set two parameters in job definition, so we have to give them a value here.
  
  s3uri_source: s3://${aws_account_id}-${aws_region}-data/projects/aws_batch_example/source/
  
  s3uri_target: s3://${aws_account_id}-${aws_region}-data/projects/aws_batch_example/target/

等待几秒钟, 你就可以看到 Job 从 Submitted 状态变成 Ready -> Running -> Succeeded. 然后你还可以在 S3 的 target folder 看到输出的数据了.

Recap¶

下面我们来简单总结一下. 总体来说, 做一个 AWS batch 项目的主要时间都花在了写业务逻辑, 构建容器镜像, 测试镜像上. 这也是 Batch 这个服务的价值所在, 能让你专注于业务逻辑. 而其他的步骤基本上都是在 Console 界面上点击, 花不了多少时间. 这些在 Console 上的 configuration 我们在实验项目中可以用人工点击. 但是在生产项目中, 我们会需要用 CloudFormation 工具来管理这些 configuration, 而不是人工点击.

下一步, 可能你会想将一个实验性质的项目变成一个可重复利用, 具有自动化构建, 测试, 部署的企业级成熟应用. 这时候你就需要用到 CI/CD 工具了. 我们会在下一篇文章里介绍这一企业级的架构.