AWS Batch Example Project

Keywords: AWS Batch Example Project

Summary

本文我在了解了 AWS Batch 的基本概念和功能后, 做的第一个实验性质的项目, 同时也为我今后做 AWS Batch Project 提供了参考. 在这个项目中我们刻意让业务逻辑极简化但是又有一定的业务代表性.

在这个项目中我们创建了一个 Container App, 只要给定一个 Source S3 folder 和一个 Target S3 folder 作为参数, 就能将在 Source 下的所有文件拷贝到 Target 上.

首先我们来计划一下我们需要做什么:

  1. 创建一个 ECR Repo, 然后把 App 的代码打包成容器.

  2. 创建一个 Computer Environment

  3. 创建一个 Job Queue

  4. 创建一个 Job Definition, 其中指定使用我们的容器

  5. 用创建的 Job Definition 提交一个 Job 到 Job Queue 中, 然后 Queue 会自动寻找可用的 Compute Environment 来运行这个 Job.

Reference:

Prepare Container Image

首先我们要准备好我们的业务代码和容器镜像.

App Code

我们这个 App 非常简单. 它是用 Python 实现的, requirements.txt 定义了用到的依赖:

fire==0.4.0
pathlib_mate>=1.2.1,<3.0.0
s3pathlib>=2.0.1,<3.0.0
boto_session_manager>=1.5.3,<2.0.0

App 的源代码 main.py 文件:

 1# -*- coding: utf-8 -*-
 2
 3from boto_session_manager import BotoSesManager
 4from s3pathlib import S3Path, context
 5
 6
 7def copy_s3_folder(
 8    bsm: BotoSesManager,
 9    s3dir_source: S3Path,
10    s3dir_target: S3Path,
11):
12    """
13    Core logic.
14    """
15    context.attach_boto_session(bsm.boto_ses)
16    print(f"copy files from {s3dir_source.uri} to {s3dir_target.uri}")
17    for s3path_source in s3dir_source.iter_objects():
18        relpath = s3path_source.relative_to(s3dir_source)
19        s3path_target = s3dir_target.joinpath(relpath)
20        print(f"copy: {relpath.key}")
21        s3path_source.copy_to(s3path_target, overwrite=True)
22
23
24def main(
25    region: str,
26    s3uri_source: str,
27    s3uri_target: str,
28):
29    """
30    wrapper around the core logic, expose the parameter to CLI.
31    """
32    print(f"received: region = {region!r}, s3uri_source = {s3uri_source!r}, s3uri_target = {s3uri_target!r}")
33    copy_s3_folder(
34        bsm=BotoSesManager(region_name=region),
35        s3dir_source=S3Path(s3uri_source).to_dir(),
36        s3dir_target=S3Path(s3uri_target).to_dir(),
37    )
38
39
40# convert the app to a CLI app.
41if __name__ == "__main__":
42    import fire
43
44    fire.Fire(main)

Dockerfile 的内容, 我们用的 base image 是 Python:

# this is public and open source
FROM public.ecr.aws/docker/library/python:3.9-alpine
# set working directory
WORKDIR /usr/src/app
# package application
COPY requirements.txt ./
RUN pip install --no-cache-dir -r requirements.txt
COPY main.py ./
ENTRYPOINT ["python", "./main.py"]

如果你想要在构建容器镜像之前本地运行一下 App, 你可以:

# CD to where the main.py is
# create a virtualenv at .venv folder
virtualenv -p python3.9 .venv

# activate virtualenv
source .venv/bin/activate

# install dependencies
pip install -r requirements.txt

# try to run the CLI
python main.py --region ${aws_region} --s3uri_source s3://${aws_account_id}-${aws_region}-data/projects/lambda_project1/sbx/source/ --s3uri_target s3://${aws_account_id}-${aws_region}-data/projects/lambda_project1/sbx/final/

Create ECR Repository

  • 进入 AWS ECR Console

  • 点击 Create Repository

  • Visibility settings 选: Private

  • Repository name 填: aws_batch_example

  • Tag immutability 选: disabled, 这样我们可以不断的覆盖特定的 Tag

  • 其他选默认

Build and Publish Container Image

  • CD 到 Dockerfile 所在的目录.

  • 参考 ./ecr_login 脚本的内容, 运行它用 Docker 对 AWS ECR 进行登录. 记得先用 chmod +x ecr_login 命令将其变为可执行文件. 脚本内容如下:

  1#!/usr/bin/env python
  2# -*- coding: utf-8 -*-
  3
  4"""
  5This shell script automates docker login to AWS ECR.
  6
  7Requirements:
  8
  9- Python3.7+
 10- `fire>=0.1.3,<1.0.0 <https://pypi.org/project/fire/>`_
 11- make sure you have run ``chmod +x ecr_login`` to make this script executable
 12
 13Usage:
 14
 15.. code-block:: bash
 16
 17    # show help info
 18    $ ./ecr_login -h
 19
 20    # on local laptop use AWS cli profile
 21    $ ./ecr_login --aws-profile ${your_aws_profile}
 22
 23    # on EC2, Cloud9, CloudShell
 24    $ ./ecr_login --aws-region ${your_aws_region}
 25
 26    # if your boto session doesn't have sts:GetCallerIdentity permission
 27    # you have to explicitly provide AWS account ID
 28    $ ./ecr_login --aws-region ${your_aws_region} --aws-account-id ${your_aws_account_id}
 29"""
 30
 31import typing as T
 32import boto3
 33import base64
 34import subprocess
 35
 36import fire
 37
 38
 39def get_ecr_auth_token_v1(
 40    ecr_client,
 41    aws_account_id,
 42) -> str:
 43    """
 44    Get ECR auth token using boto3 SDK.
 45    """
 46    res = ecr_client.get_authorization_token(
 47        registryIds=[
 48            aws_account_id,
 49        ],
 50    )
 51    b64_token = res["authorizationData"][0]["authorizationToken"]
 52    user_pass = base64.b64decode(b64_token.encode("utf-8")).decode("utf-8")
 53    auth_token = user_pass.split(":")[1]
 54    return auth_token
 55
 56
 57def get_ecr_auth_token_v2(
 58    aws_region: str,
 59    aws_profile: T.Optional[str] = None,
 60):
 61    """
 62    Get ECR auth token using AWS CLI.
 63    """
 64    args = ["aws", "ecr", "get-login", "--region", aws_region, "--no-include-email"]
 65    if aws_profile is not None:
 66        args.extend(["--profile", aws_profile])
 67    response = subprocess.run(args, check=True, capture_output=True)
 68    text = response.stdout.decode("utf-8")
 69    auth_token = text.split(" ")[5]
 70    return auth_token
 71
 72
 73def docker_login(
 74    auth_token: str,
 75    registry_url: str,
 76) -> bool:
 77    """
 78    Login docker cli to AWS ECR.
 79
 80    :return: a boolean flag to indicate if the login is successful.
 81    """
 82    pipe = subprocess.Popen(["echo", auth_token], stdout=subprocess.PIPE)
 83    response = subprocess.run(
 84        ["docker", "login", "-u", "AWS", registry_url, "--password-stdin"],
 85        stdin=pipe.stdout,
 86        capture_output=True,
 87    )
 88    text = response.stdout.decode("utf-8")
 89    return "Login Succeeded" in text
 90
 91
 92def main(
 93    aws_profile: T.Optional[str] = None,
 94    aws_account_id: T.Optional[str] = None,
 95    aws_region: T.Optional[str] = None,
 96):
 97    """
 98    Login docker cli to AWS ECR using boto3 SDK and AWS CLI.
 99
100    :param aws_profile: specify the AWS profile you want to use to login.
101        usually this parameter is used on local laptop that having awscli
102        installed and configured.
103    :param aws_account_id: explicitly specify the AWS account id. if it is not
104        given, it will use the sts.get_caller_identity() to get the account id.
105        you can use this to get the auth token for cross account access.
106    :param aws_region: explicitly specify the AWS region for boto3 session
107        and ecr repo. usually you need to set this on EC2, ECS, Cloud9,
108        CloudShell, Lambda, etc ...
109    """
110    boto_ses = boto3.session.Session(
111        region_name=aws_region,
112        profile_name=aws_profile,
113    )
114    ecr_client = boto_ses.client("ecr")
115    if aws_account_id is None:
116        sts_client = boto_ses.client("sts")
117        res = sts_client.get_caller_identity()
118        aws_account_id = res["Account"]
119
120    print("get ecr auth token ...")
121    auth_token = get_ecr_auth_token_v1(
122        ecr_client=ecr_client,
123        aws_account_id=aws_account_id,
124    )
125    if aws_region is None:
126        aws_region = boto_ses.region_name
127    print("docker login ...")
128    flag = docker_login(
129        auth_token=auth_token,
130        registry_url=f"https://{aws_account_id}.dkr.ecr.{aws_region}.amazonaws.com",
131    )
132    if flag:
133        print("login succeeded!")
134    else:
135        print("login failed!")
136
137
138def run():
139    fire.Fire(main)
140
141
142if __name__ == "__main__":
143    run()
  • 依次运行 ./cli build-image, ./cli test-image, ./cli push-image, 分别用来 构建, 测试, 发布. 记得先用 chmod +x cli 命令将其变为可执行文件. 脚本内容如下:

  1#!/usr/bin/env python
  2# -*- coding: utf-8 -*-
  3
  4"""
  5This shell script can:
  6
  7- build container image for AWS Batch
  8- push container image to AWS ECR
  9- test image locally
 10
 11Requirements:
 12
 13- update the "Your project configuration here" part at beginning of this script
 14- Python3.7+
 15- `fire>=0.1.3,<1.0.0 <https://pypi.org/project/fire/>`_
 16- `s3pathlib>=2.0.1,<3.0.0 <https://pypi.org/project/s3pathlib/>`_
 17- `boto_session_manager>=1.5.3,<2.0.0 <https://pypi.org/project/boto-session-manager/>`_
 18- make sure you have run ``chmod +x ecr_login`` to make this script executable
 19
 20Usage:
 21
 22.. code-block:: bash
 23
 24    # show help info
 25    $ ./cli -h
 26
 27    # build image
 28    $ ./cli build-image
 29
 30    # push image
 31    $ ./cli push-image
 32
 33    # test image
 34    $ ./cli test-image
 35"""
 36
 37import typing as T
 38import os
 39import subprocess
 40import contextlib
 41import dataclasses
 42from pathlib import Path
 43
 44from s3pathlib import S3Path, context
 45from boto_session_manager import BotoSesManager
 46
 47# ------------------------------------------------------------------------------
 48# Your project configuration here
 49aws_profile = "bmt_app_dev_us_east_1"
 50aws_region = "us-east-1"
 51repo_name = "aws-batch-example"
 52repo_tag = "latest"
 53
 54
 55# ------------------------------------------------------------------------------
 56
 57
 58@contextlib.contextmanager
 59def temp_cwd(path: T.Union[str, Path]):
 60    """
 61    Temporarily set the current working directory (CWD) and automatically
 62    switch back when it's done.
 63
 64    Example:
 65
 66    .. code-block:: python
 67
 68        with temp_cwd(Path("/path/to/target/working/directory")):
 69            # do something
 70    """
 71    path = Path(path).absolute()
 72    if not path.is_dir():
 73        raise NotADirectoryError(f"{path} is not a dir!")
 74    cwd = os.getcwd()
 75    os.chdir(str(path))
 76    try:
 77        yield path
 78    finally:
 79        os.chdir(cwd)
 80
 81
 82@dataclasses.dataclass
 83class EcrContext:
 84    aws_account_id: str
 85    aws_region: str
 86    repo_name: str
 87    repo_tag: str
 88    path_dockerfile: Path
 89
 90    @property
 91    def dir_dockerfile(self) -> Path:
 92        return self.path_dockerfile.parent
 93
 94    @property
 95    def image_uri(self) -> str:
 96        return f"{self.aws_account_id}.dkr.ecr.{self.aws_region}.amazonaws.com/{self.repo_name}:{self.repo_tag}"
 97
 98    def build_image(self):
 99        with temp_cwd(self.dir_dockerfile):
100            args = ["docker", "build", "-t", self.image_uri, "."]
101            subprocess.run(args, check=True)
102
103    def push_image(self):
104        with temp_cwd(self.dir_dockerfile):
105            args = [
106                "docker",
107                "push",
108                self.image_uri,
109            ]
110            subprocess.run(args, check=True)
111
112    def test_image(self):
113        with temp_cwd(dir_here):
114            s3bucket = f"{bsm.aws_account_id}-{bsm.aws_region}-data"
115            s3dir_source = S3Path(f"s3://{s3bucket}/projects/aws_batch_example/source/")
116            s3dir_target = S3Path(f"s3://{s3bucket}/projects/aws_batch_example/target/")
117            s3dir_source.delete()
118            s3dir_target.delete()
119            s3dir_source.joinpath("test.txt").write_text("hello-world")
120            print(f"preview source: {s3dir_source.console_url}")
121            print(f"preview target: {s3dir_target.console_url}")
122
123            args = [
124                "docker",
125                "run",
126                "--rm",
127                self.image_uri,
128                "--region",
129                "us-east-1",
130                "--s3uri_source",
131                s3dir_source.uri,
132                "--s3uri_target",
133                s3dir_target.uri,
134            ]
135            subprocess.run(args, check=True)
136
137
138dir_here = Path(__file__).absolute().parent
139path_dockerfile = dir_here.joinpath("Dockerfile")
140
141IS_LOCAL = False
142IS_CI = False
143IS_C9 = False
144if "CI" in os.environ or "CODEBUILD_CI" in os.environ:
145    IS_CI = True
146elif "C9_USER" in os.environ:
147    IS_C9 = True
148else:
149    IS_LOCAL = True
150
151if IS_LOCAL:
152    bsm = BotoSesManager(profile_name=aws_profile)
153elif IS_CI:
154    bsm = BotoSesManager(region_name=aws_region)
155elif IS_C9:
156    bsm = BotoSesManager(region_name=aws_region)
157else:
158    raise RuntimeError
159
160context.attach_boto_session(bsm.boto_ses)
161
162ecr_context = EcrContext(
163    aws_account_id=bsm.aws_account_id,
164    aws_region=aws_region,
165    repo_name=repo_name,
166    repo_tag=repo_tag,
167    path_dockerfile=path_dockerfile,
168)
169
170
171class Main:
172    def build_image(self):
173        """
174        Build the docker image.
175        """
176        ecr_context.build_image()
177
178    def push_image(self):
179        """
180        Push the docker image to ECR.
181        """
182        ecr_context.push_image()
183
184    def test_image(self):
185        """
186        Test the docker image.
187        """
188        ecr_context.test_image()
189
190
191if __name__ == "__main__":
192    import fire
193
194    fire.Fire(Main)

现在我们的 Container 已经就绪了, 可以开始配置我们的 Batch Job 了.

Configuration

这一节里我们来配置 Compute Environment, Job Queue 和 Job Definition.

Computer Environment

我们首先来配置计算环境.

  • Step 1: Compute environment configuration
    • Compute environment configuration:
      • Platform: Fargate

      • Name: aws_batch_example

      • Service role: use the default AWSServiceRoleForBatch

  • Step 2: Instance configuration
    • Use Fargate Spot capacity: turn it on (to save cost)

    • Maximum vCPUs: 4 (make it small to save cost)

  • Step 3: Network configuration
    • VPC and Subnet and Security Group: use your default VPC, public subnet, default security group

Job Queue

然后来配置 Job Queue

  • Orchestration type: Fargate

  • Name: aws_batch_example

  • Scheduling policy Amazon Resource Name (optional): leave it empty

  • Connected compute environments: use the aws_batch_example you just created

Job Definition

最后来配置 Job Definition

  • Step 1: Job definition configuration
    • Orchestration type: Fargate

    • General configuration:
      • Name: aws_batch_example

      • Execution timeout: 60 (seconds)

      • Scheduling priority: leave it empty, this is for advanced scheduling

    • Fargate platform configuration
      • Fargate platform version: LATEST (default)

      • (IMPORTANT) info=Infolabel=Assign public IP: turn it on.

        If it is on, then it allows your task to have outbound network access to the internet, so you can talk to ECR service endpoint to pull your image. If it is off, you have to ensure that you have a NatGateway on your VPC to route traffic to the internet (but it is expansive). If it is off and you don’t have a NatGateway, then you cannot pull container image from ECR. You can also use ECR VPC Endpoint to create internal connection between your VPC and ECR service endpoint. See this discussion: https://repost.aws/knowledge-center/ecs-pull-container-api-error-ecr

      • Ephemeral storage:

      • Execution role:

      • Job attempts: 1

      • Retry strategy conditions: leave it empty

  • Step 2: Container configuration
    • Image: ${aws_account_id}.dkr.ecr.${aws_region}.amazonaws.com/aws-batch-example:latest

    • Command syntax:
      • JSON: ["--region","us-east-1","--s3uri_source","Ref::s3uri_source","--s3uri_target","Ref::s3uri_target"].

    • Parameters: add two parameter name s3uri_source and s3uri_target.

    • Environment configuration:
      • Job role configuration:

      • vCPUs: 0.25

      • Memory: 0.5

  • Step 3 (optional): Linux and logging settings
    • leave everything empty

Test by Submitting a Job

最后我们就可以来运行一个 Job 了.

  • Step 1: Job configuration
    • Name: aws_batch_example

    • Job definition: aws_batch_example:1

    • Job queue: aws_batch_example

  • Step 2 (optional): Overrides
    • use default for Everything except:

    • Additional configuration -> Parameters: because we set two parameters in job definition, so we have to give them a value here.
      • s3uri_source: s3://${aws_account_id}-${aws_region}-data/projects/aws_batch_example/source/

      • s3uri_target: s3://${aws_account_id}-${aws_region}-data/projects/aws_batch_example/target/

等待几秒钟, 你就可以看到 Job 从 Submitted 状态变成 Ready -> Running -> Succeeded. 然后你还可以在 S3 的 target folder 看到输出的数据了.

Recap

下面我们来简单总结一下. 总体来说, 做一个 AWS batch 项目的主要时间都花在了写业务逻辑, 构建容器镜像, 测试镜像上. 这也是 Batch 这个服务的价值所在, 能让你专注于业务逻辑. 而其他的步骤基本上都是在 Console 界面上点击, 花不了多少时间. 这些在 Console 上的 configuration 我们在实验项目中可以用人工点击. 但是在生产项目中, 我们会需要用 CloudFormation 工具来管理这些 configuration, 而不是人工点击.

下一步, 可能你会想将一个实验性质的项目变成一个可重复利用, 具有自动化构建, 测试, 部署的企业级成熟应用. 这时候你就需要用到 CI/CD 工具了. 我们会在下一篇文章里介绍这一企业级的架构.