Automating File Transfer from FTP Server to AWS S3 Bucket Using Terraform

A DoD client requested support with automated file transfers. The client has files placed in a common folder that can be accessed by the standard File Transfer Protocol (FTP). Given the FTP server’s connection information, the client requested the files to be moved to an Amazon Web Services (AWS) S3 bucket where their analysis tools are configured to use.

Automating the download and upload process would save users time by allowing for a scheduled process to transfer data files. This can be achieved using a combination of AWS Lambda and EC2 services. AWS Lambda provides a plethora of triggering and scheduling options and the power to create EC2 instances. By creating an EC2 example, a program or script can avoid Lambdas’ limitations and perform programmatic tasking such as downloading and uploading. Additionally, this can be done using Terraform to allow for deployment in any AWS space.

Writing a Script to Do the Work

Create a Script that can log in to the FTP server, fetch/download files, and copy them to an S3 bucket before using Terraform or AWS console. This can be done effectively with Python’s built-in FTPlib and the AWS boto3 API library. There are various libraries and examples online to show how to set up a Python script to download files from an FTP server and use the boto3 library to copy them to S3.

Consider writing the script that file size will play a significant role in how FTPlib and Boto3’s copy functions work. Anything over 5GB will need to be chunked from the FTP Server and use the multiple file upload methods for the AWS API.

Creating an Instance with Our Script Loaded

Amazon provides Amazon Managed Images (AMI) to start up a basic instance. The provided Linux x86 AMI is the perfect starting place for creating a custom instance and eventually custom AMI.

With Terraform, creating an instance is like creating any other module, requiring Identity and Access Management (IAM) permissions, security group settings, and other configuration settings. The following shows the necessary items needed to make an EC2 instance with a key-pair, permissions to write to s3, install Python3.8 and libraries, and copy the script to do the file transferring into the ec2-user directory.

First, generating a key-pair, a private key, and a public key is used to prove identity when connecting to an instance. The benefit of creating the key-pair in the AWS Console is access to the generated .pem file. Having a local copy will allow for connecting to the instance via the command line, while great for debugging, but not great for deployment. Terraform can be generated and store a key-pair in its memory to avoid passing sensitive information.

# Generate a ssh key that lives in terraform
# https://registry.terraform.io/providers/hashicorp/tls/latest/docs/resources/private_key
resource "tls_private_key" "instance_private_key" {
  algorithm = "RSA"
  rsa_bits  = 4096
}

resource "aws_key_pair" "instance_key_pair" {
  key_name   = "${var.key_name}"
  public_key = "${tls_private_key.instance_private_key.public_key_openssh}"

}

To set up the secrSetup, which is the security group to run the instance in, open up the ports for Secure Shell (SSH) and Secure Copy Protocol (SCP) to copy the script file(s) to the instance. A security group acts as a virtual firewall for your EC2 instances to control incoming and outgoing traffic. Then, open other ports for ingress and egress as needed, i.e. 443 for HTTP traffic. The security group will require the vpc_id for your project. This is the Visual Private Cloud (VPC) that the instance will be running. The security group should match up with your VPC settings.

resource "aws_security_group" "instance_sg" {
  name   = "allow-all-sg"
  vpc_id = "${var.vpc_id}"
…
  ingress {
    description = "ftp port"
    cidr_blocks = ["0.0.0.0/0"]
    from_port   = 22
    to_port     = 22
    protocol    = "tcp"
  }
…
}

The IAM policy, for instance, will require PutObject access to the S3 bucket. The Terraform module will need the S3 bucket as an environment variable, and a profile instance is created. If creating the IAM policy in the AWS Console, a profile instance is automatically created, but it has to be explicitly defined in Terraform.

#iam instance profile setup
resource "aws_iam_role" "instance_s3_access_iam_role" {
  name               = "instance_s3_access_iam_role"
  assume_role_policy = &lt;&lt;EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Action": "sts:AssumeRole",
      "Principal": {
        "Service": "ec2.amazonaws.com"
      },
      "Effect": "Allow",
      "Sid": ""
    }
  ]
}
EOF
}
resource "aws_iam_policy" "iam_policy_for_ftp_to_s3_instance" {
  name = "ftp_to_s3_access_policy"

  policy = &lt;&lt;EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "VisualEditor0",
      "Effect": "Allow",
      "Action": [
          "s3:PutObject",
          "s3:GetObject"
      ],
      "Resource": "arn:aws:s3:::${var.s3_bucket}"
  },
}
EOF
}

resource "aws_iam_role_policy_attachment" "ftp_to_s3" {
  role       = aws_iam_role.instance_s3_access_iam_role.name
  policy_arn = aws_iam_policy.iam_policy_for_ftp_to_s3_instance.arn
}

resource "aws_iam_instance_profile" "ftp_to_s3_instance_profile" {
  name = "ftp_to_s3_instance_profile"
  role = "instance_s3_access_iam_role"
}

Defining the instance to start from and create the custom AMI from in the Terraform will need the following variables:

AMI – the AMI of the Linux x86 image
instance_type – the type of instance, i.e., t2.micro
subnet_id – the subnet string from which VPC the instance will run on
key-name – the name of the key, should match the key-pair name generated above or the one from the AWS console, could use a variable reference here too

Define the connection and provisioner attributes to copy the python script to do the file transferring to the ec2-user home folder. The connection will use the default ec2-user using the secure key and then copy over the python file. If using the key downloaded from AWS Console, use the following to point to the file private_key = “${file (“path/to/key-pair-file.pem”)}”.

Complete the instance setup with the correct Python version and library. The user_data attribute sends a bash script to install whatever is needed— in this case, updating Python to 3.8, installing the boto3, and paramiko libraries.

# Instance that we want to build out
resource "aws_instance" "ftp-to-s3-instance" {
  ami           = var.ami
  instance_type = var.instance_type
  subnet_id     = var.subnet_id
  key_name 	   = "${var.key_name}" #use your own key for testing
  security_groups      = ["${aws_security_group.instance_sg.id}"]
  iam_instance_profile = "${aws_iam_instance_profile.ftp_to_s3_instance_profile.id}"

  # Copies the python file to /home/ec2-user
  # depending on how the install of python works we may need to change this location
  connection {
    type        = "ssh"
    user        = "ec2-user"
    host        = "${element(aws_instance.ftp-to-s3-instance.*.public_ip, 0)}"
    private_key = "${tls_private_key.instance_private_key.private_key_pem}"
  }

  provisioner "file" {
    source      = "${path.module}/ftp_to_s3.py"
    destination = "/home/ec2-user/ftp_to_s3.py"
  }
}

  user_data = &lt;&lt;EOF
#!/bin/sh
sudo amazon-linux-extras install python3.8
python3.8 -m pip install -U pip
pip3.8 --version
pip3.8 install boto3 
pip3.8 install paramiko 

EOF
}

The last step is to create the custom AMI. This will allow our Lambda to duplicate and make as many of these instances as need.

resource "aws_ami_from_instance" "ftp-to-s3-ami" {
  name               = "ftp-to-s3_ami"
  description        = "ftp transfer to s3 bucket python 3.8 script"
  source_instance_id = "${aws_instance.ftp-to-s3-instance.id}"

  depends_on = [aws_instance.ftp-to-s3-instance]

  tags = {
    Name = "ftp-to-s3-ami"
  }
}

Creating Instances on the Fly in Lambda

Using a Lambda function that can be triggered in various ways is a straightforward way to invoke EC2 instances. The following python code show passing in environment variables to be used in an EC2 instance as both environment variables in the instance and arguments passed to the Python script. The variables needed in the python script for this example are as followed:

FTP_HOST – the URL of the FTP server
FTP_PATH – the path to the files on the URL server
FTP_USERNAME, FTP_PASSWORD, FTP_AUTH – to be used for any authentication for the FTP SERVER
S3_BUCKET_NAME – the name of the bucket for the files
S3_PATH – the folder or path files should be downloaded to in the S3 bucket
Files_to_download – for this purpose, a python list of dictionary objects with filename and size to downloaded.

For this example, the logic for checking for duplicate files is down before the Lambda invoking the instance for transferring is called. This allows the script in the instance to remain singularly focused on downloading and uploading. It is important to note that the files_to_download variable is converted to a string, and the quotes are made into double-quotes. Not doing this will make the single quotes disappear when passing to the EC2 instance.

The init_script variable will use the passed-in event variables to set up the environment variables and python script arguments. Just like when creating the instance, the user_data script is run by the instance’s root user. The root user will need to use the ec2-user’s python to run our script with the following bash command: PYTHONUSERBASE=/home/ec2-user/.local python3.8 /home/ec2-user/ftp_to_s3.py {s3_path} {files_to_download}.

# convert to string with double quotes so it knows its a string
    files_to_download = ",".join(map('"{0}"'.format, files_to_download))
    vars = {
        "FTP_HOST": event["ftp_url"],
        "FTP_PATH": event["ftp_path"],
        "FTP_USERNAME": event["username"],
        "FTP_PASSWORD": event["password"],
        "FTP_AUTH_KEY": event["auth_key"],
        "S3_BUCKET_NAME": event["s3_bucket"],
        "files_to_download": files_to_download,
        "S3_PATH": event["s3_path"],
    }
    print(vars)

    init_script = """#!/bin/bash
                /bin/echo "**************************"
                /bin/echo "* Running FTP to S3.     *"
                /bin/echo "**************************"
                export S3_BUCKET_NAME={S3_BUCKET_NAME}
                export PRODUCTS_TABLE={PRODUCTS_TABLE}
                export FTP_HOST={FTP_HOST}
                export FTP_USERNAME={FTP_USERNAME}
                export FTP_PASSWORD={FTP_PASSWORD}
                PYTHONUSERBASE=/home/ec2-user/.local python3.8 /home/ec2-user/ftp_to_s3.py {s3_path} {files_to_download}
                shutdown now -h""".format(
        **vars
    )

Invoke the instance with the boto3 library providing the parameters for the custom image AMI, Instance type, key-pair, subnet, and instance profile, all defined by Terraform environment variables. Optionally, set the Volume size to 50GB from the default 8GB for larger files.

instance = ec2.run_instances(
        ImageId=AMI,
        InstanceType=INSTANCE_TYPE,
        KeyName=KEY_NAME,
        SubnetId=SUBNET_ID,
        MaxCount=1,
        MinCount=1,
        InstanceInitiatedShutdownBehavior="terminate",
        UserData=init_script,
        IamInstanceProfile={"Arn": INSTANCE_PROFILE},
        BlockDeviceMappings=[{"DeviceName": "/dev/xvda", "Ebs": {"VolumeSize": 50}}],
    )

Conclusion

After deploying to AWS, Terraform will have created a Lambda that invokes an EC2 instance running the script passed to it during its creation. Triggering the Lambda function to invoke the custom instance can be done from a DynamoDB Stream update, scheduled timer, or even another Lambda function. This provides flexibility on how and when the instance is called.

Ultimately, this solution provides a flexible means of downloading files from an FTP server. Changes to the Lambda invoking the instance could include separating the file list to create several more minor instances to run simultaneously, moving more files faster to the AWS S3 bucket. This greatly depends on the client’s needs and the cost of operating the AWS services.

Changes can also be made to the script downloading the files. One option would be to use more robust FTP libraries than the built-in provided python library. Larger files may require more effort as FTP servers can timeout when network latency and file sizes come into play. Python’s FTPlib does not auto-reconnect, nor does it keep track of incomplete file downloads.

Writing a Script to Do the Work

Creating an Instance with Our Script Loaded

Creating Instances on the Fly in Lambda

Conclusion

Share this page:

Related Posts

Deploy Kubernetes Cluster Using kOps on AWS Cloud

Methods to Transfer Data between Amazon AWS S3 Buckets

Building AWS CloudFormation Mappings with PowerShell