As part of one of my projects, I was asked to research methods of transferring large amounts of data (> 1 Terabyte) between client-owned S3 buckets. Several suitable techniques are available. They include:

  • Running parallel uploads using the AWS command-line interface (CLI)
  • Using an AWS SDK (Source Development Kit)
  • Using cross-region or same-region replication
  • Using S3 batch operations
  • Using S3DistCp with Amazon EMR
  • Using AWS DataSync

Some of these methods, such as copying files and directories using the AWS CLI or an AWS SDK, also work well for datasets smaller than one terabyte (1TB). The content of this blog post was largely derived from this official AWS article on the subject.

Also, note that you will likely run into some issues, particularly for using the AWS CLI directly or using a custom SDK like Boto3 for Python, particularly around permissions on the source bucket or the destination bucket. This article from AWS covers many potential failure states.

Running Parallel Uploads Using the AWS Command-line Interface (AWS CLI)

The simplest method for direct S3-S3 copying (apart from using AWS’s GUI interface for one-time manual transfers) is the AWS command-line interface. This works for small, individual files and much larger sets of larger files. The official AWS CLI reference index is here, specifically for AWS CLI S3 commands.

If I wanted to copy all files with the .jpg extension in a folder at the root of my source S3 bucket named “foo_imgs” to a folder named “foo_destination” in my destination S3 bucket, the command would be:

Root Source for S3 Bucket

The basic format of the command is “aws cp,” followed by the source bucket and directory and then the destination bucket and directory. You include the “–recursive” flag whenever you have more than one file to copy – otherwise, the “cp” command only copies a single file.

If your project requires filtering for or against specific items, you can specify which folders and items to include by using the “–include” and “–exclude” parameters with each command. In the example above, “—exclude” excludes all items in the source bucket, and then, crucially, the “—include” flag includes all items with the .jpg extension. It may seem strange or counter-intuitive, but only having the “—include” flag will not work because, by default, all files are included. You would be including the files you’re filtering for, along with everything else. Adding “—include” will only re-include files that the “—exclude” flag has excluded, and if you never specified anything for “—exclude”, then there is no filtering that happens. You have to use “—include” and “—exclude” in tandem. Also, note that these parameters are processed on the client-side, so your local system’s resources will affect how well this method works.

To run parallel copy operations, you can open multiple instances in your terminal of the AWS CLI and then run the AWS cp command in each, changing the “—include” and “—exclude” flags for each instance. Suppose you have a set of files in a directory whose names begin with 0 through 9. The first terminal instance would copy filenames that start with 0 through 4, and the second terminal instance would copy those beginning with 5-9, and so on. Each copy operation would use a single thread in its terminal instance to copy the range of objects specified for it.

First terminal instance:

First terminal to copy the range of objects

Second terminal instance:

Terminal instance to copy filenames

Using an AWS SDK (Source Development Kit)

You can build a custom application using a source development kit to perform the data transfer. The idea here is that a custom application might be more efficient at performing a transfer at the scale of hundreds of millions of objects or if you want to automate the process. For example, on our project, we used the SDK for Python, Boto3.

Here is a sample Boto3 copy operation:

Boto3 Copy Operation

Using Cross-Region Replication or Same-Region Replication

Setting up either Cross-Region replication or Same-Region replication on a source S3 bucket will automatically replicate new uploads from the source bucket to a destination bucket. You can filter which objects will be replicated using a prefix or a tag. One limitation of this method is that it only replicates new objects, so you would have to empty both buckets, set up Cross-Region or Same-Region replication, and then upload objects into your source S3 bucket to test this method out.

Using S3 batch operations

You can also use Amazon S3 batch operations to copy multiple objects with a single request. You should particularly consider using this method over a method like the “aws cp” operation if your bucket contains more than 10,000,000 objects, although there are caveats to batch copying as well. You create a batch operation job in which you specify which objects to operate on using either an Amazon S3 inventory report or a CSV manifest file. Then Amazon S3 batch operations will call the API to operate. When the batch operation job is complete, you will receive a notification, and you can choose to receive a completion report about the job.

Using S3DistCp with Amazon EMR

Here is an existing definition that is already perfectly succinct: “Amazon EMR is a cloud big data platform for running large-scale distributed data processing jobs, interactive SQL queries, and machine learning (ML) applications using open-source analytics frameworks such as Apache Spark, Apache Hive, and Presto.”

There are occasions when you may want to copy S3 data to a Hadoop Distributed File System (HDFS) housed on an Amazon EMR cluster. Or you might want to move a large amount of data between S3 buckets or regions, and in these cases, the “aws cp” operation might not be feasible. Amazon EMR includes a utility called S3DistCp, an extension of DistCp, which is often used to move data with the Hadoop ecosystem. S3DistCp is optimized to work with Amazon S3 and adds several features.

S3DistCp performs parallel copying of large volumes of objects across Amazon S3 buckets. S3DistCp first copies the files from the source bucket to worker nodes in an Amazon EMR cluster. S3DistCp includes several use cases, such as copying or moving files without transformation, copying and changing file compression during the copy process, and copying files incrementally. View a complete list here.

Be aware, however, that an additional cost is incurred because you are using Amazon EMR. Here is a link to Amazon EMR’s pricing page, which includes a comparison table of Amazon EC2 on-demand price rates alongside Amazon EMR price rates. You pay a per-second rate for every second you use, with a minimum of one minute of usage.

Using AWS Datasync

Finally, you can also use AWS Datasync to move large file sets between buckets.  View a numbered list of steps to do so here. As noted at the bottom of that documentation, you will be charged an additional fee for using AWS Datasync at $0.0125 per gigabyte (GB). You can find more information and cost breakdowns for sample use cases here. Additionally, here is a link listing AWS Datasync quotas and limits.

Conclusion

The solutions listed in this article encompass many different use cases. My project used the first and second methods: running the AWS CLI through Python’s Boto3 library. Although we didn’t end up opting for the last four solutions, you may find that one of them fits your project’s needs.

Can The Power Automate Process Advisor Be Used as a Time Study Tool?

Microsoft has given us many tools over the years with multiple applications. The Process Advisor, now available in commercial, GCC, GCC High, and DoD environments, is simple enough to understand and use but has many potential applications.

What is the Process Advisor?

The Process Advisor is a tool included within Power Automate. It is used to record mouse clicks and keyboard keystrokes during a process. After recording a repetitive task, the recordings are uploaded and analyzed by the Process Advisor, making recommendations about how Power Automate can streamline the workflow.

Why is using a Process Advisor in Power Automate Important?

Automating processes saves time and Process Advisor shows you the way.  After recording a process, it will break down the steps and recommend, on a per-step basis, what could be automated with cloud flows, desktop flows, RPA, and even AI models. 

What is a Time Study?

A time study is precisely what it sounds like; studying the time it takes to complete a task. The goal of a time study is to document the steps necessary to complete a task and how long those steps take, on average.  This information is then used to try to make the process more efficient and reduce how long the task takes. There are many ways to conduct a time study. They range from using a stopwatch and paper to using expensive time study software.

Once the information has been gathered, you must consolidate, clean, and analyze the information to decipher potential improvements. The good news is that the Process Advisor uses AI to do everything for you.

Challenges and Considerations

One of the major considerations whenever using the process advisor is the scope.  In other words, what steps are going to be included when you record this task.  For example, if you have a process to receive requests from customers through your website, email, and physical mailings, you may wish to scope your recordings to include all three or prioritize one. 

Benefits of using a Time Study with Process Advisor

There are two tangible benefits of using the Process Advisor to conduct a time study. The first is the time savings when you receive your results in real-time versus all the work to manually consolidate the data and turn it into actionable information.  The second is the immediate recommendation that Process Advisor gives you about how to leverage individual pieces of the Power Automate offerings to automate the process. 

How Do I Get Started?

Install Power Automate Desktop here: https://docs.microsoft.com/en-us/power-automate/desktop-flows/install and follow the installation setup instructions.

After installation, open the Power Automate Desktop. Create a new flow, name it descriptively, and open the Desktop Recorder or Web Recorder depending on where this process resides.

Power Automate Desktop

Click Record, complete the task, then click Finish. Do this as many times as you would like (a minimum of two required for the next step), and multiple people conducting the same actions is recommended, so the tool has a lot of information to glean from.

After recording several examples of the same task being completed, have everyone upload their individual recordings to a shared area inside of the Process Advisor, which can be found in the Power Automate section of Microsoft Office, here:

Shared Area inside Process Advisor

Here you can click Create to add a new process and share it with your team. This will allow them a single place to upload their recordings for analyzing and keeping things organized.

After you have enough recordings uploaded to analyze (the minimum requirement is two recordings), you can select Processes to begin the analysis process. Here you can see everything that has been submitted for analysis, who submitted it, and what the Status is. Select Analyze at the top and all recordings will be Analyzed at once; this may take a few minutes.

Analyze recordings

Once it successfully analyzes your recorded processes, select the Analytics button.

Analyze and record

Remember all those steps after completing your data gathering? The Process Advisor has already completed them. A flow chart has been created showing you the variance in the process. Times have been assigned to steps to show how long actions take. Actions that take longer have been color-coded with a red highlight to draw attention. Also, filters can be applied down the right side of the screen to dive into your data and really focus on what the major culprits are for increasing the length of time a task takes.

Apply Filters to Dive into Data

This works wonders for saving you time having to clean and organize data after the initial gathering phase of your time in motion study. But Process Advisor doesn’t stop there. At the top of the screen, there’s another helpful button.

Preview Automated activities

This will launch another window with context from your process recordings. It will then take you to a new instance of Power Automate, where it recommends parts of these recordings that Power Automate can do for you. In short, the Power Automate Process Advisor took the time required to collect the data, submit it, analyze it, and use it to make recommendations for process improvements and reduce it drastically.

Outcomes

The results you receive are nearly instant and take the guesswork out of how to automate your current process.  Using Power Automate to use computation power to do repetitive tasks will save you time and money.  Process Advisor will get you there faster and using it as a Time Study tool will also have the added benefit of prioritizing the biggest wins as far as time savings is concerned. 

RPA-in-a-Day

Microsoft has built a training that demonstrates exactly how to use the Process Advisor, coupled with AI, to their fullest potential. Within this training, an invoice process is analyzed. The bottleneck is identified when invoices arrive via email, and then the tool recommends an RPA process to automate that task.

Contact us to schedule your own RPA-in-a-day!

We know most software applications, desktop, mobile, or web, require a database at the backend to store data. When we look at current software applications, the complexity is very high, and they have a high frequency of data transactions. So, we need testing of the data stored and retrieved from the database to make sure it has proper data integrity. Any database operation performed by the application is always one of the four, Create, Retrieve, Update and Delete.

We were required to do database unit tests in one of our recent client application implementations, as the application is mainly data-centric. The decisive reason to push for database unit testing is to ensure the application has data integrity. We were required to develop database test cases to include checks for data integrity and confirming business rules.

For database unit testing, we have the following frameworks that can be used:

  • Microsoft SQL Server Data Tools (Using Visual Studio)
  • tSQLt Unit Testing Framework

This blog post will give a high-level experience in implementing and demonstrating how to create test cases that use tSQLt features on AdventureWorks sample databases. We choose the tSQLt Database Unit Testing framework for Azure SQL Database to implement database unit testing. tSQLt allows us to create isolated test cases defined with the data we need; each test case runs in its own transaction.

A tSQLt unit test case is based on the AAA rule, which consists of:
Rules for Database Testing

Step One: tSQLt Environment Setup

The first step is to install tSQLt in your database.

  • Setup includes a set of objects (tables, stored procedures, functions, and more) that you add to the database you want to test. Download “tSQLt_V1.0.5873.27393.zip” from the tSQLt site, and unzip the file.
  • Run the “tSQLt.class.sql” script against the database. You can find the script in the zip file that you downloaded from the tSQLt site.
    Once the run is successful, you can find the tSQLt schema. Assigned to that schema will be tables, views, stored procedures, and user-defined functions. These are the components that do all the processing when creating and running test cases against your database. As a part of best practices, we will not be moving any objects created by “tSQLt.class.sql” and unit test cases outside the development database.

Object Explorer

Before you create a test case, you need to create a test class where the test case will be located. A test class is a schema configured with an extended property that tells tSQLt that it is a test class. To create the test class, you use the NewTestClass stored procedure part of the tSQLt schema.

NewTestClass Stored Procedure

The EXEC statement creates a test class named TestPerson. Once we’ve completed this step, we can add one or more test cases to the test class, so let’s get started doing that.

Step 2: Testing [Person].[GetAddressByCity] Stored Procedure

In tSQLt, a test case is a stored procedure that’s part of a test class and uses tSQLt elements to perform the testing. We can develop and test stored procedures and functions in a database.

As part of developing the tSQLt test case, we’ll create a test case for a stored procedure [Person].[GetAddressByCity] which will return address details for a given city.

Creating a Test Procedure by City

Use the CREATE PROCEDURE statement to create a test case. The procedure name must start with the word “test” and be created in an existing test class; otherwise, making the test case is much like creating any other procedure. The following T-SQL script creates a test case named “TestGetAddressByCitySuccess” in the “TestPerson” test class:

Follow tSQL Script to Create Test Case

As you can see, we are using CREATE OR ALTER PROCEDURE statement to create a test case. The critical part of the test case is the main body of the procedure definition, between BEGIN and END. First, we use tSQLt’s FakeTable stored procedure, which creates a unique temporary table with the same name as the table referenced within the actual stored procedure we are testing. This will ensure that the data in an actual database and the table is not updated or deleted. Any subsequent references we make in our test case to that table will always point to the test table and not the actual table in the database. Then we populate the table with test data.

AIS achieved Windows Server and SQL Server-based workloads to Azure, reflecting our commitment and investments in delivery around helping organizations migrate workloads to Azure.

In the test case, we use two temporary tables, #Expected, used to store the expected data, and #Actual, which will store data once the stored procedure is successfully run. We are keeping both temp tables schema similar.

Finally, we use the tSQLt AssertEqualsTable stored procedure to compare the data in the #Actual table to the data in the #Expected table.

Run the test case using tSQLt.Run the stored procedure.

When we run the test case, it should evaluate to true and return the following results:

Final result for unit testing

Now you have the basics, which will allow you to go a long way with unit testing using tSQLt for SQL Server. You can refer to documentation by visiting the tSQLt User Guide.

While recently working on moving an FTP server (using Passive FTP) from on-premises to Azure, I needed to expose the FTP server to the internet via an Azure External Load Balancer. A few things were not well documented. I thought it would be good to put together a step-by-step guide on setting up the load balancer, configuring the FTP server on the backend, and setting the Network Security Group rules associated with the FTP server. In this example, the FTP server will be running on IIS, and it is assumed that there is no Azure Firewall\NVA associated with the VNET that is hosting the FTP server. Suppose your environment does have an Azure Firewall\NVA. In that case, there will be an additional Azure resource needed, an Azure Route Table with a 0.0.0.0\0 out to the internet associated with the subnet the FTP server is hosted. This is required in order to route the outbound traffic back out to the internet. If this is not done the outbound FTP traffic will die on the Azure Firewall\NVA.

Windows IIS FTP Server Configuration

I won’t be going into the details of the FTP server configuration here. I will just be going over the FTP firewall settings and set the Data Channel Port (Passive FTP Ports). Below is how the FTP server is configured to support the external load balancer. On the IIS server settings – FTP Firewall Support is where you define the Data Channel Port Range.

FTP Firewall Support

I have defined ports 5000-5002.

Define Ports

At the site, the level is where you configure the public IP address of the external load balancer.

FTP Testt

As you see in the screenshot below the public IP of the external load balancer is configured.

FTP Firewall Support

Azure External Load Balancer

Configuring an Azure External Load Balancer for FTP is strait forward. One thing to note with Passive FTP is that all the Data Channel ports for FTP must be defined in the load balancing rules. If you define ports 5000-5100, there will be 100 load balancing rules for each data channel port. I recommend keeping the number of passive ports to a minimum since Azure External Load Balancers do not support a port range.

Requirements:

  • Azure Public IP – this will be configured as the load balancer’s front-end IP
  • Azure Load Balancer
  • FTP Port requirements for inbound traffic and public IP address(es) of the client(s) that will be accessing the FTP server

Deploy the Load Balancer

Search for Azure Load Balancer in the search bar in the Azure Portal

Search for Azure Load Balancer
Select Create

Creating a Load Balancer in Azure

Define the following parameters then select Next: Frontend IP configuration

  • Resource Group
  • Name
  • Region
  • Leave SKU: Standard
  • Leave Type: Public
  • Tier: Region

Project Details Load Balancing

Select Add a frontend IP Configuration, define the following parameters

  • Frontend IP Name
  • IP version: leave IPv4
  • IP type: leave IP address
  • Public IP address
  • Gateway Load balancer: leave None

Select Add

Name Load Balancer ad Public IP Address

Select Review + create – we will go over the configuration of the remaining items of the Load Balancer after it has been deployed.

Add Backend Pool

Configuration of the Azure External Load Balancer

I will now go over the configuration of the Azure Load Balancer. This will detail how the Backend pool, Health probes, Load balancing rules, and Outbound rules are configured.

Configuration in the Azure External Load Balancer

Frontend IP Configuration

As you see in the screenshot below, the frontend IP is defined by the rules associated with the front end. After additional load balancing rules are added I will review the frontend configuration again.

Frontend IP configuration

Backend Pools

The backend pool configuration is how you associate a virtual machine with the load balancer. In the screenshot below, I use the following configuration

  • Virtual network – the VNET that the target virtual machine is associated with
  • Backend Pool Configuration: NIC
  • IP Version: IPv4
  • Add the virtual machine by selecting the +Add button
  • Select Save

Add virtual machines to backend pools

Add Virtual Network

Health Probes

I have created an FTP Health probe for port 21. I will also be using this health probe for my FTP data channel ports. You can make a health probe for each data channel port.

  • Name: FTP
  • Protocol: TCP
  • Port: 21
  • Interval: 5
  • Unhealthy threshold:2

Use Health Probe for FTP Data

Load Balancing Rules

I have the port 21 load balancing rule already configured but need to add the FTP data channel ports that I have defined in FTP Firewall (5000-5002).

Define the following parameters:

  • Name: I like to give it the name of the port number, in this example, I will use 5000
  • IP Version: IPv4
  • Frontend IP address: Select your Frontend IP Config – FEIP
  • Backend Pool: Select your backend pool – BEP-01
  • Protocol: TCP
  • Port: 5000
  • Backend Port: 5000
  • Health Probe: ftp (TCP:21)
  • Session persistence: None
  • Idle timeout (minutes): 4
  • TCP reset: Disabled
  • Floating IP: Disabled
  • Outbound source network address translation (SNAT): (Recommended) Use outbound rules to provide backend pool members access to the internet
  • Select Add

Here is the full set of Load Balancing rules:

Load Balancing Rules

How to Add a Loan Balancing Rule

Outbound Rules

There is a requirement to create an outbound rule for the back-end resource to be able to reach out to the internet.

Create an outbound rule for back-end resource

Network Security Group Rules

The final step in configuring this solution is configuring the inbound rules for the NSG. The rule should have TCP ports 21 and the data channel ports defined in your FTP firewall (in my case, this would be 5000-5002).

External IP client accessing the FTP server

Conclusion

Passive FTP via an Azure Load Balancer combined with Network Security Group is a great solution to expose your FTP server on the internet securely. Understanding the total possible number of connections at a single time will help determine the minimum number of passive data channel ports required to configure on the Azure External Load Balancer.