Four Years Remaining

The File Download Problem

Posted by Konstantin 02.05.2017
I happen to use the Amazon cloud machines from time to time for various personal and work-related projects. Over the years I've accumulated a terabyte or so of data files there. Those are mostly useless intermediate results or expired back-ups, which should be deleted and forgotten, but I could not gather the strength for that. "What if those datafiles happen to be of some archaelogical interest 30 years from now?", I thought. Keeping them just lying there on an Amazon machine is, however, a waste of money - it would be cheaper to download them all onto a local hard drive and tuck it somewhere into a dark dry place.

But what would be the fastest way to download a terabyte of data from the cloud? Obviously, large downstream bandwidth is important here, but so should be a smart choice of the transfer technology. To my great suprise, googling did not provide me with a simple and convincing answer. A question posted to StackOverflow did not receive any informative replies and even got downvoted for reasons beyond my understanding. It's year 2017, but downloading a file is still not an obvious matter, apparently.

Unhappy with such state of affairs I decided to compare some of the standard ways for downloading a file from a cloud machine. Although the resulting measurements are very configuration-specific, I believe the overall results might still generalize to a wider scope.

Experimental Setup

Consider the following situation:
- An m4.xlarge AWS machine (which is claimed to have "High" network bandwidth) located in the EU (Ireland) region, with an SSD storage volume (400 Provisioned IOPS) attached to it.
- A 1GB file with random data, generated on that machine using the following command:
  $ dd if=/dev/urandom of=file.dat bs=1M count=1024
- The file needs to be transferred to a university server located in Tartu (Estonia). The server has a decently high network bandwidth and uses a mirrored-striped RAID for its storage backend.
Our goal is to get the file from the AWS machine into the university server in the fastest time possible. We will now try eight different methods for that, measuring the mean transfer time over 5 attempts for each method.

File Download Methods

One can probably come up with hundreds of ways for transferring a file. The following eight are probably the most common and reasonably easy to arrange.

1. SCP (a.k.a. SFTP)
- Server setup: None (the SSH daemon is usually installed on a cloud machine anyway).
- Client setup: None (if you can access a cloud server, you have the SSH client installed already).
- Download command:
```
scp -i ~/.ssh/id_rsa.amazon \
         ubuntu@$REMOTE_IP:/home/ubuntu/file.dat .
```
2. RSync over SSH
- Server setup: sudo apt install rsync (usually installed by default).
- Client setup: sudo apt install rsync (usually installed by default).
- Download command:
```
rsync -havzP --stats \
      -e "ssh -i $HOME/.ssh/id_rsa.amazon" \
      ubuntu@$REMOTE_IP:/home/ubuntu/file.dat .
```
3. Pure RSync
- Server setup:
  Install RSync (usually already installed):
```
sudo apt install rsync
```
  Create /etc/rsyncd.conf with the following contents:
```
pid file = /var/run/rsyncd.pid
lock file = /var/run/rsync.lock
log file = /var/log/rsync.log

[files]
path = /home/ubuntu
```
  Run the RSync daemon:
```
sudo rsync --daemon
```
- Client setup: sudo apt install rsync (usually installed by default).
- Download command:
```
rsync -havzP --stats \
      rsync://$REMOTE_IP/files/file.dat .
```
4. FTP (VSFTPD+WGet)
- Server setup:
  Install VSFTPD:
```
sudo apt install vsftpd
```
  Edit /etc/vsftpd.conf:
```
listen=YES
listen_ipv6=NO
pasv_address=52.51.172.88   # The public IP of the AWS machine
```
  Create password for the ubuntu user:
```
sudo passwd ubuntu
```
  Restart vsftpd:
```
sudo service vsftpd restart
```
- Client setup: sudo apt install wget (usually installed by default).
- Download command:
```
wget ftp://ubuntu:somePassword@$REMOTE_IP/file.dat
```
5. FTP (VSFTPD+Axel)

Axel is a command-line tool which can download through multiple connections thus increasing throughput.
- Server setup: See 4.
- Client setup: sudo apt install axel
- Download command:
```
axel -a ftp://ubuntu:somePassword@$REMOTE_IP/home/ubuntu/file.dat
```
6. HTTP (NginX+WGet)
- Server setup:
  Install NginX:
```
sudo apt install nginx
```
  Edit /etc/nginx/sites-enabled/default, add into the main server block:
```
location /downloadme {
    alias /home/ubuntu;
    gzip on;
}
```
  Restart nginx:
```
sudo service nginx restart
```
- Client setup: sudo apt install wget (usually installed by default).
- Download command:
```
wget http://$REMOTE_IP/downloadme/file.dat
```
7. HTTP (NginX+Axel)
- Server setup: See 6.
- Client setup: sudo apt install axel
- Download command:
```
axel -a http://$REMOTE_IP/downloadme/file.dat
```
8. AWS S3

The last option we try is first transferring the files onto an AWS S3 bucket, and then downloading from there using S3 command-line tools.
- Server setup:
  Install and configure AWS command-line tools:
```
sudo apt install awscli
aws configure
```
  Create an S3 bucket:
```
aws --region us-east-1 s3api create-bucket \
    --acl public-read-write --bucket test-bucket-12345 \
    --region us-east-1
```
  We create the bucket in the us-east-1 region because the S3 tool seems to have a bug at the moment which prevents from using it in the eu regions.
  
  Next, we transfer the file to the S3 bucket:
```
aws --region us-east-1 s3 cp file.dat s3://test-bucket-12345
```
- Client setup:
  Install and configure AWS command-line tools:
```
sudo apt install awscli
aws configure
```
- Download command:
```
aws --region us-east-1 s3 cp s3://test-bucket-12345/file.dat .
```
Results

Here are the measurement results. In case of the S3 method we report the total time needed to upload from the server to S3 and download from S3 to the local machine. Note that I did not bother to fine-tune any of the settings - it may very well be possible that some of the methods can be sped up significantly by configuring the servers appropriately. Consider the results below to indicate the "out of the box" performance of the corresponding approaches.

Although S3 comes up as the fastest method (and might be even faster if it worked out of the box with the european datacenter), RSync is only marginally slower, yet it is easier to use, requires usually no additional set-up and handles incremental downloads very gracefully. I would thus summarize the results as follows:

Whenever you need to download large files from the cloud, consider RSync over SSH as the default choice.
Posted by Konstantin @ 7:50 pm

Tags: Cloud, Experiment, Internet, Procrastination, Project
2 Comments
1. Ricardo on 24.05.2017 at 19:03 (Reply)
  
  I'm doing almost the same thing. I run a weather model on a c3.8xlarge for 5 hours, storing the resulting data files on an attached EBS. Then I download a 42 Gb tar file to my local machine via scp. It takes between 1 and 1-1/2 hours for the download. Then I do it all over and over again, in a big loop that takes about a month to complete. It seems like a big waste of money to do it this way, and shouldn't take 1.5 hours to download 42 gig anyway. Ugh.
  1. Konstantin on 24.05.2017 at 21:31 (Reply)
    
    42GB in 60-90 minutes means 1GB takes around 85-128secs, which matches nicely my measured average of 114s (actual range over 5 attempts 98-128s) per GB on SCP.
    
    Extrapolating from my measurements, if you switch to Rsync, your transfer time will go down by 33%, i.e. it will be 40-60m now.
    
    However, unless you really need to have the files on a local machine, you could warehouse them on S3 - this is much faster (my measurements say it takes on average about 30s to upload a Gb from EBS to S3, which is 3x faster than SCP - 20m instead of an hour in your case). If one day later you decide to get all your data down from S3 (via aws cli) your *total time* spent on all transfers (including upload to S3 and download from it) will still be optimal (this is the first bar in my graph above).
    
    One thing I did not mention is the cost. Note that, at the moment:
    - Downloading from EBS costs around $90 per TB.
    - Transfers between EBS and S3 are free (within the same region).
    - Downloading from S3 costs about 10% more than from EBS (because there's a "request fee" added to the otherwise equal "data transfer fee").
Leave a comment

Name (required)

E-Mail:(not displayed)(required)

Website:

Please note: Comment moderation is enabled and may delay your comment. There is no need to resubmit your comment.

Reply to:

May 2017
M	T	W	T	F	S	S
« Mar		Jul »
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

Oli on The Data Science Workflow
Adam on The Curse of Genomic Coordinates
second on How to Send an SMS
6 Regularization Techniques for Deep Learning | Python | Keras - AI ASPIRANT on The Mystery of Early Stopping
Aldo D'Ottavio on What is the Covariance Matrix?

The File Download Problem

Experimental Setup

File Download Methods

1. SCP (a.k.a. SFTP)

2. RSync over SSH

3. Pure RSync

4. FTP (VSFTPD+WGet)

5. FTP (VSFTPD+Axel)

6. HTTP (NginX+WGet)

7. HTTP (NginX+Axel)

8. AWS S3

Results

2 Comments

Leave a comment

Calendar

Recent Comments

Archives