Four Years Remaining

The File Download Problem

Posted by Konstantin 02.05.2017 2 Comments
I happen to use the Amazon cloud machines from time to time for various personal and work-related projects. Over the years I've accumulated a terabyte or so of data files there. Those are mostly useless intermediate results or expired back-ups, which should be deleted and forgotten, but I could not gather the strength for that. "What if those datafiles happen to be of some archaelogical interest 30 years from now?", I thought. Keeping them just lying there on an Amazon machine is, however, a waste of money - it would be cheaper to download them all onto a local hard drive and tuck it somewhere into a dark dry place.

But what would be the fastest way to download a terabyte of data from the cloud? Obviously, large downstream bandwidth is important here, but so should be a smart choice of the transfer technology. To my great suprise, googling did not provide me with a simple and convincing answer. A question posted to StackOverflow did not receive any informative replies and even got downvoted for reasons beyond my understanding. It's year 2017, but downloading a file is still not an obvious matter, apparently.

Unhappy with such state of affairs I decided to compare some of the standard ways for downloading a file from a cloud machine. Although the resulting measurements are very configuration-specific, I believe the overall results might still generalize to a wider scope.

Experimental Setup

Consider the following situation:
- An m4.xlarge AWS machine (which is claimed to have "High" network bandwidth) located in the EU (Ireland) region, with an SSD storage volume (400 Provisioned IOPS) attached to it.
- A 1GB file with random data, generated on that machine using the following command:
  $ dd if=/dev/urandom of=file.dat bs=1M count=1024
- The file needs to be transferred to a university server located in Tartu (Estonia). The server has a decently high network bandwidth and uses a mirrored-striped RAID for its storage backend.
Our goal is to get the file from the AWS machine into the university server in the fastest time possible. We will now try eight different methods for that, measuring the mean transfer time over 5 attempts for each method.

File Download Methods

One can probably come up with hundreds of ways for transferring a file. The following eight are probably the most common and reasonably easy to arrange.

1. SCP (a.k.a. SFTP)
- Server setup: None (the SSH daemon is usually installed on a cloud machine anyway).
- Client setup: None (if you can access a cloud server, you have the SSH client installed already).
- Download command:
```
scp -i ~/.ssh/id_rsa.amazon \
         ubuntu@$REMOTE_IP:/home/ubuntu/file.dat .
```
2. RSync over SSH
- Server setup: sudo apt install rsync (usually installed by default).
- Client setup: sudo apt install rsync (usually installed by default).
- Download command:
```
rsync -havzP --stats \
      -e "ssh -i $HOME/.ssh/id_rsa.amazon" \
      ubuntu@$REMOTE_IP:/home/ubuntu/file.dat .
```
3. Pure RSync
- Server setup:
  Install RSync (usually already installed):
```
sudo apt install rsync
```
  Create /etc/rsyncd.conf with the following contents:
```
pid file = /var/run/rsyncd.pid
lock file = /var/run/rsync.lock
log file = /var/log/rsync.log

[files]
path = /home/ubuntu
```
  Run the RSync daemon:
```
sudo rsync --daemon
```
- Client setup: sudo apt install rsync (usually installed by default).
- Download command:
```
rsync -havzP --stats \
      rsync://$REMOTE_IP/files/file.dat .
```
4. FTP (VSFTPD+WGet)
- Server setup:
  Install VSFTPD:
```
sudo apt install vsftpd
```
  Edit /etc/vsftpd.conf:
```
listen=YES
listen_ipv6=NO
pasv_address=52.51.172.88   # The public IP of the AWS machine
```
  Create password for the ubuntu user:
```
sudo passwd ubuntu
```
  Restart vsftpd:
```
sudo service vsftpd restart
```
- Client setup: sudo apt install wget (usually installed by default).
- Download command:
```
wget ftp://ubuntu:somePassword@$REMOTE_IP/file.dat
```
5. FTP (VSFTPD+Axel)

Axel is a command-line tool which can download through multiple connections thus increasing throughput.
- Server setup: See 4.
- Client setup: sudo apt install axel
- Download command:
```
axel -a ftp://ubuntu:somePassword@$REMOTE_IP/home/ubuntu/file.dat
```
6. HTTP (NginX+WGet)
- Server setup:
  Install NginX:
```
sudo apt install nginx
```
  Edit /etc/nginx/sites-enabled/default, add into the main server block:
```
location /downloadme {
    alias /home/ubuntu;
    gzip on;
}
```
  Restart nginx:
```
sudo service nginx restart
```
- Client setup: sudo apt install wget (usually installed by default).
- Download command:
```
wget http://$REMOTE_IP/downloadme/file.dat
```
7. HTTP (NginX+Axel)
- Server setup: See 6.
- Client setup: sudo apt install axel
- Download command:
```
axel -a http://$REMOTE_IP/downloadme/file.dat
```
8. AWS S3

The last option we try is first transferring the files onto an AWS S3 bucket, and then downloading from there using S3 command-line tools.
- Server setup:
  Install and configure AWS command-line tools:
```
sudo apt install awscli
aws configure
```
  Create an S3 bucket:
```
aws --region us-east-1 s3api create-bucket \
    --acl public-read-write --bucket test-bucket-12345 \
    --region us-east-1
```
  We create the bucket in the us-east-1 region because the S3 tool seems to have a bug at the moment which prevents from using it in the eu regions.
  
  Next, we transfer the file to the S3 bucket:
```
aws --region us-east-1 s3 cp file.dat s3://test-bucket-12345
```
- Client setup:
  Install and configure AWS command-line tools:
```
sudo apt install awscli
aws configure
```
- Download command:
```
aws --region us-east-1 s3 cp s3://test-bucket-12345/file.dat .
```
Results

Here are the measurement results. In case of the S3 method we report the total time needed to upload from the server to S3 and download from S3 to the local machine. Note that I did not bother to fine-tune any of the settings - it may very well be possible that some of the methods can be sped up significantly by configuring the servers appropriately. Consider the results below to indicate the "out of the box" performance of the corresponding approaches.

Although S3 comes up as the fastest method (and might be even faster if it worked out of the box with the european datacenter), RSync is only marginally slower, yet it is easier to use, requires usually no additional set-up and handles incremental downloads very gracefully. I would thus summarize the results as follows:

Whenever you need to download large files from the cloud, consider RSync over SSH as the default choice.
Tags: Cloud, Experiment, Internet, Procrastination, Project
The Great Swinging Bucket Conspiracy

Posted by Konstantin 12.02.2009 3 Comments

Most of us probably remember this experiment from high school physics lessons: you take a bucket on a string filled with water, spin it around your head and the water does not spill. "But how?" - you would ask in amazement. And the teacher would explain then:
"You see, the bucket is spinning, and this creates the so-called centrifugal force acting on the water, which cancels out gravity and thus keeps the water in the bucket". And you will have a hard time finding any other explanation. At least I failed no matter how hard I googled it.

Unfortunately, this explanation looses an essential point of the experiment and I have seen people irreparaply braindamaged by the blind belief that it is only due to rotation and the resulting virtual centrifugal force that the water does not spill.

However, it is not quite the case. Let us imagine that the bucket has accidentally stopped right over your head and as a result, all centrifugal force has been immediately lost. Would the water spill? It will certainly fall down on your head, but it will do so together with the bucket. Thus, technically, the water will stay inside the bucket.

In fact, the proper way to enjoy the true magic of the experiment is not to swing the bucket in full circles, but rather let it swing back and forth as a pendulum (if you have a string and a beverage bottle nearby, you can do an experiment right now). One will then observe that even at the highest points of the swing, where the bottom of the bucket is at its steepest angle and the centrifugal force is nonexistant, the water stays strictly parallel to the bottom of the bucket, as if no gravity would act upon it. Why doesn't it spill? Clearly, the argument of centrifugal force cancelling gravity is inappropriate.

The proper explanation is actually quite simple and much more generic. We have two objects here: the bucket and the water in it. There is one (real) force acting on the water: gravity G. There are two (real) forces acting on the bucket: gravity G and the strain S of the string pulling the bucket perpendicularly to its bottom. (Note that the centrifugal force is not "real" and I do not consider it here, but if you wish, you may. Just remember then that it acts both on the water and the bucket.)
Now the question of interest is, how does water behave with respect to the bucket? That is, what force "pulls" the water towards the bucket and vice-versa. This can be easily computed by subtracting all forces acting on the bucket from all forces acting on the water. And the result is, of course, G - (S+G) = -S, i.e. a force, pulling the water directly towards the bottom of the bucket.

A magical consequence of this argument is that gravity does not matter inside the bucket, as long as it can act on the bucket freely in the same way as on anything inside it. Nothing special about rotation here, really. It takes a while to realize.

Tags: Conspiracy, Experiment, Fun, Physics

July 2025
M	T	W	T	F	S	S
« Jan
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Oli on The Data Science Workflow
Adam on The Curse of Genomic Coordinates
second on How to Send an SMS
6 Regularization Techniques for Deep Learning | Python | Keras - AI ASPIRANT on The Mystery of Early Stopping
Aldo D'Ottavio on What is the Covariance Matrix?

The File Download Problem

Experimental Setup

File Download Methods

1. SCP (a.k.a. SFTP)

2. RSync over SSH

3. Pure RSync

4. FTP (VSFTPD+WGet)

5. FTP (VSFTPD+Axel)

6. HTTP (NginX+WGet)

7. HTTP (NginX+Axel)

8. AWS S3

Results

The Great Swinging Bucket Conspiracy

Calendar

Recent Comments

Archives