Data Download - Software & Structure

logos of software used.

Introduction

Downloading hundreds of TBs of data requires more than just a list of URLs, we still need to make sure we can download at maximum speed in the long term run.

This section includes three parts: Speed Testing & Initial Attempt, Structure, and Deployment.

Speed Testing & Initial Attempt

Initial Attempt

We already have a list of FTP download links, so we will first try to download directly. The first software I tried was Internet Download Manager (IDM). This is a common multi-thread downloading software on the Windows platform.

Speed Testing

IDM does accept a list of FTP links as its source, and the downloading speed is running at 500Mbps. However, after a night of downloading, the speed decreases rapidly to zero. I tried to stop downloading for a day and the speed is again running at full 500 Mbps. This is clearly due to restrictions of the remote firewall..

After two weeks of testing and waiting, I found the firewall wouldn't be triggered if the download speed was limited to around ~100Mbps. Also, Aspera and FTP don't seem to share the same firewall policy. So, it would be better to implement both download options to avoid potential service interruption.

In order to bypass this restriction and still reaches full utilization of the link, we need multiple IP addresses.

Structure

The ISP of our server does not allocate multiple IP addresses, and we need some remote virtual servers as the proxy. In other words, the remote file will first be downloaded to remote nodes at a slower speed, so that it won't trigger the firewall. Then, the file will be retrieved to our local server using FTP without any restrictions.

The Structure

download structure

Deployment

Once the structure was set, I obtained four remote servers. These servers, located in the US, have a 10Gbps unmetered link, which is more than enough. All servers were configured to use Debian 10 OS, and each with at least 256GB of buffer space.

Three scripts were developed for the downloading pipeline, one for downloading from the remote server, one retrieving files from proxy nodes, and one for validating and moving downloaded files.

space of vm

Navigate through the Data Download Section

Navigate through the Genetic Project