Data Download - Testing & Firewall Circumvention

rough pipeline of data processing.

Introduction

Using proxy nodes does not solve all issues. The proxy nodes are still located outside of China and the great firewall is still interfering with the downloading action.

This section includes two components: the issue and circumvention.

The Issue

Soon after the proxy servers were in place, I found a strange pattern. For each file retrieved from the proxy nodes, it either failed or downloaded at full speed. Most of the files failed.

Since the proxy servers are under my control, I double-checked that there is no firewall blocking the connection. In addition, I've tested file download from the same server using SFTP without problem. The problem is clear now, the unencrypted FTP requests were somehow intercepted by the firewall at the border. For those requests that pass the firewall occasionally, the download could continue without a problem.

Circumvention

The Protocol

Currently, FTP is confirmed to be not working for unknown reasons. FTP is not a rare protocol, but it definitely not one of the common protocols that novice users would use every day. For the new protocol, we have two options for now: SFTP and HTTP/HTTPS. The advantage of SFTP is that it provides a layer of security. However, SFTP is less common than the HTTP/HTTPS protocol, so I can't guarantee that it won't fail for the same reason as FTP in the future.

SSH failed

Therefore, HTTP/HTTPS was selected. Due to network censorship, encrypted traffic like HTTPS is still likely to be interfered with, even if not blocked. Plain HTTP was the final choice.

Structure

Given HTTP is used, I don't want to the file list to be viewed by the public, and I need another way to obtain the file list. SSH is the best choice in this case.

Firewall Bypassing Structure

Improve Accessibility

Due to the cost, the proxy server was equipped with the general network link, which has some packet-loss occasionally and is not always reliable. In order to improve service reliability, I chose to use a priority CN2 link just to obtain the file list via SSH.

Firewall Bypassing Structure improved

Navigate through the Data Download Section

Navigate through the Genetic Project