Pipeline - Overview & Software
Introduction
Once the hardware is in place, it's time to make a detailed pipeline. This pipeline will need to include the script/software used at each step, and also how to deal with errors/invalid files.
This section has three parts: download & preprocess, analysis & result collection, and compression & storage.
This section only includes a rough pipeline. As the development of the scripts, more requirement arises and more scripts will be needed. These "might" be discussed later in their corresponding section.
Download & Preprocess
Download Batches
There are a total of ~277,592 entries to download, and each entry has 4 - 20000 files to be downloaded. To make things easier, I choose to split entries into batches, and each batch has 50 entries. A script called "bulkexport.py" was then used to sanitize each entry, extract the remaining download link, and collect all download links into one file. This file will then be fed to the download pipeline.
Download Software
Aria2 will be used to download the file from remote locations via FTP.
Invalid Files
Once a file is downloaded, it will be validated. If the file is valid, it will enter the next step. If the file is invalid, the script will try to fix the error. Some common errors like the incorrect format of the URL might be fixed automatically, other errors will be discussed case by case.
Valid Files
Those valid files will be handled by script "movef.py" and move to its next location for analysis.
Analysis & Result Collection
Task Assignment
The validated data will be assigned to three analysis servers evenly for further processing. Once the result is ready, it will be returned by the analysis server. If the result is valid, the file will be sent to storage. If the result is invalid, it will be sent back to the downloading section and processed with the error pipeline.
Compression & Storage
The goal is to have 500 - 700 GB processed each day. Since it will take a server 36H to finish compression tasks, the two compression servers will be used alternatively. Once finished processing, the final archive will be moved to the tape library or HDD storage server.