Other - The Next Steps
Introduction
Like all other projects, there is always space for improvement.
This section has two parts: increasing of the data size and dealing with errors.
Increasing of the Data Size
Soon after the project started running, I found a mistake made by the researchers. The filter they applied was too strict and the actual data we need is ~2000TB.
Storage
Apparently, 2PB is beyond our current storage capability even with the tape library. Therefore, the storage portion for analyzed source data was discounted. The tape system, on the other hand, is really helpful for storing cold data. The logical volume and database developed are also very helpful in storing and tracking other kinds of data.
Network
Having 2PB of data increases the daily processing amount from 700 TB to over 2 TB, and poses a higher requirement to our network. Luckily, as the establishment of my mini datacenter, the network was upgraded to two gigabit links, which is sufficient. In order to reach the optimal downloading speed, the number of proxy nodes was increased to 9 (from B to J). With 10 download nodes (including the localhost), we can sometimes reach a speed of 2.4Gbps while downloading. It is amazing!
After the upgrade, we can actually download 3 - 4 TB of data per day.
Analysis Server
Take advantage of an automated analysis pipeline, each analysis server can actually ingest about ~2 TB of data. In the worst case, we only need two analysis servers to catch up with the downloading speed. As a result, not only are the existing servers sufficient, but we can also reassign one to serve as an additional high-performance hypervisor.
Dealing with Errors
Currently, the script can only fix a small amount of errors. There are still a total of 20 TB of accessions that have an unknown error. It would take a while to manually deal with these errors, and it makes sense to develop a script to facilitate this process.
Other features, like improving the human interaction capabilities further, are not worth the time since the project was proposed to be completed in 2024.