Script Development - Human Interaction Improvement and Auxiliary Scripts
Introduction
Given that I'm not the only user of the processing scripts, and most of them do not know computer technology very well, it is necessary to make the human interaction process user-friendly.
This page has three parts: the idea, sample run, and auxiliary scripts.
The Idea
In order to minimize code-level modifications, most scripts take parameters from the command line while running. Since most of the users are not professionals in computer technology, it is unlikely that they can feed the right parameter(s) to the script via the command line. Therefore, I have created a master wrapper "start_operation.py" to guide them for each operation. If any operation is selected, the wrapper will invoke the necessary script with proper parameters.
Reducing Parameters
Since we usually process files at the level of a batch of accessions, I went through the process myself and minimized the information needed while running the script. For example, while switching the download list on a remote server, the location of extracted downloading links "_url" and the remote server address needs to be provided. Each proxy node has a code from B to E (A reserved for localhost), the user only needs to provide the server code, and the rest of the parameters like server address, port, username, password, and remote location will be entered by the wrapper based on predefined values. In addition, since we usually operate in the unit of the batch of accessions for different operations, the user only needs to provide the path to the batch folder, and the script will automatically recognize the required file in that directory.
Fast yet Flexible
The pre-filled values are changed based on the provided server code and prompt the user for confirmation. If the value is correct (in most cases), the user only needs to press enter to confirm. If any of the options are incorrect, or if any location is temporarily changed, the user can simply provide the new location while confirming.
Error Tolerance
Entering the directory correctly can be a problem for a general user. Therefore, the value provided will be sanitized and validated before being passed to the next steps. In Windows, if the directory was copied directly from the address bar, the slash was usually replaced by the backslash, which could cause serious problems when storing as a string. Therefore, I made the script to sanitize the entered value first, replacing any backslash with a slash.
It is possible that users provide an incorrect value, or if the user chose the wrong option. When a value is provided, the script will first check for it's existence before continuing. In addition, for sensitive operations like replacing the download list, the script will check for any file left and warn the user. A file left in the folder usually indicates the previous download is not yet complete. If the user indeed wants to continue, the script will prompt the user to type "confirm" before continuing.
Sample Run
Auxiliary Scripts
List of auxiliary (non-major) scripts and their purpose.
Script Name | Purpose | Size |
---|---|---|
bulkdecompress.py | Decompress all batches in a given directory and keep the file structure. | 8 KB |
bulkexport_old.py | Extract all download links, single-thread version. | 15 KB |
bulkpurge.py | Clear any non-whitelisted files. | 3 KB |
checkgz.py | Check if a gz archive is valid. | 5 KB |
createcont.py | Create "_content" metadata file. | 4 KB |
decompressgz.py | Decompress gz archives for a single accession, sub script of bulkdecompress.py | 8 KB |
filter_size | Find and move files whose size is greater than a threshold, used to find extra large file(s). | 2 KB |
findmissing.py | Based on metadata, find missing and excess files. | 4 KB |
findnonhomo.py | Find and remove entries that are not part of three pre-set species./td> | 3 KB |
findnonsra.py | Find and remove entries that is not RNA-seq. | 3 KB |
gdreport.py | Get the download report. | 2 KB |
genfcont.py | Generate a content file that is stored with each tape. | 5 KB |
get_total_size.py | Get the total size of all files waiting to be downloaded. | 1 KB |
get_xml.py | Get XML metadata from the remote site. | 3 KB |
move_same.py | Based on the database, move downloaded(duplicate) accessions to the given directory before downloading. | 3 KB |
movestored.py | Find and move the stored file to the destination (trash). | 5 KB |
purge_folder.py | Monitor a list of directories, and remove any un-whitelisted file(s) every hour. | 2 KB |
purgefiles.py | Remove any extra(un-specified) files in a given accession folder, and prepare for storage. | 4 KB |
remove_url_config.py | Trying to find unpaired or missing files, fixing and regenerating metadata. | 13 KB |
rmlines.py | Remove unwanted entries while extracting downloading links. | 3 KB |
simdown.py | Create a fake downloaded batch of accessions, testing other scripts. | 1 KB |
split_download.py | Split downloaded accessions into batches of 50. | 1 KB |
valicont.py | Validate the content and mark it as validated. Files that pass this test will continue to analysis. | 9 KB |
secret.py | Contains credentials for FTP/SSH login. | 1 KB |
remove_empty.py | Remove empty accession after filtering. | 2 KB |
bulk_remove_empty.py | Remove enpty accessions for all batches. | 2 KB |
add_to_database.py | Prompt and add identifier & volume information to the database. | 11 KB |
bulk_import_initial.py | Import all initial values after database initialization. | 1 KB |
bulk_move_dup.py | Move duplicates for all accessions. | 1 KB |
dbconfig.py | Contains configurations (columns, credentials, tables, etc.) of the database. | 4 KB |
dbhelper.py | Shared library and functions for database operations. | 3 KB |
import_data_initial.py | Import all initial information into the database. | 9 KB |
initializedb.py | Initialize the database after creation. | 4 KB |
update_information.py | Update the status for accessions and files with the provided status code with 3 scanning modes. | 9 KB |
Navigate through the Script Development Section
- Prev
- Design Principle
- Major Scripts
- Recovery from Missing Files
- Human Interaction Improvement and Auxiliary Scripts
- Next