Data Analysis - Dealing with Errors
Errors or exceptions are inevitable during the entire pipeline. This section discusses how I deal with different kinds of errors.
This section includes three components: strategy, common errors, and Error Fixing.
Strategy
It is not practical to deal with error immediately after it occurs at every stage of the pipeline. Therefore, the strategy I choose is to report errors and deal with them together at once later.
During the analysis pipeline, if an error occurs, a new metadata file ".error" will be created indicating the error occurred. If the error is one of the common errors on file, the reason will be included in the metadata file as well. The pipeline will continue to other files within the same accession. Once completed, the script that is responsible for final processing and result combining will detect and report the error by moving the file to the error directory. The results generated during analysis will also be attached to facilitate further investigation.
Common Errors
Mismatched & Unpaired Files
Description
If the actual file in the directory is not the same as recorded in the _content file, or if there exists an interleaved-ended file without it's pair, this error will be reported.
Cause
This error occurs when the upstream did not valid the file properly. This usually happens when manual actions are taken improperly.
Action
To avoid a worse result, the accession won't enter the analysis pipeline and will be moved to the error directory immediately.
Unknown Phred Score
Description
Our pipeline can only handle phred33 or phred64 format. If the actual file supplied is in other formats like Solexa, or if the file is not a gene sequence, this error will occur.
Cause
Using a less common sequencing platform, a file is damaged during uploading, or the wrong file was uploaded can all cause this issue.
Action
Skip current file and mark it as error, continue to next file.
Failed to Retrieve Remote File
Description
When the pipeline passed the file retrieving stage but could not find the file for analysis.
Cause
This usually indicates remote file system is unavailable, like a network problem or a server reboot.
Action
The script will retry 3 times and then move on. Usually, this will be enough for the remote server to be back online.
Exceptions During Other Stages
Description
There is no fixed description for this kind of exception. It could occur during any stage when fastqc, trim_galore, or salmon is running.
Cause
Usually a system glitch, (unlikely)memory problem, disk problem, remote file system problem, or a damaged file.
Action
Retry 3 times. If it still fails after 3 retrying, the file will be marked as an error and move on to the next file.
Error Fixing
When There is an Error Result
The existence of an error result indicates that the accession passed the initial check and entered the analysis pipeline. In this situation, we will scan the error report generated, remove any file that is marked as an error, and update the metadata file. If the accession is empty after removing, the entire accession will be deleted from the system.
When There is no Error Result
This indicates that the file did not pass the initial check. In this case, the script will scan the file list and fix any inconsistency between the actual files and the metadata. Once fixed, the accession will be moved back to the analysis queue. If no problem is detected, the accession will be moved to another directory waiting for further investigation.
Navigate through the Data Analysis Section
- Prev
- Overview
- Software & Structure
- Optimization & Performance Improvement
- Dealing with Errors
- Result Collection
- Automation
- Next