Data Analysis - Dealing with Errors

Errors or exceptions are inevitable during the entire pipeline. This section discusses how I deal with different kinds of errors.

This section includes three components: strategy, common errors, and Error Fixing.

sample image of errors

Strategy

It is not practical to deal with error immediately after it occurs at every stage of the pipeline. Therefore, the strategy I choose is to report errors and deal with them together at once later.

During the analysis pipeline, if an error occurs, a new metadata file ".error" will be created indicating the error occurred. If the error is one of the common errors on file, the reason will be included in the metadata file as well. The pipeline will continue to other files within the same accession. Once completed, the script that is responsible for final processing and result combining will detect and report the error by moving the file to the error directory. The results generated during analysis will also be attached to facilitate further investigation.

Common Errors

Mismatched & Unpaired Files

Description

If the actual file in the directory is not the same as recorded in the _content file, or if there exists an interleaved-ended file without it's pair, this error will be reported.

Cause

This error occurs when the upstream did not valid the file properly. This usually happens when manual actions are taken improperly.

Action

To avoid a worse result, the accession won't enter the analysis pipeline and will be moved to the error directory immediately.

Unknown Phred Score

Description

Our pipeline can only handle phred33 or phred64 format. If the actual file supplied is in other formats like Solexa, or if the file is not a gene sequence, this error will occur.

Cause

Using a less common sequencing platform, a file is damaged during uploading, or the wrong file was uploaded can all cause this issue.

Action

Skip current file and mark it as error, continue to next file.

Failed to Retrieve Remote File

Description

When the pipeline passed the file retrieving stage but could not find the file for analysis.

Cause

This usually indicates remote file system is unavailable, like a network problem or a server reboot.

Action

The script will retry 3 times and then move on. Usually, this will be enough for the remote server to be back online.

Exceptions During Other Stages

Description

There is no fixed description for this kind of exception. It could occur during any stage when fastqc, trim_galore, or salmon is running.

Cause

Usually a system glitch, (unlikely)memory problem, disk problem, remote file system problem, or a damaged file.

Action

Retry 3 times. If it still fails after 3 retrying, the file will be marked as an error and move on to the next file.

Error Fixing

When There is an Error Result

The existence of an error result indicates that the accession passed the initial check and entered the analysis pipeline. In this situation, we will scan the error report generated, remove any file that is marked as an error, and update the metadata file. If the accession is empty after removing, the entire accession will be deleted from the system.

When There is no Error Result

This indicates that the file did not pass the initial check. In this case, the script will scan the file list and fix any inconsistency between the actual files and the metadata. Once fixed, the accession will be moved back to the analysis queue. If no problem is detected, the accession will be moved to another directory waiting for further investigation.

error directory

Navigate through the Data Analysis Section

Navigate through the Genetic Project