Data Analysis - Result Collection

The last step of the analysis pipeline is to collect the result.

This section includes two components: Filtering and Finalizing.

sample result output.

Filtering

Before the result can be moved to its final destination, unwanted entries have to be filtered.

Entries with Errors

First, the directory will be scanned for any mark of errors (.error). If any error is detected for any file, the entire accession at server A will be moved to error dir, accession at server B (if exists) will be deleted, and the result will be copied to server A waiting for further processing.

Mapping Rate

The mapping rate was extracted from Salmon's log during the run, and we only want the file that has a good mapping rate. After discussing with the researchers, we decided to keep only the entries where the mapping rate is greater than or equal to 75%. However, since the file size is relatively small, I decided to also keep a copy of the full record.

Result process pipeline

Finalizing

Once we have the desired result, it needs to be combined into one file per accession. We are particularly interested in two parameters: TPM and NumReads. For each of the parameter, two .tsv file was generated: one contains the desired result and the other contains full records. For each tsv file, I implemented a 2D array in it. The X-axis is the name of the different files without suffixes and the y-axis is the complete gene name. For entries that do not exist, I use 0.0 as the placeholder.

Once the desired entries were extracted from each file's result, they were moved to final storage along with the individual results of each file. Since the size is small, the raw result of each file was stored as well.

Filename Purpose
.analyze_complete Mark the current file as complete.
.filename The original filename
mapping_rate Contains the mapping rate of current file.
quant.genes.sf The final output from salmon.
scientific_name Contains the scientific name of genes processed.
final result of each file final combined result.

Navigate through the Data Analysis Section

Navigate through the Genetic Project