wf-ani: Output

March 22, 2024 · View on GitHub

Introduction

This document describes the output produced by the pipeline.

The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.

Pipeline overview

The pipeline is built using Nextflow and is used to perform Average Nucleotide Identity (ANI) on FastA and Genbank files.

Input quality control
- Initial input file check ensures input FastA and Genbank files meet a minimum file size
Average Amino Acid Identity (ANI)
- BLAST
- SKANI
- fastANI
Summaries of the output files generated during the pipeline process
Pipeline information - Report metrics generated during the workflow execution

Note

[sample] is a unique identifier that is parsed from input filenames and excludes everything after the first period (.).

Tip

All tab-separated value (TSV) files can be converted to Excel spreadsheets (XLSX) by using the parameter --create_excel_outputs when running the pipeline.

When using this parameter, a summary workbook is created to allow for all summary files to be added to separate worksheets within the workbook.

Input quality control

Input files must meet a minimum file size to be processed within this pipeline. If this check passes, the input files will be used to generate pairings and have ANI performed on them.

Initial input file check

QC Steps

Input files are checked to ensure that they meet a minimum file size to be processed within this pipeline [Default: 45k]. This is to prevent unusually small input sets from wasting compute time processing data that will not yield usable results.

Average nucleotide identity (ANI)

Pairings of input files that pass the quality control check are identified and ANI is performed on each pairing.

Important

Outputs generated by BLAST, SKANI, and fastANI cannot be compared even when using the same file inputs.

Tip

For many input files, fastANI may be useful in decreasing runtime.

BLAST

Output files

ANI/BLAST
- pairs.[tsv,xlsx]: List of all pairings to be performed.
- queries.[tsv,xlsx]: List of query proteome(s) (query vs reference panel workflow).
- references.[tsv,xlsx]: List of reference genomes (query vs reference panel workflow).
- genomes.[tsv,xlsx]: List of input genomes that passed quality control checks (all vs all workflow).
ANI/BLAST/[pair1]-[pair2]
- ani.[pair2],[pair1].stats.tab: ANI summary of each pair, including the combined bidirectional ANI.
- blast.[pair2],[pair1].tab: BLAST output of each fragment of [pair2] vs reference [pair1].
- blast.[pair1],[pair2].tab: BLAST output of each fragment of [pair1] vs reference [pair2].
- blast.[pair2],[pair1].filt.tab: Filtered BLAST output.
- blast.[pair2],[pair1].filt.two-way.tab: Filtered bidirectional BLAST output.

SKANI

Output files

ANI/SKANI
- pairs.[tsv,xlsx]: List of all pairings to be performed
- queries.[tsv,xlsx]: List of query proteome(s) (query vs reference panel workflow)
- references.[tsv,xlsx]: List of reference genomes (query vs reference panel workflow)
- genomes.[tsv,xlsx]: List of input genomes that passed quality control checks (all vs all workflow)
ANI/SKANI/[pair1]-[pair2]
- skani.[pair1]-[pair2].[tsv,xlsx]: Output of [pair1] vs reference [pair2] when using SKANI to perform ANI.

fastANI

Output files

ANI/fastANI
- pairs.[tsv,xlsx]: List of all pairings to be performed
- queries.[tsv,xlsx]: List of query proteome(s) (query vs reference panel workflow)
- references.[tsv,xlsx]: List of reference genomes (query vs reference panel workflow)
- genomes.[tsv,xlsx]: List of input genomes that passed quality control checks (all vs all workflow)
ANI/fastANI/[pair1]-[pair2]
- fastani.[pair1]-[pair2].[tsv,xlsx]: Output of [pair1] vs reference [pair2] when using fastANI to perform ANI.

Summaries

Concatenation of output metrics for all samples.

Note

The Summary-Report excel file is only created when the parameter --create_excel_outputs is used.

The Summary-Report excel file has the date and time appended to the filename using the following shorthand notation: year (yyyy), month (MM), day (dd), hour (HH), minute (mm), second (ss).

Output files

Summaries/
- Summary.ANI.[tsv,xlsx]: Summary of ANI results for all samples.
- Summary.QC_File_Checks.[tsv,xlsx]: Summary of all QC file checks detailing if a sample passes or fails each process.
- Summary-Report_yyyy-MM-dd_HH-mm-ss.xlsx: Excel workbook where each file in the Summaries directory is added to a separate worksheet within the workbook.

Pipeline information

Information about the pipeline execution, output logs, error logs, and QC file checks for each sample are stored here.

Note

Pipeline execution files have the date and time appended to the filename using the following shorthand notation: year (yyyy), month (MM), day (dd), hour (HH), minute (mm), second (ss).

Pipeline information

pipeline_info/
- software_versions.yml: Summary of the software packages used in each process and their version information.
- nextflow_log.[job_id].txt: Execution log file produced by Nextflow.
- ANII_[num_of_samples].o[job_id]: Output log file produced by the job scheduler.
- ANI_[num_of_samples].e[job_id]: Error log file produced by the job scheduler.
- pipeline_dag_yyyy-MM-dd_HH-mm-ss.html: Direct acrylic graph (DAG) image of the workflow that gives a visual representation of how each process connects to each other.
- execution_trace_yyyy-MM-dd_HH-mm-ss.txt: Text-based summary report detailing the work directory hash, runtime, CPU usage, memory usage, etc. for each process.
- execution_report_yyyy-MM-dd_HH-mm-ss.html: Summary report of all processes, including processes that passed/failed, resource usage, etc.
- execution_timeline_yyyy-MM-dd_HH-mm-ss.html: Summary report detailing the runtime and memory usage of each process.

Process log information

pipeline_info/process_logs/
- [sample].[process].command.out: Output log file for each sample in each process.
- [sample].[process].command.err: Error log file for each sample in each process.

QC file checks

pipeline_info/qc_file_checks/
- [sample].Initial_Input_File.[tsv,xlsx]: Details if input FastA or Genbank file meet the minimum file size criteria for the pipeline [Default: 45k].