wf-ani: Output

March 22, 2024 ยท View on GitHub

Introduction

This document describes the output produced by the pipeline.

The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.

Pipeline overview

The pipeline is built using Nextflow and is used to perform Average Nucleotide Identity (ANI) on FastA and Genbank files.

Note

[sample] is a unique identifier that is parsed from input filenames and excludes everything after the first period (.).

Tip

All tab-separated value (TSV) files can be converted to Excel spreadsheets (XLSX) by using the parameter --create_excel_outputs when running the pipeline.

When using this parameter, a summary workbook is created to allow for all summary files to be added to separate worksheets within the workbook.

Input quality control

Input files must meet a minimum file size to be processed within this pipeline. If this check passes, the input files will be used to generate pairings and have ANI performed on them.

Initial input file check

QC Steps
  • Input files are checked to ensure that they meet a minimum file size to be processed within this pipeline [Default: 45k]. This is to prevent unusually small input sets from wasting compute time processing data that will not yield usable results.

Average nucleotide identity (ANI)

Pairings of input files that pass the quality control check are identified and ANI is performed on each pairing.

Important

Outputs generated by BLAST, SKANI, and fastANI cannot be compared even when using the same file inputs.

Tip

For many input files, fastANI may be useful in decreasing runtime.

BLAST

Output files
  • ANI/BLAST
    • pairs.[tsv,xlsx]: List of all pairings to be performed.
    • queries.[tsv,xlsx]: List of query proteome(s) (query vs reference panel workflow).
    • references.[tsv,xlsx]: List of reference genomes (query vs reference panel workflow).
    • genomes.[tsv,xlsx]: List of input genomes that passed quality control checks (all vs all workflow).
  • ANI/BLAST/[pair1]-[pair2]
    • ani.[pair2],[pair1].stats.tab: ANI summary of each pair, including the combined bidirectional ANI.
    • blast.[pair2],[pair1].tab: BLAST output of each fragment of [pair2] vs reference [pair1].
    • blast.[pair1],[pair2].tab: BLAST output of each fragment of [pair1] vs reference [pair2].
    • blast.[pair2],[pair1].filt.tab: Filtered BLAST output.
    • blast.[pair2],[pair1].filt.two-way.tab: Filtered bidirectional BLAST output.

SKANI

Output files
  • ANI/SKANI
    • pairs.[tsv,xlsx]: List of all pairings to be performed
    • queries.[tsv,xlsx]: List of query proteome(s) (query vs reference panel workflow)
    • references.[tsv,xlsx]: List of reference genomes (query vs reference panel workflow)
    • genomes.[tsv,xlsx]: List of input genomes that passed quality control checks (all vs all workflow)
  • ANI/SKANI/[pair1]-[pair2]
    • skani.[pair1]-[pair2].[tsv,xlsx]: Output of [pair1] vs reference [pair2] when using SKANI to perform ANI.

fastANI

Output files
  • ANI/fastANI
    • pairs.[tsv,xlsx]: List of all pairings to be performed
    • queries.[tsv,xlsx]: List of query proteome(s) (query vs reference panel workflow)
    • references.[tsv,xlsx]: List of reference genomes (query vs reference panel workflow)
    • genomes.[tsv,xlsx]: List of input genomes that passed quality control checks (all vs all workflow)
  • ANI/fastANI/[pair1]-[pair2]
    • fastani.[pair1]-[pair2].[tsv,xlsx]: Output of [pair1] vs reference [pair2] when using fastANI to perform ANI.

Summaries

Concatenation of output metrics for all samples.

Note

The Summary-Report excel file is only created when the parameter --create_excel_outputs is used.

The Summary-Report excel file has the date and time appended to the filename using the following shorthand notation: year (yyyy), month (MM), day (dd), hour (HH), minute (mm), second (ss).

Output files
  • Summaries/
    • Summary.ANI.[tsv,xlsx]: Summary of ANI results for all samples.
    • Summary.QC_File_Checks.[tsv,xlsx]: Summary of all QC file checks detailing if a sample passes or fails each process.
    • Summary-Report_yyyy-MM-dd_HH-mm-ss.xlsx: Excel workbook where each file in the Summaries directory is added to a separate worksheet within the workbook.

Pipeline information

Information about the pipeline execution, output logs, error logs, and QC file checks for each sample are stored here.

Note

Pipeline execution files have the date and time appended to the filename using the following shorthand notation: year (yyyy), month (MM), day (dd), hour (HH), minute (mm), second (ss).

Pipeline information
  • pipeline_info/
    • software_versions.yml: Summary of the software packages used in each process and their version information.
    • nextflow_log.[job_id].txt: Execution log file produced by Nextflow.
    • ANII_[num_of_samples].o[job_id]: Output log file produced by the job scheduler.
    • ANI_[num_of_samples].e[job_id]: Error log file produced by the job scheduler.
    • pipeline_dag_yyyy-MM-dd_HH-mm-ss.html: Direct acrylic graph (DAG) image of the workflow that gives a visual representation of how each process connects to each other.
    • execution_trace_yyyy-MM-dd_HH-mm-ss.txt: Text-based summary report detailing the work directory hash, runtime, CPU usage, memory usage, etc. for each process.
    • execution_report_yyyy-MM-dd_HH-mm-ss.html: Summary report of all processes, including processes that passed/failed, resource usage, etc.
    • execution_timeline_yyyy-MM-dd_HH-mm-ss.html: Summary report detailing the runtime and memory usage of each process.
Process log information
  • pipeline_info/process_logs/
    • [sample].[process].command.out: Output log file for each sample in each process.
    • [sample].[process].command.err: Error log file for each sample in each process.
QC file checks
  • pipeline_info/qc_file_checks/
    • [sample].Initial_Input_File.[tsv,xlsx]: Details if input FastA or Genbank file meet the minimum file size criteria for the pipeline [Default: 45k].