wf-ani: Output
March 22, 2024 ยท View on GitHub
Introduction
This document describes the output produced by the pipeline.
The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.
Pipeline overview
The pipeline is built using Nextflow and is used to perform Average Nucleotide Identity (ANI) on FastA and Genbank files.
- Input quality control
- Initial input file check ensures input FastA and Genbank files meet a minimum file size
- Average Amino Acid Identity (ANI)
- Summaries of the output files generated during the pipeline process
- Pipeline information - Report metrics generated during the workflow execution
Note
[sample] is a unique identifier that is parsed from input filenames and excludes everything after the first period (.).
Tip
All tab-separated value (TSV) files can be converted to Excel spreadsheets (XLSX) by using the parameter --create_excel_outputs when running the pipeline.
When using this parameter, a summary workbook is created to allow for all summary files to be added to separate worksheets within the workbook.
Input quality control
Input files must meet a minimum file size to be processed within this pipeline. If this check passes, the input files will be used to generate pairings and have ANI performed on them.
Initial input file check
QC Steps
- Input files are checked to ensure that they meet a minimum file size to be processed within this pipeline
[Default: 45k]. This is to prevent unusually small input sets from wasting compute time processing data that will not yield usable results.
Average nucleotide identity (ANI)
Pairings of input files that pass the quality control check are identified and ANI is performed on each pairing.
Important
Outputs generated by BLAST, SKANI, and fastANI cannot be compared even when using the same file inputs.
Tip
For many input files, fastANI may be useful in decreasing runtime.
BLAST
Output files
ANI/BLASTpairs.[tsv,xlsx]: List of all pairings to be performed.queries.[tsv,xlsx]: List of query proteome(s)(query vs reference panel workflow).references.[tsv,xlsx]: List of reference genomes(query vs reference panel workflow).genomes.[tsv,xlsx]: List of input genomes that passed quality control checks(all vs all workflow).
ANI/BLAST/[pair1]-[pair2]ani.[pair2],[pair1].stats.tab: ANI summary of each pair, including the combined bidirectional ANI.blast.[pair2],[pair1].tab: BLAST output of each fragment of [pair2] vs reference [pair1].blast.[pair1],[pair2].tab: BLAST output of each fragment of [pair1] vs reference [pair2].blast.[pair2],[pair1].filt.tab: Filtered BLAST output.blast.[pair2],[pair1].filt.two-way.tab: Filtered bidirectional BLAST output.
SKANI
Output files
ANI/SKANIpairs.[tsv,xlsx]: List of all pairings to be performedqueries.[tsv,xlsx]: List of query proteome(s)(query vs reference panel workflow)references.[tsv,xlsx]: List of reference genomes(query vs reference panel workflow)genomes.[tsv,xlsx]: List of input genomes that passed quality control checks(all vs all workflow)
ANI/SKANI/[pair1]-[pair2]skani.[pair1]-[pair2].[tsv,xlsx]: Output of [pair1] vs reference [pair2] when using SKANI to perform ANI.
fastANI
Output files
ANI/fastANIpairs.[tsv,xlsx]: List of all pairings to be performedqueries.[tsv,xlsx]: List of query proteome(s)(query vs reference panel workflow)references.[tsv,xlsx]: List of reference genomes(query vs reference panel workflow)genomes.[tsv,xlsx]: List of input genomes that passed quality control checks(all vs all workflow)
ANI/fastANI/[pair1]-[pair2]fastani.[pair1]-[pair2].[tsv,xlsx]: Output of [pair1] vs reference [pair2] when using fastANI to perform ANI.
Summaries
Concatenation of output metrics for all samples.
Note
The Summary-Report excel file is only created when the parameter --create_excel_outputs is used.
The Summary-Report excel file has the date and time appended to the filename using the following shorthand notation: year (yyyy), month (MM), day (dd), hour (HH), minute (mm), second (ss).
Output files
Summaries/Summary.ANI.[tsv,xlsx]: Summary of ANI results for all samples.Summary.QC_File_Checks.[tsv,xlsx]: Summary of all QC file checks detailing if a sample passes or fails each process.Summary-Report_yyyy-MM-dd_HH-mm-ss.xlsx: Excel workbook where each file in the Summaries directory is added to a separate worksheet within the workbook.
Pipeline information
Information about the pipeline execution, output logs, error logs, and QC file checks for each sample are stored here.
Note
Pipeline execution files have the date and time appended to the filename using the following shorthand notation: year (yyyy), month (MM), day (dd), hour (HH), minute (mm), second (ss).
Pipeline information
pipeline_info/software_versions.yml: Summary of the software packages used in each process and their version information.nextflow_log.[job_id].txt: Execution log file produced by Nextflow.ANII_[num_of_samples].o[job_id]: Output log file produced by the job scheduler.ANI_[num_of_samples].e[job_id]: Error log file produced by the job scheduler.pipeline_dag_yyyy-MM-dd_HH-mm-ss.html: Direct acrylic graph (DAG) image of the workflow that gives a visual representation of how each process connects to each other.execution_trace_yyyy-MM-dd_HH-mm-ss.txt: Text-based summary report detailing the work directory hash, runtime, CPU usage, memory usage, etc. for each process.execution_report_yyyy-MM-dd_HH-mm-ss.html: Summary report of all processes, including processes that passed/failed, resource usage, etc.execution_timeline_yyyy-MM-dd_HH-mm-ss.html: Summary report detailing the runtime and memory usage of each process.
Process log information
pipeline_info/process_logs/[sample].[process].command.out: Output log file for each sample in each process.[sample].[process].command.err: Error log file for each sample in each process.
QC file checks
pipeline_info/qc_file_checks/[sample].Initial_Input_File.[tsv,xlsx]: Details if input FastA or Genbank file meet the minimum file size criteria for the pipeline[Default: 45k].