TSV file for sample
February 12, 2019 ยท View on GitHub
Input files for Sarek can be specified using a TSV file given to the --sample parameter.
The TSV file is a Tab Separated Value file with columns: subject gender status sample lane fastq1 fastq2, subject gender status sample lane bam or subject gender status sample bam bai.
The content of these columns should be quite straight-forward:
subjectdesignate the subject, it should be the ID of the Patient, or if you don't have one, it could be the Normal ID Sample.genderis the gender of the Patient, (XX or XY)statusis the status of the Patient, (0 for Normal or 1 for Tumor)sampledesignate the Sample, it should be the ID of the Sample (it is possible to have more than one tumor sample for each patient)laneis used when the sample is multiplexed on several lanesfastq1is the path to the first pair of the fastq filefastq2is the path to the second pair of the fastq filebamis the bam filebaiis the index
All examples are given for a normal/tumor pair. If no tumors are listed in the TSV file, then the workflow will proceed as if it is a single normal sample instead of a normal/tumor pair.
Example TSV file for a normal/tumor pair with FASTQ files
In this sample for the normal case there are 3 read groups, and 2 for the tumor. It is recommended to add the absolute path of the paired FASTQ files, but relative path should work also. Note, the delimiter is the tab (\t) character:
G15511 XX 0 C09DFN C09DF_1 pathToFiles/C09DFACXX111207.1_1.fastq.gz pathToFiles/C09DFACXX111207.1_2.fastq.gz
G15511 XX 0 C09DFN C09DF_2 pathToFiles/C09DFACXX111207.2_1.fastq.gz pathToFiles/C09DFACXX111207.2_2.fastq.gz
G15511 XX 0 C09DFN C09DF_3 pathToFiles/C09DFACXX111207.3_1.fastq.gz pathToFiles/C09DFACXX111207.3_2.fastq.gz
G15511 XX 1 D0ENMT D0ENM_1 pathToFiles/D0ENMACXX111207.1_1.fastq.gz pathToFiles/D0ENMACXX111207.1_2.fastq.gz
G15511 XX 1 D0ENMT D0ENM_2 pathToFiles/D0ENMACXX111207.2_1.fastq.gz pathToFiles/D0ENMACXX111207.2_2.fastq.gz
Example TSV file for a normal/tumor pair with BAM files
In this sample for the normal case there are 3 read groups, and 2 for the tumor. It is recommended to add the absolute path of BAM files, but relative path should work also. Note, the delimiter is the tab (\t) character:
G15511 XX 0 C09DFN C09DF_1 pathToFiles/C09DFAC_1.bam
G15511 XX 0 C09DFN C09DF_2 pathToFiles/C09DFAC_2.bam
G15511 XX 0 C09DFN C09DF_3 pathToFiles/C09DFAC_3.bam
G15511 XX 1 D0ENMT D0ENM_1 pathToFiles/D0ENMAC_1.bam
G15511 XX 1 D0ENMT D0ENM_2 pathToFiles/D0ENMAC_2.bam
Example TSV file for a normal/tumor pair with recalibrated BAM files
The same way, if you have recalibrated BAMs and their indexes, you should use a structure like:
G15511 XX 0 C09DFN pathToFiles/G15511.C09DFN.md.real.bam pathToFiles/G15511.C09DFN.md.real.bai
G15511 XX 1 D0ENMT pathToFiles/G15511.D0ENMT.md.real.bam pathToFiles/G15511.D0ENMT.md.real.bai
All the files will be in he Preprocessing/Recalibrated/ directory, and by default a corresponding TSV file will also be deposited there. Generally, getting MuTect1 and Strelka calls on the recalibrated files should be done by:
nextflow run SciLifeLab/Sarek/somaticVC.nf --sample Preprocessing/Recalibrated/mysample.tsv --tools Mutect2,Strelka
Input FASTQ file name best practices
The input folder, containing the FASTQ files for one individual (ID) should be organized into one subfolder for every sample. All fastq files for that sample should be collected here.
ID
+--sample1
+------sample1_lib_flowcell-index_lane_R1_1000.fastq.gz
+------sample1_lib_flowcell-index_lane_R2_1000.fastq.gz
+------sample1_lib_flowcell-index_lane_R1_1000.fastq.gz
+------sample1_lib_flowcell-index_lane_R2_1000.fastq.gz
+--sample2
+------sample2_lib_flowcell-index_lane_R1_1000.fastq.gz
+------sample2_lib_flowcell-index_lane_R2_1000.fastq.gz
+--sample3
+------sample3_lib_flowcell-index_lane_R1_1000.fastq.gz
+------sample3_lib_flowcell-index_lane_R2_1000.fastq.gz
+------sample3_lib_flowcell-index_lane_R1_1000.fastq.gz
+------sample3_lib_flowcell-index_lane_R2_1000.fastq.gz
Fastq filename structure:
sample_lib_flowcell-index_lane_R1_1000.fastq.gzandsample_lib_flowcell-index_lane_R2_1000.fastq.gz
Where:
sample= sample idlib= indentifier of libaray preparationflowcell= identifyer of flow cell for the sequencing runlane= identifier of the lane of the sequencing run
Read group information will be parsed from fastq file names according to this:
RGID= "sample_lib_flowcell_index_lane"RGPL= "Illumina"PU= sampleRGLB= lib