Data Setup
March 16, 2020 ยท View on GitHub
To help get started quickly, we have pre-processed some of the public domain data (more specifically, the data and truth sets shared during the Precision FDA Truth challenge) and made them available for instant use. They can be downloaded from our public S3 bucket links mentioned below.
Common
- hs37d5 reference
https://dl4vc.s3.us-east-2.amazonaws.com/hs37d5.fa
https://dl4vc.s3.us-east-2.amazonaws.com/hs37d5.fa.fai
Training
- HG001 50x BAM (generated from precisionFDA HG001 FASTQ file)
https://dl4vc.s3.us-east-2.amazonaws.com/HG001-NA12878-50x.sort.bam
https://dl4vc.s3.us-east-2.amazonaws.com/HG001-NA12878-50x.sort.bam.bai
- Truth set split by multi-allele and normalized
- High confidence region
Evaluation
- HG002 50x BAM (generated from pFGA HG002 FASTQ file)
https://dl4vc.s3.us-east-2.amazonaws.com/HG002-NA24385-50x.sort.bam
https://dl4vc.s3.us-east-2.amazonaws.com/HG002-NA24385-50x.sort.bam.bai
- Truth set split by multi-allele and normalized
- High confidence region
- High recall variant candidates in ihgh confidence region
https://dl4vc.s3.us-east-2.amazonaws.com/HG002-NA24385-50x-candidates.vcf.gz
https://dl4vc.s3.us-east-2.amazonaws.com/HG002-NA24385-50x-candidates.vcf.gz.csi