General scripts
August 15, 2022 ยท View on GitHub
Table of Contents generated with DocToc
- Scripts required to prepare for a data release
- Generating analysis files for the manuscript
- Other scripts
Scripts required to prepare for a data release
The overall steps for preparing a data release are as follows:
- Start a release (termed
release-vX-YYYYMMDDbelow) that contains all of the PBTA data files (i.e., upstream files) included. - Run
scripts/generate-analysis-files-for-release.shusing the PBTA data files inrelease-vX-YYYYMMDDand commit any changes to files tracked in the repository. - Add the analysis files in
scratch/analysis_files_for_releasetorelease-vX-YYYYMMDD. - Run
scripts/run-for-subtyping.shusing the PBTA data files and analysis files inrelease-vX-YYYYMMDDand commit any changes to files tracked in the repository. - Add
pbta-histologies.tsvtorelease-vX-YYYYMMDD.
For definitions of the kinds of files in data releases, please see this documentation.
Analysis file generation
Running the following from this directory will generate all analysis files that are included in data releases and compile them in scratch/analysis_files_for_release for convenience:
bash generate-analysis-files-for-release.sh
This script also generates a file that contains the MD5 checksums for the analysis files (scratch/analysis_files_for_release/analysis_files_md5sum.txt).
Notes
- Modules run via this script must have options to use the base (pre-subtyping) histologies file
pbta-histologies-base.tsv; these options are used ingenerate-analysis-files-for-release.sh. - :warning: This requires 100GB of disk space to run and it may require more than 32 GB of ram. To test locally, you can use the following:
RUN_LOCAL=1 bash generate-analysis-files-for-release.sh
Molecular subtyping
Molecular subtyping as part of data release can be run with the following from this directory:
bash run-for-subtyping.sh
This will re-run subtyping for the following broad histologies:
molecular-subtyping-EWSmolecular-subtyping-HGGmolecular-subtyping-LGATmolecular-subtyping-embryonalmolecular-subtyping-CRANIOmolecular-subtyping-EPNmolecular-subtyping-MBmolecular-subtyping-neurocytoma
It will also run any analysis steps used for subtyping that do not generate files included in a release and molecular-subtyping-pathology & molecular-subtyping-integrate modules to generate the compiled_molecular_subtypes_with_clinical_pathology_feedback.tsv file containing the molecular_subtype column.
Adding summary analyses to run-for-subtyping.sh
For an analysis to be run for subyping, it must use pbta-histologies-base.tsv as input, and it should not depend on molecular_subtype or integrated_diagnosis columns for molecular-subtyping-* modules.
Please set OPENPBTA_BASE_SUBTYPING=1 as a condition to run code with pbta-histologies-base.tsv.
Here is an example from the TP53 classifier module (assumes root of repo):
OPENPBTA_BASE_SUBTYPING=1 bash analyses/tp53_nf1_score/run_classifier.sh
Generating analysis files for the manuscript
Once a new data release has been cut, analysis modules should be run with the new data release.
Specifically, non-deprecated analyses which appear in manuscript should be run, and as well as certain analyses that were run in generate-analysis-files-for-release.sh which export output files in scratch/ that are needed for figure generation or require disease label information in the released histologies file.
Note that subtyping modules do not need to be re-run, since subtyping was performed to create the data release itself.
The script run-manuscript-analyses.sh can be used for this purpose as:
bash run-manuscript-analyses.sh
By default, this script will run all relevant analyses as described.
However, some of those analyses have significant memory requirements which are generally not available on local machines.
Therefore, to run only analyses that can be run locally, set RUN_LOCAL=1:
RUN_LOCAL=1 bash run-manuscript-analyses.sh
Other scripts
download-ci-files.shallows you to download the CI files locally, e.g., for debugging. See these docs.install_bioc.Ris used to install R packages on the project Docker image. See these docs.check-python.shis used in CI to ensure all Python packages on the project Docker image match what is in therequirements.txtfile in the root of the repository. See these docs.