Logan-analysis
October 15, 2024 · View on GitHub
System to analyze Logan unitigs/contigs at scale with AWS Batch.
Adapted from https://github.com/ababaian/logan
Warning / Costs
Running this system costs real $'s in your AWS bill. Spot instances with local disk are 0.022 total. This corresponds roughly to a job capable of processing Logan compressed contigs at 1 MB per second per core. Do a pilot run and use AWS Cost Explorer 24 hours later to see real costs.
How to prepare a new task
To create a new task to run at scale, follow the steps below:
-
Fork this Repository
-
Add a New Script:
- Navigate to the
batch/tasks/directory in your forked repository. - Create a new shell script for your analysis task. This script should contain the specific commands for your analysis.
- Use an existing script as a reference, such as analysis_aug26.sh. If there is any newer analysis file in the
tasks/folder, take that one instead as a reference.
- Navigate to the
-
Modify the Dockerfile:
- If your analysis requires additional software, modify the Dockerfile to include the necessary installations. This ensures that all dependencies are available in the container.
- The Dockerfile can be found here: Dockerfile.
-
Testing Locally:
- Before deploying at scale, test your task locally to ensure it runs as expected.
- Use the
test_docker.shscript in thebatch/folder to test your task within the Docker container.
-
Deploying the Task:
- Once tested, commit your script to the
tasks/directory and push it to your forked repository. - Notify Rayan (or the current maintainer) to run your task at scale. The maintainer will pull your changes and execute the task on AWS Batch.
- Once tested, commit your script to the
Setup to run in production
You don't need to do this section if all you do is test the container locally. Read on if you're going to run analyses on cloud yourself (unlikely).
So far this setup has only been tested on c5d instances because tasks are relying on a local disk to download contig files.
-
Ask Rayan to share
ami-09f62d2604cc5b8fewith you, or make your own AMI withawlcliv2, or just include AWS CLI in theDockerfileand use no AMI. -
Run
spinupd.shto deploy the Cloudformation stack and check your Cloudformation web Interface to make sure the stack isCREATE_COMPLETE. -
If needed to make adjustments to the stack, do them and run
spinupd.sh --updateand check your Cloudformation to make sure the stack isUPDATE_COMPLETE.
Checklist for running in production
This is a rehash of the two previous sections. No need to do this section if you're testing locally.
Prepare your data in the analyses/ folder, see previous runs for an example of file organization.
In the ̀batch/ folder:
-
Modify the beginning of
logan-analysis.shso that it does the task you want. -
Modify the task itself in
task/folder. -
Modify
Dockerfileto upload the desired references indexes. -
Test the container with
test_docker.sh. -
Run
deploy-docker.shto upload the container.
Go to the root folder of this repository. Pay attention to the output bucket names bucket hard coded in run_*.sh (serratus-rayan).
-
Modify the
vcpusvariable inprocess_array.shto correctjobdef. 2 vcpus, i.e.jobdef=logan-analysis-2c-jobshould be fine, jobs will be 2 cores and 3.5 GB RAM, DIAMOND needs at least that. -
Run
run_test.shto see that it works at all. -
Run
run_pilot.shto get an estimate of the costs -
Run
run_many.shfor the big run on all Logan contigs.
Behind the scenes, these scripts call process_array.sh [dest_bucket] [nb_jobs]. Where dest_bucket is the name of the destination bucket, and nb_jobs is the number of jobs to submit (can't exceed 10000). The more jobs, the faster it will be. Destination bucket file structure is decided by the task.
Running tests
Run test_docker.sh for a local test.
Modify and run run_test.sh for a Batch test job, then run_pilot.sh for an estimation of costs. Those scripts have a hardcoded output bucket name that needs to be changed.
Dealing with 27 million files
Typical result of a Logan analysis run is 27 million Diamond output files. Handling this many files can be a challenge. Some lessons learned:
-
For batch download, always use
--recursiveinaws s3 cpor/*ins5cmd. It is too slow to execute 27 million commands, even in parallel on a single machine. Best solution I could find is to modify and useutils/parallel_download.sh. -
For aggregating results into a smaller number of files, use
utils/package_diamond.shon the results of the parallel download above. Best is to end up with ~100 files instead of just 1, allows for further multithreading. -
Check out the
logan-aggregaterepo for more industrial-grade aggregation scripts.
Cleanup
Manually delete the CloudFormation stack. Also delete the ECR image.