Using AlphaFold on Berzelius

1. Introduction
2. Preparations
3. Running AlphaFold Using the Module
4. Running AlphaFold Using Conda
5. Running AlphaFold Using Apptainer
6. Optimization
7. Best Practice of Running AlphaFold
- 7.1 On Tetralith
- 7.2 On Berzelius
8. AlphaFold Alternatives
- 8.1 LocalColabFold
- 8.2 OpenFold

1. Introduction

AlphaFold is a deep learning-based protein structure prediction program developed by DeepMind. The software uses a neural network to predict the 3D structure of a protein from its amino acid sequence. The first version of AlphaFold was released in 2018, and it was considered a breakthrough in the field of protein structure prediction. In 2021, AlphaFold2 won the CASP14 competition, a biennial competition that evaluates the state-of-the-art methods in protein structure prediction. AlphaFold2 was able to predict the structure of proteins with remarkable accuracy, which has implications for drug discovery and understanding diseases at the molecular level.

2. Preparations

2.1 AlphaFold Genetic Databases

We have a copy of AlphaFold Genetic Databases available on Berzelius at /proj/common-datasets for public use.

2.2 Downloading Test Data

The test input T1050.fasta can be found on this page. Download and save it to ${ALPHAFOLD_RESULTS}/input.

2.3 Patch

Our patch includes two new input arguments in run_alphafold.py.

n_parallel_msa
--n_parallel_msa=1: the searches are not parallelized.
--n_parallel_msa=3: the searches are all parallelized.
This new flag has been wrapped as -P in the wrapper.
run_feature_only
--run_feature_only=true: to only run MSA and template searches.
This new flag has been wrapped as -F in the wrapper.

The patch also provides the flexibility to choose the number of threads used for the MSA searches. Read the Optimization section for more details.

3. Running AlphaFold Using the Module

On Berzelius, we have AlphaFold as a module.

3.1 Loading the Module

On a compute node we load the AlphaFold module.

module load AlphaFold/2.3.1-hpc1

3.2 Setting the Paths

We specify the paths for AlphaFold database and results.

export ALPHAFOLD_DB=/proj/common-datasets/AlphaFold
export ALPHAFOLD_RESULTS=/proj/nsc_testing/xuan/alphafold_results_2.3.1
mkdir -p ${ALPHAFOLD_DB} ${ALPHAFOLD_RESULTS}/output ${ALPHAFOLD_RESULTS}/input

3.3 Running an Example

We run an example.

run_alphafold.sh \
  -d ${ALPHAFOLD_DB} \
  -o ${ALPHAFOLD_RESULTS}/output \
  -f ${ALPHAFOLD_RESULTS}/input/T1050.fasta \
  -t 2021-11-01 \
  -g true \
  -P 3 \
  -F false

Please run run_alphafold.sh -h to check the usage.

4. Running AlphaFold Using Conda

4.1 Creating a Conda Env

We first load the Anaconda module.

module load Miniforge3/24.7.1-2-hpc1-bdist

We create a conda env from a yml file.

git clone https://gitlab.liu.se/xuagu37/berzelius-alphafold-guide /tmp/berzelius-alphafold-guide
mamba env create -f /tmp/berzelius-alphafold-guide/alphafold_2.3.1.yml
mamba activate alphafold_2.3.1

4.2 Setting the Paths

We specify the paths for AlphaFold database, installation and results.

export ALPHAFOLD_DB=/proj/common-datasets/AlphaFold
export ALPHAFOLD_DIR=/proj/nsc_testing/xuan/alphafold_2.3.1
export ALPHAFOLD_RESULTS=/proj/nsc_testing/xuan/alphafold_results_2.3.1
mkdir -p ${ALPHAFOLD_DB} ${ALPHAFOLD_DIR} ${ALPHAFOLD_RESULTS}/output ${ALPHAFOLD_RESULTS}/input

4.3 Installing AlphaFold

To download AlphaFold

wget -O /tmp/v2.3.1.tar.gz https://github.com/deepmind/alphafold/archive/refs/tags/v2.3.1.tar.gz
tar -xf /tmp/v2.3.1.tar.gz -C ${ALPHAFOLD_DIR} --strip-components=1

To apply OpenMM patch

cd /home/xuan/.conda/envs/alphafold_2.3.1/lib/python3.8/site-packages/ 
patch -p0 < ${ALPHAFOLD_DIR}/docker/openmm.patch

To download chemical properties

wget -q -P ${ALPHAFOLD_DIR}/alphafold/common/ https://git.scicore.unibas.ch/schwede/openstructure/-/raw/7102c63615b64735c4941278d92b554ec94415f8/modules/mol/alg/src/stereo_chemical_props.txt

To install the patch

git clone https://gitlab.liu.se/xuagu37/berzelius-alphafold-guide /tmp/berzelius-alphafold-guide
cd /tmp && bash berzelius-alphafold-guide/patch/patch_2.3.1.sh ${ALPHAFOLD_DIR}

4.4 Running an Example

cd ${ALPHAFOLD_DIR}
bash run_alphafold.sh \
  -d ${ALPHAFOLD_DB} \
  -o ${ALPHAFOLD_RESULTS}/output \
  -f ${ALPHAFOLD_RESULTS}/input/T1050.fasta \
  -t 2021-11-01 \
  -g true \
  -P 3 \
  -F false

Please check the input arguments by running: bash run_alphafold.sh.

5. Running AlphaFold Using Apptainer

5.1 Creating the Container Image

There is an Apptainer image of AlphaFold 2.3.1 at /software/sse/containers.

5.2 Setting the Paths

We specify the paths for AlphaFold database and results.

export ALPHAFOLD_DB=/proj/common-datasets/AlphaFold
export ALPHAFOLD_RESULTS=/proj/nsc_testing/xuan/alphafold_results_2.3.1
mkdir -p ${ALPHAFOLD_DB} ${ALPHAFOLD_RESULTS}/output ${ALPHAFOLD_RESULTS}/input

5.3 Running an Example

apptainer exec --nv alphafold_2.3.1.sif bash -c "cd /app/alphafold && bash run_alphafold.sh \
  -d ${ALPHAFOLD_DB} \
  -o ${ALPHAFOLD_RESULTS}/output \
  -f ${ALPHAFOLD_RESULTS}/input/T1050.fasta \
  -t 2021-11-01 \
  -g true \
  -P 3 \
  -F false"

6. Optimization

The following section is intended for users who are interested in exploring potential optimizations to improve the performance and efficiency of AlphaFold. Some of these optimizations have already been included in our patch, while others are additional tweaks that may benefit advanced users or those running large-scale predictions. All enhancements are optional and can be selectively applied depending on your use case and computational environment.

6.1 MSA Searches in Parallel

The three independent sequential MSA searches can be arranged in parallel to accelerate the job. You can enable parallelization by setting the flag -P 3.

jackhmmer(uniref90) + template_searcher(pdb)
jackhmmer(mgnify)
hhblits(bfd) or jackhmmer(small_bfd)

Ref 1: AlphaFold PR 399 Parallel execution of MSA tools.
Ref 2: Zhong et al. 2022, ParaFold: Paralleling AlphaFold for Large-Scale Predictions.

6.2 Multithreading for MSA Searches

AlphaFold 2.3.1 uses a default choice of 8, 8, and 4 threads for the three MSA searches, which is not always optimal. The hhblits search is the most time-consuming, so we can manually allocate more threads to it. You can set the number of threads for the three searches in alphafold/data/pipeline.py at lines 131 to 134.

For multimer models, the Jackhmmer (uniprot) search will start when the first three searches finish. You can set the number of threads in alphafold/data/pipeline_multimer.py at line 179.

We recommend using n_cpu=8, 8, 16, 32 on Berzelius for jackhmmer (uniref90), jackhmmer (mgnify), hhblits (bfd), and Jackhmmer (uniprot), respectively.

6.3 Separation of CPU Part (MSA and Template Searches) and GPU part (Predictions)

A flag --run_feature_only has been added to separate the CPU and GPU parts. Alphafold makes use of GPUs for the prediction part of the modelling, which can be a small part of the running time. Most of the operations executed by the script are CPU-based (MSA and template searches). We therefore strongly suggest to run the CPU part on Tetralith and the GPU part on Berzelius.

6.4 I/O Optimization

Each compute node has a local scratch file system used for temporary storage while the job is running. The data will be deleted once the job finishes. On Berzelius, the size of this disk at /scratch/local is 15 TB of NVMe SSD storage. While it’s possible to copy the AlphaFold database to /scratch/local at the beginning of a job to improve I/O performance, our experiments on Berzelius have shown that copying the database to the node’s local storage doesn’t result in any significant improvement in job running time.

7. Best Practice of Running AlphaFold

7.1 On Tetralith

On Tetralith, the GPU node’s local disk at /scratch/local is 2 TB of NVMe SSD storage. You can copy the BFD subset (1.8 TB) to /scratch/local at the beginning of a job to improve I/O performance. On Tetralith, the AlphaFold database can be found at /proj/common_datasets/AlphaFold.

export ALPHAFOLD_DB=/proj/common_datasets/AlphaFold
export ALPHAFOLD_DB_LOCAL=/scratch/local

cp -a ${ALPHAFOLD_DB}/bfd ${ALPHAFOLD_DB_LOCAL}/
ln -s ${ALPHAFOLD_DB}/mgnify_2022_05 ${ALPHAFOLD_DB_LOCAL}/mgnify  
ln -s ${ALPHAFOLD_DB}/params_2022_12_06 ${ALPHAFOLD_DB_LOCAL}/params    
ln -s ${ALPHAFOLD_DB}/pdb70_200401 ${ALPHAFOLD_DB_LOCAL}/pdb70    
ln -s ${ALPHAFOLD_DB}/pdb_mmcif ${ALPHAFOLD_DB_LOCAL}/pdb_mmcif    
ln -s ${ALPHAFOLD_DB}/pdb_seqres ${ALPHAFOLD_DB_LOCAL}/pdb_seqres    
ln -s ${ALPHAFOLD_DB}/uniprot_2022_05 ${ALPHAFOLD_DB_LOCAL}/uniprot    
ln -s ${ALPHAFOLD_DB}/uniref30_2021_03 ${ALPHAFOLD_DB_LOCAL}/uniref30    
ln -s ${ALPHAFOLD_DB}/uniref90_2022_05 ${ALPHAFOLD_DB_LOCAL}/uniref90 

An sbatch script example has been prepared for you here.

export ALPHAFOLD_RESULTS=/proj/nsc/users/xuan/alphafold_results_2.3.1
module load AlphaFold/2.3.1-hpc1

run_alphafold.sh \
  -d ${ALPHAFOLD_DB_LOCAL} \
  -o ${ALPHAFOLD_RESULTS}/output \
  -f ${ALPHAFOLD_RESULTS}/input/T1050.fasta \
  -t 2021-11-01 \
  -g true \
  -P 3 \
  -F false

7.2 On Berzelius

To make the best use of the GPU resources on Berzelius, we strongly suggest separating the CPU and GPU parts when running AlphaFold jobs. You should run the CPU part on a CPU node, then run the GPU part on a GPU node.

Running the CPU Part on a CPU Node

You need to set -F true in the command to run MSA and template searches only.

An sbatch script example has been prepared for you here.

export ALPHAFOLD_DB=/proj/common-datasets/AlphaFold
export ALPHAFOLD_RESULTS=/proj/nsc_testing/xuan/alphafold_results_2.3.1
module load AlphaFold/2.3.1-hpc1

run_alphafold.sh \
  -d ${ALPHAFOLD_DB} \
  -o ${ALPHAFOLD_RESULTS}/output \
  -f ${ALPHAFOLD_RESULTS}/input/T1050.fasta \
  -t 2021-11-01 \
  -g false \
  -P 3 \
  -F true

Running the GPU Part on a GPU Node

You need to set -F false in the command. This will skip the MSA and template searches and proceed directly to the predictions.

An sbatch script example has been prepared for you here.

export ALPHAFOLD_DB=/proj/common-datasets/AlphaFold
export ALPHAFOLD_RESULTS=/proj/nsc_testing/xuan/alphafold_results_2.3.1
module load AlphaFold/2.3.1-hpc1

run_alphafold.sh \
  -d ${ALPHAFOLD_DB} \
  -o ${ALPHAFOLD_RESULTS}/output \
  -f ${ALPHAFOLD_RESULTS}/input/T1050.fasta \
  -t 2021-11-01 \
  -g true \
  -P 1 \
  -F false

Running Multiple AlphaFold GPU Jobs in Parallel

To improve GPU utilization, you can run multiple AlphaFold GPU jobs concurrently using background execution:

run_alphafold_gpu_job_1 &
run_alphafold_gpu_job_2 &
wait

This will start both jobs in parallel and wait for them to finish before proceeding.

8. AlphaFold Alternatives

8.1 LocalColabFold

LocalColabFold is a local version of ColabFold. ColabFold allows researchers to use AlphaFold’s powerful capabilities without requiring large computational resources, making it accessible via Google Colab.

LocalColabFold was developed to enable users to run AlphaFold predictions on their own local systems (like workstations or local servers) rather than relying on Google Colab. It integrates with various tools like MMseqs2 and makes the AlphaFold structure prediction pipeline more accessible by reducing the dependency on external services. This setup is particularly useful for labs or institutions that have access to GPUs or other high-performance computing resources.

Loading the Module

On a compute node we load the LocalColabFold module.

module load LocalColabFold/1.5.5-hpc1

Running an Example

We run an example.

colabfold_batch --data /proj/common-datasets/AlphaFold input/ output/

8.2 OpenFold

OpenFold is an open-source reimplementation of AlphaFold. OpenFold aims to reproduce the key functionalities of AlphaFold with an open-source license, allowing more flexibility for researchers to modify, improve, and integrate the model into their workflows.

Loading the Module

On a compute node we load the OpenFold module.

module load OpenFold/2.1.0-hpc1

Running an Example: Pre-compute alignments

export BASE_DATA_DIR=/proj/common-datasets/OpenFold
export INPUT_FASTA_DIR=/proj/nsc_testing/xuan/alphafold_results_2.3.1/input/T1050
export OUTPUT_DIR=/proj/nsc_testing/xuan/alphafold_results_2.3.1/output

cd ${OPENFOLD_PREFIX}
python3 scripts/precompute_alignments.py $INPUT_FASTA_DIR ${OUTPUT_DIR}/alignments \
    --uniref90_database_path $BASE_DATA_DIR/uniref90/uniref90.fasta \
    --mgnify_database_path $BASE_DATA_DIR/mgnify/mgy_clusters_2022_05.fa \
    --pdb70_database_path $BASE_DATA_DIR/pdb70/pdb70 \
    --uniclust30_database_path $BASE_DATA_DIR/uniclust30/uniclust30_2018_08/uniclust30_2018_08 \
    --bfd_database_path $BASE_DATA_DIR/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
    --cpus_per_task 16 \
    --jackhmmer_binary_path ${CONDA_PREFIX}/bin/jackhmmer \
    --hhblits_binary_path ${CONDA_PREFIX}/bin/hhblits \
    --hhsearch_binary_path ${CONDA_PREFIX}/bin/hhsearch \
    --kalign_binary_path ${CONDA_PREFIX}/bin/kalign

Running an Example: Pre-compute alignments with the ColabFold pipeline

export BASE_DATA_DIR=/proj/common-datasets/OpenFold
export INPUT_FASTA_DIR=/proj/nsc_testing/xuan/alphafold_results_2.3.1/input/T1050
export OUTPUT_DIR=/proj/nsc_testing/xuan/alphafold_results_2.3.1/output

cd ${OPENFOLD_PREFIX}
python3 scripts/precompute_alignments_mmseqs.py $INPUT_FASTA_DIR/T1050.fasta \
    $BASE_DATA_DIR/mmseqs_dbs \
    uniref30_2103_db \
    ${OUTPUT_DIR}/alignments \
    ${CONDA_PREFIX}/bin/mmseqs \
    --hhsearch_binary_path ${CONDA_PREFIX}/bin/hhsearch \
    --env_db colabfold_envdb_202108_db \
    --pdb70 $BASE_DATA_DIR/pdb70/pdb70

Running an Example: Model inference with pre-computed alignments

export BASE_DATA_DIR=/proj/common-datasets/OpenFold
export TEMPLATE_MMCIF_DIR=${BASE_DATA_DIR}/pdb_data/mmcif_files
export INPUT_FASTA_DIR=/proj/nsc_testing/xuan/alphafold_results_2.3.1/input/T1050
export OUTPUT_DIR=/proj/nsc_testing/xuan/alphafold_results_2.3.1/output
export PRECOMPUTED_ALIGNMENTS=/proj/nsc_testing/xuan/alphafold_results_2.3.1/output/alignments

cd ${OPENFOLD_PREFIX}
python3 run_pretrained_openfold.py ${INPUT_FASTA_DIR} \
  $TEMPLATE_MMCIF_DIR \
  --output_dir $OUTPUT_DIR \
  --use_precomputed_alignments $PRECOMPUTED_ALIGNMENTS \
  --config_preset model_1_ptm \
  --model_device "cuda:0" 

Running an Example: Model inference without pre-computed alignments

export BASE_DATA_DIR=/proj/common-datasets/OpenFold
export TEMPLATE_MMCIF_DIR=${BASE_DATA_DIR}/pdb_data/mmcif_files
export INPUT_FASTA_DIR=/proj/nsc_testing/xuan/alphafold_results_2.3.1/input/T1050
export OUTPUT_DIR=/proj/nsc_testing/xuan/alphafold_results_2.3.1/output

cd ${OPENFOLD_PREFIX}
python3 run_pretrained_openfold.py \
    $INPUT_FASTA_DIR \
    $TEMPLATE_MMCIF_DIR \
    --output_dir $OUTPUT_DIR \
    --config_preset model_1_ptm \
    --uniref90_database_path $BASE_DATA_DIR/uniref90/uniref90.fasta \
    --mgnify_database_path $BASE_DATA_DIR/mgnify/mgy_clusters_2022_05.fa \
    --pdb70_database_path $BASE_DATA_DIR/pdb70/pdb70 \
    --uniclust30_database_path $BASE_DATA_DIR/uniclust30/uniclust30_2018_08/uniclust30_2018_08 \
    --bfd_database_path $BASE_DATA_DIR/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
    --model_device "cuda:0" 

Running an Example: Training a new OpenFold model

export BASE_DATA_DIR=/proj/common-datasets/OpenFold
export TEMPLATE_MMCIF_DIR=${BASE_DATA_DIR}/pdb_data/mmcif_files
export OUTPUT_DIR=/proj/nsc_testing/xuan/alphafold_results_2.3.1/output

cd ${OPENFOLD_PREFIX}
python3 train_openfold.py $BASE_DATA_DIR/pdb_data/mmcif_files $BASE_DATA_DIR/alignment_data/alignments $TEMPLATE_MMCIF_DIR $OUTPUT_DIR \
    2021-10-10 \
    --train_chain_data_cache_path $BASE_DATA_DIR/pdb_data/data_caches/chain_data_cache.json \
    --template_release_dates_cache_path $BASE_DATA_DIR/pdb_data/data_caches/mmcif_cache.json \
	--config_preset initial_training \
    --seed 42 \
    --obsolete_pdbs_file_path $BASE_DATA_DIR/pdb_data/obsolete.dat \
    --num_nodes 1 \
    --gpus 2

For more information on using OpenFold, please refer to the OpenFold documentation.

Using AlphaFold on Berzelius

1. Introduction

2. Preparations

2.1 AlphaFold Genetic Databases

2.2 Downloading Test Data

2.3 Patch

3. Running AlphaFold Using the Module

3.1 Loading the Module

3.2 Setting the Paths

3.3 Running an Example

4. Running AlphaFold Using Conda

4.1 Creating a Conda Env

4.2 Setting the Paths

4.3 Installing AlphaFold

4.4 Running an Example

5. Running AlphaFold Using Apptainer

5.1 Creating the Container Image

5.2 Setting the Paths

5.3 Running an Example

6. Optimization

6.1 MSA Searches in Parallel

6.2 Multithreading for MSA Searches

6.3 Separation of CPU Part (MSA and Template Searches) and GPU part (Predictions)

6.4 I/O Optimization

7. Best Practice of Running AlphaFold

7.1 On Tetralith

7.2 On Berzelius

Running the CPU Part on a CPU Node

Running the GPU Part on a GPU Node

Running Multiple AlphaFold GPU Jobs in Parallel

8. AlphaFold Alternatives

8.1 LocalColabFold

Loading the Module

Running an Example

8.2 OpenFold

Loading the Module

Running an Example: Pre-compute alignments

Running an Example: Pre-compute alignments with the ColabFold pipeline

Running an Example: Model inference with pre-computed alignments

Running an Example: Model inference without pre-computed alignments

Running an Example: Training a new OpenFold model

User support

Getting access

Everything OK!

Self-service