Using AlphaFold on Berzelius

Introduction

AlphaFold is a deep learning-based protein structure prediction program developed by DeepMind. The software uses a neural network to predict the 3D structure of a protein from its amino acid sequence. The first version of AlphaFold was released in 2018, and it was considered a breakthrough in the field of protein structure prediction. In 2021, AlphaFold2 won the CASP14 competition, a biennial competition that evaluates the state-of-the-art methods in protein structure prediction. AlphaFold2 was able to predict the structure of proteins with remarkable accuracy, which has implications for drug discovery and understanding diseases at the molecular level.

Preparations

Setting the Paths

We have a copy of AlphaFold database available on Berzelius at /proj/common-datasets for public use.

We specify the paths for AlphaFold database, AlphaFold installation and results.

export ALPHAFOLD_DB=/proj/common-datasets/AlphaFold
export ALPHAFOLD_DIR=/proj/nsc_testing/xuan/alphafold_2.3.1
export ALPHAFOLD_RESULTS=/proj/nsc_testing/xuan/alphafold_results_2.3.1
mkdir -p ${ALPHAFOLD_DB} ${ALPHAFOLD_DIR} ${ALPHAFOLD_RESULTS}
mkdir -p ${ALPHAFOLD_RESULTS}/output ${ALPHAFOLD_RESULTS}/input

Downloading Genetic Databases

If you prefer to download your own copy of AlphaFold database, you can do so by

export ALPHAFOLD_DB=/proj/nsc_testing/xuan/AlphaFold
module load aria2/1.36.0-gcc-8.5.0
wget -O /tmp/v2.3.1.tar.gz https://github.com/deepmind/alphafold/archive/refs/tags/v2.3.1.tar.gz
tar -xf /tmp/v2.3.1.tar.gz -C ${ALPHAFOLD_DIR} --strip-components=1
cd ${ALPHAFOLD_DIR}
scripts/download_all_data.sh ${ALPHAFOLD_DB}

Downloading Test Data

The test input T1050.fasta can be found on this page. Download and save it to ${ALPHAFOLD_RESULTS}/input.

Patch

Our patch includes two new input arguments in run_alphafold.py.

  • n_parallel_msa
    --n_parallel_msa=1: the searches are not parallelized.
    --n_parallel_msa=3: the searches are all parallelized.
    This new flag has been wrapped as -P in the wrapper.

  • run_feature_only
    --run_feature_only=true: to only run MSA and template searches.
    This new flag has been wrapped as -F in the wrapper.

The patch also provides the flexibility to choose the number of threads used for the MSA searches. Read the Optimization section for more details.

Running AlphaFold Using the Module

On Berzelius, we have AlphaFold as a module.

Loading the Module

On a compute node we load the AlphaFold module.

module load AlphaFold/2.3.1-hpc1

Running an Example

We run an example.

run_alphafold.sh \
  -d ${ALPHAFOLD_DB} \
  -o ${ALPHAFOLD_RESULTS}/output \
  -f ${ALPHAFOLD_RESULTS}/input/T1050.fasta \
  -t 2021-11-01 \
  -g true \
  -P 3 \
  -F false

Please run run_alphafold.sh -h to check the usage.

Running AlphaFold Using Conda

Creating a Conda Env

We first load the Anaconda module.

module load Mambaforge/23.3.1-1-hpc1-bdist

We create a conda env from a yml file.

git clone https://gitlab.liu.se/xuagu37/berzelius-alphafold-guide /tmp/berzelius-alphafold-guide
mamba env create -f /tmp/berzelius-alphafold-guide/alphafold_2.3.1.yml
mamba activate alphafold_2.3.1

Installing AlphaFold

To download AlphaFold

wget -O /tmp/v2.3.1.tar.gz https://github.com/deepmind/alphafold/archive/refs/tags/v2.3.1.tar.gz
tar -xf /tmp/v2.3.1.tar.gz -C ${ALPHAFOLD_DIR} --strip-components=1

To apply OpenMM patch

cd /home/xuan/.conda/envs/alphafold_2.3.1/lib/python3.8/site-packages/ 
patch -p0 < ${ALPHAFOLD_DIR}/docker/openmm.patch

To download chemical properties

wget -q -P ${ALPHAFOLD_DIR}/alphafold/common/ https://git.scicore.unibas.ch/schwede/openstructure/-/raw/7102c63615b64735c4941278d92b554ec94415f8/modules/mol/alg/src/stereo_chemical_props.txt

To install the patch

git clone https://gitlab.liu.se/xuagu37/berzelius-alphafold-guide /tmp/berzelius-alphafold-guide
cd /tmp && bash berzelius-alphafold-guide/patch/patch_2.3.1.sh ${ALPHAFOLD_DIR}

Running an Example

cd ${ALPHAFOLD_DIR}
bash run_alphafold.sh \
  -d ${ALPHAFOLD_DB} \
  -o ${ALPHAFOLD_RESULTS}/output \
  -f ${ALPHAFOLD_RESULTS}/input/T1050.fasta \
  -t 2021-11-01 \
  -g true \
  -P 3 \
  -F false

Please check the input arguments in run_alphafold.py. A complete list of input arguments is attached here for reference.

Running AlphaFold Using Apptainer

Creating the Container Image

There is an Apptainer image of AlphaFold 2.3.1 at /software/sse/containers.

Running an Example

apptainer exec --nv alphafold_2.3.1.sif bash -c "cd /app/alphafold && bash run_alphafold.sh \
  -d ${ALPHAFOLD_DB} \
  -o ${ALPHAFOLD_RESULTS}/output \
  -f ${ALPHAFOLD_RESULTS}/input/T1050.fasta \
  -t 2021-11-01 \
  -g true \
  -P 3 \
  -F false"

Optimization

MSA Searches in Parallel

The three independent sequential MSA searches can be arranged in parallel to accelerate the job. You can enable parallelization by setting the flag -P 3.

  • jackhmmer(uniref90) + template_searcher(pdb)
  • jackhmmer(mgnify)
  • hhblits(bfd) or jackhmmer(small_bfd)

Ref 1: AlphaFold PR 399 Parallel execution of MSA tools.
Ref 2: Zhong et al. 2022, ParaFold: Paralleling AlphaFold for Large-Scale Predictions.

Multithreading for MSA Searches

AlphaFold 2.3.1 uses a default choice of 8, 8, and 4 threads for the three MSA searches, which is not always optimal. The hhblits search is the most time-consuming, so we can manually allocate more threads to it. You can set the number of threads for the three searches in alphafold/data/pipeline.py at lines 131 to 134.

For multimer models, the Jackhmmer (uniprot) search will start when the first three searches finish. You can set the number of threads in alphafold/data/pipeline_multimer.py at line 179.

We recommend using n_cpu=8, 8, 16, 32 on Berzelius for jackhmmer (uniref90), jackhmmer (mgnify), hhblits (bfd), and Jackhmmer (uniprot), respectively.

Separation of CPU Part (MSA and Template Searches) and GPU part (Predictions)

A flag --run_feature_only has been added to separate the CPU and GPU parts. Alphafold makes use of GPUs for the prediction part of the modelling, which can be a small part of the running time. Most of the operations executed by the script are CPU-based (MSA and template searches). We therefore strongly suggest to run the CPU part on Tetralith and the GPU part on Berzelius.

I/O Optimization

Each compute node has a local scratch file system used for temporary storage while the job is running. The data will be deleted once the job finishes. On Berzelius, the size of this disk at /scratch/local is 15 TB of NVMe SSD storage. While it’s possible to copy the AlphaFold database to /scratch/local at the beginning of a job to improve I/O performance, our experiments on Berzelius have shown that copying the database to the node’s local storage doesn’t result in any significant improvement in job running time.

Best Practice of Running AlphaFold on Tetralith

On Tetralith, the GPU node’s local disk at /scratch/local is 2 TB of NVMe SSD storage. You can copy the BFD subset (1.8 TB) to /scratch/local at the beginning of a job to improve I/O performance. On Tetralith, the AlphaFold database can be found at /proj/common_datasets/AlphaFold.

export ALPHAFOLD_DB=/proj/common_datasets/AlphaFold
export ALPHAFOLD_DB_LOCAL=/scratch/local

cp -a ${ALPHAFOLD_DB}/bfd ${ALPHAFOLD_DB_LOCAL}/
ln -s ${ALPHAFOLD_DB}/mgnify_2022_05 ${ALPHAFOLD_DB_LOCAL}/mgnify  
ln -s ${ALPHAFOLD_DB}/params_2022_12_06 ${ALPHAFOLD_DB_LOCAL}/params    
ln -s ${ALPHAFOLD_DB}/pdb70_200401 ${ALPHAFOLD_DB_LOCAL}/pdb70    
ln -s ${ALPHAFOLD_DB}/pdb_mmcif ${ALPHAFOLD_DB_LOCAL}/pdb_mmcif    
ln -s ${ALPHAFOLD_DB}/pdb_seqres ${ALPHAFOLD_DB_LOCAL}/pdb_seqres    
ln -s ${ALPHAFOLD_DB}/uniprot_2022_05 ${ALPHAFOLD_DB_LOCAL}/uniprot    
ln -s ${ALPHAFOLD_DB}/uniref30_2021_03 ${ALPHAFOLD_DB_LOCAL}/uniref30    
ln -s ${ALPHAFOLD_DB}/uniref90_2022_05 ${ALPHAFOLD_DB_LOCAL}/uniref90 

An sbatch script example has been prepared for you here.

export ALPHAFOLD_RESULTS=/proj/nsc/users/xuan/alphafold_results_2.3.1
module load AlphaFold/2.3.1-hpc1
run_alphafold.sh \
  -d ${ALPHAFOLD_DB_LOCAL} \
  -o ${ALPHAFOLD_RESULTS}/output \
  -f ${ALPHAFOLD_RESULTS}/input/T1050.fasta \
  -t 2021-11-01 \
  -g true \
  -P 3 \
  -F false

Best Practice of Running AlphaFold on Berzelius

To make the best use of the GPU resources on Berzelius, we strongly suggest separating the CPU and GPU parts when running AlphaFold jobs. You should run the CPU part on Tetralith or your local computer, and then run the GPU part on Berzelius.

  1. Run the CPU part of the job on Tetralith.

You need to set -F true in the command to run MSA and template searches only.

On Tetralith, the GPU node’s local disk at /scratch/local is 2 TB of NVMe SSD storage. You can copy the BFD subset (1.8 TB) to /scratch/local at the beginning of a job to improve I/O performance. On Tetralith, the AlphaFold database can be found at /proj/common_datasets/AlphaFold.

export ALPHAFOLD_DB=/proj/common_datasets/AlphaFold
export ALPHAFOLD_DB_LOCAL=/scratch/local

cp -a ${ALPHAFOLD_DB}/bfd ${ALPHAFOLD_DB_LOCAL}/
ln -s ${ALPHAFOLD_DB}/mgnify_2022_05 ${ALPHAFOLD_DB_LOCAL}/mgnify  
ln -s ${ALPHAFOLD_DB}/params_2022_12_06 ${ALPHAFOLD_DB_LOCAL}/params    
ln -s ${ALPHAFOLD_DB}/pdb70_200401 ${ALPHAFOLD_DB_LOCAL}/pdb70    
ln -s ${ALPHAFOLD_DB}/pdb_mmcif ${ALPHAFOLD_DB_LOCAL}/pdb_mmcif    
ln -s ${ALPHAFOLD_DB}/pdb_seqres ${ALPHAFOLD_DB_LOCAL}/pdb_seqres    
ln -s ${ALPHAFOLD_DB}/uniprot_2022_05 ${ALPHAFOLD_DB_LOCAL}/uniprot    
ln -s ${ALPHAFOLD_DB}/uniref30_2021_03 ${ALPHAFOLD_DB_LOCAL}/uniref30    
ln -s ${ALPHAFOLD_DB}/uniref90_2022_05 ${ALPHAFOLD_DB_LOCAL}/uniref90 

An sbatch script example has been prepared for you here.

export ALPHAFOLD_RESULTS=/proj/nsc/users/xuan/alphafold_results_2.3.1
module load AlphaFold/2.3.1-hpc1
run_alphafold.sh \
  -d ${ALPHAFOLD_DB_LOCAL} \
  -o ${ALPHAFOLD_RESULTS}/output \
  -f ${ALPHAFOLD_RESULTS}/input/T1050.fasta \
  -t 2021-11-01 \
  -g false \
  -P 3 \
  -F true
  1. Transfer the CPU part results from Tetralith to Berzelius via your local computer.

  2. Run the GPU part of the job on Berzelius.

You need to set -F false in the command. This will skip the MSA and template searches and proceed directly to the predictions.

An sbatch script example has been prepared for you here.

export ALPHAFOLD_DB=/proj/common-datasets/AlphaFold
export ALPHAFOLD_RESULTS=/proj/nsc_testing/xuan/alphafold_results_2.3.1/
module load AlphaFold/2.3.1-hpc1
run_alphafold.sh \
  -d ${ALPHAFOLD_DB} \
  -o ${ALPHAFOLD_RESULTS}/output \
  -f ${ALPHAFOLD_RESULTS}/input/T1050.fasta \
  -t 2021-11-01 \
  -g true \
  -P 1 \
  -F false
  1. To achieve better GPU utilization, you can run several AlphaFold GPU part jobs concurrently. See the example sbatch script, which demonstrates how to execute 5 GPU part jobs concurrently.

AlphaFold Alternatives

LocalColabFold

LocalColabFold is a local version of ColabFold. ColabFold allows researchers to use AlphaFold’s powerful capabilities without requiring large computational resources, making it accessible via Google Colab.

LocalColabFold was developed to enable users to run AlphaFold predictions on their own local systems (like workstations or local servers) rather than relying on Google Colab. It integrates with various tools like MMseqs2 and makes the AlphaFold structure prediction pipeline more accessible by reducing the dependency on external services. This setup is particularly useful for labs or institutions that have access to GPUs or other high-performance computing resources.

Loading the Module

On a compute node we load the LocalColabFold module.

module load LocalColabFold/1.5.5-hpc1

Running an Example

We run an example.

colabfold_batch --data /proj/common-datasets/AlphaFold input/ output/

OpenFold

OpenFold is an open-source reimplementation of AlphaFold. OpenFold aims to reproduce the key functionalities of AlphaFold with an open-source license, allowing more flexibility for researchers to modify, improve, and integrate the model into their workflows.

Loading the Module

On a compute node we load the OpenFold module.

module load OpenFold/2.1.0-hpc1

Running an Example: Pre-compute alignments

export BASE_DATA_DIR=/proj/common-datasets/OpenFold
export INPUT_FASTA_DIR=/proj/nsc_testing/xuan/alphafold_results_2.3.1/input/T1050
export OUTPUT_DIR=/proj/nsc_testing/xuan/alphafold_results_2.3.1/output

cd ${OPENFOLD_PREFIX}
python3 scripts/precompute_alignments.py $INPUT_FASTA_DIR ${OUTPUT_DIR}/alignments \
    --uniref90_database_path $BASE_DATA_DIR/uniref90/uniref90.fasta \
    --mgnify_database_path $BASE_DATA_DIR/mgnify/mgy_clusters_2022_05.fa \
    --pdb70_database_path $BASE_DATA_DIR/pdb70/pdb70 \
    --uniclust30_database_path $BASE_DATA_DIR/uniclust30/uniclust30_2018_08/uniclust30_2018_08 \
    --bfd_database_path $BASE_DATA_DIR/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
    --cpus_per_task 16 \
    --jackhmmer_binary_path ${CONDA_PREFIX}/bin/jackhmmer \
    --hhblits_binary_path ${CONDA_PREFIX}/bin/hhblits \
    --hhsearch_binary_path ${CONDA_PREFIX}/bin/hhsearch \
    --kalign_binary_path ${CONDA_PREFIX}/bin/kalign

Running an Example: Pre-compute alignments with the ColabFold pipeline

export BASE_DATA_DIR=/proj/common-datasets/OpenFold
export INPUT_FASTA_DIR=/proj/nsc_testing/xuan/alphafold_results_2.3.1/input/T1050
export OUTPUT_DIR=/proj/nsc_testing/xuan/alphafold_results_2.3.1/output

cd ${OPENFOLD_PREFIX}
python3 scripts/precompute_alignments_mmseqs.py $INPUT_FASTA_DIR/T1050.fasta \
    $BASE_DATA_DIR/mmseqs_dbs \
    uniref30_2103_db \
    ${OUTPUT_DIR}/alignments \
    ${CONDA_PREFIX}/bin/mmseqs \
    --hhsearch_binary_path ${CONDA_PREFIX}/bin/hhsearch \
    --env_db colabfold_envdb_202108_db \
    --pdb70 $BASE_DATA_DIR/pdb70/pdb70

Running an Example: Model inference with pre-computed alignments

export BASE_DATA_DIR=/proj/common-datasets/OpenFold
export TEMPLATE_MMCIF_DIR=${BASE_DATA_DIR}/pdb_data/mmcif_files
export INPUT_FASTA_DIR=/proj/nsc_testing/xuan/alphafold_results_2.3.1/input/T1050
export OUTPUT_DIR=/proj/nsc_testing/xuan/alphafold_results_2.3.1/output
export PRECOMPUTED_ALIGNMENTS=/proj/nsc_testing/xuan/alphafold_results_2.3.1/output/alignments

cd ${OPENFOLD_PREFIX}
python3 run_pretrained_openfold.py ${INPUT_FASTA_DIR} \
  $TEMPLATE_MMCIF_DIR \
  --output_dir $OUTPUT_DIR \
  --use_precomputed_alignments $PRECOMPUTED_ALIGNMENTS \
  --config_preset model_1_ptm \
  --model_device "cuda:0" 

Running an Example: Model inference without pre-computed alignments

export BASE_DATA_DIR=/proj/common-datasets/OpenFold
export TEMPLATE_MMCIF_DIR=${BASE_DATA_DIR}/pdb_data/mmcif_files
export INPUT_FASTA_DIR=/proj/nsc_testing/xuan/alphafold_results_2.3.1/input/T1050
export OUTPUT_DIR=/proj/nsc_testing/xuan/alphafold_results_2.3.1/output

cd ${OPENFOLD_PREFIX}
python3 run_pretrained_openfold.py \
    $INPUT_FASTA_DIR \
    $TEMPLATE_MMCIF_DIR \
    --output_dir $OUTPUT_DIR \
    --config_preset model_1_ptm \
    --uniref90_database_path $BASE_DATA_DIR/uniref90/uniref90.fasta \
    --mgnify_database_path $BASE_DATA_DIR/mgnify/mgy_clusters_2022_05.fa \
    --pdb70_database_path $BASE_DATA_DIR/pdb70/pdb70 \
    --uniclust30_database_path $BASE_DATA_DIR/uniclust30/uniclust30_2018_08/uniclust30_2018_08 \
    --bfd_database_path $BASE_DATA_DIR/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
    --model_device "cuda:0" 

Running an Example: Training a new OpenFold model

export BASE_DATA_DIR=/proj/common-datasets/OpenFold
export TEMPLATE_MMCIF_DIR=${BASE_DATA_DIR}/pdb_data/mmcif_files
export OUTPUT_DIR=/proj/nsc_testing/xuan/alphafold_results_2.3.1/output

cd ${OPENFOLD_PREFIX}
python3 train_openfold.py $BASE_DATA_DIR/pdb_data/mmcif_files $BASE_DATA_DIR/alignment_data/alignments $TEMPLATE_MMCIF_DIR $OUTPUT_DIR \
    2021-10-10 \
    --train_chain_data_cache_path $BASE_DATA_DIR/pdb_data/data_caches/chain_data_cache.json \
    --template_release_dates_cache_path $BASE_DATA_DIR/pdb_data/data_caches/mmcif_cache.json \
	--config_preset initial_training \
    --seed 42 \
    --obsolete_pdbs_file_path $BASE_DATA_DIR/pdb_data/obsolete.dat \
    --num_nodes 1 \
    --gpus 2

For more information on using OpenFold, please refer to the OpenFold documentation.


User Area

User support

Guides, documentation and FAQ.

Getting access

Applying for projects and login accounts.

System status

Everything OK!

No reported problems

Self-service

SUPR
NSC Express