Using AlphaFold 3 on Berzelius

Introduction

AlphaFold is a deep learning-based protein structure prediction program developed by DeepMind. The software uses a neural network to predict the 3D structure of a protein from its amino acid sequence. Building on the successes of AlphaFold 2, which revolutionized the field by predicting protein structures with near-experimental accuracy, AlphaFold 3 introduces several new capabilities and enhancements aimed at expanding its applicability to complex biological problems.

Preparations

Setting the Paths

We have a copy of AlphaFold 3 database available on Berzelius at /proj/common-datasets for public use.

We specify the paths for AlphaFold database, AlphaFold model parameters and results. Due to Terms of Use limitations, you will need to obtain the model parameters yourself.

export ALPHAFOLD_DB=/proj/common-datasets/AlphaFold3
export ALPHAFOLD_MODEL=${ALPHAFOLD_DB}/model_parameters
export ALPHAFOLD_RESULTS=/proj/nsc_testing/xuan/alphafold_results_3.0.0
mkdir -p ${ALPHAFOLD_DB} ${ALPHAFOLD_MODEL} ${ALPHAFOLD_RESULTS}
mkdir -p ${ALPHAFOLD_RESULTS}/output ${ALPHAFOLD_RESULTS}/input

Downloading Test Data

The test input alphafold_input.json can be found on this page. Download and save it to ${ALPHAFOLD_RESULTS}/input.

Running AlphaFold Using the Module

On Berzelius, we have AlphaFold 3 as a module.

Loading the Module

On a compute node we load the AlphaFold module:

module load AlphaFold/3.0.0-hpc1

Running an Example

We run an example:

python ${ALPHAFOLD_PREFIX}/run_alphafold.py \
    --db_dir=${ALPHAFOLD_DB} \
    --json_path=${ALPHAFOLD_RESULTS}/input/alphafold_input.json \
    --model_dir=${ALPHAFOLD_MODEL} \
    --output_dir=${ALPHAFOLD_RESULTS}/output \
    --run_inference=True

Please run python ${ALPHAFOLD_PREFIX}/run_alphafold.py --help to check the usage.

Best Practice of Running AlphaFold on Tetralith

On Tetralith, the GPU node’s local disk at /scratch/local is 2 TB of NVMe SSD storage. You can copy the Database (0.6 TB) to /scratch/local at the beginning of a job to improve I/O performance. On Tetralith, the AlphaFold database can be found at /proj/common_datasets/AlphaFold3.

export ALPHAFOLD_DB=/proj/common_datasets/AlphaFold3
export ALPHAFOLD_DB_LOCAL=/scratch/local
cp -a ${ALPHAFOLD_DB}/* ${ALPHAFOLD_DB_LOCAL}

We run an example:

export ALPHAFOLD_MODEL=${ALPHAFOLD_DB}/model_parameters
export ALPHAFOLD_RESULTS=/proj/nsc/users/xuan/alphafold_results_3.0.0
module load AlphaFold/3.0.0-hpc1

python ${ALPHAFOLD_PREFIX}/run_alphafold.py \
    --db_dir=${ALPHAFOLD_DB_LOCAL} \
    --json_path=${ALPHAFOLD_RESULTS}/input/alphafold_input.json \
    --model_dir=${ALPHAFOLD_MODEL} \
    --output_dir=${ALPHAFOLD_RESULTS}/output \
    --flash_attention_implementation=xla \
    --run_inference=True

Best Practice of Running AlphaFold on Berzelius

To make the best use of the GPU resources on Berzelius, we strongly suggest separating the CPU and GPU parts when running AlphaFold jobs. You should run the CPU part on Tetralith or your local computer, and then run the GPU part on Berzelius.

  1. Run the CPU part of the job on Tetralith.

You need to set --norun_inference in the command to run MSA and template searches only.

On Tetralith, the GPU node’s local disk at /scratch/local is 2 TB of NVMe SSD storage. You can copy the BFD subset (1.8 TB) to /scratch/local at the beginning of a job to improve I/O performance. On Tetralith, the AlphaFold database can be found at /proj/common_datasets/AlphaFold3.

export ALPHAFOLD_DB=/proj/common_datasets/AlphaFold
export ALPHAFOLD_DB_LOCAL=/scratch/local
cp -a ${ALPHAFOLD_DB}/* ${ALPHAFOLD_DB_LOCAL} 

We run an example:

export ALPHAFOLD_MODEL=${ALPHAFOLD_DB}/model_parameters
export ALPHAFOLD_RESULTS=/proj/nsc/users/xuan/alphafold_results_3.0.0
module load AlphaFold/3.0.0-hpc1

python ${ALPHAFOLD_PREFIX}/run_alphafold.py \
    --db_dir=${ALPHAFOLD_DB_LOCAL} \
    --json_path=${ALPHAFOLD_RESULTS}/input/alphafold_input.json \
    --model_dir=${ALPHAFOLD_MODEL} \
    --output_dir=${ALPHAFOLD_RESULTS}/output \
    --flash_attention_implementation=xla \
    --norun_inference
  1. Transfer the CPU part results from Tetralith to Berzelius via your local computer.

  2. Run the GPU part of the job on Berzelius.

You need to set --norun_data_pipeline in the command. This will skip the MSA and template searches and proceed directly to the predictions. This stage requires the input JSON file to contain pre-computed MSAs and templates.

export ALPHAFOLD_DB=/proj/common-datasets/AlphaFold
export ALPHAFOLD_RESULTS=/proj/nsc_testing/xuan/alphafold_results_3.0.0/
module load AlphaFold/3.0.0-hpc1

python ${ALPHAFOLD_PREFIX}/run_alphafold.py \
    --db_dir=${ALPHAFOLD_DB} \
    --json_path=${ALPHAFOLD_RESULTS}/output/2pv7/2pv7_data.json \
    --model_dir=${ALPHAFOLD_MODEL} \
    --output_dir=${ALPHAFOLD_RESULTS}/output \
    --norun_data_pipeline
  1. To achieve better GPU utilization, you can run several AlphaFold GPU part jobs concurrently. See the example sbatch script, which demonstrates how to execute 5 GPU part jobs concurrently.

AlphaFold 3 Alternatives

HelixFold3

HelixFold3 is developed to replicate the advanced capabilities of AlphaFold3. HelixFold3’s accuracy in predicting the structures of small molecule ligands, nucleic acids (including DNA and RNA), and proteins is comparable to that of AlphaFold3.

Loading the Module

On a compute node we load the HelixFold3 module.

module load HelixFold3/73cd80b-hpc1

Running an Example

You can use the flag --run_feature_only to separate the CPU and GPU parts of the job.

export INPUT_JSON_PATH=/proj/nsc_testing/xuan/helixfold3_results/input/demo_protein_ligand.json
export OUTPUT_DIR=/proj/nsc_testing/xuan/helixfold3_results/output

run_infer.sh --input_json ${INPUT_JSON_PATH} \
--output_dir ${OUTPUT_DIR} \
--run_feature_only False \
--infer_times 5 \
--diff_batch_size 1 \
--precision "fp32"

User Area

User support

Guides, documentation and FAQ.

Getting access

Applying for projects and login accounts.

System status

Everything OK!

No reported problems

Self-service

SUPR
NSC Express