AlphaFold is a deep learning-based protein structure prediction program developed by DeepMind. The software uses a neural network to predict the 3D structure of a protein from its amino acid sequence. The first version of AlphaFold was released in 2018, and it was considered a breakthrough in the field of protein structure prediction. In 2021, AlphaFold2 won the CASP14 competition, a biennial competition that evaluates the state-of-the-art methods in protein structure prediction. AlphaFold2 was able to predict the structure of proteins with remarkable accuracy, which has implications for drug discovery and understanding diseases at the molecular level.
We have a copy of AlphaFold database available on Berzelius at /proj/common-datasets
for public use.
We specify the paths for AlphaFold database, AlphaFold installation and results.
export ALPHAFOLD_DB=/proj/common-datasets/AlphaFold
export ALPHAFOLD_DIR=/proj/nsc_testing/xuan/alphafold_2.3.1
export ALPHAFOLD_RESULTS=/proj/nsc_testing/xuan/alphafold_results_2.3.1
mkdir -p ${ALPHAFOLD_DB} ${ALPHAFOLD_DIR} ${ALPHAFOLD_RESULTS}
mkdir -p ${ALPHAFOLD_RESULTS}/output ${ALPHAFOLD_RESULTS}/input
If you prefer to download your own copy of AlphaFold database, you can do so by
export ALPHAFOLD_DB=/proj/nsc_testing/xuan/AlphaFold
module load aria2/1.36.0-gcc-8.5.0
wget -O /tmp/v2.3.1.tar.gz https://github.com/deepmind/alphafold/archive/refs/tags/v2.3.1.tar.gz
tar -xf /tmp/v2.3.1.tar.gz -C ${ALPHAFOLD_DIR} --strip-components=1
cd ${ALPHAFOLD_DIR}
scripts/download_all_data.sh ${ALPHAFOLD_DB}
The test input T1050.fasta
can be found on this page. Download and save it to ${ALPHAFOLD_RESULTS}/input
.
Our patch includes two new input arguments in run_alphafold.py
.
n_parallel_msa
--n_parallel_msa=1
: the searches are not parallelized.
--n_parallel_msa=3
: the searches are all parallelized.
This new flag has been wrapped as -P
in the wrapper.
run_feature_only
--run_feature_only=true
: to only run MSA and template searches.
This new flag has been wrapped as -F
in the wrapper.
The patch also provides the flexibility to choose the number of threads used for the MSA searches. Read the Optimization section for more details.
On Berzelius, we have AlphaFold as a module.
On a compute node we load the AlphaFold module.
module load AlphaFold/2.3.1-hpc1
We run an example.
run_alphafold.sh \
-d ${ALPHAFOLD_DB} \
-o ${ALPHAFOLD_RESULTS}/output \
-f ${ALPHAFOLD_RESULTS}/input/T1050.fasta \
-t 2021-11-01 \
-g true \
-P 3 \
-F false
Please run run_alphafold.sh -h
to check the usage.
We first load the Anaconda module.
module load Mambaforge/23.3.1-1-hpc1-bdist
We create a conda env from a yml file.
git clone https://gitlab.liu.se/xuagu37/berzelius-alphafold-guide /tmp/berzelius-alphafold-guide
mamba env create -f /tmp/berzelius-alphafold-guide/alphafold_2.3.1.yml
mamba activate alphafold_2.3.1
To download AlphaFold
wget -O /tmp/v2.3.1.tar.gz https://github.com/deepmind/alphafold/archive/refs/tags/v2.3.1.tar.gz
tar -xf /tmp/v2.3.1.tar.gz -C ${ALPHAFOLD_DIR} --strip-components=1
To apply OpenMM patch
cd /home/xuan/.conda/envs/alphafold_2.3.1/lib/python3.8/site-packages/
patch -p0 < ${ALPHAFOLD_DIR}/docker/openmm.patch
To download chemical properties
wget -q -P ${ALPHAFOLD_DIR}/alphafold/common/ https://git.scicore.unibas.ch/schwede/openstructure/-/raw/7102c63615b64735c4941278d92b554ec94415f8/modules/mol/alg/src/stereo_chemical_props.txt
To install the patch
git clone https://gitlab.liu.se/xuagu37/berzelius-alphafold-guide /tmp/berzelius-alphafold-guide
cd /tmp && bash berzelius-alphafold-guide/patch/patch_2.3.1.sh ${ALPHAFOLD_DIR}
cd ${ALPHAFOLD_DIR}
bash run_alphafold.sh \
-d ${ALPHAFOLD_DB} \
-o ${ALPHAFOLD_RESULTS}/output \
-f ${ALPHAFOLD_RESULTS}/input/T1050.fasta \
-t 2021-11-01 \
-g true \
-P 3 \
-F false
Please check the input arguments in run_alphafold.py
. A complete list of input arguments is attached here for reference.
There is an Apptainer image of AlphaFold 2.3.1 at /software/sse/containers
.
apptainer exec --nv alphafold_2.3.1.sif bash -c "cd /app/alphafold && bash run_alphafold.sh \
-d ${ALPHAFOLD_DB} \
-o ${ALPHAFOLD_RESULTS}/output \
-f ${ALPHAFOLD_RESULTS}/input/T1050.fasta \
-t 2021-11-01 \
-g true \
-P 3 \
-F false"
The three independent sequential MSA searches can be arranged in parallel to accelerate the job. You can enable parallelization by setting the flag -P 3
.
Ref 1: AlphaFold PR 399 Parallel execution of MSA tools.
Ref 2: Zhong et al. 2022, ParaFold: Paralleling AlphaFold for Large-Scale Predictions.
AlphaFold 2.3.1 uses a default choice of 8, 8, and 4 threads for the three MSA searches, which is not always optimal. The hhblits search is the most time-consuming, so we can manually allocate more threads to it. You can set the number of threads for the three searches in alphafold/data/pipeline.py
at lines 131 to 134.
For multimer models, the Jackhmmer (uniprot) search will start when the first three searches finish. You can set the number of threads in alphafold/data/pipeline_multimer.py
at line 179.
We recommend using n_cpu=8, 8, 16, 32
on Berzelius for jackhmmer (uniref90), jackhmmer (mgnify), hhblits (bfd), and Jackhmmer (uniprot), respectively.
A flag --run_feature_only
has been added to separate the CPU and GPU parts. Alphafold makes use of GPUs for the prediction part of the modelling, which can be a small part of the running time. Most of the operations executed by the script are CPU-based (MSA and template searches). We therefore strongly suggest to run the CPU part on Tetralith and the GPU part on Berzelius.
Each compute node has a local scratch file system used for temporary storage while the job is running. The data will be deleted once the job finishes. On Berzelius, the size of this disk at /scratch/local
is 15 TB of NVMe SSD storage. While it’s possible to copy the AlphaFold database to /scratch/local
at the beginning of a job to improve I/O performance, our experiments on Berzelius have shown that copying the database to the node’s local storage doesn’t result in any significant improvement in job running time.
On Tetralith, the GPU node’s local disk at /scratch/local
is 2 TB of NVMe SSD storage. You can copy the BFD subset (1.8 TB) to /scratch/local
at the beginning of a job to improve I/O performance. On Tetralith, the AlphaFold database can be found at /proj/common_datasets/AlphaFold
.
export ALPHAFOLD_DB=/proj/common_datasets/AlphaFold
export ALPHAFOLD_DB_LOCAL=/scratch/local
cp -a ${ALPHAFOLD_DB}/bfd ${ALPHAFOLD_DB_LOCAL}/
ln -s ${ALPHAFOLD_DB}/mgnify_2022_05 ${ALPHAFOLD_DB_LOCAL}/mgnify
ln -s ${ALPHAFOLD_DB}/params_2022_12_06 ${ALPHAFOLD_DB_LOCAL}/params
ln -s ${ALPHAFOLD_DB}/pdb70_200401 ${ALPHAFOLD_DB_LOCAL}/pdb70
ln -s ${ALPHAFOLD_DB}/pdb_mmcif ${ALPHAFOLD_DB_LOCAL}/pdb_mmcif
ln -s ${ALPHAFOLD_DB}/pdb_seqres ${ALPHAFOLD_DB_LOCAL}/pdb_seqres
ln -s ${ALPHAFOLD_DB}/uniprot_2022_05 ${ALPHAFOLD_DB_LOCAL}/uniprot
ln -s ${ALPHAFOLD_DB}/uniref30_2021_03 ${ALPHAFOLD_DB_LOCAL}/uniref30
ln -s ${ALPHAFOLD_DB}/uniref90_2022_05 ${ALPHAFOLD_DB_LOCAL}/uniref90
An sbatch script example has been prepared for you here.
export ALPHAFOLD_RESULTS=/proj/nsc/users/xuan/alphafold_results_2.3.1
module load AlphaFold/2.3.1-hpc1
run_alphafold.sh \
-d ${ALPHAFOLD_DB_LOCAL} \
-o ${ALPHAFOLD_RESULTS}/output \
-f ${ALPHAFOLD_RESULTS}/input/T1050.fasta \
-t 2021-11-01 \
-g true \
-P 3 \
-F false
To make the best use of the GPU resources on Berzelius, we strongly suggest separating the CPU and GPU parts when running AlphaFold jobs. You should run the CPU part on Tetralith or your local computer, and then run the GPU part on Berzelius.
You need to set -F true
in the command to run MSA and template searches only.
On Tetralith, the GPU node’s local disk at /scratch/local is 2 TB of NVMe SSD storage. You can copy the BFD subset (1.8 TB) to /scratch/local at the beginning of a job to improve I/O performance. On Tetralith, the AlphaFold database can be found at /proj/common_datasets/AlphaFold
.
export ALPHAFOLD_DB=/proj/common_datasets/AlphaFold
export ALPHAFOLD_DB_LOCAL=/scratch/local
cp -a ${ALPHAFOLD_DB}/bfd ${ALPHAFOLD_DB_LOCAL}/
ln -s ${ALPHAFOLD_DB}/mgnify_2022_05 ${ALPHAFOLD_DB_LOCAL}/mgnify
ln -s ${ALPHAFOLD_DB}/params_2022_12_06 ${ALPHAFOLD_DB_LOCAL}/params
ln -s ${ALPHAFOLD_DB}/pdb70_200401 ${ALPHAFOLD_DB_LOCAL}/pdb70
ln -s ${ALPHAFOLD_DB}/pdb_mmcif ${ALPHAFOLD_DB_LOCAL}/pdb_mmcif
ln -s ${ALPHAFOLD_DB}/pdb_seqres ${ALPHAFOLD_DB_LOCAL}/pdb_seqres
ln -s ${ALPHAFOLD_DB}/uniprot_2022_05 ${ALPHAFOLD_DB_LOCAL}/uniprot
ln -s ${ALPHAFOLD_DB}/uniref30_2021_03 ${ALPHAFOLD_DB_LOCAL}/uniref30
ln -s ${ALPHAFOLD_DB}/uniref90_2022_05 ${ALPHAFOLD_DB_LOCAL}/uniref90
An sbatch script example has been prepared for you here.
export ALPHAFOLD_RESULTS=/proj/nsc/users/xuan/alphafold_results_2.3.1
module load AlphaFold/2.3.1-hpc1
run_alphafold.sh \
-d ${ALPHAFOLD_DB_LOCAL} \
-o ${ALPHAFOLD_RESULTS}/output \
-f ${ALPHAFOLD_RESULTS}/input/T1050.fasta \
-t 2021-11-01 \
-g false \
-P 3 \
-F true
Transfer the CPU part results from Tetralith to Berzelius via your local computer.
Run the GPU part of the job on Berzelius.
You need to set -F false
in the command. This will skip the MSA and template searches and proceed directly to the predictions.
An sbatch script example has been prepared for you here.
export ALPHAFOLD_DB=/proj/common-datasets/AlphaFold
export ALPHAFOLD_RESULTS=/proj/nsc_testing/xuan/alphafold_results_2.3.1/
module load AlphaFold/2.3.1-hpc1
run_alphafold.sh \
-d ${ALPHAFOLD_DB} \
-o ${ALPHAFOLD_RESULTS}/output \
-f ${ALPHAFOLD_RESULTS}/input/T1050.fasta \
-t 2021-11-01 \
-g true \
-P 1 \
-F false
LocalColabFold is a local version of ColabFold. ColabFold allows researchers to use AlphaFold’s powerful capabilities without requiring large computational resources, making it accessible via Google Colab.
LocalColabFold was developed to enable users to run AlphaFold predictions on their own local systems (like workstations or local servers) rather than relying on Google Colab. It integrates with various tools like MMseqs2 and makes the AlphaFold structure prediction pipeline more accessible by reducing the dependency on external services. This setup is particularly useful for labs or institutions that have access to GPUs or other high-performance computing resources.
On a compute node we load the LocalColabFold module.
module load LocalColabFold/1.5.5-hpc1
We run an example.
colabfold_batch --data /proj/common-datasets/AlphaFold input/ output/
OpenFold is an open-source reimplementation of AlphaFold. OpenFold aims to reproduce the key functionalities of AlphaFold with an open-source license, allowing more flexibility for researchers to modify, improve, and integrate the model into their workflows.
On a compute node we load the OpenFold module.
module load OpenFold/2.1.0-hpc1
export BASE_DATA_DIR=/proj/common-datasets/OpenFold
export INPUT_FASTA_DIR=/proj/nsc_testing/xuan/alphafold_results_2.3.1/input/T1050
export OUTPUT_DIR=/proj/nsc_testing/xuan/alphafold_results_2.3.1/output
cd ${OPENFOLD_PREFIX}
python3 scripts/precompute_alignments.py $INPUT_FASTA_DIR ${OUTPUT_DIR}/alignments \
--uniref90_database_path $BASE_DATA_DIR/uniref90/uniref90.fasta \
--mgnify_database_path $BASE_DATA_DIR/mgnify/mgy_clusters_2022_05.fa \
--pdb70_database_path $BASE_DATA_DIR/pdb70/pdb70 \
--uniclust30_database_path $BASE_DATA_DIR/uniclust30/uniclust30_2018_08/uniclust30_2018_08 \
--bfd_database_path $BASE_DATA_DIR/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
--cpus_per_task 16 \
--jackhmmer_binary_path ${CONDA_PREFIX}/bin/jackhmmer \
--hhblits_binary_path ${CONDA_PREFIX}/bin/hhblits \
--hhsearch_binary_path ${CONDA_PREFIX}/bin/hhsearch \
--kalign_binary_path ${CONDA_PREFIX}/bin/kalign
export BASE_DATA_DIR=/proj/common-datasets/OpenFold
export INPUT_FASTA_DIR=/proj/nsc_testing/xuan/alphafold_results_2.3.1/input/T1050
export OUTPUT_DIR=/proj/nsc_testing/xuan/alphafold_results_2.3.1/output
cd ${OPENFOLD_PREFIX}
python3 scripts/precompute_alignments_mmseqs.py $INPUT_FASTA_DIR/T1050.fasta \
$BASE_DATA_DIR/mmseqs_dbs \
uniref30_2103_db \
${OUTPUT_DIR}/alignments \
${CONDA_PREFIX}/bin/mmseqs \
--hhsearch_binary_path ${CONDA_PREFIX}/bin/hhsearch \
--env_db colabfold_envdb_202108_db \
--pdb70 $BASE_DATA_DIR/pdb70/pdb70
export BASE_DATA_DIR=/proj/common-datasets/OpenFold
export TEMPLATE_MMCIF_DIR=${BASE_DATA_DIR}/pdb_data/mmcif_files
export INPUT_FASTA_DIR=/proj/nsc_testing/xuan/alphafold_results_2.3.1/input/T1050
export OUTPUT_DIR=/proj/nsc_testing/xuan/alphafold_results_2.3.1/output
export PRECOMPUTED_ALIGNMENTS=/proj/nsc_testing/xuan/alphafold_results_2.3.1/output/alignments
cd ${OPENFOLD_PREFIX}
python3 run_pretrained_openfold.py ${INPUT_FASTA_DIR} \
$TEMPLATE_MMCIF_DIR \
--output_dir $OUTPUT_DIR \
--use_precomputed_alignments $PRECOMPUTED_ALIGNMENTS \
--config_preset model_1_ptm \
--model_device "cuda:0"
export BASE_DATA_DIR=/proj/common-datasets/OpenFold
export TEMPLATE_MMCIF_DIR=${BASE_DATA_DIR}/pdb_data/mmcif_files
export INPUT_FASTA_DIR=/proj/nsc_testing/xuan/alphafold_results_2.3.1/input/T1050
export OUTPUT_DIR=/proj/nsc_testing/xuan/alphafold_results_2.3.1/output
cd ${OPENFOLD_PREFIX}
python3 run_pretrained_openfold.py \
$INPUT_FASTA_DIR \
$TEMPLATE_MMCIF_DIR \
--output_dir $OUTPUT_DIR \
--config_preset model_1_ptm \
--uniref90_database_path $BASE_DATA_DIR/uniref90/uniref90.fasta \
--mgnify_database_path $BASE_DATA_DIR/mgnify/mgy_clusters_2022_05.fa \
--pdb70_database_path $BASE_DATA_DIR/pdb70/pdb70 \
--uniclust30_database_path $BASE_DATA_DIR/uniclust30/uniclust30_2018_08/uniclust30_2018_08 \
--bfd_database_path $BASE_DATA_DIR/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
--model_device "cuda:0"
export BASE_DATA_DIR=/proj/common-datasets/OpenFold
export TEMPLATE_MMCIF_DIR=${BASE_DATA_DIR}/pdb_data/mmcif_files
export OUTPUT_DIR=/proj/nsc_testing/xuan/alphafold_results_2.3.1/output
cd ${OPENFOLD_PREFIX}
python3 train_openfold.py $BASE_DATA_DIR/pdb_data/mmcif_files $BASE_DATA_DIR/alignment_data/alignments $TEMPLATE_MMCIF_DIR $OUTPUT_DIR \
2021-10-10 \
--train_chain_data_cache_path $BASE_DATA_DIR/pdb_data/data_caches/chain_data_cache.json \
--template_release_dates_cache_path $BASE_DATA_DIR/pdb_data/data_caches/mmcif_cache.json \
--config_preset initial_training \
--seed 42 \
--obsolete_pdbs_file_path $BASE_DATA_DIR/pdb_data/obsolete.dat \
--num_nodes 1 \
--gpus 2
For more information on using OpenFold, please refer to the OpenFold documentation.
Guides, documentation and FAQ.
Applying for projects and login accounts.