Berzelius Common Datasets

To avoid data duplication and save hard drive space, we provide access to a selection of public datasets frequently used in AI/ML research. The datasets are available read-only under COMMON_DATASETS=/proj/common-datasets.

Please refer to the List of Common Datasets on Berzelius for the information of version control and license.

Users are encouraged to contact us to request corrections, updates, or the addition of new datasets.

AlphaFold Genetic Databases
AlphaFold 3 Genetic Databases
Argoverse
CIFAR-10 and CIFAR-100
COCO
DomainNet
Fashion-MNIST
ImageNet
Imagenette
KITTI
KITTI-360
MAN-TruckScenes
MNIST
nuImages
nuPlan
nuScenes
OpenFold
Places365
SMHI IFCB plankton
SYKE-plankton_IFCB_2022
SYKE-plankton_IFCB_Utö_2021
Waymo Open Dataset
WHOI-Plankton
Zenseact Open Dataset
List of Common Datasets on Berzelius

AlphaFold Genetic Databases

AlphaFold needs multiple genetic (sequence) databases to run:

BFD,
MGnify,
PDB70,
PDB (structures in the mmCIF format),
PDB seqres – only for AlphaFold-Multimer,
UniRef30 (FKA UniClust30),
UniProt – only for AlphaFold-Multimer,
UniRef90.

The dataset is available at $COMMON_DATASETS/AlphaFold.

AlphaFold 3 Genetic Databases

AlphaFold 3 needs multiple genetic (sequence) databases to run:

BFD small
MGnify
PDB (structures in the mmCIF format)
PDB seqres
UniProt
UniRef90
NT
RFam
RNACentral

The dataset is available at $COMMON_DATASETS/AlphaFold3.

Due to Terms of Use limitations, you will need to obtain the model parameters yourself.

Argoverse

Argoverse is a publicly available dataset for autonomous driving research and development. It is widely used for tasks such as perception, prediction, motion forecasting, 3D object detection, and other aspects of self-driving car development.

Please accept the Terms of Access to get access to the dataset on Berzelius.

The dataset is available at $COMMON_DATASETS/Argoverse.

CIFAR-10 and CIFAR-100

The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.

The CIFAR-100 dataset has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class.

The dataset is available at $COMMON_DATASETS/CIFAR.

COCO

The COCO (Common Objects in Context) dataset is a large-scale object detection, segmentation, and captioning dataset.

The dataset is available at $COMMON_DATASETS/COCO.

DomainNet

DomainNet is a large, multi-domain dataset used for domain adaptation research in machine learning and computer vision. It is specifically designed to help researchers train models that can generalize across different visual domains.

The dataset is available at $COMMON_DATASETS/DomainNet.

Fashion-MNIST

Fashion-MNIST is a dataset of Zalando’s article images—consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes.

The dataset is available at $COMMON_DATASETS/Fashion-MNIST.

ImageNet

ImageNet is a large and widely used dataset in the field of computer vision, particularly in tasks involving image classification, object detection, and other types of visual recognition tasks. We provide the datasets for ImageNet Large-scale Visual Recognition Challenge (ILSVRC) 2012, including

train: 1,281,167 training images in 1,000 categories
val: 50,000 validation images

We also provide the training and validation images in both LMDB and TFRecord formats.

Please accept the Terms of Access to get access to the dataset on Berzelius.

The dataset is available at $COMMON_DATASETS/ImageNet.

Imagenette

Imagenette is a subset of 10 easily classified classes from Imagenet (tench, English springer, cassette player, chain saw, church, French horn, garbage truck, gas pump, golf ball, parachute).

Please accept the Terms of Access to get access to the dataset on Berzelius.

The dataset is available at $COMMON_DATASETS/Imagenette.

KITTI

The KITTI dataset is one of the most widely used datasets in autonomous driving research. It was created by the Karlsruhe Institute of Technology (KIT) and Toyota Technological Institute (TTI) and is designed for developing and evaluating autonomous vehicle algorithms, particularly for tasks such as 3D object detection, tracking, stereo vision, optical flow, and visual odometry.

The dataset is available at $COMMON_DATASETS/KITTI.

KITTI-360

The KITTI-360 is a large-scale autonomous driving dataset created as an extension of the original KITTI Vision Benchmark Suite. It was released by the Autonomous Vision Group at Tübingen University (same group behind KITTI).

The dataset is available at $COMMON_DATASETS/KITTI-360.

MAN TruckScenes

MAN TruckScenes is a public, large-scale, multimodal dataset released by MAN Truck & Bus specifically designed for advancing autonomous truck perception and driving research.

The dataset is available at $COMMON_DATASETS/MAN-TruckScenes.

MNIST

MNIST is a handwritten digit database used for image processing and machine learning algorithms.

Four files are available:

train-images-idx3-ubyte: training set images
train-labels-idx1-ubyte: training set labels
t10k-images-idx3-ubyte: test set images
t10k-labels-idx1-ubyte: test set labels

The dataset is available at $COMMON_DATASETS/MNIST.

nuImages

nuImages is a large-scale dataset for autonomous driving, developed by the team at Motional. It contains images from multiple cameras mounted on a vehicle, along with annotations for various objects in the scene.

Please accept the Terms of Access to get access to the dataset on Berzelius.

The dataset is available at $COMMON_DATASETS/nuImages.

nuPlan

nuPlan is the world’s first large-scale planning benchmark for autonomous driving.

Please accept the Terms of Access to get access to the dataset on Berzelius.

The dataset is available at $COMMON_DATASETS/nuPlan.

nuScenes

The nuScenes dataset is a public large-scale dataset for autonomous driving developed by the team at Motional.

Please accept the Terms of Access to get access to the dataset on Berzelius.

The dataset is available at $COMMON_DATASETS/nuScenes.

OpenFold

In addition to AlphaFold’s genetic databases, OpenFold requires the following datasets to run:

OpenFold trained parameters: openfold_params
SoloSeq trained parameters: openfold_soloseq_params,
ColabFold’s environmental database: mmseqs_dbs,
Alignments for training: alignment_data,
Alignment DBs: alignment_data/alignment_dbs,
Data caches for training: pdb_data/data_caches.

The dataset is available at $COMMON_DATASETS/OpenFold.

Places365

There are 1.8 million train images from 365 scene categories in the Places365-Standard, which are used to train the Places365 CNNs. There are 50 images per category in the validation set and 900 images per category in the testing set. We have the high-resolution images.

Please send us an email with a screenshot of your completed form and confirm your agreement to the following Terms of use. Once we receive your confirmation, we will grant you access to the dataset on Berzelius.

Terms of use: by downloading the image data you agree to the following terms:

1. You will use the data only for non-commercial research and educational purposes.
2. You will NOT distribute the above images.
3. Massachusetts Institute of Technology makes no representations or warranties regarding the data, including but not limited to warranties of non-infringement or fitness for a particular purpose.
4. You accept full responsibility for your use of the data and shall defend and indemnify Massachusetts Institute of Technology, including its employees, officers and agents, against any and all claims arising from your use of the data, including but not limited to your use of any copies of copyrighted images that you may create from the data.

The dataset is available at $COMMON_DATASETS/Places.

SMHI IFCB plankton

SMHI IFCB plankton includes three datasets of manually annotated plankton images by phytoplankton experts at the Swedish Meteorological and Hydrological Institute (SMHI).

The dataset is available at $COMMON_DATASETS/SMHI-IFCB-Plankton.

SYKE-plankton_IFCB_2022

The SYKE-plankton_IFCB_2022 dataset consists of approximately 63,000 images representing 50 different classes of phytoplankton, collected using the Imaging FlowCytobot (IFCB) from various locations in the Baltic Sea. These images were manually annotated by expert taxonomists and are used to develop and evaluate classification methods for phytoplankton recognition.

The dataset is available at $COMMON_DATASETS/SYKE-plankton_IFCB_2022.

SYKE-plankton_IFCB_Utö_2021

The SYKE-plankton_IFCB_Utö_2021 dataset is a collection of approximately 150,000 images of phytoplankton, classified into 50 distinct categories, with an additional set of about 94,000 unclassifiable images. The dataset was collected using an Imaging FlowCytobot (IFCB) at the Utö Atmospheric and Marine Research Station in the Baltic Sea during 2021.

The dataset is available at $COMMON_DATASETS/SYKE-plankton_IFCB_Utö_2021.

Waymo Open Dataset

The Waymo Open Dataset is a publicly available dataset provided by Waymo, focused on autonomous driving technology. This dataset is designed to advance research and development in the field of autonomous driving by providing high-quality, diverse, and large-scale data collected from Waymo’s fleet of autonomous vehicles.

Please send us an email with a screenshot of your registration at waymo.com/open and confirm your agreement to the Waymo License. Once we receive your confirmation, we will grant you access to the dataset on Berzelius.

The dataset is available at $COMMON_DATASETS/Waymo.

WHOI-Plankton

WHOI-Plankton is a comprehensive dataset of annotated plankton images developed by researchers at the Woods Hole Oceanographic Institution (WHOI). The dataset contains over 3.5 million images of microscopic marine plankton, categorized into 103 classes. These images are used primarily for developing and evaluating visual recognition models in plankton classification.

The dataset is available at $COMMON_DATASETS/WHOI-Plankton.

Zenseact Open Dataset

The Zenseact Open Dataset (ZOD) is a large multi-modal autonomous driving (AD) dataset, created by researchers at Zenseact. It was collected over a 2-year period in 14 different European countries, using a fleet of vehicles equipped with a full sensor suite. The dataset consists of three subsets: Frames, Sequences, and Drives, designed to encompass both data diversity and support for spatiotemporal learning, sensor fusion, localization, and mapping.

Please accept the Terms of Access to get access to the dataset on Berzelius.

The dataset is available at $COMMON_DATASETS/Zenseact-Open-Dataset.

List of Common Datasets on Berzelius

Dataset	Version Control	License
AlphaFold - BFD	No	CC BY 4.0 Deed
AlphaFold - MGnify	2022_05, 2018_12	CC0
AlphaFold - PDB70	from_mmcif_200401	CC BY 4.0 Deed
AlphaFold - PDB	No	CC0
AlphaFold - PDB seqres	No	CC0
AlphaFold - UniRef30	2021_03	CC BY-SA 4.0 Deed
AlphaFold - UniProt	2022_05	CC BY 4.0
AlphaFold - UniRef90	2022_05	CC BY 4.0
AlphaFold - Parameters	2022-12-06	Apache License 2.0
AlphaFold 3 - BFD small	No	CC BY 4.0 Deed
AlphaFold 3 - MGnify	2022_05	CC0
AlphaFold 3 - PDB	2022_09_28	CC0
AlphaFold 3 - PDB seqres	2022_09_28	CC0
AlphaFold 3 - UniProt	2021_04	CC BY 4.0
AlphaFold 3 - UniRef90	2022_05	CC BY 4.0
AlphaFold 3 - NT	2023_02_23	None
AlphaFold 3 - RFam	14_9	CC
AlphaFold 3 - RNACentral	No	CC0
Argoverse	v1.1, v2.0	Terms of Use
CIFAR-10 and CIFAR-100	No	None
COCO	No	CC BY 4.0
DomainNet	No	Fair Use Notice
Fashion-MNIST	No	MIT License
ImageNet	No	Terms of access
Imagenette	No	Terms of access
KITTI	No	None
KITTI-360	No	None
MAN-TruckScenes	v1.0	CC BY-NC-SA 4.0
MNIST	No	CC BY-SA 3.0 Deed
nuImages	v1.0	Terms of Use
nuPlan	v1.1	Terms of Use
nuScenes - panoptic	v1.0	Terms of Use
nuScenes - lidarseg	v1.0	Terms of Use
nuScenes- CAN bus expansion	v1.0	Terms of Use
nuScenes - Map expansion	v1.3	Terms of Use
nuScenes - Full dataset	v1.0	Terms of Use
OpenFold - Trained parameters	No	Apache License 2.0
OpenFold - SoloSeq trained parameters	No	Apache License 2.0
OpenFold - ColabFold's environmental database	202108	MIT License
OpenFold - Alignments	No	CC0
OpenFold - Alignment DBs	No	CC0
OpenFold - Data caches	No	Apache License 2.0
Places365	No	Terms of use
SMHI IFCB Plankton	version 2	CC BY 4.0
SYKE-plankton_IFCB_2022	20220201	CC BY 4.0
SYKE-plankton_IFCB_Utö_2021	20220428	CC BY 4.0
Waymo Open Dataset - Motion Dataset	1.2.1, 1.3.0	License Agreement
Waymo Open Dataset - Perception Dataset	1.4.3, 2.0.1	License Agreement
WHOI-Plankton	No	MIT License
Zenseact Open Dataset	No	License

Berzelius Common Datasets

AlphaFold Genetic Databases

AlphaFold 3 Genetic Databases

Argoverse

CIFAR-10 and CIFAR-100

COCO

DomainNet

Fashion-MNIST

ImageNet

Imagenette

KITTI

KITTI-360

MAN TruckScenes

MNIST

nuImages

nuPlan

nuScenes

OpenFold

Places365

SMHI IFCB plankton

SYKE-plankton_IFCB_2022

SYKE-plankton_IFCB_Utö_2021

Waymo Open Dataset

WHOI-Plankton

Zenseact Open Dataset

List of Common Datasets on Berzelius

User support

Getting access

Everything OK!

Self-service