GPFS prefetch control

DO NOT use the gpfsnoprefetch feature for your jobs unless you have been explicitly told by NSC staff to do so OR you have done testing with and without it to see that your application performs better with it on.

The storage/file system software (GPFS/Scale Storage) used on NSC Centre Storage (i.e /proj and /home on Tetralith and Sigma) will sometimes prefetch data. If your application read a little bit of data from a file, GPFS might read more than that from our storage servers since it believes that you will continue reading more data from the file.

This is not always a good idea, and GPFS tries to be smart about when if should prefetch data and when it should not. But sometimes it gets it wrong. In extreme cases a compute node might read many times more data from the storage system than the application actually looks at.

GPFS will typically get the easy cases right. An application doing e.g seek(random location)-read(4MB)-seek(random location)-read(4MB)-… will not trigger prefetch. An application reading a file sequentially from beginning to end will trigger prefetch.

An example of an I/O pattern that GPFS does not handle well is seek(random location)-read(4MB)-read(4MB)-seek(random location)-read(4MB)-read(4MB)-… GPFS will then do a large prefetch read as soon as it sees the second 4MB read from the application, and but all that data except the first 4MB will never be used.

Doing excessive prefetch will usually give a little worse performance for your application, but more importantly it puts a lot of extra load on the storage system, making disk I/O slower for everyone.

When we detect unusually high storage load we will track down the source and so you might be contacted by us and be asked to modify your jobs or run fewer jobs in parallell.

But we also have another option - disable prefetch completely for the compute node(s) that runs your job.

To do this, submit your job with the option -C gpfsnoprefetch.

This feature can only be used on jobs that request one or more full compute nodes, e.g "-N1 --exclusive". If you use the option for a job requesting less than a full compute node (e.g "-n2") the job will fail on startup.

This feature is currently (2024-09-24) only available on Tetralith and Sigma. If you want to use it on another system please contact NSC Support.

If you believe you have an application that could benefit from turning off prefetch, we suggest that you

Contact NSC Support and discuss the application. We might be able to help you measure the I/O and determine if excessive prefetch is happening.

Run a few test jobs with and without -C gpfsnoprefetch. If you see a clear difference in performance, feel free to use it. If performance drops or is unchanged, do not use it.

There is also a script check_read_amplification you can use to measure the approximate “read amplification” that happens on a node, i.e how much more data is read from the storage system than is read by the application.

If the script shows a high read amplification and a lot of network bandwidth used, that job type is a good candidate for trying the “nogpfsprefetch” feature on.

In order to get meaning ful data from the script, the following must be true:

Only your job must run on the compute node (e.g by using -N1 --exclusive)
There must be no significant amount of other network traffic on the network used for disk storage. On Tetralith and Sigma this is the “ib0” network interface which is rarely used by applications. MPI traffic will not cause false readings, but e.g copying data to the node with scp or rsync might.

Example of how to use the script:

Start your job normally
Login to the compute node using jobsh
Run the script. By default it runs until you stop it (Ctrl-C) and prints a report every 60 seconds. Other options are available, run check_read_amplification --help to see them.

Here is an example where we measure a job that is reading the first block of a file, then seeks to a random position in the file, reads 10 blocks, and repeats.

[kronberg@tetralith1 ~]$ jobsh n1428
[kronberg@n1428 ~]$ check_read_amplification --interval=10
2024-09-24 16:38:32.309197: read amplification is 1.8 X network: 181.93 MiB/s, applications: 101.60 MiB/s
2024-09-24 16:38:42.349928: read amplification is 1.8 X network: 203.84 MiB/s, applications: 115.20 MiB/s
[...]

GPFS prefetch control

User support

Getting access

Everything OK!

Self-service