How do I get the list of features and resources of each node in Slurm?

KrisP · May 9, 2018, 11:37pm

I want to be able to list nodes in a slurm-managed cluster with specific features - how many cores, which processor, how much memory, does it have gpu, what are the available features. How do I do that?

CURATOR: Kristina Plazonic KrisP

KrisP · May 11, 2018, 3:51pm

ANSWER: Short answer is the following:

sinfo -o "%20N  %10c  %10m  %25f  %10G "

You can see the options of sinfo by doing sinfo --help. In particular sinfo -o specifies the format of the output, and the options above are short for

N = node name
c = number of cores
m = memory
f = features, often it will be the architecture or type of associated gpu
G = gres type and number, e.g. gpu:2

The %20 means 20 characters for this field. For example, for easy import to a Confluence page, you would want to separate fields with |, and so your command would be
sinfo -o “|%20N | %10c | %10m | %25f | %10G|”

vsoch · November 17, 2018, 7:34am

+1! And here is an example of the output above:

$ sinfo -o "%20N  %10c  %10m  %25f  %10G "
NODELIST              CPUS        MEMORY      AVAIL_FEATURES             GRES       
sh-01-[01-36],sh-02-  16          64000+      CPU_GEN:IVB,CPU_SKU:E5-26  (null)     
sh-03-01,sh-07-[25-3  16          64000+      CPU_GEN:HSW,CPU_SKU:E5-26  (null)     
sh-101-[01-58,61-72]  20          128000+     CPU_GEN:BDW,CPU_SKU:E5-26  (null)     
sh-112-01             56          3072000     CPU_GEN:BDW,CPU_SKU:E5-46  (null)     
sh-02-[13-14]         32          1536000     CPU_GEN:SNB,CPU_SKU:E5-46  (null)     
sh-112-[02-03],sh-11  32          512000+     CPU_GEN:BDW,CPU_SKU:E5-26  (null)     
sh-09-[03-05],sh-13-  16          64000+      CPU_GEN:IVB,CPU_SKU:E5-26  gpu:8      
sh-112-[04,06-07],sh  20          256000+     CPU_GEN:BDW,CPU_SKU:E5-26  gpu:4      
sh-113-[12-14],sh-11  20          256000      CPU_GEN:BDW,CPU_SKU:E5-26  gpu:4      
sh-09-[01-02]         16          256000      CPU_GEN:IVB,CPU_SKU:E5-26  gpu:8      
sh-112-05,sh-113-08   20          256000      CPU_GEN:BDW,CPU_SKU:E5-26  gpu:4      
sh-17-29,sh-27-[21,3  16          128000+     CPU_GEN:HSW,CPU_SKU:E5-26  gpu:8      
sh-103-[25-36],sh-10  24          191000+     CPU_GEN:SKX,CPU_SKU:5118,  (null)     
sh-09-[07-10]         16          64000       CPU_GEN:IVB,CPU_SKU:E5-26  gpu:4      
sh-15-[01-08],sh-16-  16          128000+     CPU_GEN:IVB,CPU_SKU:E5-26  gpu:8      
sh-17-12,sh-25-23     48          1536000     CPU_GEN:HSW,CPU_SKU:E7-48  (null)     
sh-17-[23-28]         32          256000      CPU_GEN:HSW,CPU_SKU:E5-26  (null)     
sh-28-10              20          256000      CPU_GEN:HSW,CPU_SKU:E5-26  (null)     
sh-112-[08-12],sh-11  20          128000+     CPU_GEN:BDW,CPU_SKU:E5-26  gpu:4      
sh-112-[13-17],sh-11  20          256000+     CPU_GEN:BDW,CPU_SKU:E5-26  gpu:8      
sh-114-[01-04]        20          256000      CPU_GEN:BDW,CPU_SKU:E5-26  gpu:4      
sh-15-[09-10],sh-19-  16          128000      CPU_GEN:HSW,CPU_SKU:E5-26  gpu:8      
sh-18-[01-10]         24          96000       CPU_GEN:HSW,CPU_SKU:E5-26  (null)     
sh-18-11              28          128000      CPU_GEN:HSW,CPU_SKU:E5-26  (null)

If you want to do a custom format, take a look at the formatting examples and options for sinfo on this page.

One thing I always had a hard time with (and I don’t have a good answer beyond looking at cluster-specific documentation) is “What in the heck do the features actually mean?” The best I could find is to poke around and look at the slurm configurtation file at /etc/slurm/slurm.conf. I think it’s non standard (and not found across clusters) because our cluster admin is a badass. but it helped explain the feature listed by sinfo.

# -- Nodes --------------------------------------------------------------------

# >> Nodes features
#
# -- CPU features
#   * CPU_GEN: CPU generation: SNB|IVB|HSW|BDW|SKX|CNL
#   * CPU_SKU: CPU model     : E5-2640v2
#   * CPU_FRQ: CPU frequency : 2.60GHz
#
# -- GPU features
#   * GPU_GEN: GPU generation: KPL|MXW|PSC|VLT
#   * GPU_BRD: GPU brand     : GEFORCE|TESLA
#   * GPU_SKU: GPU model     : TITAN_{BLACK,X,Xp}|TESLA_{K{20,80},P40,P100}
#   * GPU_MEM: GPU memory    : 8GB
#   * GPU_CC:  GPU Compute Capability : 3.5|3.7|6.1

# >> Weights
#
# * Nodes with lower weight will be selected first.
# * Nodes with more memory or GRES are given a higher weight, so they're
#   selected last and saved for jobs that really need it
# * Nodes with more recent CPU generation will be selected first to give the
#   best performance
#
#   Weight mask: 1 | #GRES | Memory | #Cores | CPUgen | 1
#       prefix is to avoid octal conversion
#       suffix is to avoid having null weights
#
#   Values:
#       #GRES   none: 0   Memory   64 GB: 0   #Cores  16: 0  CPUgen  ???: 3
#              1 GPU: 1            96 GB: 1           20: 1          CNL: 4
#              2 GPU: 2           128 GB: 2           24: 2          SKX: 5
#              3 GPU: 3           192 GB: 3           28: 3          BDW: 6
#              4 GPU: 4           256 GB: 4           32: 4          HSW: 7
#              6 GPU: 5           384 GB: 5           48: 5          IVB: 8
#              8 GPU: 6           512 GB: 6           56: 6          SNB: 9
#             10 GPU: 7          1024 GB: 7
#             16 GPU: 8          1534 GB: 8
#                                3072 GB: 9

All that said, knowing that “HSU” is a kind of CPU generation doesn’t really help me much. I would never know how or when it is appropriate to ask for these features. Does anyone else have thoughts on this?