[Uc-slurm] Fairshare/cgroups

Wed Feb 24 10:59:22 PST 2021

Hi All,

I know a number of the UC campuses are converting to Slurm, and here at 
UCSB we are too (from many years of being happily Torqued).  We have 
slurm running on a newer cluster(Pod), and still have a lot to learn 
about fine tuning it, but it mostly works how we want ( in part because 
of help from some of you at the other UC's!).   However, we've run into 
an issue with what should be a pretty simple configuration on another 
cluster, so I'll restart this list by throwing our questions out 
there!   And if people have recipes, notes, etc.  for building new slurm 
clusters, maybe we can assemble a list of them (we can decide if it 
should be public, or not)??

We're updating an older cluster (knot) from CentOS 6 to 7, and switching 
it to Slurm.  In this particular cluster, all users are the same, and 
have equal usage rights, so we want to base the priorities mostly on 
fairshare.  This is simpler than our one system that is already running 
slurm (pod), or our yet-to-be-configured new condo cluster.   Should be 
simple, right?  (note- I've dropped the .conf files for both clusters 
intothis google drive folder) 
<https://drive.google.com/drive/folders/1S1zuEKVK28UQ5SCk5dn0sIHlPhZKeB4j?usp=sharing>.

Question 1 (Fairshare)

The knot cluster doesn't seem to be using Fairshare in setting 
priorities.  e.g. a 'sprio -a' only shows age of the job in sprio

[root at node1 ~]# sprio -j 4472
           JOBID PARTITION   PRIORITY       SITE        AGE FAIRSHARE
            4472 batch         265691          0 265691          0

I certainly don't claim to understand slurm very well yet, and I think 
we had some dumb luck (and smart help) in getting Pod configured, but on 
this rebuilt Knot cluster it seems like a pretty vanilla case, so hard 
to understand why it's not behaving. Hints/recipes for setting up a 
simple configuration like this, or anyone run into this before?

Question 2 (cgroups)

While 'cgroups' works fine on our newer cluster (Pod) to constrain 
people to the cores they asked for (e.g. ask for 4 cores, even if you 
then attempt to use 8, it will restrict you to 4 cores for your 8 
processes).   On the reconfigured older cluster(knot), if cgroup is 
enabled (via ProctrackType and Taskplugin), it will always constrain the 
job to 1 core (e.g. if you ask for 4, you'll just get only 1 core - 
although shows you having 4 in 'squeue' - and use 25% of the time for 
each process). Turning it off and everything work normally, but the user 
can then overrun their allocation (e.g. ask for 4 and use 12).  This 
isn't a big issue, as Torque worked that way for us, and most users 
behave - and we have scripts that patrol for it, but again, annoying 
that it works on the one cluster, but not the older, simpler one!   This 
isn't an urgent issue, but wondering if others have a set of known 
'gotchas' with cgroups.

Thanks all, Paul

-- 
California NanoSystems Institute
Center for Scientific Computing
Elings Hall, Room 3231
University of California
Santa Barbara, CA 93106-6105
http://www.cnsi.ucsb.edu     http://csc.cnsi.ucsb.edu
(805)-893-4205

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.ucr.edu/pipermail/uc-slurm/attachments/20210224/589ea6df/attachment.html>