[Uc-slurm] Fairshare/cgroups
Paul Weakliem
weakliem at cnsi.ucsb.edu
Wed Feb 24 10:59:22 PST 2021
Hi All,
I know a number of the UC campuses are converting to Slurm, and here at
UCSB we are too (from many years of being happily Torqued). We have
slurm running on a newer cluster(Pod), and still have a lot to learn
about fine tuning it, but it mostly works how we want ( in part because
of help from some of you at the other UC's!). However, we've run into
an issue with what should be a pretty simple configuration on another
cluster, so I'll restart this list by throwing our questions out
there! And if people have recipes, notes, etc. for building new slurm
clusters, maybe we can assemble a list of them (we can decide if it
should be public, or not)??
We're updating an older cluster (knot) from CentOS 6 to 7, and switching
it to Slurm. In this particular cluster, all users are the same, and
have equal usage rights, so we want to base the priorities mostly on
fairshare. This is simpler than our one system that is already running
slurm (pod), or our yet-to-be-configured new condo cluster. Should be
simple, right? (note- I've dropped the .conf files for both clusters
intothis google drive folder)
<https://drive.google.com/drive/folders/1S1zuEKVK28UQ5SCk5dn0sIHlPhZKeB4j?usp=sharing>.
Question 1 (Fairshare)
The knot cluster doesn't seem to be using Fairshare in setting
priorities. e.g. a 'sprio -a' only shows age of the job in sprio
[root at node1 ~]# sprio -j 4472
JOBID PARTITION PRIORITY SITE AGE FAIRSHARE
4472 batch 265691 0 265691 0
I certainly don't claim to understand slurm very well yet, and I think
we had some dumb luck (and smart help) in getting Pod configured, but on
this rebuilt Knot cluster it seems like a pretty vanilla case, so hard
to understand why it's not behaving. Hints/recipes for setting up a
simple configuration like this, or anyone run into this before?
Question 2 (cgroups)
While 'cgroups' works fine on our newer cluster (Pod) to constrain
people to the cores they asked for (e.g. ask for 4 cores, even if you
then attempt to use 8, it will restrict you to 4 cores for your 8
processes). On the reconfigured older cluster(knot), if cgroup is
enabled (via ProctrackType and Taskplugin), it will always constrain the
job to 1 core (e.g. if you ask for 4, you'll just get only 1 core -
although shows you having 4 in 'squeue' - and use 25% of the time for
each process). Turning it off and everything work normally, but the user
can then overrun their allocation (e.g. ask for 4 and use 12). This
isn't a big issue, as Torque worked that way for us, and most users
behave - and we have scripts that patrol for it, but again, annoying
that it works on the one cluster, but not the older, simpler one! This
isn't an urgent issue, but wondering if others have a set of known
'gotchas' with cgroups.
Thanks all, Paul
--
California NanoSystems Institute
Center for Scientific Computing
Elings Hall, Room 3231
University of California
Santa Barbara, CA 93106-6105
http://www.cnsi.ucsb.edu http://csc.cnsi.ucsb.edu
(805)-893-4205
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.ucr.edu/pipermail/uc-slurm/attachments/20210224/589ea6df/attachment.html>
More information about the Uc-slurm
mailing list