<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
</head>
<body>
<p>Hi All,</p>
<p>I know a number of the UC campuses are converting to Slurm, and
here at UCSB we are too (from many years of being happily
Torqued). We have slurm running on a newer cluster(Pod), and
still have a lot to learn about fine tuning it, but it mostly
works how we want ( in part because of help from some of you at
the other UC's!). However, we've run into an issue with what
should be a pretty simple configuration on another cluster, so
I'll restart this list by throwing our questions out there! And
if people have recipes, notes, etc. for building new slurm
clusters, maybe we can assemble a list of them (we can decide if
it should be public, or not)??<br>
</p>
<p>We're updating an older cluster (knot) from CentOS 6 to 7, and
switching it to Slurm. In this particular cluster, all users are
the same, and have equal usage rights, so we want to base the
priorities mostly on fairshare. This is simpler than our one
system that is already running slurm (pod), or our
yet-to-be-configured new condo cluster. Should be simple,
right? (note- I've dropped the .conf files for both clusters into<a
moz-do-not-send="true"
href="https://drive.google.com/drive/folders/1S1zuEKVK28UQ5SCk5dn0sIHlPhZKeB4j?usp=sharing">
this google drive folder)</a>.<br>
</p>
<p>Question 1 (Fairshare)<br>
</p>
<p>The knot cluster doesn't seem to be using Fairshare in setting
priorities. e.g. a 'sprio -a' only shows age of the job in sprio<br>
</p>
<p>[root@node1 ~]# sprio -j 4472<br>
JOBID PARTITION PRIORITY SITE AGE
FAIRSHARE<br>
4472 batch 265691 0
265691 0<br>
</p>
<p>I certainly don't claim to understand slurm very well yet, and I
think we had some dumb luck (and smart help) in getting Pod
configured, but on this rebuilt Knot cluster it seems like a
pretty vanilla case, so hard to understand why it's not behaving.
Hints/recipes for setting up a simple configuration like this, or
anyone run into this before?<br>
</p>
<p>Question 2 (cgroups)</p>
<p>While 'cgroups' works fine on our newer cluster (Pod) to
constrain people to the cores they asked for (e.g. ask for 4
cores, even if you then attempt to use 8, it will restrict you to
4 cores for your 8 processes). On the reconfigured older
cluster(knot), if cgroup is enabled (via ProctrackType and
Taskplugin), it will always constrain the job to 1 core (e.g. if
you ask for 4, you'll just get only 1 core - although shows you
having 4 in 'squeue' - and use 25% of the time for each process).
Turning it off and everything work normally, but the user can then
overrun their allocation (e.g. ask for 4 and use 12). This isn't
a big issue, as Torque worked that way for us, and most users
behave - and we have scripts that patrol for it, but again,
annoying that it works on the one cluster, but not the older,
simpler one! This isn't an urgent issue, but wondering if others
have a set of known 'gotchas' with cgroups.<br>
</p>
<p>Thanks all, Paul<br>
</p>
<p><br>
</p>
<p><br>
</p>
<pre class="moz-signature" cols="72">--
California NanoSystems Institute
Center for Scientific Computing
Elings Hall, Room 3231
University of California
Santa Barbara, CA 93106-6105
<a class="moz-txt-link-freetext" href="http://www.cnsi.ucsb.edu">http://www.cnsi.ucsb.edu</a> <a class="moz-txt-link-freetext" href="http://csc.cnsi.ucsb.edu">http://csc.cnsi.ucsb.edu</a>
(805)-893-4205</pre>
</body>
</html>