<html>

  <head>


    <meta http-equiv="content-type" content="text/html; charset=UTF-8">

  </head>

  <body>

    <p>Hi All,</p>

    <p>I know a number of the UC campuses are converting to Slurm, and

      here at UCSB we are too (from many years of being happily

      Torqued).  We have slurm running on a newer cluster(Pod), and

      still have a lot to learn about fine tuning it, but it mostly

      works how we want ( in part because of help from some of you at

      the other UC's!).   However, we've run into an issue with what

      should be a pretty simple configuration on another cluster, so

      I'll restart this list by throwing our questions out there!   And

      if people have recipes, notes, etc.  for building new slurm

      clusters, maybe we can assemble a list of them (we can decide if

      it should be public, or not)??<br>

    </p>

    <p>We're updating an older cluster (knot) from CentOS 6 to 7, and

      switching it to Slurm.  In this particular cluster, all users are

      the same, and have equal usage rights, so we want to base the

      priorities mostly on fairshare.  This is simpler than our one

      system that is already running slurm (pod), or our

      yet-to-be-configured new condo cluster.   Should be simple,

      right?  (note- I've dropped the .conf files for both clusters into<a

        moz-do-not-send="true"

href="https://drive.google.com/drive/folders/1S1zuEKVK28UQ5SCk5dn0sIHlPhZKeB4j?usp=sharing">

        this google drive folder)</a>.<br>

    </p>

    <p>Question 1 (Fairshare)<br>

    </p>

    <p>The knot cluster doesn't seem to be using Fairshare in setting

      priorities.  e.g. a 'sprio -a' only shows age of the job in sprio<br>

    </p>

    <p>[root@node1 ~]# sprio -j 4472<br>

                JOBID PARTITION   PRIORITY       SITE        AGE 

      FAIRSHARE<br>

                 4472 batch         265691          0    

      265691          0<br>

    </p>

    <p>I certainly don't claim to understand slurm very well yet, and I

      think we had some dumb luck (and smart help) in getting Pod

      configured, but on this rebuilt Knot cluster it seems like a

      pretty vanilla case, so hard to understand why it's not behaving. 

      Hints/recipes for setting up a simple configuration like this, or

      anyone run into this before?<br>

    </p>

    <p>Question 2 (cgroups)</p>

    <p>While 'cgroups' works fine on our newer cluster (Pod) to

      constrain people to the cores they asked for (e.g. ask for 4

      cores, even if you then attempt to use 8, it will restrict you to

      4 cores for your 8 processes).   On the reconfigured older

      cluster(knot), if cgroup is enabled (via ProctrackType and

      Taskplugin), it will always constrain the job to 1 core (e.g. if

      you ask for 4, you'll just get only 1 core - although shows you

      having 4 in 'squeue' - and use 25% of the time for each process). 

      Turning it off and everything work normally, but the user can then

      overrun their allocation (e.g. ask for 4 and use 12).  This isn't

      a big issue, as Torque worked that way for us, and most users

      behave - and we have scripts that patrol for it, but again,

      annoying that it works on the one cluster, but not the older,

      simpler one!   This isn't an urgent issue, but wondering if others

      have a set of known 'gotchas' with cgroups.<br>

    </p>

    <p>Thanks all, Paul<br>

    </p>

    <p><br>

    </p>

    <p><br>

    </p>

    <pre class="moz-signature" cols="72">-- 

California NanoSystems Institute

Center for Scientific Computing

Elings Hall, Room 3231

University of California

Santa Barbara, CA 93106-6105

<a class="moz-txt-link-freetext" href="http://www.cnsi.ucsb.edu">http://www.cnsi.ucsb.edu</a>     <a class="moz-txt-link-freetext" href="http://csc.cnsi.ucsb.edu">http://csc.cnsi.ucsb.edu</a>

(805)-893-4205</pre>

  </body>

</html>