This is an old revision of the document!
A high-performance computer (HPC system) is a tool used by computational scientists and engineers to tackle problems that require more computing resources or time than they can obtain on the personal computers available to them.
* Computers connected by some type of network (ethernet, infiniband, etc.). * These computers is often referred to as a node. * Several different types of nodes, specialized for different purposes. * Head (front-end or/and login): where you login to interact with the HPC system. * Compute nodes (CPU, GPU) are where the real computing is done. Access to these resources is controlled by a scheduler or batch system.
In order to share these large systems among many users, it is common to allocate subsets of the compute nodes to tasks (or jobs), based on requests from users. These jobs may take a long time to complete, so they come and go in time. To manage the sharing of the compute nodes among all of the jobs, HPC systems use a batch system or scheduler.
The batch system usually has commands for submitting jobs, inquiring about their status, and modifying them. The HPC center defines the priorities of different jobs for execution on the compute nodes, while ensuring that the compute nodes are not overloaded.
A typical HPC workflow could look something like this:
For all the nodes before you install Slurm or Munge, you need create user and group using seem UID and GID:
export MUNGEUSER=991 groupadd -g $MUNGEUSER munge useradd -m -c "MUNGE Uid 'N' Gid Emporium" -d /var/lib/munge -u $MUNGEUSER -g munge -s /sbin/nologin munge
export SLURMUSER=992 groupadd -g $SLURMUSER slurm useradd -m -c "SLURM workload manager" -d /var/lib/slurm -u $SLURMUSER -g slurm -s /bin/bash slurm
In every node we need install a few dependencies:
yum install openssl openssl-devel pam-devel numactl numactl-devel hwloc hwloc-devel lua lua-devel readline-devel rrdtool-devel ncurses-devel man2html libibmad libibumad perl-ExtUtils-MakeMaker gcc -y
In every node we need install that:
We need to get the latest EPEL repository:
yum install epel-release yum update
For CentOS 8, we need edit file /etc/yum.repos.d/CentOS-PowerTools.repo, and enable repository
Change:
enable=0 by enable=1
- Update database repository:
yum update
After that, we can install Munge
yum install munge munge-libs munge-devel -y
In the server master, we need create Munge key and copy that to all another server
yum install rng-tools -y rngd -r /dev/urandom
Creating Munge key
/usr/sbin/create-munge-key -r dd if=/dev/urandom bs=1 count=1024 > /etc/munge/munge.key chown munge: /etc/munge/munge.key chmod 400 /etc/munge/munge.key
Copying Munge key to another server
scp /etc/munge/munge.key root@node01:/etc/munge scp /etc/munge/munge.key root@node02:/etc/munge . . . scp /etc/munge/munge.key root@nodeN:/etc/munge
In every node we need correct the permissions as well as enable and start the Munge service.
chown -R munge: /etc/munge/ /var/log/munge/ chmod 0700 /etc/munge/ /var/log/munge/
systemctl enable munge systemctl start munge
To test Munge, we can try to access another node with Munge from our server node.
munge -n munge -n | unmunge munge -n | ssh node01.cluster.test unmunge remunge
At the server, download the last version of Slurm. At this moment the last version es 19.05.5
cd /tmp wget https://download.schedmd.com/slurm/slurm-19.05.5.tar.bz2 yum install rpm-build rpmbuild -ta slurm-19.05.5.tar.bz2
Copying the Slurm rpm files for installation from the master to the other servers o to a shared folder.
cd ~/rpmbuild/RPMS/x86_64 cp slurm*.rpm /fns/shared_folder
The slurm-torque package could perhaps be omitted, but it does contain a useful /usr/bin/mpiexec wrapper script.
Before install Slurm, we need disable selinux
nano /etc/selinux/config
change SELINUX=enforcing to SELINUX=disables
cd ~/rpmbuild/RPMS/x86_64 export VER=19.05.5-1 yum install slurm-$VER*rpm slurm-devel-$VER*rpm slurm-perlapi-$VER*rpm slurm-torque-$VER*rpm slurm-example-configs-$VER*rpm
Explicitly enable the service in the master
systemctl enable slurmctld
Only if the database service will run on the Master node: Install the database service RPM:
cd ~/rpmbuild/RPMS/x86_64 export VER=19.05.5-1 yum install slurm-slurmdbd-$VER*rpm
If you have a server for database, install in this server:
export VER=19.05.5-1 yum install slurm-$VER*rpm slurm-devel-$VER*rpm slurm-slurmdbd-$VER*rpm
Explicitly enable the service:
systemctl enable slurmdbd
On Compute nodes you may additionally install the slurm-pam_slurm RPM package to prevent rogue users from logging in:
export VER=19.05.5-1 yum install slurm-pam_slurm-$VER*rpm systemctl enable slurmd
Study the configuration information in the Quick Start Administrator_Guide.