Módosítások

← Régebbi szerkesztés

PRACE User Support

1 847 bájt törölve, 2019. október 29., 15:56

a

→‎Acknowledgement in publications

== User Guide to obtain a digital certificate ==

<code>

gsissh -p 2222 prace-login.~~budapest.hpc~~sc.niif.hu

</code>

<code>

globus-url-copy file://task/myfile.c gsiftp://prace-login.~~budapest~~sc.~~hpc~~niif.hu/home/~~task~~prace/pr1hrocz/myfile.c

</code>

* -stripe Use this parameter to initiate a “striped” GridFTP transfer that uses more than one node at the source and destination. As multiple nodes contribute to the transfer, each using its own network interface, a larger amount of the network bandwidth can be consumed than with a single system. Thus, at least for “big” (> 100 MB) files, striping can considerably improve performance.

==Usage of the ~~Sun Grid Engine~~ SLURM scheduler == Basically the SGE is a scheduler, which divides the resources, computers into resource partitions. These are called queues. A queue can’t be larger than a physical resource; it can’t expand its borders. SGE registers a waiting list for the resources managed by itself, to which the posted computing tasks are directed. The scheduler searches for the resource defined by the description of the task and starts it. The task-resource coupling depends on the ability of the resources and the parameters of the tasks. In case the resources are overloaded, the tasks have to wait while the requested processor and memory becomes available. ~~The detailed documentation of the SGE can be found [~~Website: http://~~docs~~slurm.~~oracle~~schedmd.com~~/cd/E24901_01/doc.62/e21976.pdf here].~~ ~~SGE version on all HPC sites: [http://gridscheduler.sourceforge.net/documentation.html Open Grid Scheduler (OGS/GE 2011.11p1)]~~ ~~=== The most simple commands ===~~ ~~The most simple SGE command is the display of the cluster data:~~ ~~<code>~~ ~~qhost</code>~~ ~~A possible outcome of this command can be:~~ ~~{| class="wikitable" border="1"~~ |- ~~|HOSTNAME|ARCH|NCPU|LOAD|MEMTOT|MEMUSE|SWAPTO|SWAPUS~~|-~~|global~~ ~~|<nowiki>-</nowiki>~~ ~~|<nowiki>-</nowiki>~~ ~~|<nowiki>-</nowiki>~~ ~~|<nowiki>-</nowiki>~~ ~~|<nowiki>-</nowiki>~~ ~~|<nowiki>-</nowiki>~~ ~~|<nowiki>-</nowiki>~~|-~~|cn01~~ ~~|linux-x64~~ ~~|24~~ ~~|5.00~~ ~~|62.9G~~ ~~|8.6G~~ ~~|0.0~~ ~~|0.0~~|-~~|cn02~~ ~~|linux-x64~~ ~~|24~~ ~~|0.01~~ ~~|62.9G~~ ~~|1.2G~~ ~~|0.0~~ ~~|0.0~~|-~~|cn03~~ ~~|linux-x64~~ ~~|24~~ ~~|0.03~~ ~~|62.9G~~ ~~|1.5G~~ ~~|0.0~~ ~~|0.0~~|} The first two columns define the names and types of the computers, which are in the cluster. The NCPU column shows the number of the available processor cores. LOAD shows the computer’s load for the moment (this value equals with the value demonstrated by the uptime UNIX command). The rest of the cells are: overall physical memory, the actual used memory, the available swap-memory, and the used swap. The global line marks all the information in total regarding the cluster. ~~We can have a look at the available queue-s with the following command:~~ ~~<code>~~ ~~qconf -sql</code>~~ ~~One probable outcome of the command:~~ ~~<code>~~ ~~parael.q~~ ~~serial.q~~ ~~test.q~~ ~~</code>~~ ~~To get more info about the state of the system use~~ ~~<code>~~ ~~qstat -f</code>~~ It shows which jobs run in which queues, and you can also get detailed info about the queues themselves (state, environment). The command can be used without the -f switch too, but it is less informative, since in this case only the jobs’ states will appear. The command’s outcome: ~~<code>~~ ~~queuename qtype resv/used/tot. load_avg arch states~~ ~~<nowiki>-------------------------------------------------------------------------------- </nowiki>~~ ~~test.q@cn.32 BIP 0/3/24 3.15 linux-x64~~ ~~905 1.00000 PI_SEQ_TES stefan r 06/04/2011 09:12:14 1~~ ~~</code>~~

The ~~first column~~ schedule of ~~this table shows~~ the ~~name of~~ HPCs are CPU hour based. This means that the ~~row, the second column marks the type (B-batch, I-interactive, C-checkpointing, P-parallel environment, E-error state)~~available core hours are divided between users on a monthly basis. All UNIX users are connected to one or more account. This scheduler account is connected to an HPC project and a UNIX group. ~~The third part of the column shows how many~~ HPC jobs can only be ~~run at~~ sent by using one of the accounts. The core hours are calculated by the ~~same~~ multiplication of wall time in (time spent running the ~~row~~job) and the CPU cores requested. ~~All in all, these values fit to the number of overall processor~~ For example reserving 2 nodes (48 cpu cores in ) at the ~~system~~NIIFI SC for 30 minutes gives 48 * 30 = 1440 core minutes = 24 core hours. ~~The second item of~~ Core hours are measured between the ~~column shows~~ start and and the ~~free compartments at~~ end of the ~~moment~~jobs.

~~If a running (scheduled) job~~ '''It is very important to be ~~found in~~ sure the application maximally uses the ~~queue, it is directly next to~~ allocated resources. An empty or non-optimal job will consume allocated core time very fast. If the ~~name~~ account run out of the ~~row~~allocated time, ~~like~~ no new jobs can be submitted until the ~~recent "PI_SEQ_TES", which runs in~~ beginning of the ~~test~~next accounting period.~~q row. The tasks waiting for the resources, because it is overwhelmed or the preliminary conditions~~ Account limits are ~~not prompt, appear behind~~ regenerated the ~~sum row, listed as pending jobs~~beginning of each month. ~~For example:~~ '''

Information about an account can be listed with the following command:

<code>

~~queuename qtype resv/used/tot. load_avg arch states~~ ~~<nowiki>--------------------------------------------------------------------------------- </nowiki>~~ ~~parallel.q@cn31 BIP 0/24/24 22.3 linux-x64~~ ~~<nowiki>--------------------------------------------------------------------------------- </nowiki>~~ ~~test.q@cn32 BIP 0/24/24 23.5 linux-x64~~ ~~<nowiki>############################################################################ </nowiki>~~ ~~- PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS~~ ~~<nowiki>############################################################################ </nowiki>~~ ~~905 0.00000 PI_SEQ_TES stefan qw 06/04/2011 09:12:04 1~~ sbalance

</code>

==== Example ====

After executing the command, the following table shows up for Bob. The user can access, and run jobs by using two different accounts (foobar, barfoo). He can see his name marked with * in the table. He shares both accounts with alice (Account column). The consumed core hours for the users are displayed in the second row (Usage), and the consumption for the jobs ran as the account is displayed in the 4th row. The last two row defines the allocated maximum time (Account limit), and the time available for the machine (Available).

<pre>

Scheduler Account Balance

---------- ----------- + ---------------- ----------- + ------------- -----------

User Usage | Account Usage | Account Limit Available (CPU hrs)

---------- ----------- + ---------------- ----------- + ------------- -----------

alice 0 | foobar 0 | 0 0

bob * 0 | foobar 0 | 0 0

bob * 7 | barfoo 7 | 1,000 993

alice 0 | barfoo 7 | 1,000 993

</pre>

Each task is given an identifier, which is a number (a job ID, or j_id), this is followed by the job’s priority (0 in both cases), then the job’s name, and the user who posted the job, and the qw marks, that the job is waiting for the queue. Finally the date of the registration for the waiting queue is next ~~When a job finishes running, this is created: jobname.ojobnumber in our actual catalog, which contains the error messages and stapled outputs created by the program..~~ === ~~Job submission~~ Estimating core time === ~~Back then~~Before production runs, ~~the SGE scheduler was designed~~ it is advised to ~~be able to operate different types of architectures. That’s why you can’t post binary files directly, only scripts, like the~~ ~~<code>~~ ~~qsub script.sh</code>~~ ~~command~~have a core time estimate. The ~~script describes the task, the main parameters of it, and its running. For example in the~~ following ~~script, the described ''hostname.sh'' task~~command can be used for getting estimate:

<code>

~~#!/bin/sh~~ #$ sestimate -N ~~HOSTNAME~~ ~~/bin/hostname~~ NODES -t WALLTIME

</code>

where <code>NODES</code> are the number of nodes to be reserved, <code>WALLTIME</code> is the maximal time spent running the job.

'''It is important to provide the core time to be reserved most precisely, because the scheduler queue the jobs based on this value. Generally, a job with shorter core time will be run sooner. It is advised to check the time used to run the job after completion with <code>sacct</code> command.'''

~~can be posted with the following command:~~ ~~<code>~~ ~~qsub hostname.sh</code>~~ ~~The scripts can be used for separating the different binaries:~~ ~~<code>~~ ~~#!/bin/sh~~ ~~case `uname` in~~ ~~SunOS) ./pi_sun~~ ~~FreeBSD) ./pi_bsd~~ ~~esac</code>~~ ~~With the following command, we can define the queue where the scheduler puts the job:~~ ~~<code>~~ ~~qsub -q serial.q range.sh</code>~~ ~~The command qsub can be issued with a number of different switches, which are gathered in the following table:~~ ~~{| class~~=~~"wikitable" border~~=~~"1"~~|-~~|Parameter|Possible example|Result~~|-~~| -N name| -N Flow|The job will appear under this name in the queue.~~|-~~| -cwd| -cwd|The output and the error files will appear in this actual catalog.~~|-~~| -S shell| -S /bin/tcsh|The shell in which the scripts run.~~|-~~| -j {y|n}| -j y|Joining the error and the output in one file.~~|-~~| -r {y|n}| -r y|After a restart, should the job restart too (from the beginning).~~|-~~| -M e-mail| -M stefan@niif.hu|Scheduler information will be sent to this address about the job.~~|-~~| -l| -l h_cpu~~=~~0:15:0|Chooses a queue for the job where 15 minutes of CPU time could be ensured. (hour:minute:second)~~|-~~| -l| -l h_vmem~~=1G|Chooses a computer for the job where 1 GB memory is available. In the case of parallel jobs its value is extended with the required number of slots. If this parameter is not given, the default setting will be the number of the maximum memory cores set up in the computers.|-~~| -l| -l in|Consuming resources, complex request. (This will be defined in the documentation written for the system administrators)~~|-~~| -binding| -binding linear:4|Chooses 4 CPU cores on the worker node-on and assignes in a fix way. Further information: [http://docs.oracle.com/cd/E24901_01/doc.62/e21976/chapter2.htm#autoId75 here].~~|-~~| -l| -l exclusive~~Example ====~~true|Demand of exclusive task execution (another job will not be scheduled on the chosen computers). It can be used in the following sites: Szeged, Budapest és Debrecen.~~|-~~| -P| -P niifi|Chooses a HPC project. This command will list the available HPC projects: ''qconf -sprjl''~~|-~~| -R~~ ~~| -R y|Resource reservation. This will cause that bigger parallel jobs will get higher priority.~~|}

Alice want to reserve 2 days 10 hours and 2 nodes, she checks, if she have enough time on her account.

<pre>

sestimate -N 2 -t 2-10:00:00

~~qsub command arguments can be added~~ Estimated CPU hours: 2784</pre>Unfortunately, she couldn't afford to ~~the ~/.sge_request file. If~~ run this ~~file exists then it will be added to the qsub arument list~~job.

~~Sometimes we want to delete a job before its running. For this you can use the~~ === Status information ===

Jobs in the queue can be listed with <code>squeue</code> command, the status of the cluster can be retrieved with the <code>sinfo</code> command. All jobs sent will get a JOBID. The properties of a job can be retrieved by using this id. Status of a running or waiting job:

<code>

~~qdel job_id</code>~~scontrol show job JOBID

~~command.~~

~~<code>~~

~~qdel 903~~

</code>

All jobs will be inserted into an accounting database. The ~~example deletes~~ properties of the ~~job number 903~~completed jobs can be retrieved from this database. Detailed statistics can be viewed by using this command:

<code>

~~qdel~~ sacct -~~f 903~~l -j JOBID

</code>

It Memory used can ~~delete the running jobs immediately.~~ ~~For pending and then continuing jobs, use qmod {-s,-us}.~~ be retrieved by using

<code>

~~qmod -s 903~~ ~~qmod -us 903~~ smemory JOBID

</code>

~~The previous one suspends the running of number 903 (SIGSTOP), while the latter one allows (SIGCONT).~~ ~~If there is a need to change the features (resource requirements) of a job put into the waiting list, it~~ Disk usage can be ~~done with the~~ retrieved by this command: ~~''qalter''~~

<code>

~~qalter -l h_cpu=0:12:0 903~~ sdisk JOBID

</code>

==== Example ====

There are 3 jobs in the queue. The ~~previous command alternates the hard-CPU requirements of the~~ first is an array job ~~number 903~~ which is waiting for resources (~~h_cpu~~PENDING) ~~and changes it to 12~~ . The second is an MPI job running on 4 nodes for 25 minutesnow. The ~~switches~~ third is an OMP run running on one node, just started. The NAME of the ~~qalter command are mainly overlap the ones of the qsub command~~jobs can be freely given, it is advised to use short, informative names.

<pre>

squeue -l

~~In a special case, we have to execute the same task, but on different data. These tasks are the~~ Wed Oct 16 08:30:07 2013 JOBID PARTITION NAME USER STATE TIME TIMELIMIT NODES NODELIST(REASON)591_[1-96] normal array ~~jobs. With SGE we can upload several jobs to the waiting. For example in the pi task shown in previous chapter, it can be posted multiple times, with different parameters, with the following script~~ alice PENDING 0:~~''array.sh''~~ 00 30:00 1 (None) 589 normal mpi bob RUNNING 25:55 2:00:00 4 cn[05-08] 590 normal omp alice RUNNING 0:25 1:00:00 1 cn09</pre>

This two-node batch job had a typical load of 10GB virtual, and 6.5GB RSS memory per node.

<~~code~~pre> ~~#!/bin/sh~~ ~~#$ -N PI_ARRAY_TEST~~ ~~./pi_gcc `expr $SGE_TASK_ID \* 100000`~~ ~~</code>~~smemory 430

~~The SGE_TASK_ID is an internal integer used by the SGE, which created values for each running job. The interval can be set up when posting the block:~~ MaxVMSize MaxVMSizeNode AveVMSize MaxRSS MaxRSSNode AveRSS---------- -------------- ---------- ---------- ---------- ----------10271792K cn06 10271792K 6544524K cn06 6544524K 10085152K cn07 10085152K 6538492K cn07 6534876K </pre>

==== Checking jobs ====

It is important to be sure the application fully uses the core time reserved. A running application can be monitored with the following command:

<code>

~~qsub -t 1-7 array.sh~~ sjobcheck JOBID

</code>

===== Example =====

~~meaning that~~ This job runs on 4 nodes. The LOAD group provides information about the general load of the machine, this is more or less equal to the number of cores. The CPU group gives you information about the ~~array~~exact usage.~~sh program will run in seven issues~~Ideally, ~~and~~ values of the ~~SGE_TASK_ID will have~~ <code>User</code> column are over 90. If the value ~~of 1~~is below that, 2there is a problem with the application, ~~...~~or it is not optimal, ~~7 in every running issue~~and the run should be ended. ~~The qstat -f shows how~~ This example job fully using ("maxing out") the ~~block tasks are split:~~ available resources.

<~~code~~pre>Hostname LOAD CPU Gexec CPUs (Procs/Total) [ ~~<nowiki>--------------------------------------------------------------------------------- <~~1, 5, 15min] [ User, Nice, System, Idle, Wio]cn08 24 ( 25/~~nowiki>~~ ~~parallel~~ 529) [ 24.83, 24.84, 20.98] [ 99.8, 0.0, 0.~~q@cn30 BIP~~ 2, 0/.0, 0~~/24~~ .0 ~~linux-x64~~ ] OFF ~~<nowiki>--------------------------------------------------------------------------------- <~~cn07 24 ( 25/~~nowiki>~~ ~~test~~529) [ 24.93, 24.88, 20.98] [ 99.~~q@cn32 BIP~~ 8, 0~~/7/24 7~~.~~15 linux-x64~~ ~~907 1~~0, 0.~~00000 PI_ARRAY_T stefan r 06/04/2011 10:34:14 1 1~~ ~~907~~ 2, 0.~~50000 PI_ARRAY_T stefan t 06/04/2011 10:34:14 1 2~~ ~~907~~ 0, 0.~~33333 PI_ARRAY_T stefan t 06~~0] OFFcn06 24 ( 25/~~04/2011 10:34:14 1 3~~ ~~907~~ 529) [ 25.00, 24.90, 20.97] [ 99.9, 0.~~25000 PI_ARRAY_T stefan t 06/04/2011 10:34:14 1 4~~ ~~907~~ 0, 0.~~20000 PI_ARRAY_T stefan t 06/04/2011 10:34:14~~ 1 5 ~~907~~ , 0.~~16667 PI_ARRAY_T stefan t 06/04/2011 10:34:14 1 6~~ ~~907~~ 0, 0.~~14286 PI_ARRAY_T stefan t 06/04/2011 10:34:14 1 7~~ 0] OFFcn05 <24 ( 25/~~code>~~ ~~It is clear~~ 544) [ 25.11, 24.96, 20.97] [ 99.8, ~~that behind the tasks there are their array index with which we can refer to the components to the task~~ 0. ~~For example~~0, ~~in the case of block tasks~~ 0.2, ~~there is a possibility to delete particular parts of the block~~ 0. ~~If we want to delete the subtasks from 5-7 of the previous task~~0, ~~the command~~ ~~<code>~~ ~~qdel -f 907~~ 0.~~5-7~~ 0] OFF</~~code~~pre>

~~will delete chosen components, but leaves the tasks 907.1-4 intact.The result of the running is seven individual files, with seven different running solutions:~~==== Checking licenses ====

It The used and available licenses can ~~happen; that the task placed in the queue won’t start. This case the~~be retrieved with this command:

<code>

~~qstat -j job_id~~ slicenses

</code>

~~command will show the detailed scheduling information, containing which running parameters are unfulfilled by the task.~~ The priority of the different tasks only means the gradiation listed in the pending jobs. The scheduler will analyze the tasks in this order. Since it requires the reservation of resources, it is not sure, that the tasks will run exactly the same order.==== Checking downtime ====

~~If we wonder why a certain job won’t~~ In downtime periods, the scheduler doesn't startnew jobs, ~~here’s how you~~ but jobs can be sent. The periods can ~~get information~~be retrieved by using the following command:

<code>

~~qalter -w v job_id~~sreservations

</code>

~~One possible outcome~~ === Running jobs ===

~~<code>~~ ~~Job 53505 cannot run in queue "parallel.q" because it is not contained~~ Running applications in ~~its hard queue list (-q)~~ ~~Job 53505 (-l NONE) cannot run~~ the HPC can be done in ~~queue "cn30~~batch mode.~~budapest~~This means all runs must have a job script containing the resources and commands needed.~~hpc.niif.hu" because exclusive resource~~ The parameters of the scheduler (~~exclusive) is already in use~~ ~~Job 53505 (-l NONE) cannot run in queue "cn31.budapest.hpc.niif.hu" because exclusive~~ resource ~~(exclusive~~definitions) ~~is already in use~~ ~~Job 53505 cannot run in PE "mpi" because it only offers 0 slots~~ ~~verification: no suitable queues~~can be given with the <code>#SBATCH</code>directive. Comparison of the schedulers, and the directives available at slurm are available at this [http://slurm.schedmd.com/rosetta.pdf table].

~~You can check with this command where the jobs~~ ==== Obligatory parameters ====The following parameters are ~~running~~obligatory to provide: <pre>#!/bin/bash#SBATCH -A ACCOUNT#SBATCH --job-name=NAME#SBATCH --time=TIME</pre>

where <code> ~~qhost -j~~ ACCOUNT</code> is the name of the account to use (available accounts can be retrieved with the <code>sbalance</code> command), <code>NAME</code> is the short name of the job, <code>TIME</code> is the maximum walltime using <code>DD-qHH:MM:SS</code>syntax. Acceptable time formats include "minutes", "minutes:seconds", "hours:minutes:seconds", "days-hours", "days-hours:minutes" and "days-hours:minutes:seconds".

The following command submit jobs:

<code>

~~HOSTNAME ARCH NCPU LOAD MEMTOT MEMUSE SWAPTO SWAPUS~~ ~~<nowiki>------------------------------------------------------------------------------- </nowiki>~~ ~~global - - - - - - -~~ ~~cn01 linux-x64 24 24.43 62.9G 3.0G 0.0 0.0~~ ~~serial.q BI 0/42/48~~ ~~120087 0.15501 run~~sbatch jobscript.sh ~~roczei r 09/23/2012 14:25:51 MASTER 22~~ ~~120087 0.15501 run.sh roczei r 09/23/2012 15:02:21 MASTER 78~~ ~~120087 0.15501 run.sh roczei r 10/01/2012 07:58:21 MASTER 143~~ ~~120087 0.15501 run.sh roczei r 10/01/2012 08:28:51 MASTER 144~~ ~~120087 0.15501 run.sh roczei r 10/04/2012 17:41:51 MASTER 158~~ ~~120340 0.13970 pwhg.sh roczei r 09/24/2012 23:24:51 MASTER 3~~ ~~120340 0.13970 pwhg.sh roczei r 09/24/2012 23:24:51 MASTER 5~~ ~~120340 0.13970 pwhg.sh roczei r 09/24/2012 23:24:51 MASTER 19~~ ~~120340 0.13970 pwhg.sh roczei r 09/24/2012 23:24:51 MASTER 23~~ ~~120340 0.13970 pwhg.sh roczei r 09/24/2012 23:24:51 MASTER 31~~ ~~120340 0.13970 pwhg.sh roczei r 09/24/2012 23:24:51 MASTER 33~~ ~~120340 0.13970 pwhg.sh roczei r 09/26/2012 13:42:51 MASTER 113~~ ~~120340 0.13970 pwhg.sh roczei r 10/01/2012 07:43:06 MASTER 186~~ ~~120340 0.13970 pwhg.sh roczei r 10/01/2012 07:58:36 MASTER 187~~ ~~...~~

</code>

~~=== Queue types ===~~ ~~''parallel.q'' - for paralel jobs (jobs can run maximum 31 days)~~ ~~''serial.q'' - for serial jobs (jobs can run maximum 31 days)~~ ~~''test.q'' - test queue~~If the submission was successful, the ~~job will be killed after 2 hours~~following is outputted:<pre>Submitted batch job JOBID</pre> ~~Getting information on~~ where <code>JOBID</code> is the unique id of the ~~waiting line’s status:~~job

The following commmand cancels the job:

<code>

~~qstat -g c~~ scancel JOBID

</code>

==== Job queues ====

There are two separate queue (partition) available in the HPC, the <code>test</code> queue and the <code>prod</code> queue. Tha latter is for the production runs, the former is for testing purposes. In the test queue, 1 node can be allocated for the maximum of half hours, The default queue is <code>prod</code>. Test partition can be chosen with the following directive: ~~CUSTER QUEUE CQLOAD USED RES AVAIL TOTAL aoACDS cdsuE~~ <~~nowiki~~pre>#SBATCH --~~------------------------------------------------------------------------------~~ partition=test</~~nowiki~~pre>

~~parallel.q 0.52 368 0 280 648 0 0~~ ~~serial.q 0.05 5 0 91 96 0 0~~ ~~test.q 0.00 0 0 24 24 0 0</code>~~==== Quality of Service (QoS) ====

~~=== Running PVM~~ There is an option for submitting low priority jobs. These jobs can be interrupted by any normal priority job ~~===~~at any time, but only the half of the time is billed to the account. Interrupted jobs will be automatically queued again. Therefore it is important to only run jobs that can be interrupted at any time, periodically saves their states (checkpoint) and can restart quickly.The default QoS is <code>normal</code>, non-interruptable.

The following directive choses low priority:

<pre>

#SBATCH --qos=lowpri

</pre>

~~To run the previously shown and translated gexample application, we need the following task-describing ''gexample.sh'' script:~~ ==== Memory settings ====

1000 MB memory is allocated for 1 CPU core by default, more can be allocated with the following directive:<~~code~~pre> #~~!/bin/sh~~ #$ SBATCH --mem-per-~~N GEXAMPLE~~ cpu=MEMORY .</~~gexample~~ pre>where <~~< EOL~~ 30 5 ~~EOL~~ code>MEMORY</code>is given in MB. The maximum memory/core at NIIFI SC is 2600 MB.

~~We can submit this with~~ ==== Email notification ====Sending mail when the ~~following command~~status of the job change (start, stop, error): <pre>#SBATCH --mail-type=ALL#SBATCH --mail-user=EMAIL</pre>where <code>EMAIL</code> is the e-mail to notify.

==== Array jobs ====Array jobs are needed, when multiple one threaded (serial) jobs are to be sent (with different data). Slurm stores unique id of the instances in the <code>SLURM_ARRAY_TASK_ID</code> enviromnemt variable. It is possible to seperate threads of the array job by retrieving these ids. Output of the threads are written into <code> ~~qsub~~ slurm-SLURM_ARRAY_JOB_ID-~~pe pvm 5 gexample~~SLURM_ARRAY_TASK_ID.sh out</code>files. The scheduler uploads outputs tightly. It is useful to use multiply threads for a CPU core. [http://slurm.schedmd.com/job_array.html More on this topic]

===== Example =====Alice user submits 96 serial job for a maximum of 24 hour run. on the expenses of 'foobar' account. The <code>#SBATCH --array=1-~~pe pvm 5 command will tell to the SGE to create a PVM parallel computer machine with 5 virtual processors~~96</code> directive indicates, ~~and~~ that it is an array job. The application can be run with the ~~application~~ <code>srun</code> command. This is a shell script in thisexample.<pre>#!/bin/bash#SBATCH -A foobar#SBATCH --time=24:00:00#SBATCH --job-name=array#SBATCH --array=1-96srun envtest. sh</pre>

==== MPI jobs ====Using MPI jobs, the number of MPI processes running on a node is to be given (<code> ~~uv.q@uv BIP 0~~#SBATCH --ntasks-per-node=</~~5/1110 5~~code>).~~15 linux-x64~~ ~~908 1~~The most frequent case is to provide the number of CPU cores.~~00000 GEXAMPLE stefan r 06/04/2011 13:05:14 5~~ Parallel programs should be started by using <code>mpirun</code>command.

~~Also note that after the running two output files were created: one containing~~ ===== Example =====Bob user allocates 2 nodes, 12 hour for an ~~attached standard error and standard output (GEXAMPLE.o908)~~MPI job, ~~another describing the working method of the (GEXAMLE~~billing 'barfoo' account.~~po908)~~24 MPI thread will be started on each node. The ~~latter one~~ stdout output is ~~mainly for finding errors~~piped to <code>slurm.out</code> file (<code>#SBATCH -o</code>).

<pre>

#!/bin/bash

#SBATCH -A barfoo

#SBATCH --job-name=mpi

#SBATCH -N 2

#SBATCH --ntasks-per-node=24

#SBATCH --time=12:00:00

#SBATCH -o slurm.out

mpirun ./a.out

</pre>

=== ~~Running MPI jobs~~ = CPU binding ====Generally, the performance of MPI application can be optimized with CPU core binding. In this case, the threads of the paralel program won't be scheduled by the OS between the CPU cores, and the memory localization can be made better (less cache miss). It is advised to use memory binding. Tests can be run to define, what binding strategy gives the best performance for our application. The following settings are valid for OpenMPI environment. Further information on binding can be retrieved with <code>--report-bindings</code> MPI option. Along with the running commands, few lines of the detailed binding information are shown. It is important, that one should not use task_binding of the scheduler!

===== Binding per CPU core =====

In this case, MPI fills CPU cores by the order of threads (rank).

~~All computers are set up with several installations of the MPI system~~<pre>Command to run: ~~vendor~~mpirun -~~specific MPI implementations, and MPICH system too. The default setup is the vendor~~-~~specific MPI.~~bind-to-core --bycore

~~Running in the MPI environment is similar~~ [cn05:05493] MCW rank 0 bound to ~~the PVM environment~~socket 0[core 0]: [B . . ~~Let’s have a look at the example shown in the previous chapter connectivity~~. ~~A very simple task which tests the MPI tasks’internal communication~~. ~~Use the following connectivity~~.~~sh script~~ . . . . . .][. . . . . . . . . . . .][cn05:05493] MCW rank 1 bound to ~~run it~~socket 0[core 1]: [. B . . . . . . . . . .][. . . . . . . . . . . .][cn05:05493] MCW rank 2 bound to socket 0[core 2]: [. . B . . . . . . . . .][. . . . . . . . . . . .][cn05:05493] MCW rank 3 bound to socket 0[core 3]: [. . . B . . . . . . . .][. . . . . . . . . . . .]</pre>

===== Binding based on CPU socket =====In this case, MPI threads are filling CPUs alternately.<~~code~~pre> ~~#!/bin/sh~~ #$ Command to run: mpirun --bind-to-core --~~N CONNECTIVITY~~ ~~</code>~~bysocket

~~<code>~~ [cn05:05659] MCW rank 0 bound to socket 0[core 0]: [B . . . . . . . . . . .][. . . . . . . . . . . .][cn05:05659] MCW rank 1 bound to socket 1[core 0]: [. . . . . . . . . . . .][B . . . . . . . . . . .][cn05:05659] MCW rank 2 bound to socket 0[core 1]: [. B . . . . . . . . . .][. . . . . . . . . . . .] ~~mpirun -np $NSLOTS~~ [cn05:05659] MCW rank 3 bound to socket 1[core 1]: [.~~/connectivity~~ . . . . . . . . . . .][. B . . . . . . . . . .]</~~code~~pre>

~~Here~~===== Binding by nodes =====In this case, ~~the $NLOTS variable indicates that how many processors should be used in the~~ MPI ~~environment~~threads are filling nodes alternately. ~~This equals with that number what we have reuired for the parallel environment~~At least 2 nodes needs to be allocated. <pre>Command to run: mpirun --bind-to-core --bynode

~~The job can be submitted with the following command~~[cn05: 05904] MCW rank 0 bound to socket 0[core 0]: [B . . . . . . . . . . .][. . . . . . . . . . . .][cn05:05904] MCW rank 2 bound to socket 0[core 1]: [. B . . . . . . . . . .][. . . . . . . . . . . .][cn06:05969] MCW rank 1 bound to socket 0[core 0]: [B . . . . . . . . . . .][. . . . . . . . . . . .][cn06:05969] MCW rank 3 bound to socket 0[core 1]: [. B . . . . . . . . . .][. . . . . . . . . . . .]</pre>

==== OpenMP (OMP) jobs ====

For OpenMP paralell applications, 1 node needs to be allocated, and the number of OMP threads needs to be provided with the <code>OMP_NUM_THREADS</code> environment variable. The variable needs to be written before the application (see example), or needs to be exported before executing the command:

<code>

~~qsub -pe mpi 20 connectivity.sh~~ export OMP_NUM_THREADS=24

</code>

~~With this command we instruct the scheduler to create~~ ===== Example =====Alice user starts a ~~parallel MPI environment containing 20 processors, and reserve space~~ 24 threaded OMP application for ~~it in one~~ maximum 6 hours on the expenses of ~~the queues~~foobar account. ~~Once the space is available, the~~ <pre>#!/bin/bash#SBATCH -A foobar#SBATCH --job ~~starts~~-name=omp#SBATCH --time=06:00:00#SBATCH -N 1OMP_NUM_THREADS=24 ./a.out</pre>

~~<code>~~ ~~uv.q@uv BIP 0/20/1110 20.30 linux~~==== Hybrid MPI-~~x64~~ ~~910 1.00000 CONNECTOVI stefan r 06/04/2011 14:03:14 20~~ ~~</code>~~OMP jobs ====

~~Running~~ When an application uses MPI and OMP it is running in hybrid MPI-OMP mode. Good to know that Intel MKL linked applications MKL calls are OpenMP capable. Generally, the ~~program will result in two files~~following distribution suggested: MPI process number is from 1 to the ~~first one (CONNECTIVITY.o910)~~ CPU socket number, OMP thread number is the ~~overlap~~ number of CPU cores in a node, or the ~~result~~ half or quarter of ~~the already run program standard output and standard error, while the second one~~ that (~~CONNECTIVITY~~it depends on code).~~po910) is for~~ For the ~~follow-up of~~ job script, the ~~operation~~ parameters of ~~the parallel environment. If the running is successful, this file is empty~~these two needs to be combined. ~~The command -pe mpi 20 can be given in the script too with the directive #$ -pe mpi 20~~

===== Example =====Alice user sent a hybrid job on the expenses of the 'foobar'~~'Important notes: you should use mpirun~~account for 8 hours, and 2 nodes.~~sge by SGI MPT~~ 1 MPI process is running on one node using 24 OMP thread per node. For the ~~Debrecen supercomputer''' when you run a~~ 2 nodes, 2 MPI process is running, with 2x24 OMP threads<pre>#!/bin/bash#SBATCH -A foobar#SBATCH --job ~~under SGE~~-name=mpiomp#SBATCH -N 2#SBATCH --time=08:00:00#SBATCH --ntasks-per-node=1#SBATCH -o slurm.outexport OMP_NUM_THREADS=24mpirun . ~~This can automatic parse which machines have been selected by SGE~~/a.out</pre>

~~This way you~~ ==== Maple Grid jobs ====Maple can ~~check that you are~~ be run - similarly to OMP jobs - on one node. Maple module need to be loaded for using ~~SGI MPT or not~~it. A grid server needs to be started, because Maple is working in client-server mode (<code>${MAPLE}/toolbox/Grid/bin/startserver</code>). This application needs to use license, which have to be given in the jobscript (<code>#SBATCH --licenses=maplegrid: 1</code>). Starting of a Maple job is done by using <code>${MAPLE}/toolbox/Grid/bin/joblauncher</code> code.

~~DEBRECEN[service0] ~ (~~===== Example =====Alice user is running a Maple Grid application for 6 hours on the expenses of 'foobar' account:<pre>#!/bin/bash#SBATCH -A foobar#SBATCH --job-name=maple#SBATCH -N 1#SBATCH --ntasks-per-node=24#SBATCH --time=06:00:00#SBATCH -o slurm.out#SBATCH --licenses=maplegrid:1~~)$ type mpirun~~

~~mpirun is hashed (/opt/nce/packages/global/sgi/mpt/2.04/bin/mpirun)~~ module load maple

~~DEBRECEN[service0] ~ (0)~~$ ~~type mpirun~~{MAPLE}/toolbox/Grid/bin/startserver${MAPLE}/toolbox/Grid/bin/joblauncher ${MAPLE}/toolbox/Grid/samples/Simple.~~sge~~ mpl</pre>

~~mpirun~~==== GPU compute nodes ====The Szeged site accomodates 2 GPU enabled compute nodes.~~sge is hashed~~ Each GPU node has 6 Nvidia Tesla M2070 cards. The GPU nodes reside in a separate job queue (<code>--partition gpu</~~opt/nce/packages/global/sgi/mpt/2~~code>).04To specify the number of GPUs set <code>--gres gpu:#</~~bin/mpirun~~code> directive.~~sge)~~

~~DEBRECEN[service0] ~ (0)$~~ ===== Example =====Alice user submits to the foobar account a 4 GPU, 6 hour job.<pre>#!/bin/bash#SBATCH -A foobar#SBATCH --job-name=GPU#SBATCH --partition gpu#SBATCH --gres gpu:4#SBATCH --time=06:00:00

You should use mpirun binary directly if you are using SHF3 environment or you would like to use a more complex MPI run. However, you need to parse the SGE's PE_HOSTFILE environment variable in this case.$PWD/gpu_burnout 3600</pre>

~~=== Running OpenMP jobs ===~~

~~There are applications that either use the solutions of~~ == Extensions ==Extensions should be asked for at the ~~operation system for multi~~Execution site (NIIF) at prace-~~threaded program execution, or use a special library designed for this, like OpenMP~~support@niif. ~~These applications have to be instructed how many threads they can use~~hu. ~~The matrix multiplication algorithm presented in the previous chapter can~~ All requests will be ~~described with the following ''omp_mm~~carefully reviewed and decided if eligable.~~sh'' script~~

~~<code>~~== Reporting after finishing project == ~~#!/bin/sh~~ #$ A report must be created after using PRACE resources. Please contact prace-~~N OPENMP_MM~~ ~~</code>~~support@niif.hu for further details.

~~<code>~~ ~~./omp_mm~~ ~~</code>~~== Acknowledgement in publications ==

~~it can be submitted with this command which will use 6 threads~~ PRACE

~~<code>~~ ~~qsub -pe openmp 6 omp_mm~~'''We acknowledge [PRACE/KIFÜ] for awarding us access to resource based in Hungary at [Budapest/Debrecen/Pécs/Szeged].sh~~</code>~~'''

KIFÜ

~~=== Checkpointing support ===~~'''We acknowledge KIFÜ for awarding us access to resource based in Hungary at [Budapest/Debrecen/Pécs/Szeged].'''

Where technical support has been received the following additional text should also be used:

'''The support of [name of person/people] from KIFÜ, Hungary to the technical work is gratefully acknowledged.'''

~~At the moment the system doesn’t support any automatic checkpointing/restarting mechanism. If it is need, the application has to take care of it.~~[[Category: HPC]]

Kzoli(AT)niif.hu

bürokraták, adminisztrátorok

142

szerkesztés

Módosítások

PRACE User Support

Navigációs menü

Személyes eszközök

Névterek

Változatok

Nézetek

Több

Keresés

Navigáció

Eszközök