Módosítások

PRACE User Support

1 198 bájt törölve, 2013. november 4., 17:14

chaned sge to slurm, translation in progress ;)

* -stripe Use this parameter to initiate a “striped” GridFTP transfer that uses more than one node at the source and destination. As multiple nodes contribute to the transfer, each using its own network interface, a larger amount of the network bandwidth can be consumed than with a single system. Thus, at least for “big” (> 100 MB) files, striping can considerably improve performance.

==Usage of the ~~Sun Grid Engine~~ SLURM scheduler ==

The schedule of the HPCs are CPU hour based. This means that the available core hours are divided between users on a monthly basis. All UNIX users are connected to one or more account. This scheduler account is connected to an HPC project and a UNIX group. HPC jobs can only be sent by using one of the accounts. The core hours are calculated by the multiplication of wall time (time spent running the job) and the CPU cores requested.

For example reserving 2 nodes (48 cpu cores) at the NIIFI SC for 30 minutes gives 48 * 30 = 1440 core minutes = 24 core hours. Core hours are measured between the start and and the end of the jobs.

~~Basically the SGE~~ '''It is ~~a scheduler, which divides the resources, computers into resource partitions. These are called queues. A queue can’t~~ very important to be ~~larger than a physical resource; it can’t expand its borders. SGE registers a waiting list for~~ sure the resources managed by itself, to which the posted computing tasks are directed. The scheduler searches for the resource defined by the description of the task and starts it. The task-resource coupling depends on the ability of application maximally uses the allocated resources ~~and the parameters of the tasks~~. ~~In case the resources are overloaded, the tasks have to wait while the requested processor and memory becomes available.~~ ~~The detailed documentation of the SGE can be found [http://docs.oracle.com/cd/E24901_01/doc.62/e21976.pdf here].~~ ~~SGE version on all HPC sites: [http://gridscheduler.sourceforge.net/documentation.html Open Grid Scheduler (OGS/GE 2011.11p1)]~~ ~~=== The most simple commands ===~~ ~~The most simple SGE command is the display of the cluster data:~~ ~~<code>~~ ~~qhost</code>~~ ~~A possible outcome of this command can be:~~ ~~{| class="wikitable" border="1"~~ |An emty or non- ~~|HOSTNAME|ARCH|NCPU|LOAD|MEMTOT|MEMUSE|SWAPTO|SWAPUS~~|-~~|global~~ ~~|<nowiki>-</nowiki>~~ ~~|<nowiki>-</nowiki>~~ ~~|<nowiki>-</nowiki>~~ ~~|<nowiki>-</nowiki>~~ ~~|<nowiki>-</nowiki>~~ ~~|<nowiki>-</nowiki>~~ ~~|<nowiki>-</nowiki>~~|-~~|cn01~~ ~~|linux-x64~~ ~~|24~~ |5optimal job will consume allocated core time very fast.00 ~~|62.9G~~ ~~|8.6G~~ ~~|0.0~~ ~~|0.0~~|-~~|cn02~~ ~~|linux-x64~~ ~~|24~~ ~~|0.01~~ ~~|62.9G~~ ~~|1.2G~~ ~~|0.0~~ ~~|0.0~~|-~~|cn03~~ ~~|linux-x64~~ ~~|24~~ ~~|0.03~~ ~~|62.9G~~ ~~|1.5G~~ ~~|0.0~~ ~~|0.0~~|} The first two columns define the names and types of the computers, which are in the cluster. The NCPU column shows the number of the available processor cores. LOAD shows the computer’s load for the moment (this value equals with the value demonstrated by the uptime UNIX command). The rest of the cells are: overall physical memory, the actual used memory, the available swap-memory, and the used swap. The global line marks all the information in total regarding If the ~~cluster.~~ ~~We can have a look at the available queue-s with the following command:~~ ~~<code>~~ ~~qconf -sql</code>~~ ~~One probable outcome of the command:~~ ~~<code>~~ ~~parallel.q~~ ~~serial.q~~ ~~test.q~~ ~~</code>~~ ~~To get more info about the state of the system use~~ ~~<code>~~ ~~qstat -f</code>~~ ~~It shows which jobs~~ account run in which queues, and you can also get detailed info about the queues themselves (state, environment). The command can be used without the -f switch too, but it is less informative, since in this case only the jobs’ states will appear. The command’s outcome: ~~<code>~~ ~~queuename qtype resv/used/tot. load_avg arch states~~ ~~--------------------------------------------------------------------------------~~ ~~test.q@cn.32 BIP 0/3/24 3.15 linux-x64~~ ~~905 1.00000 PI_SEQ_TES stefan r 06/04/2011 09:12:14 1~~ ~~</code>~~ ~~The first column~~ out of ~~this table shows~~ the ~~name of the row~~allocated time, ~~the second column marks the type (B-batch, I-interactive, C-checkpointing, P-parallel environment, E-error state). The third part of the column shows how many~~ no new jobs can be ~~run at the same time in~~ submitted until the ~~row. All in all, these values fit to the number of overall processor cores in the system. The second item~~ beginning of the ~~column shows the free compartments at the moment.~~ ~~If a running (scheduled) job is to be found in the queue, it is directly~~ next ~~to the name of the row, like the recent "PI_SEQ_TES", which runs in the test~~accounting period.~~q row. The tasks waiting for the resources, because it is overwhelmed or the preliminary conditions~~ Account limits are ~~not prompt, appear behind~~ regenerated the ~~sum row, listed as pending jobs. For example:~~ ~~<code>~~ ~~queuename qtype resv/used/tot. load_avg arch states~~ ~~---------------------------------------------------------------------------------~~ ~~parallel.q@cn31 BIP 0/24/24 22.3 linux-x64~~ ~~---------------------------------------------------------------------------------~~ ~~test.q@cn32 BIP 0/24/24 23.5 linux-x64~~ ~~############################################################################~~ ~~- PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS~~ ~~############################################################################~~ ~~905 0.00000 PI_SEQ_TES stefan qw 06/04/2011 09:12:04 1~~ ~~</code>~~ Each task is given an identifier, which is a number (a job ID, or j_id), this is followed by the job’s priority (0 in both cases), then the job’s name, and the user who posted the job, and the qw marks, that the job is waiting for the queue. Finally the date of the registration for the waiting queue is next ~~When a job finishes running, this is created: jobname.ojobnumber in our actual catalog, which contains the error messages and stapled outputs created by the program..~~ ~~=== Job submission ===Back then, the SGE scheduler was designed to be able to operate different types~~ beginning of ~~architectures~~each month. ~~That’s why you can’t post binary files directly, only scripts, like the~~ ~~<code>~~ ~~qsub script.sh</code>~~ ~~command. The script describes the task, the main parameters of it, and its running. For example in the following script, the described '~~'~~hostname.sh~~'' ~~task:~~ ~~<code>~~ ~~#!/bin/sh~~ ~~#$ -N HOSTNAME~~ ~~/bin/hostname~~ ~~</code>~~ ~~can be posted with the following command:~~ ~~<code>~~ ~~qsub hostname.sh</code>~~ ~~The scripts can be used for separating the different binaries:~~ ~~<code>~~ ~~#!/bin/sh~~ ~~case `uname` in~~ ~~SunOS) ./pi_sun~~ ~~FreeBSD) ./pi_bsd~~ ~~esac</code>~~ ~~With the following command, we can define the queue where the scheduler puts the job:~~

Information about an account can be listed with the following command:

<code>

~~qsub -q serial.q range.sh~~sbalance

</code>

~~The~~ ==== Example ====After executing the command ~~qsub can be issued with a number of different switches~~, ~~which are gathered in~~ the following table: ~~{| class="wikitable" border="1"~~|-~~|Parameter|Possible example|Result~~|-~~| -N name| -N Flow|The job will appear under this name in the queue~~shows up for Bob.|-~~| -cwd| -cwd~~|The ~~output~~ user can access, and ~~the error files will appear in this actual catalog~~run jobs by using two differnt accounts (foobar,barfoo).|-~~| -S shell| -S /bin/tcsh|The shell~~ He can see his name marked with * in ~~which~~ the ~~scripts run~~table.|-~~| -j {y,n}| -j y|Joining the error and the output in one file.~~|-~~| -r {y,n}| -r y|After a restart, should the job restart too~~ He shares both accounts with alice (~~from the beginning~~Account column).|-~~| -M e-mail| -M stefan@niif.hu|Scheduler information will be sent to this address about~~ The consumed core hours for the ~~job.~~|-~~| -l| -l h_cpu=0:15:0|Chooses a queue for~~ users are displayed in the ~~job where 15 minutes of CPU time could be ensured.~~ second row (~~hour:minute:second~~Usage)|-~~| -l| -l h_vmem=1G|Chooses a computer~~ , and the consumption for the ~~job where 1 GB memory is available. In~~ jobs ran as the ~~case of parallel jobs its value~~ account is ~~extended with~~ displayed in the ~~required number of slots~~4th row. ~~If this parameter is not given, the default setting will be the number of~~ The last two row defines the allocated maximum ~~memory cores set up in the computers.~~|-~~| -l| -l in|Consuming resources~~time (Account limit), ~~complex request. (This will be defined in~~ and the ~~documentation written~~ time available for the ~~system administrators)~~|-~~| -binding| -binding linear:4|Chooses 4 CPU cores on the worker node-on and assignes in a fix way. Further information: [http://docs.oracle.com/cd/E24901_01/doc.62/e21976/chapter2.htm#autoId75 here].~~|-~~| -l| -l exclusive=true|Demand of exclusive task execution~~ machine (~~another job will not be scheduled on the chosen computers~~Available). |-~~| -P| -P niifi|Chooses a HPC project. This command will list the available HPC projects: ''qconf -sprjl''~~|-~~| -R~~ ~~| -R y|Resource reservation. This will cause that bigger parallel jobs will get higher priority.~~|}

~~qsub command arguments can be added to the ~/.sge_request file. If this file exists then it will be added to the qsub arument list.~~<pre>Scheduler Account Balance---------- ----------- + ---------------- ----------- + ------------- -----------User Usage | Account Usage | Account Limit Available (CPU hrs)---------- ----------- + ---------------- ----------- + ------------- -----------alice 0 | foobar 0 | 0 0bob * 0 | foobar 0 | 0 0

~~Sometimes we want to delete a job before its running. For this you can use the~~ bob * 7 | barfoo 7 | 1,000 993alice 0 | barfoo 7 | 1,000 993</pre>

=== A gépidő becslése ===

Nagyüzemi (production) futtatások előtt gépidőbecslést érdemes végezni. Ehhez a következő parancs használható:

<code>

~~qdel job_id~~sestimate -N NODES -t WALLTIME

</code>

ahol a <code>NODES</code> a lefoglalni kívánt node-ok száma, a <code>WALLTIME</code> pedig a futás maximális ideje.

~~command~~'''Fontos, hogy a lefoglalni kívánt gépidőt a lehető legpontosabban adjuk meg, mivel az ütemező ez alapján is rangsorolja a futtatásra váró feladatokat. Általában igaz, hogy a rövidebb job hamarabb sorra kerül. Érdemes minden futás idejét utólag az <code>sacct</code> paranccsal is ellenőrizni.'''

~~<code>~~==== Példa ==== ~~qdel 903~~Alice 2 nap 10 órára és 2 node-ra szeretne foglalást kérni, megnézi van-e elég gépidő a számláján:<~~/code~~pre>sestimate -N 2 -t 2-10:00:00

~~The example deletes the job number 903~~Estimated CPU hours: 2784</pre>Sajnos ebben a hónapban erre már nem telik.

=== Állapotinformációk ===

Az ütemezőben lévő jobokról az <code>squeue</code>, a klaszter általános állapotáról az <code>sinfo</code> parancs ad tájékoztatást. Minden beküldött jobhoz egy egyedi azonosítószám (JOBID) rendelődik. Ennek ismeretében további információkat kérhetünk. Feladott vagy már futó job jellemzői:

<code>

~~qdel -f 903~~scontrol show job JOBID

</code>

~~It can delete the running jobs immediately~~Minden job egy ún. ~~For pending and then continuing jobs, use qmod {-s,~~számlázási adatbázisba (accounting) is bekerül. Ebből az adatbázisból visszakereshetők a lefuttatott feladatok jellemzői és erőforrás-~~us}~~felhasználás statisztikái. A részletes statisztikát a következő paranccsal tudjuk megnézni:

<code>

~~qmod~~ sacct -~~s 903~~ ~~qmod~~ l -~~us 903~~ j JOBID

</code>

~~The previous one suspends the running of number 903 (SIGSTOP), while the latter one allows (SIGCONT).~~ ~~If there is~~ A felhasznált memóriáról a ~~need to change the features (resource requirements) of a job put into the waiting list, it can be done with the command~~következő parancs ad tájékoztatást: ~~''qalter''~~

<code>

~~qalter -l h_cpu=0:12:0 903~~ smemory JOBID

</code>

The previous command alternates the hard-CPU requirements of the job number 903 (h_cpu) and changes it to 12 minutes. The switches of the qalter command are mainly overlap the ones of the qsub command. In A lemezhasználatról pedig a special case, we have to execute the same task, but on different data. These tasks are the array jobs. With SGE we can upload several jobs to the waiting. For example in the pi task shown in previous chapter, it can be posted multiple times, with different parameters, with the following script:''array.sh''

<code>

~~#!/bin/sh~~ ~~#$ -N PI_ARRAY_TEST~~ ~~./pi_gcc `expr $SGE_TASK_ID \* 100000`~~ sdisk JOBID

</code>

~~The SGE_TASK_ID is an internal integer used by the SGE~~==== Példa ====Az ütemezőben 3 feladat van. Az első egy arrayjob, ~~which created values for each running~~ éppen erőforrásra vár (PENDING). A második egy MPI job, ami 4 node-on fut már 25 perce (TIME). A harmadik egy egy node-os OMP futtatás, éppen most indult el. A feladatik nevei (NAME) egyénileg adható meg. Rövid, informatív neveket érdemes adni. ~~The interval can be set up when posting the block:~~ <pre> squeue -l

~~<code>~~Wed Oct 16 08:30:07 2013 JOBID PARTITION ~~qsub~~ NAME USER STATE TIME TIMELIMIT NODES NODELIST(REASON)591_[1-t 96] normal array alice PENDING 0:00 30:00 1(None) 589 normal mpi bob RUNNING 25:55 2:00:00 4 cn[05-~~7 array.sh~~ 08] 590 normal omp alice RUNNING 0:25 1:00:00 1 cn09</~~code~~pre>

~~meaning that the array.sh program will run in seven issues, and the SGE_TASK_ID will have the value of 1,~~ Ennek a 2, node-os batch jobnak a jellemző memóriaterhelés a következő volt: kb.10GB virtuális és 6.5GB RSS memóriát használt el node-onként.~~, 7 in every running issue. The qstat -f shows how the block tasks are split:~~ <pre> smemory 430

~~<code>~~ MaxVMSize MaxVMSizeNode AveVMSize ~~---------------------------------------------------------------------------------~~MaxRSS MaxRSSNode ~~parallel.q@cn30 BIP 0/0/24 0 linux-x64~~ AveRSS ~~-----------------~~---------------------------------------------------------------- ~~test.q@cn32 BIP~~ 10271792K cn06 10271792K ~~0/7/24 7.15 linux-x64~~ ~~907 1.00000 PI_ARRAY_T stefan~~ 6544524K ~~r 06/04/2011 10:34:14 1 1~~ cn06 6544524K ~~907 0.50000 PI_ARRAY_T stefan~~ 10085152K cn07 10085152K 6538492K ~~t 06/04/2011 10:34:14 1 2~~ ~~907 0.33333 PI_ARRAY_T stefan t 06/04/2011 10:34:14 1 3~~ ~~907 0.25000 PI_ARRAY_T stefan t 06/04/2011 10:34:14 1 4~~ ~~907 0.20000 PI_ARRAY_T stefan t 06/04/2011 10:34:14 1 5~~ ~~907 0.16667 PI_ARRAY_T stefan t 06/04/2011 10:34:14 1 6~~ ~~907 0.14286 PI_ARRAY_T stefan t 06/04/2011 10:34:14 1 7~~ cn07 6534876K </~~code~~pre> It is clear, that behind the tasks there are their array index with which we can refer to the components to the task. For example, in the case of block tasks, there is a possibility to delete particular parts of the block. If we want to delete the subtasks from 5-7 of the previous task, the command

==== Feladatok ellenőrzése ====

Nagyon fontos meggyőződni arról, hogy az alkalmazás kihasználja-e a rendelkezésre álló gépidőt. Egy futó alkalmazás a következő paranccsal tudunk monitorozni:

<code>

~~qdel -f 907.5-7~~ sjobcheck JOBID

</code>

~~will delete chosen components~~===== Példa =====Ez a job 4 node-on fut. A LOAD csoport a gép általános terheléséről ad információt és kb. a core-ok számával egyezik meg. A helyes felhasználásról a CPU csoport ad tájékoztatást. Ideális esetben a <code>User</code> oszlop értékei 90 fölött vannak. Ez alatt valamilyen probléma lépett fel és a futást érdemes megszakítani. A példa job rendkívűl jól kihasználja a gépet (kimaxolja).<pre>Hostname LOAD CPU Gexec CPUs (Procs/Total) [ 1, 5, ~~but leaves the tasks 907~~15min] [ User, Nice, System, Idle, Wio]cn08 24 ( 25/ 529) [ 24.83, 24.84, 20.98] [ 99.~~1-4 intact~~8, 0.0, 0.2, 0.0, 0.0] OFF~~The result of the running is seven individual files~~cn07 24 ( 25/ 529) [ 24.93, 24.88, 20.98] [ 99.8, 0.0, 0.2, ~~with seven different running solutions:~~ 0.0, 0.0] OFFcn06 24 ( 25/ 529) [ 25.00, 24.90, 20.97] [ 99.9, 0.0, 0.1, 0.0, 0.0] OFF~~It can happen; that the task placed in the queue won’t start~~cn05 24 ( 25/ 544) [ 25.11, 24. ~~This case the:~~ 96, 20.97] [ 99.8, 0.0, 0.2, 0.0, 0.0] OFF</pre>

==== Licenszek ellenőrzése ====

Az elérhető és éppen használt licenszekről a következő parancs ad információt:

<code>

~~qstat -j job_id~~ slicenses

</code>

~~command will show the detailed scheduling information, containing which running parameters are unfulfilled by the task.~~==== Karbantartás ellenőrzése ==== The priority of the different tasks only means the gradiation listed in the pending jobs. The scheduler will analyze the tasks in this order. Since it requires the reservation of resources, it is not sureA karbantartási időablakban az ütemező nem indít új jobokat, ~~that the tasks will run exactly the same order~~de beküldeni lehet. ~~If we wonder why~~ A karbantartások időpontjairól a ~~certain job won’t start, here’s how you can get information~~következő parancs ad tájékoztatást:

<code>

~~qalter -w v job_id~~sreservations

</code>

~~One possible outcome~~ === Feladatok futtatása ===Alkalmazások futtatása a szupergépeken kötegelt (batch) üzemmódban lehetséges. Ez azt jelenti, hogy minden futtatáshoz egy job szkriptet kell elkészíteni, amely tartalmazza az igényelt erőforrások leírását és a futtatáshoz szükséges parancsokat. Az ütemező paramétereit (erőforrás igények) a <code>#SBATCH</code> direktívával kell megadni. Az ütemezők összehasonlításáról és a Slurm-ban elérhető direktívákról a következő [http://slurm.schedmd.com/rosetta.pdf táblázat] ad bővebb tájékoztatást.

==== Kötelező paraméterek ====A következő paramétereket minden esetben meg kell adni:<~~code~~pre> ~~Job 53505 cannot run in queue "parallel.q" because it is not contained in its hard queue list (~~#!/bin/bash#SBATCH -A ACCOUNT#SBATCH --job-q) name=NAME ~~Job 53505 (~~#SBATCH --~~l NONE) cannot run in queue "cn30.budapest.hpc.niif.hu" because exclusive resource~~ time=TIME</pre>ahol az <code>ACCOUNT</code> a terhelendő számla neve (~~exclusive~~elérhető számláinkről az <code>sbalance</code> parancs ad felvilágosítást) ~~is already in use~~ ~~Job 53505~~ , a <code>NAME</code> a job rövid neve, a <code>TIME</code> pedig a maximális walltime idő (<code>DD-~~l NONE) cannot run in queue "cn31.budapest.hpc.niif.hu" because exclusive resource (exclusive) is already in use~~ ~~Job 53505 cannot run in PE "mpi" because it only offers 0 slots~~ ~~verification~~HH:MM: ~~no suitable queues~~SS</code> ~~You can check with this command where the jobs are running:~~ ).

A jobok feladását a következő parancs végzi:

<code>

~~qhost -j -q~~sbatch jobscript.sh

</code>

Sikeres feladás esetén a következő kimenetet kapjuk:<~~code~~pre> ~~HOSTNAME ARCH NCPU LOAD MEMTOT MEMUSE SWAPTO SWAPUS~~ Submitted batch job JOBID ~~-------------------------------------------------------------------------------~~ ~~global - - - - - - -~~ ~~cn01 linux-x64 24 24.43 62.9G 3.0G 0.0 0.0~~ ~~serial.q BI 0~~</~~42/48~~ ~~120087 0.15501 run.sh roczei r 09/23/2012 14:25:51 MASTER 22~~ ~~120087 0.15501 run.sh roczei r 09/23/2012 15:02:21 MASTER 78~~ ~~120087 0.15501 run.sh roczei r 10/01/2012 07:58:21 MASTER 143~~ ~~120087 0.15501 run.sh roczei r 10/01/2012 08:28:51 MASTER 144~~ ~~120087 0.15501 run.sh roczei r 10/04/2012 17:41:51 MASTER 158~~ ~~120340 0.13970 pwhg.sh roczei r 09/24/2012 23:24:51 MASTER 3~~ ~~120340 0.13970 pwhg.sh roczei r 09/24/2012 23:24:51 MASTER 5~~ ~~120340 0.13970 pwhg.sh roczei r 09/24/2012 23:24:51 MASTER 19~~ ~~120340 0.13970 pwhg.sh roczei r 09/24/2012 23:24:51 MASTER 23~~ ~~120340 0.13970 pwhg.sh roczei r 09/24/2012 23:24:51 MASTER 31~~ ~~120340 0.13970 pwhg.sh roczei r 09/24/2012 23:24:51 MASTER 33~~ ~~120340 0.13970 pwhg.sh roczei r 09/26/2012 13:42:51 MASTER 113~~ ~~120340 0.13970 pwhg.sh roczei r 10/01/2012 07:43:06 MASTER 186~~ ~~120340 0.13970 pwhg.sh roczei r 10/01/2012 07:58:36 MASTER 187~~ ~~...~~ pre>ahol a <code>JOBID</code> ~~=== Queue types ===~~ ~~''parallel.q'' - for paralel jobs (jobs can run maximum 31 days)''serial.q'' - for serial jobs (jobs can run maximum 31 days)''test~~a feladat egyedi azonosítószáma.~~q'' - test queue, the job will be killed after 2 hours~~ ~~Getting information on the waiting line’s status:~~

A feladat leállítását a következő parancs végzi:

<code>

~~qstat -g c~~ scancel JOBID

</code>

~~<code>~~ ~~CUSTER QUEUE CQLOAD USED RES AVAIL TOTAL aoACDS cdsuE~~ ~~--------------------------------------------------------------------------------~~ ~~parallel.q 0.52 368 0 280 648 0 0~~ ~~serial.q 0.05 5 0 91 96 0 0~~ ~~test.q 0.00 0 0 24 24 0 0</code>~~ === ~~Running PVM job~~ = Feladat sorok ==== ~~To run the previously shown and translated gexample application~~A szupergépeken két, egymást nem átfedő, sor (partíció) áll rendelkezésre, ~~we need the following task-describing ''gexample.sh'' script:~~ a <code> ~~#!/bin/sh~~ ~~#$ -N GEXAMPLE~~ ~~./gexample << EOL~~ 30 5 ~~EOL~~ test</code> ~~We can submit this with the following command:~~ sor és a <code> ~~qsub -pe pvm 5 gexample.sh~~ prod</code> ~~The~~ sor. Utóbbi az éles számolásokra való, előbbi fejlesztés és tesztelés céljára használható. A teszt sorban összesen 1 node-~~pe pvm 5 command will tell to the SGE to create a PVM parallel computer machine with 5 virtual processors~~ot, ~~and run the application in this~~maximum fél órára lehet lefoglalni. Az alapértelmezett sor a <code>prod</code>. A teszt partíciót a következő direktívával lehet kiválasztani: <pre> ~~parallel.q@cn31 BIP 0/5/24 5.15 linux~~#SBATCH --~~x64~~ ~~908 1.00000 GEXAMPLE stefan r 06/04/2011 13:05:14 5~~ partition=test</~~code~~pre>

~~Also note that after the running two output files were created: one containing an attached standard error and standard output~~ ==== A szolgáltatás minősége (~~GEXAMPLE~~QOS) ====Lehetőség van alacsony prioritású jobok feladására is. Az ilyen feladatokat bármilyen normál prioritású job bármikor megszakíthatja, cserébe az elhasznált gépidő fele számlázódik csak.~~o908)~~A megszakított jobok automatikusan újraütemeződnek. Fontos, hogy olyan feladatokat indítsunk alacsony prioritással, amelyek kibírják a véletlenszerű megszakításokat, ~~another describing the working method of the~~ rendszeresen elmentik az állapotukat (~~GEXAMLE.po908~~checkpoint)és ebből gyorsan újra tudnak indulni. ~~The latter one is mainly for finding errors~~A szolgáltatást alapértelmezett minősége <code>normal</code>, azaz nem megszakítható a futás.

~~=== Running MPI jobs ===~~

Az alacsony prioritás kiválasztását a következő direktívával tehetjük meg:

<pre>

#SBATCH --qos=lowpri

</pre>

~~All computers are set up with several installations of the MPI system~~==== Memória beállítások ====Alapértelmezetten 1 CPU core-hoz 1000 MB memória van rendelve, ennél többet a következő direktívával igényelhetünk: ~~vendor~~<pre>#SBATCH --~~specific MPI implementations, and MPICH system too~~mem-per-cpu=MEMORY</pre>ahol <code>MEMORY</code> MB egységben van megadva. ~~The default setup is the vendor-specific MPI~~Budapesten a maximális memória/core 2600 MB.

~~Running in the MPI environment is similar to the PVM environment~~==== Email értesítés ====Levél küldése job állapotának változásakor (elindulás,leállás,hiba):<pre>#SBATCH --mail-type=ALL#SBATCH --mail-user=EMAIL</pre>ahol az <code>EMAIL</code> az értesítendő emial cím. Let’s have a look at the example shown in the previous chapter connectivity. A very simple task which tests the MPI tasks’internal communication. Use the following connectivity.sh script to run it:

==== Tömbfeladatok (arrayjob) ====Tömbfeladatokra akkor van szükségünk, egy szálon futó (soros) alkalmazást szeretnénk egyszerre sok példányban (más-más adatokkal) futtatni. A példányok számára az ütemező a <code> #!SLURM_ARRAY_TASK_ID</~~bin/sh~~ #$ code> környezeti változóban tárolja az egyedi azonosítót. Ennek lekérdezésével lehet az arrayjob szálait elkülöníteni. A szálak kimenetei a <code>slurm-SLURM_ARRAY_JOB_ID-~~N CONNECTIVITY~~ SLURM_ARRAY_TASK_ID.out</code>fájlokba íródnak. Az ütemező a feltöltést szoros pakolás szerint végzi. Ebben az esetben is érdemes a processzorszám többszörösének választani a szálak számát. [http://slurm.schedmd.com/job_array.html Bővebb ismertető]

===== Példa =====Alice felhasználó a foobar számla terhére, maximum 24 órára ad fel 96 db soros jobot. A <code>#SBATCH --array=1-96</code> direktíva jelzi, hogy tömbfeladatról van szó. Az alkalmazást az <code>srun</code> paranccsal kell indítani. Ebben az esetben ez egy shell szkript. ~~mpirun~~ <pre>#!/bin/bash#SBATCH -A foobar#SBATCH -~~np $NSLOTS~~ -time=24:00:00#SBATCH --job-name=array#SBATCH --array=1-96srun envtest.~~/connectivity~~ sh</~~code~~pre>

~~Here, the $NLOTS variable indicates that how many processors should be used in the~~ ==== MPI ~~environment~~feladatok ====MPI feladatok esetén meg kell adnunk az egy node-on elinduló MPI processzek számát is (<code>#SBATCH --ntasks-per-node=</code>). A leggyakoribb esetben ez az egy node-ban található CPU core-ok száma. ~~This equals with that number what we have reuired for the parallel environment~~A párhuzamos programot az <code>mpirun</code> paranccsal kell indítani.

~~The~~ ===== Példa =====Bob felhasználó a barfoo számla terhére 2 node-ot, 12 órára foglal le egy MPI job ~~can be submitted with the following command~~számára. Az egyes node-okon 24 MPI szálat fog elindítani. A program stdout kimenetét a <code>slurm.out</code> fájlba irányítja (<code>#SBATCH -o</code>).<pre>#!/bin/bash#SBATCH -A barfoo#SBATCH --job-name=mpi#SBATCH -N 2#SBATCH --ntasks-per-node=24#SBATCH --time=12:00: 00#SBATCH -o slurm.outmpirun ./a.out</pre>

==== CPU binding ====Az MPI programok teljesítménye általában javítható a processzek CPU magokhoz kötésével. Ilyenkor a párhuzamos program szálait az operációs rendszer nem ütemezi a CPU magok között, ezért javulhat a memória lokalizáció (kevesebb cache miss). A kötés használata ajánlott. Tesztekkel meg kell győződni, hogy egy adott alkalmazás esetén melyik kötési stratégia adja a legjobb eredményt. A következő beállítások az OpenMPI környezetre vontakoznak. A kötésekről részletes információt a <code> ~~qsub~~ -~~pe mpi 20 connectivity.sh~~ -report-bindings</code>MPI opcióval kaphatunk. Az indítási parancsok melett a részletes binding információ néhány sora is látható. Fontos, hogy az ütemező task binding-ját nem kell használni!

~~With this command we instruct the scheduler to create~~ ===== Kötés CPU magonként =====Ebben az esetben az MPI szálak (rank) sorban töltik fel a ~~parallel MPI environment containing 20 processors, and reserve space for it in one of the queues~~CPU magokat. ~~Once the space is available, the job starts~~<pre>Indítási parancs:mpirun --bind-to-core --bycore

~~<code>~~[cn05:05493] MCW rank 0 bound to socket 0[core 0]: [B . . . . . . . . . . .][. . . . . . . . . . . .] ~~parallel~~[cn05:05493] MCW rank 1 bound to socket 0[core 1]: [. B . . . . . . . . . .][. . . . . . . .~~q@cn31 BIP~~ . . . .][cn05:05493] MCW rank 2 bound to socket 0~~/20/24 20~~[core 2]: [. . B . . . . . . . . .][. . . . . . . . . . . .~~30 linux-x64~~ ] ~~910 1.00000 CONNECTOVI stefan r 06/04/2011 14~~[cn05:0305493] MCW rank 3 bound to socket 0[core 3]:~~14 20~~ [. . . B . . . . . . . .][. . . . . . . . . . . .]</~~code~~pre>

~~Running the program will result in two files~~===== Kötés CPU foglalat szerint =====Ebben az esetben az MPI szálak váltakozva töltik fel a CPU-kat.<pre>Indítási parancs: ~~the first one (CONNECTIVITY.o910) is the overlap of the result of the already run program standard output and standard error, while the second one (CONNECTIVITY.po910) is for the follow~~mpirun --bind-to-~~up of the operation of the parallel environment. If the running is successful, this file is empty. The command~~ core -~~pe mpi 20 can be given in the script too with the directive #$~~ -~~pe mpi 20~~bysocket

~~=== Running OpenMP jobs ===~~[cn05:05659] MCW rank 0 bound to socket 0[core 0]: [B . . . . . . . . . . .][. . . . . . . . . . . .][cn05:05659] MCW rank 1 bound to socket 1[core 0]: [. . . . . . . . . . . .][B . . . . . . . . . . .][cn05:05659] MCW rank 2 bound to socket 0[core 1]: [. B . . . . . . . . . .][. . . . . . . . . . . .][cn05:05659] MCW rank 3 bound to socket 1[core 1]: [. . . . . . . . . . . .][. B . . . . . . . . . .]</pre>

~~There are applications that either use the solutions of the operation system for multi~~===== Kötés node-~~threaded program execution, or use~~ ok szerint =====Ebben az esetben az MPI szálak váltakozva töltik fel a ~~special library designed for this, like OpenMP~~node-okat. ~~These applications have~~ Lagalább 2 node foglalása szükséges.<pre>Indítási parancs: mpirun --bind-to ~~be instructed how many threads they can use. The matrix multiplication algorithm presented in the previous chapter can be described with the following ''omp_mm.sh'' script~~ -core --bynode

~~<code>~~[cn05:05904] MCW rank 0 bound to socket 0[core 0]: [B . . . . . . . . . . .][. . . . . . . . . . . .][cn05:05904] MCW rank 2 bound to socket 0[core 1]: [. B . . . . . . . . . .][. . . . . . . . . . . .] ~~#!/bin/sh~~ [cn06:05969] MCW rank 1 bound to socket 0[core 0]: [B . . . . . . . . . . .][. . . . . . . . . . . .] ~~#$ -N OPENMP_MM~~ [cn06:05969] MCW rank 3 bound to socket 0[core 1]: [. B . . . . . . . . . .][. . . . . . . . . . . .]</~~code~~pre>

==== OpenMP (OMP) feladatok ====

OpenMP párhuzamos alkalmazásokhoz 1 node-ot kell lefoglalni és meg kell adni az OMP szálák számát a <code>OMP_NUM_THREADS</code> környezeti változóval. A változót vagy az alkamazás elé kell írni (ld. példa), vagy exportálni kell az alkalmazást indító parancs előtt:

<code>

~~./omp_mm~~ export OMP_NUM_THREADS=24

</code>

~~it can be submitted with this command which will use~~ ===== Példa =====Alice felhasználó a foobar számla terhére, maximum 6 ~~threads~~ órára indít el egy 24 szálas OMP alkalmazást.<~~code~~pre> ~~qsub~~ #!/bin/bash#SBATCH -~~pe openmp 6 omp_mm~~A foobar#SBATCH --job-name=omp#SBATCH --time=06:00:00#SBATCH -N 1OMP_NUM_THREADS=24 ./a.shout</~~code~~pre>

==== Hibrid MPI-OMP feladatok ====

Hibrid MPI-OMP módról akkor beszélünk, ha a párhuzamos alkalmazás MPI-t és OMP-t is használ. Érdemes tudni, hogy az Intel MKL-el linkelt programok MKL hívásai OpenMP képesek. Általában a következő elosztás javasolt: az MPI processzek száma 1-től az egy node-ban található CPU foglalatok száma, az OMP szálak ennek megfelelően az egy node-ban található összes CPU core szám vagy annak fele, negyede (értelem szerűen). A jobszkipthez a fenti két mód paramétereit kombinálni kell.

=== ~~Checkpointing support~~ ==Példa =====Alice felhasználó a foobar számla terhére, 8 órára, 2 node-ra küldött be egy hibrid jobot. Egy node-on egyszerre csak 1 db MPI processz fut ami node-onként 24 OMP szálat használ. A 2 gépen összesen 2 MPI proceszz és 2 x 24 OMP szál fut.<pre>#!/bin/bash#SBATCH -A foobar#SBATCH --job-name=mpiomp#SBATCH -N 2#SBATCH --time=08:00:00#SBATCH --ntasks-per-node=1#SBATCH -o slurm.outexport OMP_NUM_THREADS=24mpirun ./a.out</pre>

==== Maple Grid feladatok ====

Maple-t az OMP feladatokhoz hasonlóan 1 node-on lehet futtatni. Használatához be kell tölteni a maple modult is. A Maple kliens-szerver üzemmódban működik ezért a Maple feladat futtatása előtt szükség van a grid szerver elindítására is (<code>${MAPLE}/toolbox/Grid/bin/startserver</code>). Ez az alkalmazás licensz köteles, amit a jobszkriptben meg kell adni (<code>#SBATCH --licenses=maplegrid:1</code>). A Maple feladat indátását a <code>${MAPLE}/toolbox/Grid/bin/joblauncher</code> paranccsal kell elvégezni.

~~At the moment the system doesn’t support any automatic checkpointing~~===== Példa =====Alice felhasználó a foobar számla terhére, 6 órára indítja el a Maple Grid alkalmazást:<pre>#!/bin/~~restarting mechanism~~bash#SBATCH -A foobar#SBATCH --job-name=maple#SBATCH -N 1#SBATCH --ntasks-per-node=24#SBATCH --time=06:00:00#SBATCH -o slurm. ~~If it is need, the application has to take care of it~~out#SBATCH --licenses=maplegrid:1${MAPLE}/toolbox/Grid/bin/startserver${MAPLE}/toolbox/Grid/bin/joblauncher ${MAPLE}/toolbox/Grid/samples/Simple.mpl</pre>

Kzoli(AT)niif.hu

bürokraták, adminisztrátorok

142

szerkesztés

Módosítások

PRACE User Support

Navigációs menü

Személyes eszközök

Névterek

Változatok

Nézetek

Több

Keresés

Navigáció

Eszközök