Módosítások

PRACE User Support

1 701 bájt hozzáadva, 2013. június 27., 12:33

→‎Usage of the Sun Grid Engine scheduler

=== The most simple commands === The most simple SGE command is the display of the cluster data: ~~qhostA~~ <code> qhost</code> A possible outcome of this command can be: {| class="wikitable" border="1" |- |HOSTNAME |ARCH |NCPU |LOAD |MEMTOT |MEMUSE |SWAPTO |SWAPUS ~~----------------------------------------------------------------~~|-|global |<nowiki>-</nowiki> |<nowiki>-</nowiki> |<nowiki>-</nowiki> |<nowiki>-</nowiki> |<nowiki>-</nowiki> |<nowiki>-</nowiki> |<nowiki>-</nowiki>|-|uv |linux-x64 |1152 |900.56 |6057.9G |132.4G |0.0 |0.0|} The first two columns define the names and types of the computers, which are in the cluster. The NCPU column shows the number of the available processor cores. LOAD shows the computer’s load for the moment (this value equals with the value demonstrated by the uptime UNIX command). The rest of the cells are: overall physical memory, the actual used memory, the available swap-~~----~~ memory, and the used swap. The global - line marks all the information in total regarding the cluster. We can have a look at the available queue- s with the following command: <code> qconf - sql</code> One probable outcome of the command: <code> test.q ~~- - - -~~ uv ~~linux-x64 1152 900~~.~~56 6057.9G 132.4G 0.0~~ q </code> To get more info about the state of the system use <code> ~~0.0 The first two columns define the names and types of the computers,~~ qstat -f</code> It shows which ~~are~~ jobs run in ~~the cluster~~which queues, and you can also get detailed info about the queues themselves (state, environment). The ~~NCPU column shows~~ command can be used without the ~~number of~~ -f switch too, but it is less informative, since in this case only the ~~available processor cores. LOAD shows the computer’s load for the moment (this value equals with the value demonstrated by the uptime UNIX command)~~jobs’ states will appear. The ~~rest of the cells are~~command’s outcome: ~~overall physical memory, the actual~~ <code> queuename qtype resv/used memory, the available swap-memory, and the used swap. The global line marks all the information in total regarding the cluster.We can have a look at the available queue-s with the following command: qconf -sqlOne probable outcome of the command: test.q uv.q To get more info about the state of the system use qstat -fIt shows which jobs run in which queues, and you can also get detailed info about the queues themselves (state, environment). The command can be used without the -f switch too, but it is less informative, since in this case only the jobs’ states will appear. The command’s outcome: queuename qtype resv/used/tot. load_avg arch states --/tot. load_avg arch states <nowiki>--------------------------------------------------------------------------------</nowiki> test.q@uv BIP 0/1/30 800.15 linux-x64 905 1.00000 PI_SEQ_TES stefan r 06/04/2011 09:12:14 1 <nowiki>------------------------------------------ ~~test.q@uv BIP 0/1/30 800.15 linux-x64 905 1.00000 PI_SEQ_TES stefan r 06/04/2011 09:12:14 1 ---------~~--------------------------------------</nowiki> uv.q@uv BIP 0/802/1110 800.15 linux-x64 </code> The first column of this table shows the name of the row, the second column marks the type (B-batch, I-interactive, C-checkpointing, P-parallel environment, E-~~---------------------------- uv~~error state). The third part of the column shows how many jobs can be run at the same time in the row.~~q@uv BIP 0/802/1110 800~~All in all, these values fit to the number of overall processor cores in the system.~~15 linux-x64~~ The ~~first~~ second item of the column ~~of this table~~ shows the ~~name~~ free compartments at the moment. If a running (scheduled) job is to be found in the queue, it is directly next to the name of the row, like the ~~second column marks~~ recent "PI_SEQ_TES", which runs in the ~~type (B-batch, I-interactive, C-checkpointing, P-parallel environment, E-error state)~~test.q row. The ~~third part of~~ tasks waiting for the resources, because it is overwhelmed or the ~~column shows how many jobs can be run at the same time in~~ preliminary conditions are not prompt, appear behind the sum row~~. All in all~~, ~~these values fit to the number of overall processor cores in the system~~listed as pending jobs. ~~The second item of the column shows the free compartments at the moment~~For example: <code> queuename qtype resv/used/tot.If a running (scheduled) job is to be found in the queue, it is directly next to the name of the row, like the recent "PI_SEQ_TES", which runs in the test.q row. The tasks waiting for the resources, because it is overwhelmed or the preliminary conditions are not prompt, appear behind the sum row, listed as pending jobs. For example: queuename qtype resv/used/tot. load_avg arch states ---load_avg arch states <nowiki>--------------------------------------------------------------------------------- </nowiki> test.q@uv BIP 0/0/30 600.42 linux-x64 <nowiki>------------------------------------------------------ ~~test.q@uv BIP 0/0/30 600.42 linux~~-~~x64~~ --------------------------</nowiki> uv.q@uv BIP 0/598/1110 600.42 linux-~~------------------------------------------------------ uv.q@uv BIP 0/598/1110 600.42 linux-x64~~ x64 <nowiki>############################################################################~~########################################### - PENDING JOBS~~ </nowiki> - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS <nowiki>############################################################################ </nowiki> 905 0.00000 PI_SEQ_TES stefan qw 06/04/2011 09:12:04 1 </code> Each task is given an identifier, which is a number (a job ID, or j_id), this is followed by the job’s priority (0 in both cases), then the job’s name, and the user who posted the job, and the qw marks, that the job is waiting for the queue. Finally the date of the registration for the waiting queue is ~~nextWhen~~ next When a job finishes running, this is created: jobname.ojobnumber in our actual catalog, which contains the error messages and stapled outputs created by the program.. === Job ~~submissionBack~~ submission === Back then, the SGE scheduler was designed to be able to operate different types of architectures. That’s why you can’t post binary files directly, only scripts, like the <code> qsub script.~~shcommand~~sh</code> command. The script describes the task, the main parameters of it, and its running. For example in the following script, the ~~describedhostname~~described ''hostname.sh '' task: <code> #!/bin/sh #$ -N HOSTNAME /bin/hostname </code> can be posted with the following command: <code> qsub hostname.~~shThe~~ sh</code> The scripts can be used for separating the different binaries: <code> #!/bin/sh case `uname` in SunOS) ./pi_sun FreeBSD) ./pi_bsd ~~esacWith the~~ esac</code> With the following command, we can define the queue where the scheduler puts the job:: <code> qsub -q serial.q range.~~shThe~~ sh</code> The command qsub can be issued with a number of different switches, which are gathered in the following table:~~ParameterPossible exampleResult~~ {| class="wikitable" border="1"|-|Parameter|Possible example|Result|-| -N name| -N ~~FlowThe~~ Flow|The job will appear under this name in the queue.|-| -cwd| -~~cwdThe~~ cwd|The output and the error files will appear in this actual catalog.|-| -S shell| -S /bin/~~tcshThe~~ tcsh|The shell in which the scripts run.|-| -j {y|n}| -j ~~yJoining~~ y|Joining the error and the output in one file.|-| -r {y|n}| -r ~~yAfter~~ y|After a restart, should the job restart too (from the beginning).|-| -M e-mail| -M stefan@niif.~~huScheduler~~ hu|Scheduler information will be sent to this address about the job.|-| -l| -l h_cpu=0:15:~~0Chooses~~ 0|Chooses a queue for the job where 15 minutes of CPU time could be ensured. (hour:minute:second)|-| -l| -l h_vmem=~~1GChooses~~ 1G|Chooses a computer for the job where 1 GB memory is available. In the case of parallel jobs its value is extended with the required number of slots. If this parameter is not given, the default setting will be the number of the maximum memory cores set up in the computers.|-| -l| -l ~~inConsuming~~ in|Consuming resources, complex request. (This will be defined in the documentation written for the system administrators)|-| -binding| -binding linear:~~4Chooses~~ 4|Chooses 4 CPU cores on the worker node-on and assignes in a fix way. Further information: [http://docs.oracle.com/cd/E24901_01/doc.62/e21976/chapter2. htm#autoId75 here].|-| -l| -l exclusive=~~trueDemand~~ true|Demand of exclusive task execution (another job will not be scheduled on the chosen computers). It can be used in the following sites: Szeged, Budapest és Debrecen.|-| -P| -P ~~niifiChooses~~ niifi|Chooses a HPC project. This command will list the available HPC projects: ''qconf -sprjl''|-| -R | -R ~~yResource~~ y|Resource reservation. This will cause that bigger parallel jobs will get higher priority. |} qsub command arguments can be added to the ~/.sge_request file. If this file exists then it will be added to the qsub arument list. Sometimes we want to delete a job before its running. For this you can use the <code> qdel ~~job_idcommand~~job_id</code> command. <code> qdel ~~903The example~~ 903</code> The example deletes the job number 903. <code> qdel -f ~~903It~~ 903</code> It can delete the running jobs immediately. For pending and then continuing jobs, use qmod {-s,-us}. <code> qmod -s 903 qmod -us 903 </code> The previous one suspends the running of number 903 (SIGSTOP), while the latter one allows (SIGCONT). If there is a need to change the features (resource requirements) of a job put into the waiting list, it can be done with the command: ''qalter '' <code> qalter -l h_cpu=0:12:0 903 </code> The previous command alternates the hard-CPU requirements of the job number 903 (h_cpu) and changes it to 12 minutes. The switches of the qalter command are mainly overlap the ones of the qsub command. In a special case, we have to execute the same task, but on different data. These tasks are the array jobs. With SGE we can upload several jobs to the waiting. For example in the pi task shown in previous chapter, it can be posted multiple times, with different parameters, with the following script:''array.sh '' <code> #!/bin/sh #$ -N PI_ARRAY_TEST ./pi_gcc `expr $SGE_TASK_ID \* 100000` </code> The SGE_TASK_ID is an internal integer used by the SGE, which created values for each running job. The interval can be set up when posting the block: <code> qsub -t 1-7 array.sh </code> meaning that the array.sh program will run in seven issues, and the SGE_TASK_ID will have the value of 1, 2, ..., 7 in every running issue. The qstat -f shows how the block tasks are split: <code> <nowiki>--------------------------------------------------------------------------------- </nowiki> test.q@uv BIP 0/0/30 8.15 linux-x64 <nowiki>--------------------------------------------------------------------------------- uv</nowiki> uv.q@uv BIP 0/7/1110 8.15 linux-x64 907 1.00000 PI_ARRAY_T stefan r 06/04/2011 10:34:14 1 1 907 0.50000 PI_ARRAY_T stefan t 06/04/2011 10:34:14 1 2 907 0.33333 PI_ARRAY_T stefan t 06/04/2011 10:34:14 1 3 907 0.25000 PI_ARRAY_T stefan t 06/04/2011 10:34:14 1 4 907 0.20000 PI_ARRAY_T stefan t 06/04/2011 10:34:14 1 5 907 0.16667 PI_ARRAY_T stefan t 06/04/2011 10:34:14 1 6 907 0.14286 PI_ARRAY_T stefan t 06/04/2011 10:34:14 1 7 </code> It is clear, that behind the tasks there are their array index with which we can refer to the components to the task. For example, in the case of block tasks, there is a possibility to delete particular parts of the block. If we want to delete the subtasks from 5-7 of the previous task, the command <code> qdel -f 907.5-7 </code> will delete chosen components, but leaves the tasks 907.1-4 intact.The result of the running is seven individual files, with seven different running solutions: It can happen; that the task placed in the queue won’t start. This case the: <code> qstat -j job_id </code> command will show the detailed scheduling information, containing which running parameters are unfulfilled by the task. The priority of the different tasks only means the gradiation listed in the pending jobs. The scheduler will analyze the tasks in this order. Since it requires the reservation of resources, it is not sure, that the tasks will run exactly the same order. If we wonder why a certain job won’t start, here’s how you can get information: <code> qalter -w v ~~job_idOne~~ job_id</code> One possible outcome <code> Job 53505 cannot run in queue "szeged.q" because it is not contained in its hard queue list (-q) Job 53505 (-l NONE) cannot run in queue "cn46.szeged.hpc.niif.hu" because exclusive resource (exclusive) is already in use Job 53505 (-l NONE) cannot run in queue "cn48.szeged.hpc.niif.hu" because exclusive resource (exclusive) is already in use Job 53505 cannot run in PE "mpi" because it only offers 0 slots verification: no suitable ~~queuesYou~~ queues</code> You can check with this command where the jobs are running: <code> qhost -j -~~qHOSTNAME~~ q</code> <code> HOSTNAME ARCH NCPU LOAD MEMTOT MEMUSE SWAPTO SWAPUS <nowiki>------------------------------------------------------------------------------- </nowiki> global - - - - - - - cn01 linux-x64 48 41.43 126.0G 3.0G 0.0 0.0 serial.q BI 0/42/48 120087 0.15501 run.sh roczei r 09/23/2012 14:25:51 MASTER 22 120087 0.15501 run.sh roczei r 09/23/2012 15:02:21 MASTER 78 120087 0.15501 run.sh roczei r 10/01/2012 07:58:21 MASTER 143 120087 0.15501 run.sh roczei r 10/01/2012 08:28:51 MASTER 144 120087 0.15501 run.sh roczei r 10/04/2012 17:41:51 MASTER 158 120340 0.13970 pwhg.sh roczei r 09/24/2012 23:24:51 MASTER 3 120340 0.13970 pwhg.sh roczei r 09/24/2012 23:24:51 MASTER 5 120340 0.13970 pwhg.sh roczei r 09/24/2012 23:24:51 MASTER 19 120340 0.13970 pwhg.sh roczei r 09/24/2012 23:24:51 MASTER 23 120340 0.13970 pwhg.sh roczei r 09/24/2012 23:24:51 MASTER 31 120340 0.13970 pwhg.sh roczei r 09/24/2012 23:24:51 MASTER 33 120340 0.13970 pwhg.sh roczei r 09/26/2012 13:42:51 MASTER 113 120340 0.13970 pwhg.sh roczei r 10/01/2012 07:43:06 MASTER 186 120340 0.13970 pwhg.sh roczei r 10/01/2012 07:58:36 MASTER 187 ... </code> === Queue types === ''parallel.q '' - for paralel jobs (jobs can run maximum 31 days) ''serial.q '' - for serial jobs (jobs can run maximum 31 days) ''test.q '' - test queue, the job will be killed after 2 ~~hoursGetting~~ hours Getting information on the waiting line’s status: <code> qstat -g c </code> <code> CUSTER QUEUE CQLOAD USED RES AVAIL TOTAL aoACDS cdsuE <nowiki>-------------------------------------------------------------------------------- </nowiki> parallel.q 0.91 460 0 44 44 504 0 0 serial.q 0.84 200 0 40 40 240 0 0 test.q 0.00 0 0 24 24 0 0 ~~0Running~~ </code> === Running PVM ~~jobTo~~ job === To run the previously shown and translated gexample application, we need the following task-describing ''gexample.sh '' script: <code> #!/bin/sh #$ -N GEXAMPLE ./gexample << EOL 30 5 EOL </code> We can submit this with the following command: <code> qsub -pe pvm 5 gexample.sh </code> The -pe pvm 5 command will tell to the SGE to create a PVM parallel computer machine with 5 virtual processors, and run the application in this. <code> uv.q@uv BIP 0/5/1110 5.15 linux-x64 908 1.00000 GEXAMPLE stefan r 06/04/2011 13:05:14 5 </code> Also note that after the running two output files were created: one containing an attached standard error and standard output (GEXAMPLE.o908), another describing the working method of the (GEXAMLE.po908). The latter one is mainly for finding errors. === Running MPI ~~jobsAll~~ jobs === All computers are set up with several installations of the MPI system: vendor-specific MPI implementations, and MPICH system too. The default setup is the vendor-specific MPI. Running in the MPI environment is similar to the PVM environment. Let’s have a look at the example shown in the previous chapter connectivity. A very simple task which tests the MPI tasks’internal communication. Use the following connectivity.sh script to run it: <code> #!/bin/sh #$ -N CONNECTIVITY </code> <code> mpirun -np $NSLOTS ./connectivity </code> Here, the $NLOTS variable indicates that how many processors should be used in the MPI environment. This equals with that number what we have reuired for the parallel environment. The job can be submitted with the following command: <code> qsub -pe mpi 20 connectivity.sh </code> With this command we instruct the scheduler to create a parallel MPI environment containing 20 processors, and reserve space for it in one of the queues. Once the space is available, the job starts: <code> uv.q@uv BIP 0/20/1110 20.30 linux-x64 910 1.00000 CONNECTOVI stefan r 06/04/2011 14:03:14 20 </code> Running the program will result in two files: the first one (CONNECTIVITY.o910) is the overlap of the result of the already run program standard output and standard error, while the second one (CONNECTIVITY.po910~~) is for the follow-up of the operation of the parallel environment. If the running is successful, this file is empty. The command~~ ) is for the follow-up of the operation of the parallel environment. If the running is successful, this file is empty. The command -pe mpi 20 can be given in the script too with the directive #$ -pe mpi 20 ~~can be given in the script too with the directive #$ -pe mpi 20Important~~ '''Important notes: you should use mpirun.sge by SGI MPT on the Debrecen supercomputer ''' when you run a job under SGE. This can automatic parse which machines have been selected by SGE. This way you can check that you are using SGI MPT or not: DEBRECEN[service0] ~ (1)$ type mpirun mpirun is hashed (/opt/nce/packages/global/sgi/mpt/2.04/bin/mpirun) DEBRECEN[service0] ~ (0)$ type mpirun.sge mpirun.sge is hashed (/opt/nce/packages/global/sgi/mpt/2.04/bin/mpirun.sge) DEBRECEN[service0] ~ (0)$ You should use mpirun binary directly if you are using SHF3 environment or you would like to use a more complex MPI run. However, you need to parse the SGE's PE_HOSTFILE environment variable in this case.Running OpenMP jobsThere are applications that either use the solutions of the operation system for multi-threaded program execution, or use a special library designed for this, like OpenMP. These applications have to be instructed how many threads they can use. The matrix multiplication algorithm presented in the previous chapter can be described with the following omp_mm.sh script #!/bin/sh #$ -N OPENMP_MM ./omp_mm it can be submitted with this command which will use 6 threads qsub -pe openmp 6 omp_mm.shCheckpointing supportAt the moment the system doesn’t support any automatic checkpointing/restarting mechanism. If it is need, the application has to take care of it.

Herbert(AT)niif.hu

57

szerkesztés

Módosítások

PRACE User Support

Navigációs menü

Személyes eszközök

Névterek

Változatok

Nézetek

Több

Keresés

Navigáció

Eszközök