PRACE User Support

Tartalomjegyzék

1 User Guide to obtain a digital certificate
2 Access with GSI-SSH
3 GridFTP file transfer
4 Usage of the SLURM scheduler

User Guide to obtain a digital certificate

This document gives a short overview about how to require a digital certificate from NIIF CA for users, if the pre-registration form has been filled.

This guide is valid only for the Hungarian users.

If you are from a foreign country, and would like to get a certificate, here you can found your country's certification authority.

Installing NIIF CA root certificate

The first step is to download the "root certificate" ("NIIF CA Root Certificate" part), in the the format, which is known for the used browser or other SSL-using program. The browser asks wether to install/accept the certificate or not - accept or install the certificate in any cases. In addition, activate or allow the option which permits the browser to use the certificate to authenticate websites. Without that, it is not possible to reach the CA's web interface with secure protocol (https). The downloaded/installed certificate can be found in the certificate management modul of the browser.

Request a certificate

Request a certificate with openssl

Sing in into the certification registration website of the NIIF CA with our email address and password stored in the directory.

This site uses secure protocol (https), which the browser often indicates with a warning window - they should be acknowledged implicitly.

In the opening page - which is the public web surface of the CMS certificate management software - choose the "OpenSSL kliens kérelem benyújtása (PKCS#10)" (request an OpenSSL client) option. This leads to the datasheet, which must be filled in accordance with the printed datasheet. First, according to the purpose of the request, the corresponding field must be choosen (CSIRT, GRID, NIIF felhasználó, Független kutató, HBONE).

Copy the public part of our certificate in the field "PKCS#10". You can find a user guide about How to create a PKCS#10 certificate with openssl, which suitable for the NIIF CA requirements below.

A Challenge and a Request passwords must be given - both of them must be at least 8 characters long. Note them, because they needed for cancellation the certificate, or for the personal authentication.

Fill the other fields (name, email address, phone, organisation), and if there is anything, the CA operator should know, fill the last field with it. If everything is done, after a last check, click on the Elküld ("send") button on the bottom of the page.

In case of a successful PKCS#10 key-uploading, a page is opening with the confirmation of the successful certification request.

User Guide to create a PKCS#10 digital certification request with openssl

This paragraph gives a short overview about how to require a digital certificate from NIIF CA for users using openssl with the PKCS#10 format.The latest version of the openssl program can be downloaded from: Windows, Linux.

1. Download the openssl configuration file

To generate the CSR, there is a prewritten niif_ca_user_openssl.cnf file on the NIIF CA website.

The following modifications must be done in the config:

#purpose of the certificate

1.organizationalUnitName	= Organizational Unit Name
1.organizationalUnitName_default	= GRID # For example: GRID, HBONE, General Purpose
2.organizationalUnitName	= Second Organizational Unit Name
2.organizationalUnitName_default	= NIIF # For example: BME, ELTE, SZFKI, SZTAKI, NIIF, ...
commonName	= Common Name (YOUR name) # User Name.
commonName_max	= 64A

2. Create PKCS#10 reqquest

No existing secret key:

Run the


   openssl req -newkey rsa:1024 -config ./niif_ca_user_openssl.cnf -out new_csr.pem

command, and answer the appearing questions at the prompt. The Institute (NIIF CA) and country (HU) datas should not be changed, or the request is going to be invalid. The certification request and the corresponding private key will be saved in the new_csr.pem and privkey.pem files. To gain acces to the private key, during the generating given "pass phrase" password must be used. In case of a forgotten password, the certificate will be unusable.

Existing private key (extend)

If there is an existing, previously generated private key (it must be at least a 1024 bit RSA key), which can be found in the old_key.pem file, then the following command creates the CSR


   openssl req -new -key ./old_key.pem -config ./niif_ca_user_openssl.cnf -out new_csr.pem

Personal Authentication

After the successful registration on the website, please visit the NIIF CA Registration Office personally with the copy of the pre-registration datasheet, the Request password and an ID document (ID card, passport).

Address:

NIIF Iroda

(RA Administrator)

Victor Hugo Str. 18-22.

H-1132 Budapest, HUNGARY

email: ca (at) niif (dot) hu

RA opening hours: Monday, 14:00 - 16:30 (CET)

During the authentication, the colleagues of the Registration Office verify the datas of the certificate and the user, and after the successful identification, they take the next steps in order to create the certification (it is not needed to wait for it).

Downloading the certificate

An email is going to arrive after the valid certificate has been completed (to the given email address during the request), and clicking on the URL in the email, the certificate can be downloaded. The saved certificate does not contain the private key.

If the certificate is installed into the browser, it is advised to export it with the private key in PKCS#12 format, so there will be a common backup with the private key and the certificate. Handle this backup carefully! If the private key lost, or gets into unauthorized hands, immediately request a certificate cancellation at the registration interface "Tanúsítvány visszavonása" (certificate cancellation) or at the Registration Office, and inform the concerned people!

Access with GSI-SSH

A user can access to the supercomputers by using the GSI-SSH protocol.

It requires a machine with a Globus installation that provides the gsissh client.

The needed credentials (these mean the private and public keys) must be created before entering the machine with the


   grid-proxy-init

or


   arcproxy

commands.

By default, the proxies are valid for 12 hours. It is possible to modify this default value with the following commands:


   arcproxy -c validityPeriod=86400

or


   grid-proxy-init -hours 24

Both of the previous commands set the validation of the proxies to 24 hours. Using the arcproxy, the validation time must be given in seconds.

To enter the site, the


   gsissh -p 2222 prace-login.sc.niif.hu

command has to be used.

GridFTP file transfer

In order to use GridFTP for file transfer, one needs a GridFTP client program that provides the interface between the user and a remote GridFTP server. There are several clients available for GridFTP, one of which is globus-url-copy, a command line tool which can transfer files using the GridFTP protocol as well as other protocols such as http and ftp. globus-url-copy is distributed with the Globus Toolkit and usually available on machines that have the Globus Toolkit installed.

Syntax


   globus-url-copy [options] sourceURL destinationURL

[options] The optional command line switches as described later.

sourceURL The URL of the file(s) to be copied. If it is a directory, it must end with a slash (/), and all files within that directory will be copied.

destURL The URL to which to copy the file(s). To copy several files to one destination URL, destURL must be a directory and be terminated with a slash (/).

Globus-url-copy supports multiple protocols, so the format of the source and destination URLs can be either


   file://path

when you refer to a local file or directory or


   protocol://host[:port]/path

when you refer to a remote file or directory.

globus-url-copy is supporting other protocols such as http, https, ftp and gsiftp as well.

Example:


   globus-url-copy file://task/myfile.c gsiftp://prace-login.sc.niif.hu/home/prace/pr1hrocz/myfile.c

This command uploads the myfile.c file from the locak task folder to the remote machine's home/task folder.

Command line options for globus-url-copy [options]

-help Prints usage information for the globus-url-copy program.
-version Prints the version of the globus-url-copy program.
-vb During the transfer, displays: (1) number of bytes transferred (2) performance since the last update (every 5 seconds) (3) average performance for the whole transfer

The following table lists parameters which you can set to optimize the performance of your data transfer:

-tcp-bs <size>Specifies the size (in bytes) of the TCP buffer to be used by the underlying GridFTP data channels.
-p <number of parallel streams> Specifies the number of parallel streams to be used in the GridFTP transfer.
-stripe Use this parameter to initiate a “striped” GridFTP transfer that uses more than one node at the source and destination. As multiple nodes contribute to the transfer, each using its own network interface, a larger amount of the network bandwidth can be consumed than with a single system. Thus, at least for “big” (> 100 MB) files, striping can considerably improve performance.

Usage of the SLURM scheduler

The schedule of the HPCs are CPU hour based. This means that the available core hours are divided between users on a monthly basis. All UNIX users are connected to one or more account. This scheduler account is connected to an HPC project and a UNIX group. HPC jobs can only be sent by using one of the accounts. The core hours are calculated by the multiplication of wall time (time spent running the job) and the CPU cores requested. For example reserving 2 nodes (48 cpu cores) at the NIIFI SC for 30 minutes gives 48 * 30 = 1440 core minutes = 24 core hours. Core hours are measured between the start and and the end of the jobs.

It is very important to be sure the application maximally uses the allocated resources. An emty or non-optimal job will consume allocated core time very fast. If the account run out of the allocated time, no new jobs can be submitted until the beginning of the next accounting period. Account limits are regenerated the beginning of each month.

Information about an account can be listed with the following command:


   sbalance

Example

After executing the command, the following table shows up for Bob. The user can access, and run jobs by using two differnt accounts (foobar,barfoo). He can see his name marked with * in the table. He shares both accounts with alice (Account column). The consumed core hours for the users are displayed in the second row (Usage), and the consumption for the jobs ran as the account is displayed in the 4th row. The last two row defines the allocated maximum time (Account limit), and the time available for the machine (Available).

Scheduler Account Balance
---------- ----------- + ---------------- ----------- + ------------- -----------
User             Usage |          Account       Usage | Account Limit   Available (CPU hrs)
---------- ----------- + ---------------- ----------- + ------------- -----------
alice                0 |           foobar           0 |             0           0
bob *                0 |           foobar           0 |             0           0

bob *                7 |           barfoo           7 |         1,000         993
alice                0 |           barfoo           7 |         1,000         993

Estimating core time

Before production runs, it is advised to have a core time estimate. The following command can be used for getting estimate:


   sestimate -N NODES -t WALLTIME

where NODES are the number of nodes to be reserved, WALLTIME is the maximal time spent running the job.

It is important to provide the core time to be reserved most precisely, because the scheduler queue the jobs based on this value. Generally, a job with shorter core time will be run sooner. It is advised to check the time used to run the job after completion with sacct command.

Example

Alice want to reserve 2 days 10 hours and 2 nodes, she checks, if she have enough time on her account.

sestimate -N 2 -t 2-10:00:00

Estimated CPU hours: 2784

Unfortunately, she couldn't afford to run this job.

Status information

Jobs in the queue can be listed with squeue command, the status of the cluster can be retrieved with the sinfo command. All jobs sent will get a JOBID. The properties of a job can be retrieved by using this id. Status of a running or waiting job:


   scontrol show job JOBID

All job will be inserted into an accounting database. The properties of the completed jobs can be retrieved from this database. Detailed statistics can be viewed by using this command:


   sacct -l -j JOBID

Memory used can be retrieved by using


   smemory JOBID

Disk usage can be retrieved by this command:


   sdisk JOBID

Example

There are 3 jobs in the queue. The first is an arrayjob wainting for resources (PENDING). The second is an MPI job running on 4 nodes for 25 minutes now. The third is an OMP run running on one node, just staerted. The NAME of the jobs can be freely given, it is advised to use short, informative names.

    squeue -l

Wed Oct 16 08:30:07 2013              
     JOBID PARTITION     NAME     USER    STATE       TIME TIMELIMIT  NODES NODELIST(REASON)
591_[1-96]    normal    array    alice  PENDING       0:00     30:00      1 (None)
       589    normal      mpi      bob  RUNNING      25:55   2:00:00      4 cn[05-08]
       590    normal      omp    alice  RUNNING       0:25   1:00:00      1 cn09

This two-node batch job had a typical load of 10GB virtual, and 6.5GB RSS memory per node.

    smemory 430

 MaxVMSize  MaxVMSizeNode  AveVMSize     MaxRSS MaxRSSNode     AveRSS
---------- -------------- ---------- ---------- ---------- ----------
10271792K           cn06  10271792K   6544524K       cn06   6544524K   
10085152K           cn07  10085152K   6538492K       cn07   6534876K

Checking jobs

It is important to be sure the application fully uses the core time reserved. A running application can be monitored with the following command:


   sjobcheck JOBID

Example

This job runs on 4 nodes. The LOAD group provides information about the general load of the machine, this is more or less equal to the number of cores. The CPU group gives you information about the exact usage. Ideally, values of the User column are over 90. If the value is below that, there is a problem with the application, or it is not optimal, and the run should be ended. This example job fully using ("maxing out") the available resources.

Hostname                     LOAD                       CPU              Gexec  
 CPUs (Procs/Total) [     1,     5, 15min] [  User,  Nice, System, Idle, Wio]
cn08    24 (   25/  529) [ 24.83, 24.84, 20.98] [  99.8,   0.0,   0.2,   0.0,   0.0] OFF
cn07    24 (   25/  529) [ 24.93, 24.88, 20.98] [  99.8,   0.0,   0.2,   0.0,   0.0] OFF
cn06    24 (   25/  529) [ 25.00, 24.90, 20.97] [  99.9,   0.0,   0.1,   0.0,   0.0] OFF
cn05    24 (   25/  544) [ 25.11, 24.96, 20.97] [  99.8,   0.0,   0.2,   0.0,   0.0] OFF

Checking licenses

The licenses used and available can be retrieved with this command:


   slicenses

Checking downtime

In downtime periods, the scheduler doesn't start new jobs, but jobs can be sent. The periods can be retrieved by using the following command:


   sreservations

Running jobs

Running applications in the HPC can be done in batch mode. This means all runs must have a job script containing the resources and commands needed. The parameters of the scheduler (resource definitions) can be given with the #SBATCH directive. Comparison of the schedulers, and the directives available at slurm are available at this table.

Obligatory parameters

The following parameters are obligatory to provide:

#!/bin/bash
#SBATCH -A ACCOUNT
#SBATCH --job-name=NAME
#SBATCH --time=TIME

where ACCOUNT is the name of the account to use (available accounts can be retrieved with the sbalance command), NAME is the short name of the job, TIME is the maximum walltime using DD-HH:MM:SS syntax.

The following command submit jobs:


   sbatch jobscript.sh

If the submission was successful, the following is outputted:

Submitted batch job JOBID

where JOBID is the unique id of the job

The following commmand cancels the job:


   scancel JOBID

Feladat sorok

A szupergépeken két, egymást nem átfedő, sor (partíció) áll rendelkezésre, a test sor és a prod sor. Utóbbi az éles számolásokra való, előbbi fejlesztés és tesztelés céljára használható. A teszt sorban összesen 1 node-ot, maximum fél órára lehet lefoglalni. Az alapértelmezett sor a prod. A teszt partíciót a következő direktívával lehet kiválasztani:

#SBATCH --partition=test

A szolgáltatás minősége (QOS)

Lehetőség van alacsony prioritású jobok feladására is. Az ilyen feladatokat bármilyen normál prioritású job bármikor megszakíthatja, cserébe az elhasznált gépidő fele számlázódik csak. A megszakított jobok automatikusan újraütemeződnek. Fontos, hogy olyan feladatokat indítsunk alacsony prioritással, amelyek kibírják a véletlenszerű megszakításokat, rendszeresen elmentik az állapotukat (checkpoint) és ebből gyorsan újra tudnak indulni. A szolgáltatást alapértelmezett minősége normal, azaz nem megszakítható a futás.

Az alacsony prioritás kiválasztását a következő direktívával tehetjük meg:

#SBATCH --qos=lowpri

Memória beállítások

Alapértelmezetten 1 CPU core-hoz 1000 MB memória van rendelve, ennél többet a következő direktívával igényelhetünk:

#SBATCH --mem-per-cpu=MEMORY

ahol MEMORY MB egységben van megadva. Budapesten a maximális memória/core 2600 MB.

Email értesítés

Levél küldése job állapotának változásakor (elindulás,leállás,hiba):

#SBATCH --mail-type=ALL
#SBATCH --mail-user=EMAIL

ahol az EMAIL az értesítendő emial cím.

Tömbfeladatok (arrayjob)

Tömbfeladatokra akkor van szükségünk, egy szálon futó (soros) alkalmazást szeretnénk egyszerre sok példányban (más-más adatokkal) futtatni. A példányok számára az ütemező a SLURM_ARRAY_TASK_ID környezeti változóban tárolja az egyedi azonosítót. Ennek lekérdezésével lehet az arrayjob szálait elkülöníteni. A szálak kimenetei a slurm-SLURM_ARRAY_JOB_ID-SLURM_ARRAY_TASK_ID.out fájlokba íródnak. Az ütemező a feltöltést szoros pakolás szerint végzi. Ebben az esetben is érdemes a processzorszám többszörösének választani a szálak számát. Bővebb ismertető

Példa

Alice felhasználó a foobar számla terhére, maximum 24 órára ad fel 96 db soros jobot. A #SBATCH --array=1-96 direktíva jelzi, hogy tömbfeladatról van szó. Az alkalmazást az srun paranccsal kell indítani. Ebben az esetben ez egy shell szkript.

#!/bin/bash
#SBATCH -A foobar
#SBATCH --time=24:00:00
#SBATCH --job-name=array
#SBATCH --array=1-96
srun envtest.sh

MPI feladatok

MPI feladatok esetén meg kell adnunk az egy node-on elinduló MPI processzek számát is (#SBATCH --ntasks-per-node=). A leggyakoribb esetben ez az egy node-ban található CPU core-ok száma. A párhuzamos programot az mpirun paranccsal kell indítani.

Példa

Bob felhasználó a barfoo számla terhére 2 node-ot, 12 órára foglal le egy MPI job számára. Az egyes node-okon 24 MPI szálat fog elindítani. A program stdout kimenetét a slurm.out fájlba irányítja (#SBATCH -o).

#!/bin/bash
#SBATCH -A barfoo
#SBATCH --job-name=mpi
#SBATCH -N 2
#SBATCH --ntasks-per-node=24
#SBATCH --time=12:00:00
#SBATCH -o slurm.out
mpirun ./a.out

CPU binding

Az MPI programok teljesítménye általában javítható a processzek CPU magokhoz kötésével. Ilyenkor a párhuzamos program szálait az operációs rendszer nem ütemezi a CPU magok között, ezért javulhat a memória lokalizáció (kevesebb cache miss). A kötés használata ajánlott. Tesztekkel meg kell győződni, hogy egy adott alkalmazás esetén melyik kötési stratégia adja a legjobb eredményt. A következő beállítások az OpenMPI környezetre vontakoznak. A kötésekről részletes információt a --report-bindings MPI opcióval kaphatunk. Az indítási parancsok melett a részletes binding információ néhány sora is látható. Fontos, hogy az ütemező task binding-ját nem kell használni!

Kötés CPU magonként

Ebben az esetben az MPI szálak (rank) sorban töltik fel a CPU magokat.

Indítási parancs: mpirun --bind-to-core --bycore

[cn05:05493] MCW rank 0 bound to socket 0[core 0]: [B . . . . . . . . . . .][. . . . . . . . . . . .]
[cn05:05493] MCW rank 1 bound to socket 0[core 1]: [. B . . . . . . . . . .][. . . . . . . . . . . .]
[cn05:05493] MCW rank 2 bound to socket 0[core 2]: [. . B . . . . . . . . .][. . . . . . . . . . . .]
[cn05:05493] MCW rank 3 bound to socket 0[core 3]: [. . . B . . . . . . . .][. . . . . . . . . . . .]

Kötés CPU foglalat szerint

Ebben az esetben az MPI szálak váltakozva töltik fel a CPU-kat.

Indítási parancs: mpirun --bind-to-core --bysocket

[cn05:05659] MCW rank 0 bound to socket 0[core 0]: [B . . . . . . . . . . .][. . . . . . . . . . . .]
[cn05:05659] MCW rank 1 bound to socket 1[core 0]: [. . . . . . . . . . . .][B . . . . . . . . . . .]
[cn05:05659] MCW rank 2 bound to socket 0[core 1]: [. B . . . . . . . . . .][. . . . . . . . . . . .]
[cn05:05659] MCW rank 3 bound to socket 1[core 1]: [. . . . . . . . . . . .][. B . . . . . . . . . .]

Kötés node-ok szerint

Ebben az esetben az MPI szálak váltakozva töltik fel a node-okat. Lagalább 2 node foglalása szükséges.

Indítási parancs: mpirun --bind-to-core --bynode

[cn05:05904] MCW rank 0 bound to socket 0[core 0]: [B . . . . . . . . . . .][. . . . . . . . . . . .]
[cn05:05904] MCW rank 2 bound to socket 0[core 1]: [. B . . . . . . . . . .][. . . . . . . . . . . .]
[cn06:05969] MCW rank 1 bound to socket 0[core 0]: [B . . . . . . . . . . .][. . . . . . . . . . . .]
[cn06:05969] MCW rank 3 bound to socket 0[core 1]: [. B . . . . . . . . . .][. . . . . . . . . . . .]

OpenMP (OMP) feladatok

OpenMP párhuzamos alkalmazásokhoz 1 node-ot kell lefoglalni és meg kell adni az OMP szálák számát a OMP_NUM_THREADS környezeti változóval. A változót vagy az alkamazás elé kell írni (ld. példa), vagy exportálni kell az alkalmazást indító parancs előtt:


 export OMP_NUM_THREADS=24

Példa

Alice felhasználó a foobar számla terhére, maximum 6 órára indít el egy 24 szálas OMP alkalmazást.

#!/bin/bash
#SBATCH -A foobar
#SBATCH --job-name=omp
#SBATCH --time=06:00:00
#SBATCH -N 1
OMP_NUM_THREADS=24 ./a.out

Hibrid MPI-OMP feladatok

Hibrid MPI-OMP módról akkor beszélünk, ha a párhuzamos alkalmazás MPI-t és OMP-t is használ. Érdemes tudni, hogy az Intel MKL-el linkelt programok MKL hívásai OpenMP képesek. Általában a következő elosztás javasolt: az MPI processzek száma 1-től az egy node-ban található CPU foglalatok száma, az OMP szálak ennek megfelelően az egy node-ban található összes CPU core szám vagy annak fele, negyede (értelem szerűen). A jobszkipthez a fenti két mód paramétereit kombinálni kell.

Példa

Alice felhasználó a foobar számla terhére, 8 órára, 2 node-ra küldött be egy hibrid jobot. Egy node-on egyszerre csak 1 db MPI processz fut ami node-onként 24 OMP szálat használ. A 2 gépen összesen 2 MPI proceszz és 2 x 24 OMP szál fut.

#!/bin/bash
#SBATCH -A foobar
#SBATCH --job-name=mpiomp
#SBATCH -N 2
#SBATCH --time=08:00:00
#SBATCH --ntasks-per-node=1
#SBATCH -o slurm.out
export OMP_NUM_THREADS=24
mpirun ./a.out

Maple Grid feladatok

Maple-t az OMP feladatokhoz hasonlóan 1 node-on lehet futtatni. Használatához be kell tölteni a maple modult is. A Maple kliens-szerver üzemmódban működik ezért a Maple feladat futtatása előtt szükség van a grid szerver elindítására is (${MAPLE}/toolbox/Grid/bin/startserver). Ez az alkalmazás licensz köteles, amit a jobszkriptben meg kell adni (#SBATCH --licenses=maplegrid:1). A Maple feladat indátását a ${MAPLE}/toolbox/Grid/bin/joblauncher paranccsal kell elvégezni.

Példa

Alice felhasználó a foobar számla terhére, 6 órára indítja el a Maple Grid alkalmazást:

#!/bin/bash
#SBATCH -A foobar
#SBATCH --job-name=maple
#SBATCH -N 1
#SBATCH --ntasks-per-node=24
#SBATCH --time=06:00:00
#SBATCH -o slurm.out
#SBATCH --licenses=maplegrid:1
${MAPLE}/toolbox/Grid/bin/startserver
${MAPLE}/toolbox/Grid/bin/joblauncher ${MAPLE}/toolbox/Grid/samples/Simple.mpl

PRACE User Support

Tartalomjegyzék

User Guide to obtain a digital certificate

Installing NIIF CA root certificate

Request a certificate

Request a certificate with openssl

User Guide to create a PKCS#10 digital certification request with openssl

Personal Authentication

Downloading the certificate

Access with GSI-SSH

GridFTP file transfer

Usage of the SLURM scheduler

Example

Estimating core time

Example

Status information

Example

Checking jobs

Example

Checking licenses

Checking downtime

Running jobs

Obligatory parameters

Feladat sorok

A szolgáltatás minősége (QOS)

Memória beállítások

Email értesítés

Tömbfeladatok (arrayjob)

Példa

MPI feladatok

Példa

CPU binding

Kötés CPU magonként

Kötés CPU foglalat szerint

Kötés node-ok szerint

OpenMP (OMP) feladatok

Példa

Hibrid MPI-OMP feladatok

Példa

Maple Grid feladatok

Példa

Navigációs menü

Keresés