Requesting GPU

To request a GPU slot for a job, you can simply add the attribute request_gpus (as well as request_cpus and request_memory) to the job.

Currently, a job can only use 1 GPU at the time.

request_gpus = 1

+RequiresGPU=1

request_cpus = 1

request_memory = 2 GB

Note the number of GPU resources in CMS is still limited at present, so the matching can take longer than regular (cpu) jobs.

It is currently not possible to specify exactly what type of GPU you want, but you can match on for example CUDA compute capability. For example, use the following requirements expression in your job:

requirements = CUDACapability > 3=

You can find an example on how to use GPUs with TensorFlow workflows here.

Using GPU Resources Interactively

The following example submits a sleep job requesting GPU jobs and transfers a singularity wrapper script to setup tensorflow interactively (after the user access the node via SSH).

Current example uses GPUs coming from UCSD and Nebraska.

To download the tutorial, type:

tutorial GPUs

$ tutorial interactivegpus

Installing interactivegpus (master)...

Tutorial files installed in ./tutorial-interactivegpus.

Running setup in ./tutorial-interactivegpus...

[login ~]$ cd tutorial-interactivegpus/

[login tutorial-interactivegpus]$

The job will execute for about 8 hours by default. The maximum walltime allowed is 48 hours.

To change the desired walltime for the resource to be available, edit the following parameter in request_gpus.jdl:

# Walltime for job in minutes

# Default ~8 hours

+MaxWallTimeMins = 500

To get SSH access to the job, follow the instructions below:

Using condor_ssh_to_job

# 1. Submit request

$ condor_submit request_gpus.jdl

Submitting job(s).

1 job(s) submitted to cluster 4967783.

# 2. Wait for it, until it starts running (R state), it can take a few minutes

#   depending on the availability of resources.

#    Monitor via condor_q

$ condor_q 4967783.0

-- Schedd: login.uscms.org : <192.170.227.118:9618?... @ 11/16/18 08:47:57

ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD

4967783.0 khurtado 11/16 08:46 0+00:00:09 R 0 0.0 connect_wrapper.sh sleep .sh

# 3. Now, access to it via ssh:

$ condor_ssh_to_job 4969789.0

Welcome to slot1_1@glidein_36930_830805504@cgpu-1.t2.ucsd.edu!

Your condor job is running with pid(s) 29916.

[0849] cuser6@cgpu-1 /data1/condor_local/execute/dir_36926/glide_hYQHlu/execute/dir_29832$

# 4. Now, enter the container shell

./singularity_wrapper.sh bash

# 5. Start using tensorflow!

You can see below, a quick example downloaded from stash public user's area using Tensorflow with the GPU resource,:

Executing tensorflow

# 1. Enter singularity container shell
[0851] cuser6@cgpu-1 /data1/condor_local/execute/dir_36926/glide_hYQHlu/execute/dir_29832$ ./singularity_wrapper.sh bash
cuser6@cgpu-1:~$ cat /etc/issue
Ubuntu 16.04.5 LTS \n \l
 
# 2. Download an example from stash
cuser6@cgpu-1:~$ wget http://stash.osgconnect.net/~khurtado/tensorflow/tf_matmul.py
--2018-11-16 16:53:52--  http://stash.osgconnect.net/~khurtado/tensorflow/tf_matmul.py
Resolving stash.osgconnect.net (stash.osgconnect.net)... 192.170.227.197
Connecting to stash.osgconnect.net (stash.osgconnect.net)|192.170.227.197|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 578 [application/octet-stream]
Saving to: 'tf_matmul.py'
 
tf_matmul.py                              100%[=====================================================================================>]     578  --.-KB/s    in 0s
 
2018-11-16 16:53:52 (60.5 MB/s) - 'tf_matmul.py' saved [578/578]
 
# 3. Execute example
cuser6@cgpu-1:~$ python3 tf_matmul.py
2018-11-16 16:56:37.914379: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-11-16 16:56:41.677599: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:85:00.0
totalMemory: 10.92GiB freeMemory: 10.76GiB
2018-11-16 16:56:41.678089: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 0
2018-11-16 16:57:35.059743: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-11-16 16:57:35.062858: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971]      0
2018-11-16 16:57:35.062898: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0:   N
2018-11-16 16:57:35.063932: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10407 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:85:00.0, compute capability: 6.1)
Device mapping:
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:85:00.0, compute capability: 6.1
2018-11-16 16:57:35.291222: I tensorflow/core/common_runtime/direct_session.cc:288] Device mapping:
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:85:00.0, compute capability: 6.1
 
2018-11-16 16:57:35.298476: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 0
2018-11-16 16:57:35.298533: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-11-16 16:57:35.298552: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971]      0
2018-11-16 16:57:35.298571: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0:   N
2018-11-16 16:57:35.298746: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10407 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:85:00.0, compute capability: 6.1)
Device mapping:
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:85:00.0, compute capability: 6.1
2018-11-16 16:57:35.298929: I tensorflow/core/common_runtime/direct_session.cc:288] Device mapping:
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:85:00.0, compute capability: 6.1
 
MatrixInverse: (MatrixInverse): /job:localhost/replica:0/task:0/device:GPU:0
2018-11-16 16:57:35.301525: I tensorflow/core/common_runtime/placer.cc:935] MatrixInverse: (MatrixInverse)/job:localhost/replica:0/task:0/device:GPU:0
MatMul: (MatMul): /job:localhost/replica:0/task:0/device:GPU:0
2018-11-16 16:57:35.301561: I tensorflow/core/common_runtime/placer.cc:935] MatMul: (MatMul)/job:localhost/replica:0/task:0/device:GPU:0
Const: (Const): /job:localhost/replica:0/task:0/device:GPU:0
2018-11-16 16:57:35.301604: I tensorflow/core/common_runtime/placer.cc:935] Const: (Const)/job:localhost/replica:0/task:0/device:GPU:0
result of matrix multiplication
===============================
[[ 1.0000000e+00  0.0000000e+00]
 [-4.7683716e-07  1.0000002e+00]]
===============================
cus`er6@cgpu-1:~$

Using Stashcp and XRootD from OSG TensorFlow Containers

OSG-built TensorFlow containers for CPU/GPUs are based on Ubuntu 16.04. You can follow the example below in order to use xrdcp and stashcp from inside these containers:

# Load the software

$ export LD_LIBRARY_PATH=/opt/xrootd/lib:$LD_LIBRARY_PATH

$ export PATH=/opt/xrootd/bin:/opt/StashCache/bin:$PATH

Now, to copy a file inside /stash/user/khurtado/work/test.txt with either one of these tools, you can do:

$ xrdcp root://stash.osgconnect.net:1094//user/khurtado/work/test.txt .

$ stashcp /user/khurtado/work/test.txt .

You can also copy files (e.g.: tarballs) via wget by following this guide.

-- CarlLundstedt - 2022-12-01

Edit | Attach | Watch | Print version | History: r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r1 - 2022-12-01 - CarlLundstedt
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Main All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback