Main Web>TWikiUsers>KonstantinosSamarasTsakiris>CudaService (2015-09-10, unknown)

EditAttachPDF

CudaService: Using GPUs to accelerate tasks in CMSSW

Contents:

CudaService: Using GPUs to accelerate tasks in CMSSW

Github repository

Until merging into the main CMSSW repository: https://github.com/Oblynx/cmssw/tree/cudaService

All references to files in this page correspond to this repository/branch.

Goal of this Page

This is a tutorial on how to use GPUs in the CMS offline software. For a more comprehensive example, check the CudaService integration test in the CMSSW code: FWCore/Services/test/integrationTest_cudaService_cfg.py For particular use cases, consult the unit tests: FWCore/Services/test/test_cudaService_gcc.cppunit.cc and test_cudaService.cu

Introduction

CUDA

GPUs are massively parallel hardware that provide a good target for executing data-parallel algorithms. Their programming model is different than CPUs however; NVidia GPUs are programmed in a C language extension called CUDA. Using CUDA includes writing the algorithm that will be executed on the GPU in a specially annotated C function that is called a kernel and then calling that kernel using a special API.

Kernels and some native CUDA syntax cannot be parsed by standard C++ compilers and instead require NVidia's compiler, NVCC. This tool is provided with CMSSW and is automatically set to compile any source code file with extension .cu (instead of .cc or .cpp).

This tutorial will cover how to run GPU code in CMSSW; writing CUDA kernels is out of this tutorial's scope and it is assumed that the reader knows the basics of CUDA (like grid/block sizes for kernel launch configurations).

CudaService

The CudaService is a CMSSW Service that allows running CUDA code from EDM modules. It provides a task-based interface, in which the user submits a task to the service and gets an std::future handle to get the result asynchronously. The task should be a data-parallel algorithm implemented in a CUDA kernel, potentially with a CPU fallback version in case the executing machine doesn't have a GPU, along with its accompanying data.

Summary of Use

Using the service requires the following actions (in summary):

Declaring the CudaService in the job's python configuration file.
In the package's BuildFile.xml, declaring a dependency on the "FWCore/Services" & "cuda" packages.
Writing the relevant kernels (and <>kernel wrappers) in a .cu file in the same package's library (the src/ folder).
In the actual C++ code of the module where the GPU code is going to be called:
- Include the "FWCore/Services/interface/cuda_service.h" header
- Declare the kernel wrappers that have been defined in the .cu file (or create a local header file for them).
- Declare a service smart pointer to CudaService: edm::Service cudaServiceHandle;
- Use the CudaService 's API cudaLaunch() to launch tasks.

More details in the following section.

Using CudaService

The user of the CudaService system needs to understand the following concepts:

Kernels and Kernel Wrappers
The cudaLaunch() method
The cudaPointer class, a smart pointer for managing GPU data

Kernels, Wrappers and CPU fallbacks

The first thing the user has to do is, of course, to write the GPU kernels that will be launched with the service. The kernels have to be written in a separate .cu file (automatically compiled with NVCC), in a library that will be linked to the code that calls them. However, CudaService, being native C++ code, cannot express the actual kernel launch syntax -- it is NVCC-specific. For that reason, and to make supplying the (optional but recommended) CPU fallback easier, the user has to write a wrapper for each kernel which simply calls the kernel. The "wrapper recipe" is always the same, no matter how complex the kernel is:

__global__ void simpleKernel(const float* in, float* out) {...}
void simpleCPUfallback(const float* in, float* out) {...}
void simpleWrapper(bool GPUpresent, cuda::ExecutionPolicy execPol, const float* in, float* out){
   if(GPU) simpleKernel<<<execPol.getGridSize(), execPol.getBlockSize()>>>(in, out);
   else    simpleCPUfallback(in, out);
}
// Even if the kernel is very complex:
__global__ void complexKernel(int param1, float param2, const double* in1, const double* in2, double* out1, int* out2) {...}
void complexCPUfallback(int param1, float param2, const double* in1, const double* in2, double* out1, int* out2) {...}
void complexWrapper(bool GPUpresent, cuda::ExecutionPolicy execPol, int param1, float param2, ...){
   if(GPU) complexKernel<<<execPol.getGridSize(), execPol.getBlockSize()>>>(param1, param2, in1, in2, out1, out2);
   else    complexCPUfallback(param1, param2, in1, in2, out1, out2);
}

So, the wrappers accept:

A bool argument passed by CudaService, which determines whether the kernel or the fallback should be used.
An object that determines the kernel launch configuration.
The kernel's arguments.

Then they launch the kernel and forward the arguments. The above recipe should be used for all kernels.

cudaLaunch()

The CudaService class provides a simple API to launch GPU tasks:

 CudaService::cudaLaunch(launchType, kernelWrapper, args...)

The parameters are:

launchType:
- An explicit cuda::ExecutionPolicy object: for providing a manual launch configuration. This is necessary in case of a multidimensional Grid or Block.
- unsigned int: The size of the problem (or the total threads the kernel should be run with), in case of automatic launch configuration. To perform an automatic launch configuration, a different flavour of the wrapper should be used. For examples, see the unit tests in the repository.
kernelWrapper: A user-provided function with the syntax that was previously demonstrated.
args...: The arguments that should be passed to the kernel.

Currently the kernels can accept 3 kinds of C++ types as arguments:

Arithmetic types: int, float, ...
Pointers to arithmetic types
Pointers to structures of user-defined structs that comprise only arithmetic types.
- [FUTURE]: or cudaPointers to arithmetic types (see next section). Currently, structures that include pointers or cudaPointers don't work. (ArrayOfStructs is possible, StructureOfArrays is not)

It should be stressed that the kernel can only receive by value fundamental, arithmetic types. Structs/classes that contain only arithmetic members need to be passed by pointer. STL data structures, like std::vectors or std::maps, are currently not supported by CUDA. A workaround with the use of the Thrust library is a future goal of this system.

cudaPointer

This is a smart pointer class that implicitly manages the GPU memory allocation and migration of data between the CPU and the GPU, wherever they are needed. It acts like an std::unique_ptr, in that it contains a pointer of type T* (where T is an arithmetic type), that is allocated at the object's construction and freed at the object's destruction (RAII). Thus, it maintains ownership of its data throughout its lifetime. It is also portable, as it will allocate and deallocate memory using the GPU unified memory system if a GPU is available, or using the standard C++ new and delete otherwise.

Due to the way the GPU Unified Memory system works, it is illegal to access from the CPU cudaPointers that have been submitted to a kernel before getting the result of the computation. This is also the case for read-only input data, unfortunately (until unified memory is improved by NVidia).

There are 2 different versions of cudaPointers, modelled after std::unique_ptr:

cudaPointer<T>: pointers to single data elements
cudaPointer<T[]>: pointers to arrays of type T elements

For every kernel argument that is a pointer to arithmetic type, a cudaPointer object needs to be given to the cudaLaunch() call in its place. For example:

__global__ void kernel(int param, float* array) {...}

would be called with:

int param;
cudaPointer<float> array;
// <initialize param, array>...
cudaServicePtr->cudaLaunch(execPol, kernelWrapper, param, array);

Internally, the service takes care to only pass to the kernel wrapper the pointer to float that is contained in this example's cudaPointer<float>. When passed to the kernel and when released, cudaPointer objects change their "CUDA stream attachment", that is, their visibility scope between the host and the current stream. This also provides a mechanism to check for illegal attempts to access the cudaPointer's data from the CPU before the GPU task returns, without causing a bus error.

cuda::ExecutionPolicy

These are simple structs that store the kernel launch parameters: grid, block and shared memory size. They can either be created explicitly by the user and passed to CudaService at the kernel launch or implicitly by the service itself given the total threads the kernel should be executed with (problem size), if the kernel expects a single-dimensional configuration. For multidimensional launches, the ExecutionPolicy has to be created explicitly. There is a factory functor, cuda::AutoConfig()(<problem size>), which utilizes a heuristic that attempts to minimize idling hardware resources during the kernel's execution. Example:

auto execPol= cuda::AutoConfig()(n, (void*)kernel);

ExecutionPolicy and automatic configuration have been taken from Mark Harris' HEMI (open source on Github).

CudaService Architecture

CudaService is a thread pool that fulfills a GPU management role.

Lifecycle

As it is a CMSSW service, CudaService is meant to only be constructed implicitly by the Framework and never explicitly by the physics code. In fact, it assumes that there are never more than 1 instances and provides only a default constructor, with deleted copy and move semantics. It is meant to be shared by all running threads and is thus thread-safe.

The constructor checks the number of GPUs present on the system and registers callbacks to the CMSSW ActivityRegistry to start and end the pool's worker threads when the Framework signals the postBeginJob and postEndJob transitions respectively. If the constructor is called a second time, a runtime exception will occur [release version only].

State

CudaService incarnates a consumer/producer model:

Incoming tasks are pushed to a tbb::concurrent_bounded_queue
Waiting threads consume tasks from the queue

The threads are std::threads and are independent from the TBB scheduler. When the queue is empty, they are waiting idly on the blocking "pop" method of tbb::concurrent_bounded_queue.

Examples

Examples for the service's use can be found in the source code's repository: there are unit tests that exemplify particular ways of using the service and there is also an integration test, which is a complete example of writing a piece of physics code that uses the service. The unit tests are all together in 2 files: FWCore/Services/test/test_cudaService_gcc.cppunit.cc and test_cudaService.cu. The integration test, however, encompasses multiple files. You should examine the relevant python configuration script, FWCore/Services/test/integrationTest_cudaService_cfg.py, which points to the other elements of the integration test.

Note about tests in the release version

The unit tests might not compile/run, because they use CudaService in a non-standard way. The current integration test includes altering some physics code (which of course has to be undone in the release version). So it will either exist only as a reference or a new integration test will be written, that will exist totally in the FWCore/Services/test/ directory.