Dr. Alban de Vaucorbeil

Simulations and Modelling of Materials and Structures

How to Easily Port a Material Point Method code to GPU

It is possible to easily translate CPU code to GPU without using CUDA or OpenCL. We did this for our Material Point Method (MPM) code and obtained a near 100x increase in performance. This is how we did it. This article is a summary of our paper “Easily Porting Material Point Method codes to GPU.

As part of my research, I created an open-source code for the Material Point Method (MPM) called Karamelo. Originally, I wrote Karamelo for multi-CPU using MPI. It was good for small problems, but was slow for large simulations. At that time, a new massive super computer was being built at Monash University, where I worked. To my surprise, GPUs and not CPUs were generating much of the computing power. As scientists, we had to adapt: for our codes to run fast, we had to port them to GPU.

Simple GPU coding with an API

Where to start when programming for GPU? CUDA? The learning curve is very steep. I did not have the time, nor the resources to learn it. Of course, I tried to learn a little bit, but I found simple problems to be extremely difficult to code. Ten years ago, I would have had to bite the bullet and learn it, if I really wanted to port my code to GPU. But now, there are ways to harvest the power and speed of GPUs without the need to learn CUDA by using a GPU API.

Which GPU API exist?

Depending on the language that you use, you will have different choices. For a C++ code, you have two choices:

  • OpenACC, which uses directives to tell the compiler where to parallelize loops, just like OpenMP.
  • Kokkos, which is a set of C++ libraries and tools that wrap other CPU and GPU APIs, notably including OpenMP on CPU, and CUDA and OpenMPTarget on GPU.

You could also decide to use another language. One that I really like is Julia, through JuliaGPU API. However, it is extremely concise. In most cases, it only requires the decoration of existing serial code, similar to OpenCL. This makes it extremely easy for beginners to use.

So, which one of these options would give the best performance to ease of use ratio?

What is the best GPU API for MPM?

The performance of GPU APIs can vary dramatically from application to application. So to have an accurate comparison, we used a benchmark. We implemented taichi-mpm to OpenACC, Kokkos, and JuliaGPU.

Code length

A short code does not always guarantee a simple code. But the length of a code can give a pretty good idea of its complexity. does not always One proxy for code complexity is the length of the code. In the following table, we shows the length of the benchmark code written for the different GPU APIs.

API Non-empty lines Characters
OpenACC 214 7292
Kokkos 218 8331
JuliaGPU 222 7125

In all cases, the code is a bit longer than the original 88 lines. This is because Taichi already has a singular value decomposition (SVD) function that we had to re-implement here. OpenACC and Kokkos give codes that are relatively of the same length. However, the code written in Julia is significantly longer. This is due to two things:

  • Julia requires boilerplate code to configure and launch GPU kernels
  • Julia requires parallelised functions to explicitly take all parameters as arguments.

This can quickly lead to increasingly complex codes. This is not desirable. In OpenACC and Kokkos, variables can simply be captured from outside the loops.

OpenACC

OpenACC is a programming standard for parallel computing. Like Kokkos, it was designed to simplify parallel programming of heterogeneous CPU/GPU systems. Syntactically, OpenACC is like OpenMPM. It uses directives to tell the compiler which loop to parallelize. Therefore, OpenACC is much simpler than Kokkos and JuliaGPU. Here is an example of a standard for loop parallelized using OpenACC.

std::vector<double> x(n_particles);
#pragma acc parallel loop independent present(x, ...)
for (int p = 0; p < n_particles; p++) { 
... 
}

However, OpenACC is quite frustrating. Here, simplicity comes at the expanse of control. The problem is that most low-level decisions are offloaded to the compiler. But, it is often very difficult to know what the outcomes of the compiler’s decision are. Since we don’t understand what the compiler does, how can we efficiently tune the code? Sometimes, OpenACC decides not to run decorated loops on the GPU. Often, the cause is buried under a pile of meaningless compiler logs. Moreover, many errors are not picked at compile time, but rather just at runtime. Besides, depending on the compiler, different compile flags are required.

JuliaGPU

Julia is very quick to learn and write, and is already a pretty easy language to learn. It is similar to Python in many ways and promises to be as quick as C++. With JuliaGPU, one can quickly write and run GPU codes. Here is the example of a parallel for loop written with JuliaGPU:

x = CUDA.zeros(Float64, n_particles)
function P2G(x, ...)
p = (blockIdx().x - 1)*blockDim().x + threadIdx().x
...
@krun n_particles P2G(x, ...)

The problem is that JuliaGPU is extremely difficult to debug. This is because:

  • Julia is dynamically typed, so type information can be difficult to assess.
  • Julia uses lazy compilation at runtime, so compilation errors and execution errors can be hard to differentiate.and execution errors.
  • The Julia compiler error messages are often meaningless with GPU code.
  • Debugging tools are simply less advanced and far fewer compared to that of C++.

JuliaGPU is efficient for small codes. But, for large-scale software, the four problems mentioned above are troublesome. This is especially the case for our MPM code Karamelo.

Kokkos

Kokkos is a programming model designed for parallel programming on heterogeneous CPU/GPU systems. It provides an abstraction of both computation and application data allocation and layout. Moreover, it was designed to be hardware agnostic, and multi-platform. Here is the example of a parallel for loop written with Kokkos:

Kokkos::View<double*> x(n_particles);
Kokkos::parallel_for(n_particles, KOKKOS_LAMBDA(int p) {
---
});

Compared to OpenACC and JuliaGPU, Kokkos has a much steeper initial learning curve. It can be hard to know where to start. However, the Kokkos team has published a great series of video tutorials for all user levels. If you are interested in learning to code with Kokkos, I highly recommend following them.

Coding with Kokkos is much smoother than with either OpenACC or Julia. The compiling warnings are very informative. Moreover, C++ being strong typed, compilers can pick many sources of errors at compile time. Kokkos works well with all GPU debugging tools such as CUDA-MEMCHECK and NVIDIA Nsight systems. Standard C++ debuggers can also be used by changing the execution space to CPU. Note that switching between execution spaces is done at compile time and does not require any code change.

What we like the best about Kokkos is the ability to specify almost everything explicitly. In spite of all low-level decisions to be abstracted away, we have large control over how loops are parallelized.

Performance

One of the principal criteria when choosing a GPU API is performance. Therefore, we ran and timed 5000 iterations of taichi-mpm (re-written using these APIs) with varying numbers of particles. Here are the results:

Comparison between OpenACC, JuliaGPU and Kokkos

Surprisingly, Kokkos is as fast as JuliaGPU, while OpenACC is around twice as slow. Again, simplicity has a cost. OpenACC has a significant initial overload. It is even slower than serial codes for low number of particles.

Why choosing Kokkos?

Compared to OpenACC and JuliaGPU, Kokkos is one of the fastest GPU APIs. It is also flexible and offers a pleasant programming experience. Everything can nearly be abstracted away, but we also have the possibility to almost specify everything. However, debugging is not yet in pair with serial CPU code debugging. But Kokkos is compatible with good tools such as NVIDIA Nsight systems. Moreover, standard CPU tools such as GDB can be used too (if the execution space is serial CPU).

Although Kokkos learning curve is steep at first, it is then fairly linear. In addition, seeing a few good examples helps at the start. Our code is open-source, and I encourage you to use it as a test case to learn how we parallelized our code.

The last reason to choose Kokkos is that it is under massive development. It is a large project from Sandia National Laboratory in the USA. It is also part of the massive Exoscale project. So it can only get better and support more platforms in the future.

How to port a CPU code to GPU using Kokkos?

Here are a couple of easy examples to show you how serial CPU code can be parallelized using Kokkos.

Basic for loop

For loops are by far the most common type of loops that are parallelized, and also the simplest to write using Kokkos. On of the simplest steps of the MPM algorithm is the update of particle velocities, which is a for loop over all the particlesone of the simplest stepone of the simplest step. In the CPU version of Karamelo it was written as:

void Solid::update_particle_velocities(double alpha) {
  for (int ip = 0; ip < np_local; ip++) {
    v[ip] = (1 - alpha) * v_update[ip] + alpha * (v[ip] + update->dt * a[ip]);
  }
}

Note that variables v, v_update, and a are vectors of vectors and are members of the class Solid.

In the new version of the code that uses Kokkos, this function became:

void SolidMPM::advance_particles()
{
  krml_float PIC_FLIP = update->PIC_FLIP;
  krml_float dt = update->dt;

  Kokkos::View<Vector3d*> sa = this->a;
  Kokkos::View<Vector3d*> sv = this->v;
  Kokkos::View<Vector3d*> sv_update = this->v_update;

  Kokkos::parallel_for("advance_particles", np_local,
  KOKKOS_LAMBDA (const int &ip)
  {
    sv[ip] = (1 - PIC_FLIP)*sv_update[ip] + PIC_FLIP*(v + dt*sa[ip]);
  });
}

The first thing to notice is how the for loop became a Kokkos::parallel_for(). This function takes 3 arguments: the name of the loop, the iteration range, and a functor. The name of the loop is used only for debugging purposes. The iteration range is here given by an int, meaning it’s a 1D loop with a range from 0 to np_local. Finally, the function is a lambda expression which argument is going to vary according to the iteration range.

The second thing to notice is local declaration of all the variables used inside the lambda function. This is necessary due to problematic pointer re-direction on GPU. It is therefore necessary to cache and pass resultant values to GPU instead of pointers.

Lastly, v, a, and v_update have been changed from vectors to Kokkos Views.

Basic reduction loop

In post-processing, we often need to use reductions. For instance to calculate the kinetic energy of a solid. In the serial version of the code, this was simply done like this:

Ek = 0;
for (int in = 0; in < solid.np_local; in++)
  Ek += 0.5 * solid.mass[in] * square(solid.v[in].norm());

In the new version of the code that uses Kokkos, it became:

krml_float Ek = 0;

Kokkos::View<Vector3d*> v = solid.v;
Kokkos::View<krml_float*> mass = solid.mass;

Kokkos::parallel_reduce("ComputeKineticEnergy::compute_value", solid.np_local,
			KOKKOS_LAMBDA(const int &ip, krml_float &lEk) {
			  lEk += 0.5 * mass[ip] * (v[ip](0)*v[ip](0)
						   + v[ip](1)*v[ip](1)
						   + v[ip](2)*v[ip](2));
			},Ek);

The Kokkos::parallel_reduce function takes 4 arguments (one more that the parallel_for function): the name of the loop (used for debugging purposes), the iteration range, a functor, and the result. Here, the iteration range is an int which means it’s a 1D loop with a range from 0 to solid.np_local. The function is a lambda expression which takes two arguments here: the variable that will vary according to the iteration range, and a local result variable.

Note that you actually can reduce multiple variables at the same time. For instance when we need to calculate the total angular momentum of a solid:

  krml_float Lx, Ly, Lz;
  Lx = Ly = Lz = 0;


  Kokkos::View<Vector3d*> v = solid.v;
  Kokkos::View<krml_float*> mass = solid.mass;

  Kokkos::parallel_reduce("ComputeLinearMomentum::compute_value", solid.np_local,
			  KOKKOS_LAMBDA(const int &ip, krml_float &lLx, krml_float &lLy, krml_float &lLz)
			  {
			    const Vector3d &Lp = mass[ip] * v[ip];
			    lLx += Lp(0);
			    lLy += Lp(1);
			    lLz += Lp(2);
			  
			  }, Lx, Ly, Lz);
  L += Vector3d(Lx, Ly, Lz);

Kokkos Views

Kokkos Views are multidimensional arrays. Here they are defined as:

Kokkos::View<Vector3d*> v, a, v_update;

Here, we did not define the memory space. It is automatically assigned at compile time. This is controlled by the flags passed to the compiler. Therefore, it could either be the GPU, or the CPU.

Memory space

You can define the execution space explicitly too, for CUDA space (Nvidia GPU), you do it like that:

Kokkos::View<Vector3d*,  Kokkos::CudaSpace> v, a, v_update;

If you want the view to be created on the host space (i.e. the CPU), do it like that:

Kokkos::View<Vector3d*,  Kokkos::HostSpace> v, a, v_update;

One of the reasons to use a GPU API is to not have to worry about memory problems. Unfortunately, there is one aspect we cannot ignore. the CPU and the GPU do not share any memory. Therefore, Views located on the device (if GPU) are not accessible from the host (the CPU), usually.

Deep copies

Often, we have to initialize a View on the CPU, because this is where the data resides. For instance if we are reading from a file. Also, we probably will have to process data residing on the GPU on the CPU at some point, like writing it on the disk, for instance. We will then need to make deep copies.

Here is an example of View that is initialized on the CPU and then deep-copies onto the GPU:

Kokkos::View<Vector3d*> x;
Kokkos::View<Vector3d*>::HostMirror x_mirror = create_mirror(x);

  for (int i = 0; i < x.size(); i++)
    x_mirror[i] = f(i); // f is a function that cannot be executed on the device for some reason

deep_copy(x,    x_mirror);

We first have to create the View that resides on the device. Note that I did not specified CUDASpace. You could though. But it’s actually better not too for debugging purposes. If you want to change the execution space, you won’t have to change the code.

Here is an example where data that is on the device is copied back to the CPU:

Kokkos::View<Vector3d*> x;
...
Kokkos::View<Vector3d*>::HostMirror x_mirror = create_mirror(x);
deep_copy(x_mirror, x);

CudaUVMSpace

Sometimes, it can be necessary to have exactly the same data available from both the CPU and the GPU. Luckily, Nvidia GPUs offer this possibility with their Unified Virtual Memory space (UVM).

Creating a Kokkos View that uses UVM is as simple as:

Kokkos::View<Vector3d*,  Kokkos::CudaUVMSpace> v;

Don’t be fooled! CUDAUVMSpace is slow due to the absence of shared physical memory between the CPU and GPU. Consequently, deep copies have to be performed at regular intervals to ensure data consistency between the CPU and the GPU.

But CudaUVMSpace is extremely practical when translating a code from serial to Kokkos. This allows to translate in sections without breaking the code. We used this extensively when translating Karamelo. It has saved us a lot of time and hassle in debugging.

Atomics

In MPM, the number of nodes that a given particle can interact with is geometrically limited by the support of the shape functions. For instance, for a 2D problem, if we use simple linear shape functions, a particle can interact with only 4 nodes. However, the number of particles interacting with a given node is not bounded. This is a problem when interpolating from the particles to the grid (the Particle to Grid step, aka P2G).

In the serial version of Karamelo, the portion of the Particle to Grid step where the mass of the nodes is calculated as:

void Solid::compute_mass_nodes(bool reset)
{
  int ip;
  int nn = grid->nnodes_local + grid->nnodes_ghost;

  for (int in = 0; in < nn; in++)
    {
      if (reset) grid->mass[in] = 0;

      for (int j = 0; j < numneigh_np[in]; j++)
	{
	  ip = neigh_np[in][j];
	  grid->mass[in] += wf_np[in][j] * mass[ip];
	}
    }
  return;
}

We have an outer for loop over the nodes, and an inner for loop over the neighbouring particles of each node. So, in that case we have a vector of vector neigh_np that tracks the particles interacting with each node. As I explained before, since the number of particle interacting with a given node is not bounded, we don’t know in advance the size of the inner vector. So it cannot be translated into a Kokkos View.

We want to avoid to keep track of the particles interacting with each node. Therefore, we have decided to not loop over the nodes, but rather loop over the particles. However, since different particles might change the value of grid->mass[in], this creates a potential race condition . Therefore, to avoid this, we have to use atomics. The resulting code is:

void SolidMPM::compute_mass_nodes()
{
  Kokkos::View<krml_float*> smass = this->mass;
  Kokkos::View<krml_float**> swf = this->wf;
  Kokkos::View<int**> neigh_n = this->neigh_n;

  Kokkos::View<krml_float**> gmass = grid->mass;

  Kokkos::parallel_for("compute_mass_nodes", neigh_policy,
  KOKKOS_LAMBDA (int ip, int i)
  {
    if (krml_float wf = swf(ip, i))
      {
        int in = neigh_n(ip, i);
	    Kokkos::atomic_add(&gmass(in), wf*smass(ip));
	    }
      }
  });
}

Here, to add wf*smass(ip) to gmass(in), we use the atomic_add function. If we had needed to subtract instead, we would have used atomic_sub. For more options, please read the Kokkos’ documentation.

Inheritance

Kokkos does not support the use of virtual functions. This is one of the biggest problem when translating serial code to GPU. The problem is due to virtual pointers pointing to CPU memory. Creating GPU-compatible class instances are possible, but extremely complicated and fragile.

class Base {
  void virtual_iteration(int i) = 0;

  void static_loop() {
    for (int i = 0; i < n; i++)
    virtual_iteration(i);
  }
}

class DerivedA: public Base {
  void virtual_iteration(int i) override;
}

Therefore, to translate the above code to GPU, we have two options:

  1. Moving the loops into the virtual functions
  2. Using the Curiously Recurring Template Pattern

Moving the loops into the virtual functions

We can avoid using virtual functions on the GPU by moving the loop directly into the virtual functions:

class Base {
  void virtual_loop() = 0;
}


class DerivedA: public Base {
  void virtual_loop() override {
    Kokkos::parallel_for(n, KOKKOS_LAMBDA(int i) {
      static_iteration(i);
    });
  }
}

This is a valid option. It works well if all derived class use a different version of the loop. In the contrary, it would require a lot of code duplication, which is extremely poor practice. If we need to change a code in one of these loops, we have to find all the duplicates and change the code manually for all instances. Not a good idea!

Curiously Recurring Template Pattern (CRTP)

The CRTP consists of introducing a templated intermediate class which takes the derived class as an argument. This class overrides the virtual function with the shared control structure. It injects the derived class’s kernel statically using early binding.

class Base {
  void virtual_loop() = 0;
}

template<typename T>
class CRTP: public Base {
  void virtual_loop() override {
    T derived(*static_cast<T *>(this));
    
    Kokkos::parallel_for(n, KOKKOS_LAMBDA(int i) {
      derived.static_iteration(i);
    }
  }
}

class DerivedA: public CRTP<DerivedA> {
  void static_iteration(int i);
}

Speedup

The code resulting from the translation to GPU is ….

Test case 1: Two bouncing balls

The two bouncing ball problem is one of the most basic 2D MPM problems. It uses the built-in MPM contact algorithm. Two identical balls are launch with the same velocity onto each other. Here are the problem details:

Material Point Method test case: two bouncing balls

We ran this simulation on: 1 CPU, 9 CPUs, and 1 Tesla GPU. The results for 1 CPU are our baseline, from which the speed up factors are calculated. With one GPU, the largest speed up factor obtained here was around 18, but it was still increasing with the number of particles. With 9 CPUs, it was plateauing at around 3.4. So it’s a massive improvement!

Running time of the two bouncing ball test GPU speedup using Karamelo

Test case 2: A twisted column

Material Point Method test case: a twisted column

We ran this simulation on: 1 CPU, 9 CPUs, and 1 Tesla GPU. The results for 1 CPU are our baseline, from which the speed up factors are calculated. With one GPU, the largest speed up factor obtained here was around 86. With 9 CPUs, it was plateauing at around 7. In both cases, the speedup was larger than for the two bouncing balls.

Speedup for the twisted column test using GPU

The obtained speedup is really massive. And this time, the GPU seemed to be saturated.

Conclusion

Porting any serial code to GPU can significantly enhance it’s performance, and thanks to GPU APIs, it is at everyone’s reach. For C++ codes, Kokkos emerges as the most powerful and efficient tool. Our comparison of GPU APIs—OpenACC, JuliaGPU, and Kokkos—revealed that while JuliaGPU is quick to learn and write, Kokkos offers a balance of performance and ease of use, making it the preferred choice for our MPM code translation to GPU.

Kokkos not only provides a smooth programming experience but also delivers competitive performance comparable to JuliaGPU, with the added benefit of robust debugging tools and support for multiple platforms. The steep initial learning curve of Kokkos is offset by its linear learning trajectory and extensive documentation, including helpful tutorials.

Furthermore, our benchmarks and real-world test cases demonstrated substantial speedups when using Kokkos: up to 80 times faster than the serial conterpart.

In light of these findings, we recommend Kokkos as the GPU API of choice for researchers and developers looking to leverage the power of GPUs for scientific simulations. Its ongoing development and compatibility with leading GPU debugging tools position Kokkos as a versatile and future-proof solution for high-performance computing tasks like MPM simulations.

Scroll to Top