"In this self-paced, hands-on lab, we will use [OpenACC](http://openacc.org/) directives to port a basic C program to an accelerator in four simple steps, achieving *at least* a two-fold speed-up.\n",
"\n",
"Lab created by John Coombs, Mark Harris, and Mark Ebersole (Follow [@CUDAHamster](https://twitter.com/@cudahamster) on Twitter)"
"Open-specification OpenACC directives are a straightforward way to accelerate existing Fortran and C applications. With OpenACC directives, you provide hints via compiler directives (or 'pragmas') to tell the compiler where -- and how -- it should parallelize compute-intensive code for execution on an accelerator. \n",
"\n",
"If you've done parallel programming using OpenMP, OpenACC is very similar: using directives, applications can be parallelized *incrementally*, with little or no change to the Fortran, C or C++ source. Debugging and code maintenance are easier. OpenACC directives are designed for *portability* across operating systems, host CPUs, and accelerators. You can use OpenACC directives with GPU accelerated libraries, explicit parallel programming languages (e.g., CUDA), MPI, and OpenMP, *all in the same program.*\n",
"This hands-on lab walks you through a short sample of a scientific code, and demonstrates how you can employ OpenACC directives using a four-step process. You will make modifications to a simple C program, then compile and execute the newly enhanced code in each step. Along the way, hints and solution are provided, so you can check your work, or take a peek if you get lost."
"You can accelerate your applications using OpenACC directives and achieve *at least* a 2X speed-up, using 4 straightforward steps:\n",
"\n",
"1. Characterize your application\n",
"2. Add compute directives\n",
"3. Minimize data movement\n",
"4. Optimize kernel scheduling\n",
"\n",
"The content of these steps and their order will be familiar if you have ever done parallel programming on other platforms. Parallel programmers deal with the same issues whenever they tackle a new set of code, no matter what platform they are parallelizing an application for. These issues include:\n",
"\n",
"+ optimizing and benchmarking the serial version of an application\n",
"+ profiling to identify the compute-intensive portions of the program that can be executed concurrently\n",
"+ expressing concurrency using a parallel programming notation (e.g., OpenACC directives)\n",
"+ compiling and benchmarking each new/parallel version of the application\n",
"+ locating problem areas and making improvements iteratively until the target level of performance is reached\n",
"\n",
"The programming manual for some other parallel platform you've used may have suggested five steps, or fifteen. Whether you are an expert or new to parallel programming, we recommend that you walk through the four steps here as a good way to begin accelerating applications by at least 2X using OpenACC directives. We believe *being more knowledgeable about the four steps* will make the process of programming for an accelerator more understandable *and* more manageable. The 2X in 4 Steps process will help you use OpenACC on your own codes more productively, and get significantly better speed-ups in less time."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 1 - Characterize Your Application"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The most difficult part of accelerator programming begins before the first line of code is written. If your program is not highly parallel, an accelerator or coprocesor won't be much use. Understanding the code structure is crucial if you are going to *identify opportunities* and *successfully* parallelize a piece of code. The first step in OpenACC programming then is to *characterize the application*. This includes:\n",
"\n",
"+ benchmarking the single-thread, CPU-only version of the application\n",
"+ understanding the program structure and how data is passed through the call tree\n",
"+ profiling the application and identifying computationally-intense \"hot spots\"\n",
" + which loop nests dominate the runtime?\n",
" + what are the minimum/average/maximum tripcounts through these loop nests?\n",
" + are the loop nests suitable for an accelerator?\n",
"+ insuring that the algorithms you are considering for acceleration are *safely* parallel\n",
"\n",
"Note: what we've just said may sound a little scary, so please note that as parallel programming methods go OpenACC is really pretty friendly: think of it as a sandbox you can play in. Because OpenACC directives are incremental, you can add one or two directives at a time and see how things work: the compiler provides a *lot* of feedback. The right software plus good tools plus educational experiences like this one should put you on the path to successfully accelerating your programs.\n",
"\n",
"We will be accelerating a 2D-stencil called the Jacobi Iteration. Jacobi Iteration is a standard method for finding solutions to a system of linear equations. The basic concepts behind a Jacobi Iteration are described in the following video:\n",
"In this code, the outer 'while' loop iterates until the solution has converged, by comparing the computed error to a specified error tolerance, *tol*. The first of two sets of inner nested loops applies a 2D Laplace operator at each element of a 2D grid, while the second set copies the output back to the input for the next iteration."
"Before you start modifying code and adding OpenACC directives, you should benchmark the serial version of the program. To facilitate benchmarking after this and every other step in our parallel porting effort, we have built a timing routine around the main structure of our program -- a process we recommend you follow in your own efforts. Let's run the [`task1.c`](/4vwkFv7K/edit/C/task1/task1.c) file without making any changes -- using the *-fast* set of compiler options on the serial version of the Jacobi Iteration program -- and see how fast the serial program executes. This will establish a baseline for future comparisons. Execute the following two commands to compile and run the program."
"This is a good time to briefly talk about having a quality check in your code before starting to offload computation to an accelerator (or do any optimizations, for that matter). It doesn't do you any good to make an application run faster if it does not return the correct results. It is thus very important to have a quality check built into your application before you start accelerating or optimizing. This can be a simple value print out (one you can compare to a non-accelerated version of the algorithm) or something else.\n",
"\n",
"In our case, on every 100th iteration of the outer `while` loop, we print the current max error. (You just saw an example when we executed *task1_pre_out*.) As we add directives to accelerate our code later in this lab, you can look back at these values to verify that we're getting the correct answer. These print-outs also help us verify that we are converging on a solution -- which means that we should see, as we proceed, that the values are approaching zero.\n",
"\n",
"**Note:** NVIDIA GPUs implement IEEE-754 compliant floating point arithmetic just like most modern CPUs. However, because floating point arithmetic is not associative, the order of operations can affect the rounding error inherent with floating-point operations: you may not get exactly the same answer when you move to a different processor. Therefore, you'll want to make sure to verify your answer within an acceptable error bound. Please read [this](https://developer.nvidia.com/content/precision-performance-floating-point-and-ieee-754-compliance-nvidia-gpus) article at a later time, if you would like more details.\n",
"\n",
"*After each step*, we will record the results from our benchmarking and correctness tests in a table like this one: \n",
"\n",
"|Step| Execution | ExecutionTime (s) | Speedup vs. 1 CPU Thread | Correct? | Programming Time |\n",
"Back to our lab. Your objective in the step after this one (Step 2) will be to modify [`task2.c`](/4vwkFv7K/edit/C/task2/task2.c) in a way that moves the most computationally intensive, independent loops to the accelerator. With a simple code, you can identify which loops are candidates for acceleration with a little bit of code inspection. On more complex codes, a great way to find these computationally intense areas is to use a profiler (such as PGI's pgprof, NVIDIA's nvprof or open-source *gprof*) to determine which functions are consuming the largest amounts of compute time. To profile a C program on your own workstation, you'd type the lines below on the command line, but in this workshop, you just need to execute the following command, and then click on the link below it to see the pgprof interface"
"Click on `File > New Session` to start a new profiling session. Select the executable to profile by pressing the `Browse` button, then selecting `task_simple_out` `from OpenACC/labs/lab1/C`.\n",
"Clicking `Next` will bring up a screen with a list profiling settings for this session. We can leave those at their default settings for now. Clicking `Finish` will cause `pgprof` to launch your executable for profiling. Since we are profiling a regular CPU application (no acceleration added yet) we should refer to the `CPU Details` tab along the bottom of the window for a summary of what functions in our program take the most compute time on the CPU. If you do not have a `CPU Details` tab, click `View` -> `Show CPU Details View`.\n",
"Double-clicking on the most time-consuming function in the table, `main` in this case, will bring up another file browser. This time click on `Recently Used` and then `C` and press `OK`. This will open the source file for the `main` function. \n",
"In our Jacobi code sample, the compute-intensive part of our code is the two for-loops nested inside the while loop in the function *main*. This is where we'll concentrate our effort in adding OpenACC to the code.\n",
"\n",
"Let's see what it takes to accelerate those loops."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 2 - Add Compute Directives"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In C, an OpenACC directive is indicated in the code by `'#pragma acc *your directive*'`. This is very similar to OpenMP programming and gives hints to the compiler on how to handle the compilation of your source. If you are using a compiler which does not support OpenACC directives, it will simply ignore the `'#pragma acc'` directives and move on with the compilation.\n",
"\n",
"In Step 2, you will add compute regions around your expensive parallel loop(s). The first OpenACC directive you're going to learn about is the *kernels* directive. The kernels directive gives the compiler a lot of freedom in how it tries to accelerate your code - it basically says, \"Compiler, I believe the code in the following region is parallelizable, so I want you to try and accelerate it as best you can.\"\n",
"\n",
"Like most OpenACC directives in C/C++, the kernels directive applies to the structured code block immediately following the `#pragma acc *directive*`. For example, each of the following code samples instructs the compiler to generate a kernel -- from suitable loops -- for execution on an accelerator:\n",
"\n",
" #pragma acc kernels\n",
" {\n",
" // accelerate suitable loops here \n",
" }\n",
" // but not these loops\n",
"\n",
"or\n",
"\n",
" #pragma acc kernels\n",
" for ( int i = 0; i < n; ++i ) \n",
" { // body of for-loop\n",
" ... // The for-loop is a structured block, so this code will be accelerated\n",
" }\n",
" ... // Any code here will not be accelerated since it is outside of the for-loop"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"One, two or several loops may be inside the structured block, the kernels directive will try to parallelize it, telling you what it found and generating as many kernels as it thinks it safely can. At some point, you will encounter the OpenACC *parallel* directive, which provides another method for defining compute regions in OpenACC. For now, let's drop in a simple OpenACC `kernels` directive in front of and embracing *both* the two for-loop codeblocks that follow the while loop using curly braces. The kernels directive is designed to find the parallel acceleration opportunities implicit in the for-loops in the Jacobi Iteration code. \n",
"To get some hints about how and where to place your kernels directives, click on the links below. When you feel you are done, **make sure to save the [`task2.c`](/4vwkFv7K/edit/C/task2/task2.c) file you've modified with File -> Save, and continue on.** If you get completely stuck, you can look at [task2_solution.c](/4vwkFv7K/edit/C/task2/task2_solution.c) to see the answer."
"Let's now compile our [`task2.c`](/4vwkFv7K/edit/C/task2/task2.c) file by executing the command below with Ctrl-Enter (or press the play button in the toolbar above). Note that we've now added a new compiler option `-ta` to specify the type of accelerator to use. We've set it to `tesla` as we're using NVIDIA GPUs in this lab."
"If you do not get similar output, please check your work and try re-compiling. If you're stuck, you can compare what you have to task2_solution.c in the editor above.\n",
"\n",
"*The output provided by the compiler is extremely useful, and should not be ignored when accelerating your own code with OpenACC.* Let's break it down a bit and see what it's telling us.\n",
"\n",
"1. First since we used the `-Minfo` command-line option, we will see all output from the compiler. If we were to use `-Minfo=accel` we would only see the output corresponding to the accelerator, in this case an NVIDIA GPU.\n",
"2. The first line of the output, *main*, tells us which function the following information is in reference to.\n",
"3. The line starting with `41, Loop is parallelizable` of the output tells us that on line `41` in our source, an accelerated kernel was generated. This is the the loop just after where we put our `#pragma acc kernels`.\n",
"4. The following lines provide more details on the accelerator kernel on line 42. It shows we created a parallel OpenACC `loop`. This loop is made up of gangs (a grid of blocks in CUDA language) and vector parallelism (threads in CUDA language) with the vector size being 128 per gang.\n",
"5. At line 54, the compiler tells us it found another loop to accelerate.\n",
"6. The rest of the information concerns data movement which we'll get into later in this lab.\n",
"\n",
"So as you can see, lots of useful information is provided by the compiler, and it's very important that you carefuly inspect this information to make sure the compiler is doing what you've asked of it.\n",
"Once you feel your code is correct, try running it by executing the command below. You'll want to review our quality check from the beginning of task2 to make sure you didn't break the functionality of your application."
"*Note: Problem Size: 1024x1024; System Information: GK520; Compiler: PGI Community Edition 17.4*\n",
"\n",
"\n",
"Now, if your solution is similar to the one in task2_solution.c, you have probably noticed that we're executing **slower** than the non-accelerated, CPU-only version we started with. What gives?! Let's see what `pgprof` can tell us about the performance of the code. Return to your `PGPROF` window from earlier, start another new session, but this time loading `task2_out` as your executable (it's in the same directory as before). This time we'll find a colorful graph of what our program is doing, this is the GPU timeline. We can't tell much from the default view, but we can zoom in by using the `+` magnifying glass at the top of the window. If you zoom in far enough, you'll begin to see a pattern like the one in the screenshot below. The teal and purple boxes are the compute kernels that go with the two loops in our kernels region. Each of these groupings of kernels is surrounded by tan coloer boxes representing data movement. What this graph is showing us is that for every step of our `while` loop, we're copying data to the GPU and then back out. Let's try to figure out why.\n",
"The compiler feedback we collected earlier tells you quite a bit about data movement. If we look again at the compiler feedback from above, we see the following.\n",
"This is telling us that the compiler has inserted data movement around our `kernels` region at line 36 which copies the `A` array *in* and *out* of GPU memory and also copies `Anew` out. This problem of copying back and forth on every iteration of a loop is sometimes called \"data sloshing\".\n",
"\n",
"The OpenACC compiler can only work with the information we have given it. It knows we need the `A` and `Anew` arrays on the GPU for each of our two accelerated sections, but we didn't tell it anything about what happens to the data outside of those sections. Without this knowledge, it has to copy the full arrays *to the GPU and back to the CPU* for each accelerated section, *every time* it went through the while loop. That is a LOT of wasted data transfers.\n",
"\n",
"Ideally, we would just transfer `A` to the GPU at the beginning of the Jacobi Iteration, and then only transfer `A` back to the CPU at the end. As for `Anew`, it's only used within this region, so we don't need to copy any data back and forth, we only need to `create` space on the device for this array.\n",
"\n",
"Because overall accelerator performance is detetermined largely by how well memory transfers are optimized, the OpenACC specification defines the `data` directive and several modifying clauses to manage all the various forms of data movement."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"## Step 3 - Manage Data Movement\n",
"\n",
"We need to give the compiler more information about how to reduce unnecessary data movement for the Jacobi Iteration. We are going to do this with the OpenACC `data` directive and some modifying clauses defined in the OpenACC specification. \n",
"\n",
"In C, the `data` directive applies to the next structured code block. The compiler will manage data according to the provided clauses. It does this at the beginning of the `data` directive code block, and then again at the end. Some of the clauses available for use with the `data` directive are:\n",
"\n",
"* `copy( list )` - Allocates memory on GPU and copies data from host to GPU when entering region and copies data to the host when exiting region.\n",
"* `copyin( list )` - Allocates memory on GPU and copies data from host to GPU when entering region.\n",
"* `copyout( list )` - Allocates memory on GPU and copies data to the host when exiting region.\n",
"* `create( list )` - Allocates memory on GPU but does not copy.\n",
"* `present( list )` - Data is already present on GPU from another containing data region.\n",
"\n",
"As an example, the following directive copies array A to the GPU at the beginning of the code block, and back to the CPU at the end. It also copies arrays B and C *to the CPU* at the *end* of the code block, but does **not** copy them both to the GPU at the beginning:\n",
"\n",
"<pre><code>#pragma acc data copy( A ), copyout( B, C )\n",
"{\n",
" ....\n",
"}</code></pre>\n",
"\n",
"For detailed information on the `data` directive clauses, you can refer to the [OpenACC 2.5](http://www.openacc.org/sites/default/files/OpenACC_2pt5.pdf) specification.\n",
"In the [`task3.c`](/4vwkFv7K/edit/C/task3/task3.c) file, see if you can add in a `data` directive to minimize data transfers in the Jacobi Iteration. There's a place for the `create` clause in this exercise too. As usual, there are some hints provided, and you can look at [`task3_solution.c`](/4vwkFv7K/edit/C/task3/task3_solution.c) to see the answer if you get stuck or want to check your work. **Don't forget to save with File -> Save in the editor below before moving on.**"
"Once you think you have [`task3.c`](/4vwkFv7K/edit/C/task3/task3.c) saved with a directive to manage data transfer, compile it with the below command and note the changes in the compiler output in the areas discussing data movement (lines starting with `Generating ...`). Then modify Anew using the `create` clause, if you haven't yet, and compile again."
"How are we doing on our timings? Let's execute our step 3 program and see if we have indeed accelerated the application versus the execution time we recorded after Step #2."
"After making these changes, our accelerator code is much faster -- with just a few lines of OpenACC directives we have made our code more than twice as fast by running it on an accelerator, as shown in this table.\n",
"\n",
"|Step| Execution | Time (s) | Speedup vs. 1 CPU thread | Correct? | Programming Time |\n",
"|3| Manage data movement | 0.45 | 6.56X | Yes | ||\n",
"\n",
"*Note: Problem Size: 1024x1024; System Information: GK520; Compiler: PGI Community Edition 17.4*\n",
"\n",
"We are making good progress, but we can still improve performance."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 4 - Optimize Kernel Scheduling"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The final step in our tuning process is to tune the OpenACC compute region schedules using the *gang* and *vector* clauses. These clauses let us use OpenACC directives to take more explicit control over how the compiler parallelizes our code for the accelerator we will be using. \n",
"\n",
"Kernel scheduling optimizations *may* give you significantly higher speedup, but be aware that these particular optimizations can significantly reduce performance portability. The vast majority of the time, the default kernel schedules chosen by the OpenACC compilers are quite good, but other times the compiler doesn't do as well. Let's spend a little time examining how we could do better, if we were in a situation where we felt we needed to. \n",
"\n",
"First, we need to get some additional insight into how our Jacobi Iteration code with the data optimizations is running on the accelerator. Let's run the C code with all your data movement optimizations on the accelerator again. We could use `pgprof` again, but this time setting just the environment variable PGI_ACC_TIME, which will print some high level timers for us without leaving our command shell."
"There is a lot of information here about how the compiler mapped the computational kernels in our program to our particular accelerator (in this case, an NVIDIA GPU). We can see three regions. The first one is the memcopy loop nest starting on line 34, which takes only a tiny fraction of the total system time. The second region is the nested computation loop starting on line 44, which takes about 0.25 seconds. The copyback (*copyout*) loop then executes beginning with line 68. We can see that region takes very little time -- which tells us there is no other part of the program that takes significant time. If we look at the main loop nests, we can see these lines: "
]
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"grid: [32x256] block[32x4]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The terms *grid* and *block* come from the CUDA programming model. A GPU executes groups of threads called *thread blocks*. To execute a kernel, the application launches a *grid* of these thread blocks. Each block runs on one of the GPUs *multiprocessors* and is assigned a certain range of IDs that it uses to address a unique data range. In this case our thread blocks have 32x4, 128 threads each. The grid the compiler has constructed is also 2D, 32 blocks wide and 256 blocks tall. This is just enough to cover our 1024x1024 grid. But we don't really need that many blocks -- if we tell the compiler to launch fewer, it will automatically generate a sequential loop over data blocks within the kernel code run by each thread.\n",
"\n",
"*Note: You can let the compiler do the hard work of mapping loop nests, unless you are certain you can do it better (and portability is not a concern.) When you decide to intervene, think about different parallelization strategies (loop schedules): in nested loops, distributing the work of the outer loops to the GPU multiprocessors (on OpenACC = gangs) in 1D grids. Similarly, think about mapping the work of the inner loops to the cores of the multiprocessors (CUDA threads, vectors) in 1D blocks. The grids (gangs) and block (vector) sizes can be viewed by setting the environment variable ACC_NOTIFY. To get you started, here are some experiments we conducted for these computational kernels and this accelerator:*\n",
"Try to modify the [`task4.c`](/4vwkFv7K/edit/C/task4/task4.c) code for the main computational loop nests in the window below. You'll be using the openacc loop constructs `gang()` and `vector()`. Look at task4_solution.c if you get stuck:\n"
"Looking at [task4_solution.c](/4vwkFv7K/edit/C/task4/task4_solution.c), the gang(8) clause on the inner loop tells it to launch 8 blocks in the X(column) direction. The vector(32) clause on the inner loop tells the compiler to use blocks that are 32 threads (one warp) wide. The absence of clause on the outer loop lets the compiler decide how many rows of threads and how many blocks to use in the Y(row) direction. We can see what it says, again, with:"
"*Note: we usually want the inner loop to be vectorized, because it allows coalesced loading of data from global memory. This is almost guaranteed to give a big performance increase. Other optimizations are often trial and error. When selecting grid sizes, the most obvious mapping is to have*\n",
"\n",
" the number of gangs * the number of workers * the number of vectors = the total problem size. \n",
" \n",
"*We may choose to manipulate this number, as we are doing here, so that each thread does multiple pieces of work -- this helps amortize the cost of setup for simple kernels.*\n",
"\n",
"*Note: Low-level languages like CUDA C/C++ offer more direct control of the hardware. You can consider optimizing your most critical loops in CUDA C/C++ if you need to extract every last bit of performance from your application, while recognizing that doing so may impact the portability of your code. OpenACC and CUDA C/C++ are fully interoperable.*\n",
"\n",
"A similar change to the copy loop nest benefits performance by a small amount. After you've made all your changes (look at task4_solution.c to be sure) compile your code below:"
"At this point, some of you may be wondering what kind of speed-up we get against the OpenMP version of this code. If you look at [task1_omp.c](/4vwkFv7K/edit/C/task4/task1_omp.c) in the text editor above, you can see a simple OpenMP version of the Jacobi Iteration code. Running this using 8-OpenMP threads on an Intel Xeon E5-2670 , our Kepler GK520 about 2X faster. If we scale the matrix up to an even larger 4096x4096, our Kepler GK520 GPU becomes significantly faster than the 8-OpenMP thread version. If you have some time remaining in this lab, feel free to compile & run the OpenMP and OpenACC versions below with the larger matrices.\n",
"Now run the OpenMP code you just created, and record your results in the new table for the larger matrix.\n",
"\n",
"*Note: because our dataset has now grown by 16-fold your CPU may not seem as responsive. We're using `-Minfo` in the compile so you can see that something is indeed happening, but you may need to be patient*."
"* Attend an in-depth workshop offered by XSEDE (https://portal.xsede.org/overview) or a commercial provider (see the 'education' page at OpenACC.org)\n"
"Remember that in C, an OpenACC directive applies to the next structured code block. So for example, the following applies the <code>kernels</code> directive to the outer <code>for</code> loop and everything inside of it:\n",
"\n",
"```c\n",
" #pragma acc kernels\n",
" for ( int i = 0; i < n-1; i++ )\n",
" {\n",
" for ( int j = 0; j < n-1; j++)\n",
" ...\n",
" }\n",
"```\n",
"\n",
"## Step #2 - Hint #2\n",
"If you choose to use only one `#pragma acc kernels` region -- which we recommend, because it demonstrates the power of the *kernels* directive -- you will need to add some additional `{ }` brackets so it applies to the correct region of code.\n",
"\n",
"[Return to Step #2](#Step-2---Add-Compute-Directives)\n",
"\n",
"## Step #3 - Hint #1\n",
"You should only have to worry about managing the transfer of data in arrays `A` and `Anew`.\n",
"\n",
"## Step #3 - Hint #2\n",
"You want to put the data directive just above the outer `while` loop.\n",
"\n",
"## Step #3 - Hint #3\n",
"You'll want to `copy( A )` so it is transferred to the GPU and back again after the final iterations through the `data` region. But you only need to `create( Anew )` as it is just used for temporary storage on the GPU, so there is no need to ever transfer it back and forth.\n",
"\n",
"[Return to step #3](#Step-3---Manage-Data-Movement)\n",
"\n",
"## Step #4 - Hint #1\n",
"You'll want a gang() and vector() clause on the inner loops, but you may want to let the compiler decide the dimensions of the outer loops.In that case, you can use a loop directive without any modifying clauses.\n",
"\n",
"[Return to step #4](#Step-4---Optimize-Kernel-Scheduling)"