OpenACC-course/lab1/C/OpenACC C.ipynb

1072 lines
52 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# OpenACC: 2X in 4 Steps (for C)\n",
"\n",
"In this self-paced, hands-on lab, we will use [OpenACC](http://openacc.org/) directives to port a basic C program to an accelerator in four simple steps, achieving *at least* a two-fold speed-up.\n",
"\n",
"Lab created by John Coombs, Mark Harris, and Mark Ebersole (Follow [@CUDAHamster](https://twitter.com/@cudahamster) on Twitter)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Lets begin by getting information about the GPUs on the server by running the command below."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Tue Jun 20 13:10:41 2017 \n",
"+-----------------------------------------------------------------------------+\n",
"| NVIDIA-SMI 375.66 Driver Version: 375.66 |\n",
"|-------------------------------+----------------------+----------------------+\n",
"| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |\n",
"| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |\n",
"|===============================+======================+======================|\n",
"| 0 GeForce GTX 950 Off | 0000:01:00.0 On | N/A |\n",
"| 23% 59C P0 27W / 99W | 690MiB / 1996MiB | 1% Default |\n",
"+-------------------------------+----------------------+----------------------+\n",
" \n",
"+-----------------------------------------------------------------------------+\n",
"| Processes: GPU Memory |\n",
"| GPU PID Type Process name Usage |\n",
"|=============================================================================|\n",
"| 0 1982 G /usr/lib/xorg/Xorg 357MiB |\n",
"| 0 2997 G compiz 166MiB |\n",
"| 0 3233 G /usr/lib/firefox/firefox 1MiB |\n",
"| 0 3449 G ...s-passed-by-fd --v8-snapshot-passed-by-fd 25MiB |\n",
"| 0 11015 G ...el-token=53D41F0E8A4B8A669C123908959A0849 137MiB |\n",
"+-----------------------------------------------------------------------------+\n"
]
}
],
"source": [
"%%bash\n",
"nvidia-smi"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Introduction to OpenACC\n",
"\n",
"Open-specification OpenACC directives are a straightforward way to accelerate existing Fortran and C applications. With OpenACC directives, you provide hints via compiler directives (or 'pragmas') to tell the compiler where -- and how -- it should parallelize compute-intensive code for execution on an accelerator. \n",
"\n",
"If you've done parallel programming using OpenMP, OpenACC is very similar: using directives, applications can be parallelized *incrementally*, with little or no change to the Fortran, C or C++ source. Debugging and code maintenance are easier. OpenACC directives are designed for *portability* across operating systems, host CPUs, and accelerators. You can use OpenACC directives with GPU accelerated libraries, explicit parallel programming languages (e.g., CUDA), MPI, and OpenMP, *all in the same program.*\n",
"\n",
"This hands-on lab walks you through a short sample of a scientific code, and demonstrates how you can employ OpenACC directives using a four-step process. You will make modifications to a simple C program, then compile and execute the newly enhanced code in each step. Along the way, hints and solution are provided, so you can check your work, or take a peek if you get lost."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## The Value of 2X in 4 Steps\n",
"\n",
"You can accelerate your applications using OpenACC directives and achieve *at least* a 2X speed-up, using 4 straightforward steps:\n",
"\n",
"1. Characterize your application\n",
"2. Add compute directives\n",
"3. Minimize data movement\n",
"4. Optimize kernel scheduling\n",
"\n",
"The content of these steps and their order will be familiar if you have ever done parallel programming on other platforms. Parallel programmers deal with the same issues whenever they tackle a new set of code, no matter what platform they are parallelizing an application for. These issues include:\n",
"\n",
"+ optimizing and benchmarking the serial version of an application\n",
"+ profiling to identify the compute-intensive portions of the program that can be executed concurrently\n",
"+ expressing concurrency using a parallel programming notation (e.g., OpenACC directives)\n",
"+ compiling and benchmarking each new/parallel version of the application\n",
"+ locating problem areas and making improvements iteratively until the target level of performance is reached\n",
"\n",
"The programming manual for some other parallel platform you've used may have suggested five steps, or fifteen. Whether you are an expert or new to parallel programming, we recommend that you walk through the four steps here as a good way to begin accelerating applications by at least 2X using OpenACC directives. We believe *being more knowledgeable about the four steps* will make the process of programming for an accelerator more understandable *and* more manageable. The 2X in 4 Steps process will help you use OpenACC on your own codes more productively, and get significantly better speed-ups in less time."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 1 - Characterize Your Application"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The most difficult part of accelerator programming begins before the first line of code is written. If your program is not highly parallel, an accelerator or coprocesor won't be much use. Understanding the code structure is crucial if you are going to *identify opportunities* and *successfully* parallelize a piece of code. The first step in OpenACC programming then is to *characterize the application*. This includes:\n",
"\n",
"+ benchmarking the single-thread, CPU-only version of the application\n",
"+ understanding the program structure and how data is passed through the call tree\n",
"+ profiling the application and identifying computationally-intense \"hot spots\"\n",
" + which loop nests dominate the runtime?\n",
" + what are the minimum/average/maximum tripcounts through these loop nests?\n",
" + are the loop nests suitable for an accelerator?\n",
"+ insuring that the algorithms you are considering for acceleration are *safely* parallel\n",
"\n",
"Note: what we've just said may sound a little scary, so please note that as parallel programming methods go OpenACC is really pretty friendly: think of it as a sandbox you can play in. Because OpenACC directives are incremental, you can add one or two directives at a time and see how things work: the compiler provides a *lot* of feedback. The right software plus good tools plus educational experiences like this one should put you on the path to successfully accelerating your programs.\n",
"\n",
"We will be accelerating a 2D-stencil called the Jacobi Iteration. Jacobi Iteration is a standard method for finding solutions to a system of linear equations. The basic concepts behind a Jacobi Iteration are described in the following video:\n",
"\n",
"http://www.youtube.com/embed/UOSYi3oLlRs"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here is the serial C code for our Jacobi Iteration:\n",
"\n",
" #include <math.h>\n",
" #include <string.h>\n",
" #include <openacc.h>\n",
" #include \"timer.h\"\n",
" #include <stdio.h>\n",
"\n",
" #define NN 1024\n",
" #define NM 1024\n",
"\n",
" float A[NN][NM];\n",
" float Anew[NN][NM];\n",
"\n",
" int main(int argc, char** argv)\n",
" {\n",
" int i,j;\n",
" const int n = NN;\n",
" const int m = NM;\n",
" const int iter_max = 1000;\n",
" const double tol = 1.0e-6;\n",
" double error = 1.0;\n",
" \n",
" memset(A, 0, n * m * sizeof(float));\n",
" memset(Anew, 0, n * m * sizeof(float));\n",
" \n",
" for (j = 0; j < n; j++)\n",
" {\n",
" A[j][0] = 1.0;\n",
" Anew[j][0] = 1.0;\n",
" }\n",
" \n",
" printf(\"Jacobi relaxation Calculation: %d x %d mesh\\n\", n, m);\n",
" \n",
" StartTimer();\n",
" int iter = 0;\n",
" \n",
" while ( error > tol && iter < iter_max )\n",
" {\n",
" error = 0.0;\n",
"\n",
" for( j = 1; j < n-1; j++)\n",
" {\n",
" for( i = 1; i < m-1; i++ )\n",
" {\n",
" Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1]\n",
" + A[j-1][i] + A[j+1][i]);\n",
" error = fmax( error, fabs(Anew[j][i] - A[j][i]));\n",
" }\n",
" }\n",
" \n",
" for( j = 1; j < n-1; j++)\n",
" {\n",
" for( i = 1; i < m-1; i++ )\n",
" {\n",
" A[j][i] = Anew[j][i]; \n",
" }\n",
" }\n",
"\n",
" if(iter % 100 == 0) printf(\"%5d, %0.6f\\n\", iter, error);\n",
" \n",
" iter++;\n",
" }\n",
"\n",
" double runtime = GetTimer();\n",
" \n",
" printf(\" total: %f s\\n\", runtime / 1000);\n",
" \n",
" return 0;\n",
" }\n",
"\n",
"In this code, the outer 'while' loop iterates until the solution has converged, by comparing the computed error to a specified error tolerance, *tol*. The first of two sets of inner nested loops applies a 2D Laplace operator at each element of a 2D grid, while the second set copies the output back to the input for the next iteration."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Benchmarking\n",
"\n",
"Before you start modifying code and adding OpenACC directives, you should benchmark the serial version of the program. To facilitate benchmarking after this and every other step in our parallel porting effort, we have built a timing routine around the main structure of our program -- a process we recommend you follow in your own efforts. Let's run the `task1.c` file without making any changes -- using the *-fast* set of compiler options on the serial version of the Jacobi Iteration program -- and see how fast the serial program executes. This will establish a baseline for future comparisons. Execute the following two commands to compile and run the program."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Compiled Successfully!\n"
]
}
],
"source": [
"%%bash\n",
"# To be sure we see some output from the compiler, we'll echo out \"Compiled Successfully!\" \n",
"#(if the compile does not return an error)\n",
"pgcc -fast -o task1_pre_out task1/task1.c && echo 'Compiled Successfully!'"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Jacobi relaxation Calculation: 1024 x 1024 mesh\n",
" 0, 0.250000\n",
" 100, 0.002397\n",
" 200, 0.001204\n",
" 300, 0.000804\n",
" 400, 0.000603\n",
" 500, 0.000483\n",
" 600, 0.000403\n",
" 700, 0.000345\n",
" 800, 0.000302\n",
" 900, 0.000269\n",
" total: 2.815460 s\n"
]
}
],
"source": [
"%%bash\n",
"# Execute our single-thread CPU-only Jacobi Iteration to get timing information. \n",
"# Make sure you compiled successfully in the \n",
"# above command first.\n",
"./task1_pre_out"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Quality Checking/Keeping a Record\n",
"\n",
"This is a good time to briefly talk about having a quality check in your code before starting to offload computation to an accelerator (or do any optimizations, for that matter). It doesn't do you any good to make an application run faster if it does not return the correct results. It is thus very important to have a quality check built into your application before you start accelerating or optimizing. This can be a simple value print out (one you can compare to a non-accelerated version of the algorithm) or something else.\n",
"\n",
"In our case, on every 100th iteration of the outer `while` loop, we print the current max error. (You just saw an example when we executed *task1_pre_out*.) As we add directives to accelerate our code later in this lab, you can look back at these values to verify that we're getting the correct answer. These print-outs also help us verify that we are converging on a solution -- which means that we should see, as we proceed, that the values are approaching zero.\n",
"\n",
"**Note:** NVIDIA GPUs implement IEEE-754 compliant floating point arithmetic just like most modern CPUs. However, because floating point arithmetic is not associative, the order of operations can affect the rounding error inherent with floating-point operations: you may not get exactly the same answer when you move to a different processor. Therefore, you'll want to make sure to verify your answer within an acceptable error bound. Please read [this](https://developer.nvidia.com/content/precision-performance-floating-point-and-ieee-754-compliance-nvidia-gpus) article at a later time, if you would like more details.\n",
"\n",
"*After each step*, we will record the results from our benchmarking and correctness tests in a table like this one: \n",
"\n",
"|Step| Execution | ExecutionTime (s) | Speedup vs. 1 CPU Thread | Correct? | Programming Time |\n",
"|:--:| --------------- | ---------------------:| ------------------------------:|:--------:| -----------------|\n",
"|1 | CPU 1 thread | 2.95 | | Yes | | |\n",
"\n",
"*Note: Problem Size: 1024 x 1024; System Information: GK520; Compiler: PGI Community Edition 17.4*\n",
"\n",
"(The execution times quoted will be times we got running on our GK520 -- your times throughout the lab may vary for one reason or another.)\n",
"\n",
"You may also want to track how much time you spend porting your application, step by step, so a column has been included for recording time spent."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Profiling\n",
"\n",
"Back to our lab. Your objective in the step after this one (Step 2) will be to modify `task2.c` in a way that moves the most computationally intensive, independent loops to the accelerator. With a simple code, you can identify which loops are candidates for acceleration with a little bit of code inspection. On more complex codes, a great way to find these computationally intense areas is to use a profiler (such as PGI's pgprof, NVIDIA's nvprof or open-source *gprof*) to determine which functions are consuming the largest amounts of compute time. To profile a C program on your own workstation, you'd type the lines below on the command line, but in this workshop, you just need to execute the following command, and then click on the link below it to see the pgprof interface"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Compiled Successfully!\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"GetTimer:\n",
" 3, include \"timer.h\"\n",
" 62, FMA (fused multiply-add) instruction(s) generated\n",
"main:\n",
" 25, Loop not fused: function call before adjacent loop\n",
" Loop not vectorized: may not be beneficial\n",
" Unrolled inner loop 8 times\n",
" Generated 7 prefetches in scalar loop\n",
" 42, Generated vector simd code for the loop containing reductions\n",
" Generated 3 prefetch instructions for the loop\n",
" Residual loop unrolled 2 times (completely unrolled)\n",
" 52, Memory copy idiom, loop replaced by call to __c_mcopy4\n"
]
}
],
"source": [
"%%bash\n",
"pgcc -Minfo=all,ccff -fast -o task1/task1_simple_out task1/task1_simple.c && echo 'Compiled Successfully!'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this lab, to open the PGI profiler run the following command."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"%%bash\n",
"pgprof"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Click on `File > New Session` to start a new profiling session. Select the executable to profile by pressing the `Browse` button, then selecting `task_simple_out` `from OpenACC/labs/lab1/C`.\n",
"\n",
"<div align=\"center\"><img src=\"files/pgprof17_create_new_session.png\" width=\"60%\"></div>\n",
"\n",
"Clicking `Next` will bring up a screen with a list profiling settings for this session. We can leave those at their default settings for now. Clicking `Finish` will cause `pgprof` to launch your executable for profiling. Since we are profiling a regular CPU application (no acceleration added yet) we should refer to the `CPU Details` tab along the bottom of the window for a summary of what functions in our program take the most compute time on the CPU. If you do not have a `CPU Details` tab, click `View` -> `Show CPU Details View`.\n",
"\n",
"<div align=\"center\"><img src=\"files/pgprof17_cpu_details.png\" width=\"60%\"></div>\n",
"\n",
"Double-clicking on the most time-consuming function in the table, `main` in this case, will bring up another file browser. This time click on `Recently Used` and then `C` and press `OK`. This will open the source file for the `main` function. \n",
"\n",
"<div align=\"center\"><img src=\"files/pgprof17_see_main_c.png\" width=\"60%\"></div>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In our Jacobi code sample, the compute-intensive part of our code is the two for-loops nested inside the while loop in the function *main*. This is where we'll concentrate our effort in adding OpenACC to the code.\n",
"\n",
"Let's see what it takes to accelerate those loops."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 2 - Add Compute Directives"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In C, an OpenACC directive is indicated in the code by `'#pragma acc *your directive*'`. This is very similar to OpenMP programming and gives hints to the compiler on how to handle the compilation of your source. If you are using a compiler which does not support OpenACC directives, it will simply ignore the `'#pragma acc'` directives and move on with the compilation.\n",
"\n",
"In Step 2, you will add compute regions around your expensive parallel loop(s). The first OpenACC directive you're going to learn about is the *kernels* directive. The kernels directive gives the compiler a lot of freedom in how it tries to accelerate your code - it basically says, \"Compiler, I believe the code in the following region is parallelizable, so I want you to try and accelerate it as best you can.\"\n",
"\n",
"Like most OpenACC directives in C/C++, the kernels directive applies to the structured code block immediately following the `#pragma acc *directive*`. For example, each of the following code samples instructs the compiler to generate a kernel -- from suitable loops -- for execution on an accelerator:\n",
"\n",
" #pragma acc kernels\n",
" {\n",
" // accelerate suitable loops here \n",
" }\n",
" // but not these loops\n",
"\n",
"or\n",
"\n",
" #pragma acc kernels\n",
" for ( int i = 0; i < n; ++i ) \n",
" { // body of for-loop\n",
" ... // The for-loop is a structured block, so this code will be accelerated\n",
" }\n",
" ... // Any code here will not be accelerated since it is outside of the for-loop"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"One, two or several loops may be inside the structured block, the kernels directive will try to parallelize it, telling you what it found and generating as many kernels as it thinks it safely can. At some point, you will encounter the OpenACC *parallel* directive, which provides another method for defining compute regions in OpenACC. For now, let's drop in a simple OpenACC `kernels` directive in front of and embracing *both* the two for-loop codeblocks that follow the while loop using curly braces. The kernels directive is designed to find the parallel acceleration opportunities implicit in the for-loops in the Jacobi Iteration code. \n",
"\n",
"To get some hints about how and where to place your kernels directives, click on the links below. When you feel you are done, **make sure to save the `task2.c` file you've modified with File -> Save, and continue on.** If you get completely stuck, you can look at `task2_solution.c` to see the answer."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[Hint #1](#Step-#2---Hint-#1) \n",
"[Hint #2](#Step-#2---Hint-#2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's now compile our `task2.c` file by executing the command below. Note that we've now added a new compiler option `-ta` to specify the type of accelerator to use. We've set it to `tesla` as we're using NVIDIA GPUs in this lab."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"%%bash\n",
"# Compile the task2.c file with the pgcc compiler\n",
"# -acc tells the compiler to process the source recognizing #pragma acc directives\n",
"# -Minfo tells the compiler to share information about the compilation process\n",
"pgcc -acc -Minfo -fast -ta=tesla -o task2_out task2/task2.c && echo 'Compiled Successfully'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you successfully added `#pragma acc kernels` in the proper spots, you should see the following in the output of the compiler:\n",
"\n",
"main:\n",
"\n",
" 23, Loop not fused: function call before adjacent loop\n",
" Loop not vectorized: may not be beneficial\n",
" Unrolled inner loop 8 times\n",
" Generated 7 prefetches in scalar loop\n",
" 34, Loop not vectorized/parallelized: potential early exits\n",
" 36, Generating copyout(Anew[1:1022][1:1022])\n",
" Generating copyin(A[:][:])\n",
" Generating copyout(A[1:1022][1:1022])\n",
" 41, Loop is parallelizable\n",
" 43, Loop is parallelizable\n",
" Accelerator kernel generated\n",
" Generating Tesla code\n",
" 41, #pragma acc loop gang, vector(4) /* blockIdx.y threadIdx.y */\n",
" 43, #pragma acc loop gang, vector(32) /* blockIdx.x threadIdx.x */\n",
" 47, Max reduction generated for error\n",
" 52, Loop is parallelizable\n",
" 54, Loop is parallelizable\n",
" Accelerator kernel generated\n",
" Generating Tesla code\n",
" 52, #pragma acc loop gang, vector(4) /* blockIdx.y threadIdx.y */\n",
" 54, #pragma acc loop gang, vector(32) /* blockIdx.x threadIdx.x */\n",
" \n",
"If you do not get similar output, please check your work and try re-compiling. If you're stuck, you can compare what you have to task2_solution.c in the editor above.\n",
"\n",
"*The output provided by the compiler is extremely useful, and should not be ignored when accelerating your own code with OpenACC.* Let's break it down a bit and see what it's telling us.\n",
"\n",
"1. First since we used the `-Minfo` command-line option, we will see all output from the compiler. If we were to use `-Minfo=accel` we would only see the output corresponding to the accelerator, in this case an NVIDIA GPU.\n",
"2. The first line of the output, *main*, tells us which function the following information is in reference to.\n",
"3. The line starting with `41, Loop is parallelizable` of the output tells us that on line `41` in our source, an accelerated kernel was generated. This is the the loop just after where we put our `#pragma acc kernels`.\n",
"4. The following lines provide more details on the accelerator kernel on line 42. It shows we created a parallel OpenACC `loop`. This loop is made up of gangs (a grid of blocks in CUDA language) and vector parallelism (threads in CUDA language) with the vector size being 128 per gang.\n",
"5. At line 54, the compiler tells us it found another loop to accelerate.\n",
"6. The rest of the information concerns data movement which we'll get into later in this lab.\n",
"\n",
"So as you can see, lots of useful information is provided by the compiler, and it's very important that you carefuly inspect this information to make sure the compiler is doing what you've asked of it.\n",
"\n",
"Finally, let's execute this program to verify we are getting the correct answer (execute the command below). "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Once you feel your code is correct, try running it by executing the command below. You'll want to review our quality check from the beginning of task2 to make sure you didn't break the functionality of your application."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"%%bash\n",
"./task2_out"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's record our results in the table:\n",
"\n",
"|Step| Execution | Time(s) | Speedup vs. 1 CPU Thread | Correct? | Programming Time |\n",
"| -- || ------------ | ----------- | ------------------------- | -------- | ---------------- |\n",
"|1| CPU 1 thread |2.95 | | | |\n",
"|2| Add kernels directive |4.93 | 0.60X | Yes | ||\n",
"\n",
"*Note: Problem Size: 1024x1024; System Information: GK520; Compiler: PGI Community Edition 17.4*\n",
"\n",
"\n",
"Now, if your solution is similar to the one in task2_solution.c, you have probably noticed that we're executing **slower** than the non-accelerated, CPU-only version we started with. What gives?! Let's see what `pgprof` can tell us about the performance of the code. Return to your `PGPROF` window from earlier, start another new session, but this time loading `task2_out` as your executable (it's in the same directory as before). This time we'll find a colorful graph of what our program is doing, this is the GPU timeline. We can't tell much from the default view, but we can zoom in by using the `+` magnifying glass at the top of the window. If you zoom in far enough, you'll begin to see a pattern like the one in the screenshot below. The teal and purple boxes are the compute kernels that go with the two loops in our kernels region. Each of these groupings of kernels is surrounded by tan coloer boxes representing data movement. What this graph is showing us is that for every step of our `while` loop, we're copying data to the GPU and then back out. Let's try to figure out why.\n",
"\n",
"<div align=\"center\"><img src=\"files/pgprof17_step2_timeline.png\" width=\"60%\"></div>\n",
"\n",
"The compiler feedback we collected earlier tells you quite a bit about data movement. If we look again at the compiler feedback from above, we see the following.\n",
"\n",
" 36, Generating copyout(Anew[1:1022][1:1022])\n",
" Generating copyin(A[:][:])\n",
" Generating copyout(A[1:1022][1:1022])\n",
"\n",
"This is telling us that the compiler has inserted data movement around our `kernels` region at line 36 which copies the `A` array *in* and *out* of GPU memory and also copies `Anew` out. This problem of copying back and forth on every iteration of a loop is sometimes called \"data sloshing\".\n",
"\n",
"The OpenACC compiler can only work with the information we have given it. It knows we need the `A` and `Anew` arrays on the GPU for each of our two accelerated sections, but we didn't tell it anything about what happens to the data outside of those sections. Without this knowledge, it has to copy the full arrays *to the GPU and back to the CPU* for each accelerated section, *every time* it went through the while loop. That is a LOT of wasted data transfers.\n",
"\n",
"Ideally, we would just transfer `A` to the GPU at the beginning of the Jacobi Iteration, and then only transfer `A` back to the CPU at the end. As for `Anew`, it's only used within this region, so we don't need to copy any data back and forth, we only need to `create` space on the device for this array.\n",
"\n",
"Because overall accelerator performance is detetermined largely by how well memory transfers are optimized, the OpenACC specification defines the `data` directive and several modifying clauses to manage all the various forms of data movement."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"## Step 3 - Manage Data Movement\n",
"\n",
"We need to give the compiler more information about how to reduce unnecessary data movement for the Jacobi Iteration. We are going to do this with the OpenACC `data` directive and some modifying clauses defined in the OpenACC specification. \n",
"\n",
"In C, the `data` directive applies to the next structured code block. The compiler will manage data according to the provided clauses. It does this at the beginning of the `data` directive code block, and then again at the end. Some of the clauses available for use with the `data` directive are:\n",
"\n",
"* `copy( list )` - Allocates memory on GPU and copies data from host to GPU when entering region and copies data to the host when exiting region.\n",
"* `copyin( list )` - Allocates memory on GPU and copies data from host to GPU when entering region.\n",
"* `copyout( list )` - Allocates memory on GPU and copies data to the host when exiting region.\n",
"* `create( list )` - Allocates memory on GPU but does not copy.\n",
"* `present( list )` - Data is already present on GPU from another containing data region.\n",
"\n",
"As an example, the following directive copies array A to the GPU at the beginning of the code block, and back to the CPU at the end. It also copies arrays B and C *to the CPU* at the *end* of the code block, but does **not** copy them both to the GPU at the beginning:\n",
"\n",
"<pre><code>#pragma acc data copy( A ), copyout( B, C )\n",
"{\n",
" ....\n",
"}</code></pre>\n",
"\n",
"For detailed information on the `data` directive clauses, you can refer to the [OpenACC 2.5](http://www.openacc.org/sites/default/files/OpenACC_2pt5.pdf) specification.\n",
"\n",
"In the `task3.c` file, see if you can add in a `data` directive to minimize data transfers in the Jacobi Iteration. There's a place for the `create` clause in this exercise too. As usual, there are some hints provided, and you can look at `task3_solution.c` to see the answer if you get stuck or want to check your work. **Don't forget to save with File -> Save in the editor below before moving on.**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[Hint #1](#Step-#3---Hint-#1) \n",
"[Hint #2](#Step-#3---Hint-#2) \n",
"[Hint #3](#Step-#3---Hint-#3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Once you think you have `task3.c` saved with a directive to manage data transfer, compile it with the below command and note the changes in the compiler output in the areas discussing data movement (lines starting with `Generating ...`). Then modify Anew using the `create` clause, if you haven't yet, and compile again."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"%%bash\n",
"pgcc -fast -acc -Minfo=accel -ta=tesla -o task3_out task3/task3.c && echo 'Compiled Successfully'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"How are we doing on our timings? Let's execute our step 3 program and see if we have indeed accelerated the application versus the execution time we recorded after Step #2."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"%%bash\n",
"./task3_out"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"After making these changes, our accelerator code is much faster -- with just a few lines of OpenACC directives we have made our code more than twice as fast by running it on an accelerator, as shown in this table.\n",
"\n",
"|Step| Execution | Time (s) | Speedup vs. 1 CPU thread | Correct? | Programming Time |\n",
"| -- | ------------------------- | ---------------------- | ------------------------ | -------- | ---------------- |\n",
"|1| CPU 1 thread | 2.95 | | | |\n",
"|2| Add kernels directive | 4.93 | 0.60X | Yes | |\n",
"|3| Manage data movement | 0.45 | 6.56X | Yes | ||\n",
"\n",
"*Note: Problem Size: 1024x1024; System Information: GK520; Compiler: PGI Community Edition 17.4*\n",
"\n",
"We are making good progress, but we can still improve performance."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 4 - Optimize Kernel Scheduling"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The final step in our tuning process is to tune the OpenACC compute region schedules using the *gang* and *vector* clauses. These clauses let us use OpenACC directives to take more explicit control over how the compiler parallelizes our code for the accelerator we will be using. \n",
"\n",
"Kernel scheduling optimizations *may* give you significantly higher speedup, but be aware that these particular optimizations can significantly reduce performance portability. The vast majority of the time, the default kernel schedules chosen by the OpenACC compilers are quite good, but other times the compiler doesn't do as well. Let's spend a little time examining how we could do better, if we were in a situation where we felt we needed to. \n",
"\n",
"First, we need to get some additional insight into how our Jacobi Iteration code with the data optimizations is running on the accelerator. Let's run the C code with all your data movement optimizations on the accelerator again. We could use `pgprof` again, but this time setting just the environment variable PGI_ACC_TIME, which will print some high level timers for us without leaving our command shell."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"%%bash\n",
"export PGI_ACC_TIME=1\n",
"pgcc -acc -fast -ta=tesla -Minfo=accel -o accel_timing_out task3/task3.c\n",
"./accel_timing_out"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This generates some information we haven't seen previously from the PGI compiler:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"````\n",
"Accelerator Kernel Timing data\n",
"/notebooks/C/task3/task3.c\n",
" main NVIDIA devicenum=0\n",
" time(us): 379,107\n",
" 34: data region reached 2 times\n",
" 34: data copyin transfers: 1\n",
" device time(us): total=474 max=474 min=474 avg=474\n",
" 68: data copyout transfers: 1\n",
" device time(us): total=473 max=473 min=473 avg=473\n",
" 37: compute region reached 1000 times\n",
" 37: data copyin transfers: 1000\n",
" device time(us): total=8,775 max=24 min=2 avg=8\n",
" 44: kernel launched 1000 times\n",
" grid: [32x256] block: [32x4]\n",
" device time(us): total=234,338 max=245 min=233 avg=234\n",
" elapsed time(us): total=255,302 max=542 min=252 avg=255\n",
" 44: reduction kernel launched 1000 times\n",
" grid: [1] block: [256]\n",
" device time(us): total=20,969 max=28 min=20 avg=20\n",
" elapsed time(us): total=41,157 max=61 min=38 avg=41\n",
" 44: data copyout transfers: 1000\n",
" device time(us): total=18,007 max=29 min=13 avg=18\n",
" 55: kernel launched 1000 times\n",
" grid: [32x256] block: [32x4]\n",
" device time(us): total=96,071 max=105 min=95 avg=96\n",
" elapsed time(us): total=117,348 max=191 min=115 avg=117\n",
"````"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There is a lot of information here about how the compiler mapped the computational kernels in our program to our particular accelerator (in this case, an NVIDIA GPU). We can see three regions. The first one is the memcopy loop nest starting on line 34, which takes only a tiny fraction of the total system time. The second region is the nested computation loop starting on line 44, which takes about 0.25 seconds. The copyback (*copyout*) loop then executes beginning with line 68. We can see that region takes very little time -- which tells us there is no other part of the program that takes significant time. If we look at the main loop nests, we can see these lines: "
]
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"grid: [32x256] block[32x4]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The terms *grid* and *block* come from the CUDA programming model. A GPU executes groups of threads called *thread blocks*. To execute a kernel, the application launches a *grid* of these thread blocks. Each block runs on one of the GPUs *multiprocessors* and is assigned a certain range of IDs that it uses to address a unique data range. In this case our thread blocks have 32x4, 128 threads each. The grid the compiler has constructed is also 2D, 32 blocks wide and 256 blocks tall. This is just enough to cover our 1024x1024 grid. But we don't really need that many blocks -- if we tell the compiler to launch fewer, it will automatically generate a sequential loop over data blocks within the kernel code run by each thread.\n",
"\n",
"*Note: You can let the compiler do the hard work of mapping loop nests, unless you are certain you can do it better (and portability is not a concern.) When you decide to intervene, think about different parallelization strategies (loop schedules): in nested loops, distributing the work of the outer loops to the GPU multiprocessors (on OpenACC = gangs) in 1D grids. Similarly, think about mapping the work of the inner loops to the cores of the multiprocessors (CUDA threads, vectors) in 1D blocks. The grids (gangs) and block (vector) sizes can be viewed by setting the environment variable ACC_NOTIFY. To get you started, here are some experiments we conducted for these computational kernels and this accelerator:*\n",
"\n",
"| Accelerator | Grid |Outer Loop Gang | Outer Loop Vector | Inner Loop Gang | Inner Loop Vector | Seconds |\n",
"| ----------- | ------------- |------------- | --------------- | ------------- | --------------- | ------- |\n",
"| GK520 | 1024x1024 | | 8 | | 32 | 0.508 |\n",
"| | | | 4 | | 64 | 0.510 |\n",
"| | | | | 8 | 32 | 0.379 |\n",
"| | | | | 16 | 32 | 0.410 |\n",
"| | | | | 4 | 64 | 0.379 |\n",
"\n",
"Try to modify the `task4.c` code for the main computational loop nests in the window below. You'll be using the openacc loop constructs `gang()` and `vector()`. Look at task4_solution.c if you get stuck:\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[Hint #1](#Step-#4---Hint-#1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"After you've made some changes, save your work, then compile and run in the boxes below:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"%%bash\n",
"pgcc -acc -Minfo=accel -fast -ta=tesla -o task4_out task4/task4.c && echo 'Compiled Successfully'"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"%%bash\n",
"./task4_out"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Looking at `task4_solution.c`, the gang(8) clause on the inner loop tells it to launch 8 blocks in the X(column) direction. The vector(32) clause on the inner loop tells the compiler to use blocks that are 32 threads (one warp) wide. The absence of clause on the outer loop lets the compiler decide how many rows of threads and how many blocks to use in the Y(row) direction. We can see what it says, again, with:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"%%bash\n",
"export PGI_ACC_TIME=1\n",
"./task4_out"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"*Note: we usually want the inner loop to be vectorized, because it allows coalesced loading of data from global memory. This is almost guaranteed to give a big performance increase. Other optimizations are often trial and error. When selecting grid sizes, the most obvious mapping is to have*\n",
"\n",
" the number of gangs * the number of workers * the number of vectors = the total problem size. \n",
" \n",
"*We may choose to manipulate this number, as we are doing here, so that each thread does multiple pieces of work -- this helps amortize the cost of setup for simple kernels.*\n",
"\n",
"*Note: Low-level languages like CUDA C/C++ offer more direct control of the hardware. You can consider optimizing your most critical loops in CUDA C/C++ if you need to extract every last bit of performance from your application, while recognizing that doing so may impact the portability of your code. OpenACC and CUDA C/C++ are fully interoperable.*\n",
"\n",
"A similar change to the copy loop nest benefits performance by a small amount. After you've made all your changes (look at task4_solution.c to be sure) compile your code below:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"%%bash\n",
"pgcc -acc -fast -ta=tesla -Minfo=accel -o task4_out task4/task4.c && echo 'Compiled Successfully'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Then run it and record the run time of the optimized code in the table:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"%%bash\n",
"./task4_out"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here is the perfomance after these final optimizations:\n",
"\n",
"|Step| Execution | Time (s) | Speedup vs. 1 CPU thread | Correct? | Programming Time |\n",
"| -- | ------------------------- | ---------------------- | ------------------------ | -------- | ---------------- |\n",
"|1| CPU 1 thread | 2.95 | | | |\n",
"|2| Add kernels directive | 4.93 | 0.60X | | |\n",
"|3| Manage data movement | 0.45 | 6.56X | Yes | ||\n",
"|4| Optimize kernel scheduling | 0.33 | 8.9X | Yes | ||\n",
"*Note: Problem Size: 1024x1024; System Information: GK520; Compiler: PGI Community Edition 17.4*\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"At this point, some of you may be wondering what kind of speed-up we get against the OpenMP version of this code. If you look at `task1_omp.c` in the text editor above, you can see a simple OpenMP version of the Jacobi Iteration code. Running this using 8-OpenMP threads on an Intel Xeon E5-2670 , our Kepler GK520 about 2X faster. If we scale the matrix up to an even larger 4096x4096, our Kepler GK520 GPU becomes significantly faster than the 8-OpenMP thread version. If you have some time remaining in this lab, feel free to compile & run the OpenMP and OpenACC versions below with the larger matrices.\n",
"\n",
"First, compile the OpenMP version:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"%%bash\n",
"pgcc -fast -mp -Minfo -o task4_4096_omp task4/task4_4096_omp.c"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now run the OpenMP code you just created, and record your results in the new table for the larger matrix.\n",
"\n",
"*Note: because our dataset has now grown by 16-fold your CPU may not seem as responsive. We're using `-Minfo` in the compile so you can see that something is indeed happening, but you may need to be patient*."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"%%bash\n",
"OMP_NUM_THREADS=8 ./task4_4096_omp"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, compile and run the OpenACC solution for the larger 4096x4096 matrix using the next two boxes:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"%%bash\n",
"pgcc -acc -fast -ta=tesla -Minfo=accel -o task4_4096_out task4/task4_4096_solution.c && echo 'Compiled Successfully'"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"%%bash\n",
"./task4_4096_out"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here's our comparison with the larger matrix size:\n",
"\n",
"| Execution | matrix size | Time (s) | Speedup vs. 8 CPU threads | Correct? | Programming Time |\n",
"| -------------------- | ----------- | -------- | ------------------------- | | |\n",
"| CPU 8 threads | 4096x4096 | 15.03 | | YES | |\n",
"| GPU optimized kernel | 4096x4096 | 3.44 | 4.37X | Yes | ||\n",
"\n",
"*Note: System Information: GK520; Compiler: PGI Community Edition 17.4*"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Learn More\n",
"\n",
"If you are interested in learning more about OpenACC, you can use the following resources:\n",
"\n",
"* [openacc.org](http://openacc.org/)\n",
"* [OpenACC on CUDA Zone](https://developer.nvidia.com/openacc)\n",
"* Search or ask questions on [Stackoverflow](http://stackoverflow.com/questions/tagged/openacc) using the openacc tag\n",
"* Get the free [PGI Comunity Edition](https://www.pgroup.com/products/community.htm) compiler.\n",
"* Attend an in-depth workshop offered by XSEDE (https://portal.xsede.org/overview) or a commercial provider (see the 'education' page at OpenACC.org)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Hints\n",
"\n",
"## Step #2 - Hint #1\n",
"Remember that in C, an OpenACC directive applies to the next structured code block. So for example, the following applies the <code>kernels</code> directive to the outer <code>for</code> loop and everything inside of it:\n",
"\n",
"```c\n",
" #pragma acc kernels\n",
" for ( int i = 0; i < n-1; i++ )\n",
" {\n",
" for ( int j = 0; j < n-1; j++)\n",
" ...\n",
" }\n",
"```\n",
"\n",
"## Step #2 - Hint #2\n",
"If you choose to use only one `#pragma acc kernels` region -- which we recommend, because it demonstrates the power of the *kernels* directive -- you will need to add some additional `{ }` brackets so it applies to the correct region of code.\n",
"\n",
"[Return to Step #2](#Step-2---Add-Compute-Directives)\n",
"\n",
"## Step #3 - Hint #1\n",
"You should only have to worry about managing the transfer of data in arrays `A` and `Anew`.\n",
"\n",
"## Step #3 - Hint #2\n",
"You want to put the data directive just above the outer `while` loop.\n",
"\n",
"## Step #3 - Hint #3\n",
"You'll want to `copy( A )` so it is transferred to the GPU and back again after the final iterations through the `data` region. But you only need to `create( Anew )` as it is just used for temporary storage on the GPU, so there is no need to ever transfer it back and forth.\n",
"\n",
"[Return to step #3](#Step-3---Manage-Data-Movement)\n",
"\n",
"## Step #4 - Hint #1\n",
"You'll want a gang() and vector() clause on the inner loops, but you may want to let the compiler decide the dimensions of the outer loops.In that case, you can use a loop directive without any modifying clauses.\n",
"\n",
"[Return to step #4](#Step-4---Optimize-Kernel-Scheduling)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<style>\n",
"p.hint_trigger{\n",
" margin-bottom:7px;\n",
" margin-top:-5px;\n",
" background:#64E84D;\n",
"}\n",
".toggle_container{\n",
" margin-bottom:0px;\n",
"}\n",
".toggle_container p{\n",
" margin:2px;\n",
"}\n",
".toggle_container{\n",
" background:#f0f0f0;\n",
" clear: both;\n",
" font-size:100%;\n",
"}\n",
"</style>\n",
"<script>\n",
"$(\"p.hint_trigger\").click(function(){\n",
" $(this).toggleClass(\"active\").next().slideToggle(\"normal\");\n",
"});\n",
" \n",
"$(\".toggle_container\").hide();\n",
"</script>"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.13"
}
},
"nbformat": 4,
"nbformat_minor": 1
}