Compare commits
4 Commits
nvidia
...
9e3dbd99b7
Author | SHA1 | Date | |
---|---|---|---|
9e3dbd99b7 | |||
b961b06bb6 | |||
53ea1d338c | |||
cce7f726c0 |
@ -6,8 +6,6 @@
|
||||
"source": [
|
||||
"# OpenACC: 2X in 4 Steps (for C)\n",
|
||||
"\n",
|
||||
"([Fortran version](../FORTRAN/OpenACC%20Fortran.ipynb))\n",
|
||||
"\n",
|
||||
"In this self-paced, hands-on lab, we will use [OpenACC](http://openacc.org/) directives to port a basic C program to an accelerator in four simple steps, achieving *at least* a two-fold speed-up.\n",
|
||||
"\n",
|
||||
"Lab created by John Coombs, Mark Harris, and Mark Ebersole (Follow [@CUDAHamster](https://twitter.com/@cudahamster) on Twitter)"
|
||||
@ -17,46 +15,47 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"---\n",
|
||||
"Before we begin, let's verify [WebSockets](http://en.wikipedia.org/wiki/WebSocket) are working on your system. To do this, execute the cell block *below* by giving it focus (clicking on it with your mouse), and hitting Ctrl-Enter, or by pressing the play button in the toolbar *above*. If all goes well, you should see get some output returned below the grey cell. If not, please consult the [Self-paced Lab Troubleshooting FAQ](https://developer.nvidia.com/self-paced-labs-faq#Troubleshooting) to debug the issue."
|
||||
"Lets begin by getting information about the GPUs on the server by running the command below."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"execution_count": 1,
|
||||
"metadata": {
|
||||
"collapsed": true
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [],
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Tue Jun 20 13:10:41 2017 \n",
|
||||
"+-----------------------------------------------------------------------------+\n",
|
||||
"| NVIDIA-SMI 375.66 Driver Version: 375.66 |\n",
|
||||
"|-------------------------------+----------------------+----------------------+\n",
|
||||
"| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |\n",
|
||||
"| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |\n",
|
||||
"|===============================+======================+======================|\n",
|
||||
"| 0 GeForce GTX 950 Off | 0000:01:00.0 On | N/A |\n",
|
||||
"| 23% 59C P0 27W / 99W | 690MiB / 1996MiB | 1% Default |\n",
|
||||
"+-------------------------------+----------------------+----------------------+\n",
|
||||
" \n",
|
||||
"+-----------------------------------------------------------------------------+\n",
|
||||
"| Processes: GPU Memory |\n",
|
||||
"| GPU PID Type Process name Usage |\n",
|
||||
"|=============================================================================|\n",
|
||||
"| 0 1982 G /usr/lib/xorg/Xorg 357MiB |\n",
|
||||
"| 0 2997 G compiz 166MiB |\n",
|
||||
"| 0 3233 G /usr/lib/firefox/firefox 1MiB |\n",
|
||||
"| 0 3449 G ...s-passed-by-fd --v8-snapshot-passed-by-fd 25MiB |\n",
|
||||
"| 0 11015 G ...el-token=53D41F0E8A4B8A669C123908959A0849 137MiB |\n",
|
||||
"+-----------------------------------------------------------------------------+\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"print \"The answer should be three: \" + str(1+2)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Next let's get information about the GPUs on the server by executing the cell below."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"collapsed": true
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"!nvidia-smi"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"---\n",
|
||||
"The following video will explain the infrastructure we are using for this self-paced lab, as well as give some tips on it's usage. If you've never taken a lab on this system before, it's highly encourage you watch this short video first.<br><br>\n",
|
||||
"<div align=\"center\"><iframe width=\"640\" height=\"390\" src=\"http://www.youtube.com/embed/ZMrDaLSFqpY\" frameborder=\"0\" allowfullscreen>"
|
||||
"%%bash\n",
|
||||
"nvidia-smi"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -69,13 +68,7 @@
|
||||
"\n",
|
||||
"If you've done parallel programming using OpenMP, OpenACC is very similar: using directives, applications can be parallelized *incrementally*, with little or no change to the Fortran, C or C++ source. Debugging and code maintenance are easier. OpenACC directives are designed for *portability* across operating systems, host CPUs, and accelerators. You can use OpenACC directives with GPU accelerated libraries, explicit parallel programming languages (e.g., CUDA), MPI, and OpenMP, *all in the same program.*\n",
|
||||
"\n",
|
||||
"Watch the following short video introduction to OpenACC:\n",
|
||||
"\n",
|
||||
"<div align=\"center\"><iframe width=\"640\" height=\"390\" style=\"margin: 0 auto;\" src=\"http://www.youtube.com/embed/c9WYCFEt_Uo\" frameborder=\"0\" allowfullscreen></iframe></div>\n",
|
||||
"\n",
|
||||
"This hands-on lab walks you through a short sample of a scientific code, and demonstrates how you can employ OpenACC directives using a four-step process. You will make modifications to a simple C program, then compile and execute the newly enhanced code in each step. Along the way, hints and solution are provided, so you can check your work, or take a peek if you get lost.\n",
|
||||
"\n",
|
||||
"If you are confused now, or at any point in this lab, you can consult the <a href=\"#FAQ\">FAQ</a> located at the bottom of this page."
|
||||
"This hands-on lab walks you through a short sample of a scientific code, and demonstrates how you can employ OpenACC directives using a four-step process. You will make modifications to a simple C program, then compile and execute the newly enhanced code in each step. Along the way, hints and solution are provided, so you can check your work, or take a peek if you get lost."
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -127,7 +120,7 @@
|
||||
"\n",
|
||||
"We will be accelerating a 2D-stencil called the Jacobi Iteration. Jacobi Iteration is a standard method for finding solutions to a system of linear equations. The basic concepts behind a Jacobi Iteration are described in the following video:\n",
|
||||
"\n",
|
||||
"<div align=\"center\"><iframe width=\"640\" height=\"390\" src=\"http://www.youtube.com/embed/UOSYi3oLlRs\" frameborder=\"0\" allowfullscreen></iframe></div>"
|
||||
"http://www.youtube.com/embed/UOSYi3oLlRs"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -214,40 +207,64 @@
|
||||
"source": [
|
||||
"### Benchmarking\n",
|
||||
"\n",
|
||||
"Before you start modifying code and adding OpenACC directives, you should benchmark the serial version of the program. To facilitate benchmarking after this and every other step in our parallel porting effort, we have built a timing routine around the main structure of our program -- a process we recommend you follow in your own efforts. Let's run the [`task1.c`](/rpWFwS8c/edit/C/task1/task1.c) file without making any changes -- using the *-fast* set of compiler options on the serial version of the Jacobi Iteration program -- and see how fast the serial program executes. This will establish a baseline for future comparisons. Execute the following two cells to compile and run the program."
|
||||
"Before you start modifying code and adding OpenACC directives, you should benchmark the serial version of the program. To facilitate benchmarking after this and every other step in our parallel porting effort, we have built a timing routine around the main structure of our program -- a process we recommend you follow in your own efforts. Let's run the [`task1.c`](/4vwkFv7K/edit/C/task1/task1.c) file without making any changes -- using the *-fast* set of compiler options on the serial version of the Jacobi Iteration program -- and see how fast the serial program executes. This will establish a baseline for future comparisons. Execute the following two commands to compile and run the program."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
"metadata": {},
|
||||
"execution_count": 2,
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Compiled Successfully!\r\n"
|
||||
"Compiled Successfully!\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"%%bash\n",
|
||||
"# To be sure we see some output from the compiler, we'll echo out \"Compiled Successfully!\" \n",
|
||||
"#(if the compile does not return an error)\n",
|
||||
"!pgcc -fast -o task1_pre_out task1/task1.c && echo \"Compiled Successfully!\""
|
||||
"pgcc -fast -o task1_pre_out task1/task1.c && echo 'Compiled Successfully!'"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"execution_count": 3,
|
||||
"metadata": {
|
||||
"collapsed": true,
|
||||
"collapsed": false,
|
||||
"scrolled": true
|
||||
},
|
||||
"outputs": [],
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Jacobi relaxation Calculation: 1024 x 1024 mesh\n",
|
||||
" 0, 0.250000\n",
|
||||
" 100, 0.002397\n",
|
||||
" 200, 0.001204\n",
|
||||
" 300, 0.000804\n",
|
||||
" 400, 0.000603\n",
|
||||
" 500, 0.000483\n",
|
||||
" 600, 0.000403\n",
|
||||
" 700, 0.000345\n",
|
||||
" 800, 0.000302\n",
|
||||
" 900, 0.000269\n",
|
||||
" total: 2.815460 s\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# Execute our single-thread CPU-only Jacobi Iteration to get timing information. Make sure you compiled successfully in the \n",
|
||||
"# above cell first.\n",
|
||||
"!./task1_pre_out"
|
||||
"%%bash\n",
|
||||
"# Execute our single-thread CPU-only Jacobi Iteration to get timing information. \n",
|
||||
"# Make sure you compiled successfully in the \n",
|
||||
"# above command first.\n",
|
||||
"./task1_pre_out"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -281,34 +298,72 @@
|
||||
"source": [
|
||||
"### Profiling\n",
|
||||
"\n",
|
||||
"Back to our lab. Your objective in the step after this one (Step 2) will be to modify [`task2.c`](/rpWFwS8c/edit/C/task2/task2.c) in a way that moves the most computationally intensive, independent loops to the accelerator. With a simple code, you can identify which loops are candidates for acceleration with a little bit of code inspection. On more complex codes, a great way to find these computationally intense areas is to use a profiler (such as PGI's pgprof, NVIDIA's nvprof or open-source *gprof*) to determine which functions are consuming the largest amounts of compute time. To profile a C program on your own workstation, you'd type the lines below on the command line, but in this workshop, you just need to execute the following cell, and then click on the link below it to see the pgprof interface"
|
||||
"Back to our lab. Your objective in the step after this one (Step 2) will be to modify [`task2.c`](/4vwkFv7K/edit/C/task2/task2.c) in a way that moves the most computationally intensive, independent loops to the accelerator. With a simple code, you can identify which loops are candidates for acceleration with a little bit of code inspection. On more complex codes, a great way to find these computationally intense areas is to use a profiler (such as PGI's pgprof, NVIDIA's nvprof or open-source *gprof*) to determine which functions are consuming the largest amounts of compute time. To profile a C program on your own workstation, you'd type the lines below on the command line, but in this workshop, you just need to execute the following command, and then click on the link below it to see the pgprof interface"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 4,
|
||||
"metadata": {
|
||||
"collapsed": false,
|
||||
"scrolled": true
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Compiled Successfully!\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"GetTimer:\n",
|
||||
" 3, include \"timer.h\"\n",
|
||||
" 62, FMA (fused multiply-add) instruction(s) generated\n",
|
||||
"main:\n",
|
||||
" 25, Loop not fused: function call before adjacent loop\n",
|
||||
" Loop not vectorized: may not be beneficial\n",
|
||||
" Unrolled inner loop 8 times\n",
|
||||
" Generated 7 prefetches in scalar loop\n",
|
||||
" 42, Generated vector simd code for the loop containing reductions\n",
|
||||
" Generated 3 prefetch instructions for the loop\n",
|
||||
" Residual loop unrolled 2 times (completely unrolled)\n",
|
||||
" 52, Memory copy idiom, loop replaced by call to __c_mcopy4\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"%%bash\n",
|
||||
"pgcc -Minfo=all,ccff -fast -o task1/task1_simple_out task1/task1_simple.c && echo 'Compiled Successfully!'"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"In this lab, to open the PGI profiler run the following command."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"collapsed": true
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"%%bash\n",
|
||||
"pgcc -Minfo=all,ccff -fast -o task1/task1_simple_out task1/task1_simple.c && echo \"Compiled Successfully!\""
|
||||
"pgprof"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"In this lab, to open the PGI profiler in a new window <a href=\"/vnc\" onclick=\"window.open(this.href, 'Profiler',\n",
|
||||
"'left=20,top=20,width=1280,height=724,toolbar=1,resizable=0'); return false;\">click here</a>."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Click on `File > New Session` to start a new profiling session. Select the executable to profile by pressing the `Browse` button, clicking `ubuntu` from the file left side of the file selector, the selecting `notebook` and then `C`, then selecting `task_simple_out`.\n",
|
||||
"Click on `File > New Session` to start a new profiling session. Select the executable to profile by pressing the `Browse` button, then selecting `task_simple_out` `from OpenACC/labs/lab1/C`.\n",
|
||||
"\n",
|
||||
"<div align=\"center\"><img src=\"files/pgprof17_create_new_session.png\" width=\"60%\"></div>\n",
|
||||
"\n",
|
||||
@ -369,7 +424,7 @@
|
||||
"source": [
|
||||
"One, two or several loops may be inside the structured block, the kernels directive will try to parallelize it, telling you what it found and generating as many kernels as it thinks it safely can. At some point, you will encounter the OpenACC *parallel* directive, which provides another method for defining compute regions in OpenACC. For now, let's drop in a simple OpenACC `kernels` directive in front of and embracing *both* the two for-loop codeblocks that follow the while loop using curly braces. The kernels directive is designed to find the parallel acceleration opportunities implicit in the for-loops in the Jacobi Iteration code. \n",
|
||||
"\n",
|
||||
"To get some hints about how and where to place your kernels directives, click on the links below. When you feel you are done, **make sure to save the [`task2.c`](/rpWFwS8c/edit/C/task2/task2.c) file you've modified with File -> Save, and continue on.** If you get completely stuck, you can look at [task2_solution.c](/rpWFwS8c/edit/C/task2/task2_solution.c) to see the answer."
|
||||
"To get some hints about how and where to place your kernels directives, click on the links below. When you feel you are done, **make sure to save the [`task2.c`](/4vwkFv7K/edit/C/task2/task2.c) file you've modified with File -> Save, and continue on.** If you get completely stuck, you can look at [task2_solution.c](/4vwkFv7K/edit/C/task2/task2_solution.c) to see the answer."
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -384,21 +439,22 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Let's now compile our [`task2.c`](/rpWFwS8c/edit/C/task2/task2.c) file by executing the cell below with Ctrl-Enter (or press the play button in the toolbar above). Note that we've now added a new compiler option `-ta` to specify the type of accelerator to use. We've set it to `tesla` as we're using NVIDIA GPUs in this lab."
|
||||
"Let's now compile our [`task2.c`](/4vwkFv7K/edit/C/task2/task2.c) file by executing the command below with Ctrl-Enter (or press the play button in the toolbar above). Note that we've now added a new compiler option `-ta` to specify the type of accelerator to use. We've set it to `tesla` as we're using NVIDIA GPUs in this lab."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"collapsed": true
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"%%bash\n",
|
||||
"# Compile the task2.c file with the pgcc compiler\n",
|
||||
"# -acc tells the compiler to process the source recognizing #pragma acc directives\n",
|
||||
"# -Minfo tells the compiler to share information about the compilation process\n",
|
||||
"!pgcc -acc -Minfo -fast -ta=tesla -o task2_out task2/task2.c && echo \"Compiled Successfully\""
|
||||
"pgcc -acc -Minfo -fast -ta=tesla -o task2_out task2/task2.c && echo 'Compiled Successfully'"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -444,25 +500,26 @@
|
||||
"\n",
|
||||
"So as you can see, lots of useful information is provided by the compiler, and it's very important that you carefuly inspect this information to make sure the compiler is doing what you've asked of it.\n",
|
||||
"\n",
|
||||
"Finally, let's execute this program to verify we are getting the correct answer (execute the cell below). "
|
||||
"Finally, let's execute this program to verify we are getting the correct answer (execute the command below). "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Once you feel your code is correct, try running it by executing the cell block below. You'll want to review our quality check from the beginning of task2 to make sure you didn't break the functionality of your application."
|
||||
"Once you feel your code is correct, try running it by executing the command below. You'll want to review our quality check from the beginning of task2 to make sure you didn't break the functionality of your application."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"collapsed": true
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"!./task2_out"
|
||||
"%%bash\n",
|
||||
"./task2_out"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -524,7 +581,7 @@
|
||||
"\n",
|
||||
"For detailed information on the `data` directive clauses, you can refer to the [OpenACC 2.5](http://www.openacc.org/sites/default/files/OpenACC_2pt5.pdf) specification.\n",
|
||||
"\n",
|
||||
"In the [`task3.c`](/rpWFwS8c/edit/C/task3/task3.c) file, see if you can add in a `data` directive to minimize data transfers in the Jacobi Iteration. There's a place for the `create` clause in this exercise too. As usual, there are some hints provided, and you can look at [`task3_solution.c`](/rpWFwS8c/edit/C/task3/task3_solution.c) to see the answer if you get stuck or want to check your work. **Don't forget to save with File -> Save in the editor below before moving on.**"
|
||||
"In the [`task3.c`](/4vwkFv7K/edit/C/task3/task3.c) file, see if you can add in a `data` directive to minimize data transfers in the Jacobi Iteration. There's a place for the `create` clause in this exercise too. As usual, there are some hints provided, and you can look at [`task3_solution.c`](/4vwkFv7K/edit/C/task3/task3_solution.c) to see the answer if you get stuck or want to check your work. **Don't forget to save with File -> Save in the editor below before moving on.**"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -540,18 +597,19 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Once you think you have [`task3.c`](/rpWFwS8c/edit/C/task3/task3.c) saved with a directive to manage data transfer, compile it with the below cell and note the changes in the compiler output in the areas discussing data movement (lines starting with `Generating ...`). Then modify Anew using the `create` clause, if you haven't yet, and compile again."
|
||||
"Once you think you have [`task3.c`](/4vwkFv7K/edit/C/task3/task3.c) saved with a directive to manage data transfer, compile it with the below command and note the changes in the compiler output in the areas discussing data movement (lines starting with `Generating ...`). Then modify Anew using the `create` clause, if you haven't yet, and compile again."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"collapsed": true
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"!pgcc -fast -acc -Minfo=accel -ta=tesla -o task3_out task3/task3.c && echo \"Compiled Successfully\""
|
||||
"%%bash\n",
|
||||
"pgcc -fast -acc -Minfo=accel -ta=tesla -o task3_out task3/task3.c && echo 'Compiled Successfully'"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -565,11 +623,12 @@
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"collapsed": true
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"!./task3_out"
|
||||
"%%bash\n",
|
||||
"./task3_out"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -611,7 +670,7 @@
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"collapsed": true
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
@ -692,7 +751,7 @@
|
||||
"| | | | | 16 | 32 | 0.410 |\n",
|
||||
"| | | | | 4 | 64 | 0.379 |\n",
|
||||
"\n",
|
||||
"Try to modify the [`task4.c`](/rpWFwS8c/edit/C/task4/task4.c) code for the main computational loop nests in the window below. You'll be using the openacc loop constructs `gang()` and `vector()`. Look at task4_solution.c if you get stuck:\n"
|
||||
"Try to modify the [`task4.c`](/4vwkFv7K/edit/C/task4/task4.c) code for the main computational loop nests in the window below. You'll be using the openacc loop constructs `gang()` and `vector()`. Look at task4_solution.c if you get stuck:\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -713,36 +772,38 @@
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"collapsed": true
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"!pgcc -acc -Minfo=accel -fast -ta=tesla -o task4_out task4/task4.c && echo \"Compiled Successfully\""
|
||||
"%%bash\n",
|
||||
"pgcc -acc -Minfo=accel -fast -ta=tesla -o task4_out task4/task4.c && echo 'Compiled Successfully'"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"collapsed": true
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"!./task4_out"
|
||||
"%%bash\n",
|
||||
"./task4_out"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Looking at [task4_solution.c](/rpWFwS8c/edit/C/task4/task4_solution.c), the gang(8) clause on the inner loop tells it to launch 8 blocks in the X(column) direction. The vector(32) clause on the inner loop tells the compiler to use blocks that are 32 threads (one warp) wide. The absence of clause on the outer loop lets the compiler decide how many rows of threads and how many blocks to use in the Y(row) direction. We can see what it says, again, with:"
|
||||
"Looking at [task4_solution.c](/4vwkFv7K/edit/C/task4/task4_solution.c), the gang(8) clause on the inner loop tells it to launch 8 blocks in the X(column) direction. The vector(32) clause on the inner loop tells the compiler to use blocks that are 32 threads (one warp) wide. The absence of clause on the outer loop lets the compiler decide how many rows of threads and how many blocks to use in the Y(row) direction. We can see what it says, again, with:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"collapsed": true
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
@ -770,11 +831,12 @@
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"collapsed": true
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"!pgcc -acc -fast -ta=tesla -Minfo=accel -o task4_out task4/task4.c && echo \"Compiled Successfully\""
|
||||
"%%bash\n",
|
||||
"pgcc -acc -fast -ta=tesla -Minfo=accel -o task4_out task4/task4.c && echo 'Compiled Successfully'"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -788,11 +850,12 @@
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"collapsed": true
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"!./task4_out"
|
||||
"%%bash\n",
|
||||
"./task4_out"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -814,7 +877,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"At this point, some of you may be wondering what kind of speed-up we get against the OpenMP version of this code. If you look at [task1_omp.c](/rpWFwS8c/edit/C/task4/task1_omp.c) in the text editor above, you can see a simple OpenMP version of the Jacobi Iteration code. Running this using 8-OpenMP threads on an Intel Xeon E5-2670 , our Kepler GK520 about 2X faster. If we scale the matrix up to an even larger 4096x4096, our Kepler GK520 GPU becomes significantly faster than the 8-OpenMP thread version. If you have some time remaining in this lab, feel free to compile & run the OpenMP and OpenACC versions below with the larger matrices.\n",
|
||||
"At this point, some of you may be wondering what kind of speed-up we get against the OpenMP version of this code. If you look at [task1_omp.c](/4vwkFv7K/edit/C/task4/task1_omp.c) in the text editor above, you can see a simple OpenMP version of the Jacobi Iteration code. Running this using 8-OpenMP threads on an Intel Xeon E5-2670 , our Kepler GK520 about 2X faster. If we scale the matrix up to an even larger 4096x4096, our Kepler GK520 GPU becomes significantly faster than the 8-OpenMP thread version. If you have some time remaining in this lab, feel free to compile & run the OpenMP and OpenACC versions below with the larger matrices.\n",
|
||||
"\n",
|
||||
"First, compile the OpenMP version:"
|
||||
]
|
||||
@ -823,7 +886,7 @@
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"collapsed": true
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
@ -844,11 +907,12 @@
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"collapsed": true
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"!OMP_NUM_THREADS=8 ./task4_4096_omp"
|
||||
"%%bash\n",
|
||||
"OMP_NUM_THREADS=8 ./task4_4096_omp"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -862,22 +926,24 @@
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"collapsed": true
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"!pgcc -acc -fast -ta=tesla -Minfo=accel -o task4_4096_out task4/task4_4096_solution.c && echo \"Compiled Successfully\""
|
||||
"%%bash\n",
|
||||
"pgcc -acc -fast -ta=tesla -Minfo=accel -o task4_4096_out task4/task4_4096_solution.c && echo 'Compiled Successfully'"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"collapsed": true
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"!./task4_4096_out"
|
||||
"%%bash\n",
|
||||
"./task4_4096_out"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -906,49 +972,7 @@
|
||||
"* [OpenACC on CUDA Zone](https://developer.nvidia.com/openacc)\n",
|
||||
"* Search or ask questions on [Stackoverflow](http://stackoverflow.com/questions/tagged/openacc) using the openacc tag\n",
|
||||
"* Get the free [PGI Comunity Edition](https://www.pgroup.com/products/community.htm) compiler.\n",
|
||||
"* Attend an in-depth workshop offered by XSEDE (https://portal.xsede.org/overview) or a commercial provider (see the 'education' page at OpenACC.org)\n",
|
||||
"\n",
|
||||
"---\n",
|
||||
"\n",
|
||||
"<a id=\"post-lab\"></a>\n",
|
||||
"## Post-Lab\n",
|
||||
"\n",
|
||||
"Finally, don't forget to save your work from this lab before time runs out and the instance shuts down!!\n",
|
||||
"\n",
|
||||
"1. Save this IPython Notebook by going to `File -> Download as -> IPython (.ipynb)` at the top of this window\n",
|
||||
"2. You can execute the following cell block to create a zip-file of the files you've been working on, and download it with the link below."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"collapsed": true
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"%%bash\n",
|
||||
"rm -f openacc_files.zip\n",
|
||||
"zip -r openacc_files.zip task*/*.c task*/*.h"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"**After** executing the above zip command, you should be able to download the zip file [here](files/openacc_files.zip)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<a id=\"FAQ\"></a>\n",
|
||||
"---\n",
|
||||
"# Lab FAQ\n",
|
||||
"\n",
|
||||
"Q: I'm encountering issues executing the cells, or other technical problems?<br>\n",
|
||||
"A: Please see [this](https://developer.nvidia.com/self-paced-labs-faq#Troubleshooting) infrastructure FAQ."
|
||||
"* Attend an in-depth workshop offered by XSEDE (https://portal.xsede.org/overview) or a commercial provider (see the 'education' page at OpenACC.org)\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -1039,7 +1063,7 @@
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython2",
|
||||
"version": "2.7.6"
|
||||
"version": "2.7.13"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
|
BIN
lab1/C/OpenACC C.pdf
Normal file
BIN
lab1/C/OpenACC C.pdf
Normal file
Binary file not shown.
Reference in New Issue
Block a user