Adjusted the Fortran version of the lab for working without Jupyter.

This commit is contained in:
F. Dijkstra 2017-06-27 14:48:23 +02:00
parent b95b8958d6
commit 32de9f0684
2 changed files with 116 additions and 148 deletions

View File

@ -15,46 +15,47 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"Before we begin, let's verify [WebSockets](http://en.wikipedia.org/wiki/WebSocket) are working on your system. To do this, execute the cell block *below* by giving it focus (clicking on it with your mouse), and hitting Ctrl-Enter, or by pressing the play button in the toolbar *above*. If all goes well, you should see get some output returned below the grey cell. If not, please consult the [Self-paced Lab Troubleshooting FAQ](https://developer.nvidia.com/self-paced-labs-faq#Troubleshooting) to debug the issue."
"Next let's get information about the GPUs on the server by executing the command below."
]
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 1,
"metadata": {
"collapsed": true
"collapsed": false
},
"outputs": [],
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Tue Jun 27 14:46:47 2017 \n",
"+-----------------------------------------------------------------------------+\n",
"| NVIDIA-SMI 375.66 Driver Version: 375.66 |\n",
"|-------------------------------+----------------------+----------------------+\n",
"| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |\n",
"| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |\n",
"|===============================+======================+======================|\n",
"| 0 GeForce GTX 950 Off | 0000:01:00.0 On | N/A |\n",
"| 18% 56C P8 10W / 99W | 741MiB / 1996MiB | 0% Default |\n",
"+-------------------------------+----------------------+----------------------+\n",
" \n",
"+-----------------------------------------------------------------------------+\n",
"| Processes: GPU Memory |\n",
"| GPU PID Type Process name Usage |\n",
"|=============================================================================|\n",
"| 0 1942 G /usr/lib/xorg/Xorg 423MiB |\n",
"| 0 3184 G compiz 136MiB |\n",
"| 0 3392 G /usr/lib/firefox/firefox 1MiB |\n",
"| 0 3526 G ...el-token=6C5C01D5B0057C12B571711999D42376 145MiB |\n",
"| 0 3636 G ...s-passed-by-fd --v8-snapshot-passed-by-fd 31MiB |\n",
"+-----------------------------------------------------------------------------+\n"
]
}
],
"source": [
"print \"The answer should be three: \" + str(1+2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next let's get information about the GPUs on the server by executing the cell below."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"!nvidia-smi"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"The following video will explain the infrastructure we are using for this self-paced lab, as well as give some tips on it's usage. If you've never taken a lab on this system before, it's highly encourage you watch this short video first.<br><br>\n",
"<div align=\"center\"><iframe width=\"640\" height=\"390\" src=\"http://www.youtube.com/embed/ZMrDaLSFqpY\" frameborder=\"0\" allowfullscreen>"
"%%bash\n",
"nvidia-smi"
]
},
{
@ -67,10 +68,6 @@
"\n",
"If you've done parallel programming using OpenMP, OpenACC is very similar: using directives, applications can be parallelized *incrementally*, with little or no change to the Fortran or C source. Debugging and code maintenance are easier. OpenACC directives are designed for *portability* across operating systems, host CPUs, and accelerators. You can use OpenACC directives with GPU accelerated libraries, explicit parallel programming languages (e.g., CUDA), MPI, and OpenMP, *all in the same program.*\n",
"\n",
"Watch the following short video introduction to OpenACC:\n",
"\n",
"<div align=\"center\"><iframe width=\"640\" height=\"390\" style=\"margin: 0 auto;\" src=\"http://www.youtube.com/embed/c9WYCFEt_Uo\" frameborder=\"0\" allowfullscreen></iframe></div>\n",
"\n",
"This hands-on lab walks you through a short sample of a scientific code, and demonstrates how you can employ OpenACC directives using a four-step process. You will make modifications to a simple Fortran program, then compile and execute the newly enhanced code in each step. Along the way, hints and solution are provided, so you can check your work, or take a peek if you get lost.\n",
"\n",
"If you are confused now, or at any point in this lab, you can consult the <a href=\"#FAQ\">FAQ</a> located at the bottom of this page."
@ -125,7 +122,7 @@
"\n",
"We will be accelerating a 2D-stencil called the Jacobi Iteration. Jacobi Iteration is a standard method for finding solutions to a system of linear equations. The basic concepts behind a Jacobi Iteration are described in the following video:\n",
"\n",
"<div align=\"center\"><iframe width=\"640\" height=\"390\" src=\"http://www.youtube.com/embed/UOSYi3oLlRs\" frameborder=\"0\" allowfullscreen></iframe></div>"
"http://www.youtube.com/embed/UOSYi3oLlRs"
]
},
{
@ -191,33 +188,62 @@
"source": [
"### Benchmarking\n",
"\n",
"Before you start modifying code and adding OpenACC directives, you should benchmark the serial version of the program. To facilitate benchmarking after this and every other step in our parallel porting effort, we have built a timing routine around the main structure of our program -- a process we recommend you follow in your own efforts. Let's run the [`task1.f90`](/rpWFwS8c/edit/FORTRAN/task1/task1.f90) file without making any changes -- using the *-fast* set of compiler options on the serial version of the Jacobi Iteration program -- and see how fast the serial program executes. This will establish a baseline for future comparisons. Execute the following two cells to compile and run the program."
"Before you start modifying code and adding OpenACC directives, you should benchmark the serial version of the program. To facilitate benchmarking after this and every other step in our parallel porting effort, we have built a timing routine around the main structure of our program -- a process we recommend you follow in your own efforts. Let's run the `task1.f90` file without making any changes -- using the *-fast* set of compiler options on the serial version of the Jacobi Iteration program -- and see how fast the serial program executes. This will establish a baseline for future comparisons. Execute the following two commands to compile and run the program."
]
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 2,
"metadata": {
"collapsed": true
"collapsed": false
},
"outputs": [],
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Compiled Successfully!\n"
]
}
],
"source": [
"%%bash\n",
"# To be sure we see some output from the compiler, we'll echo out \"Compiled Successfully!\" \n",
"#(if the compile does not return an error)\n",
"!pgfortran -fast -o task1_pre_out task1/task1.f90 && echo \"Compiled Successfully!\""
"pgfortran -fast -o task1_pre_out task1/task1.f90 && echo \"Compiled Successfully!\""
]
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 3,
"metadata": {
"collapsed": true
"collapsed": false
},
"outputs": [],
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Jacobi relaxation Calculation: 1024 x 1024 mesh\n",
" 0 0.250000\n",
" 100 0.002397\n",
" 200 0.001204\n",
" 300 0.000804\n",
" 400 0.000603\n",
" 500 0.000483\n",
" 600 0.000403\n",
" 700 0.000345\n",
" 800 0.000302\n",
" 900 0.000269\n",
"total: 1.507478 s\n"
]
}
],
"source": [
"%%bash\n",
"# Execute our single-thread CPU-only Jacobi Iteration to get timing information. Make sure you compiled\n",
"# successfully in the above cell first.\n",
"!./task1_pre_out"
"# successfully in the above command first.\n",
"./task1_pre_out"
]
},
{
@ -251,7 +277,7 @@
"source": [
"### Profiling\n",
"\n",
"Back to our lab. Your objective in the step after this one (Step 2) will be to modify [`task2.f90`](/rpWFwS8c/edit/FORTRAN/task2/task2.f90) in a way that moves the most computationally intensive, independent loops to the accelerator. With a simple code, you can identify which loops are candidates for acceleration with a little bit of code inspection. On more complex codes, a great way to find these computationally intense areas is to use a profiler (such as PGI's PGPROF or open-source `gprof`) to determine which functions are consuming the largest amounts of compute time. To profile a program on your own workstation, you'd type the lines below on the command line, but in this workshop, you just need to execute the following cell, and then click on the link below it to see the PGPROF interface"
"Back to our lab. Your objective in the step after this one (Step 2) will be to modify `task2.f90` in a way that moves the most computationally intensive, independent loops to the accelerator. With a simple code, you can identify which loops are candidates for acceleration with a little bit of code inspection. On more complex codes, a great way to find these computationally intense areas is to use a profiler (such as PGI's PGPROF or open-source `gprof`) to determine which functions are consuming the largest amounts of compute time. To profile a program on your own workstation, you'd type the lines below on the command line, but in this workshop, you just need to execute the following command, and then click on the link below it to see the PGPROF interface"
]
},
{
@ -332,7 +358,7 @@
"source": [
"At some point, you will encounter the OpenACC *parallel* directive, which provides another method for defining compute regions in OpenACC. For now, let's drop in a simple OpenACC `kernels` directive in front of the two do-loop codeblocks that follow the do while loop. The kernels directive is designed to find the parallel acceleration opportunities implicit in the do-loops in the Jacobi Iteration code. \n",
"\n",
"To get some hints about how and where to place your kernels directives, click on the green boxes below. When you feel you are done, **make sure to save the [task2.f90](/rpWFwS8c/edit/FORTRAN/task2/task2.f90) file you've modified with File -> Save, and continue on.** If you get completely stuck, you can look at [task2_solution.f90](/rpWFwS8c/edit/FORTRAN/task2/task2_solution.f90) to see the answer."
"To get some hints about how and where to place your kernels directives, click on the green boxes below. When you feel you are done, **make sure to save the `task2.f90` file you've modified with File -> Save, and continue on.** If you get completely stuck, you can look at `task2_solution.f90` to see the answer."
]
},
{
@ -356,7 +382,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's now compile our [task2.f90](/rpWFwS8c/edit/FORTRAN/task2/task2.f90) file by executing the cell below with Ctrl-Enter (or press the play button in the toolbar above)."
"Let's now compile our `task2.f90` file by executing the command below with Ctrl-Enter (or press the play button in the toolbar above)."
]
},
{
@ -367,12 +393,13 @@
},
"outputs": [],
"source": [
"%%bash\n",
"# Compile the task2.f90 file with the pgfortran compiler\n",
"# -fast is the standard optimization flag\n",
"# -acc tells the compiler to process the source recognizing !$acc directives\n",
"# -ta=tesla tells the compiler to target an NVIDIA Tesla accelerator\n",
"# -Minfo tells the compiler to share information about the compilation process\n",
"!pgfortran -fast -acc -ta=tesla -Minfo -o task2_out task2/task2.f90"
"pgfortran -fast -acc -ta=tesla -Minfo -o task2_out task2/task2.f90"
]
},
{
@ -411,7 +438,7 @@
" 35, !$acc loop gang, vector(32) ! blockidx%x threadidx%x\n",
"````\n",
" \n",
"If you do not get similar output, please check your work and try re-compiling. If you're stuck, you can compare what you have to [task2_solution.f90](/rpWFwS8c/edit/FORTRAN/task2/task2_solution.f90) in the editor above.\n",
"If you do not get similar output, please check your work and try re-compiling. If you're stuck, you can compare what you have to `task2_solution.f90` in the editor above.\n",
"\n",
"*The output provided by the compiler is extremely useful, and should not be ignored when accelerating your own code with OpenACC.* Let's break it down a bit and see what it's telling us.\n",
"\n",
@ -424,14 +451,14 @@
"\n",
"So as you can see, lots of useful information is provided by the compiler, and it's very important that you carefuly inspect this information to make sure the compiler is doing what you've asked of it.\n",
"\n",
"Finally, let's execute this program to verify we are getting the correct answer (execute the cell below). "
"Finally, let's execute this program to verify we are getting the correct answer (execute the command below). "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Once you feel your code is correct, try running it by executing the cell block below. You'll want to review our quality check from the beginning of task2 to make sure you didn't break the functionality of your application."
"Once you feel your code is correct, try running it by executing the command below. You'll want to review our quality check from the beginning of task2 to make sure you didn't break the functionality of your application."
]
},
{
@ -442,7 +469,8 @@
},
"outputs": [],
"source": [
"!./task2_out"
"%%bash\n",
"./task2_out"
]
},
{
@ -459,7 +487,7 @@
"*Note: Problem Size: 1024x1024; System Information: GK520; Compiler: PGI Community Edition 17.4*\n",
"\n",
"\n",
"Now, if your solution is similar to the one in [task2_solution.f90](/rpWFwS8c/edit/FORTRAN/task2/task2_solution.f90), you have probably noticed that we're executing **slower** than the non-accelerated, CPU-only version we started with. What gives?! Let's see what pgprof can tell us about the performance of the code. Return to your PGPROF window from earlier, start another new session, but this time loading task2_out as your executable (it's in the same directory as before). This time we'll find a colorful graph of what our program is doing, this is the GPU timeline. We can't tell much from the default view, but we can zoom in by using the + magnifying glass at the top of the window. If you zoom in far enough, you'll begin to see a pattern like the one in the screenshot below. The teal and purple boxes are the compute kernels that go with the two loops in our kernels region. Each of these groupings of kernels is surrounded by tan coloer boxes representing data movement. What this graph is showing us is that for every step of our while loop, we're copying data to the GPU and then back out. Let's try to figure out why.\n",
"Now, if your solution is similar to the one in `task2_solution.f90`, you have probably noticed that we're executing **slower** than the non-accelerated, CPU-only version we started with. What gives?! Let's see what pgprof can tell us about the performance of the code. Return to your PGPROF window from earlier, start another new session, but this time loading task2_out as your executable (it's in the same directory as before). This time we'll find a colorful graph of what our program is doing, this is the GPU timeline. We can't tell much from the default view, but we can zoom in by using the + magnifying glass at the top of the window. If you zoom in far enough, you'll begin to see a pattern like the one in the screenshot below. The teal and purple boxes are the compute kernels that go with the two loops in our kernels region. Each of these groupings of kernels is surrounded by tan coloer boxes representing data movement. What this graph is showing us is that for every step of our while loop, we're copying data to the GPU and then back out. Let's try to figure out why.\n",
"\n",
"<div align=\"center\"><img src=\"files/pgprof17_excessive_data_movement.png\" width=\"60%\"></div>\n",
"\n",
@ -508,7 +536,7 @@
"\n",
"For detailed information on the `data` directive clauses, you can refer to the [OpenACC 2.5](http://www.openacc.org/sites/default/files/OpenACC_2pt5.pdf) specification.\n",
"\n",
"In the [task3.f90](/rpWFwS8c/edit/FORTRAN/task3/task3.f90) editor, see if you can add in a `data` directive to minimize data transfers in the Jacobi Iteration. There's a place for the `create` clause in this exercise too. As usual, there are some hints provided, and you can look at [task3_solution.f90](/rpWFwS8c/edit/FORTRAN/task3/task3_solution.f90) to see the answer if you get stuck or want to check your work. **Don't forget to save with File -> Save in the editor below before moving on.**"
"In the `task3.f90` editor, see if you can add in a `data` directive to minimize data transfers in the Jacobi Iteration. There's a place for the `create` clause in this exercise too. As usual, there are some hints provided, and you can look at `task3_solution.f90` to see the answer if you get stuck or want to check your work. **Don't forget to save with File -> Save in the editor below before moving on.**"
]
},
{
@ -527,7 +555,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Once you think you have [task3.f90](/rpWFwS8c/edit/FORTRAN/task3/task3.f90) saved with a directive to manage data transfer, compile it with the below cell and note the changes in the compiler output in the areas discussing data movement (lines starting with `Generating ...`). Then modify Anew using the `create` clause, if you haven't yet, and compile again."
"Once you think you have `task3.f90` saved with a directive to manage data transfer, compile it with the below command and note the changes in the compiler output in the areas discussing data movement (lines starting with `Generating ...`). Then modify Anew using the `create` clause, if you haven't yet, and compile again."
]
},
{
@ -538,7 +566,8 @@
},
"outputs": [],
"source": [
"!pgfortran -fast -ta=tesla -acc -Minfo=accel -o task3_out task3/task3.f90 && echo 'Compiled Successfully!'"
"%%bash\n",
"pgfortran -fast -ta=tesla -acc -Minfo=accel -o task3_out task3/task3.f90 && echo 'Compiled Successfully!'"
]
},
{
@ -556,7 +585,8 @@
},
"outputs": [],
"source": [
"!./task3_out"
"%%bash\n",
"./task3_out"
]
},
{
@ -678,7 +708,7 @@
"| | | | | 16 | 32 | 0.380 |\n",
"| | | | | 4 | 64 | 0.355 |\n",
"\n",
"Try to modify the [task4.f90](/rpWFwS8c/edit/FORTRAN/task4/task4.f90) code for the main computational loop nests in the window below. You'll be using the openacc loop constructs `gang()` and `vector()`. Look at [task4_solution.f90](/rpWFwS8c/edit/FORTRAN/task4/task4_solution.f90) if you get stuck:\n"
"Try to modify the `task4.f90` code for the main computational loop nests in the window below. You'll be using the openacc loop constructs `gang()` and `vector()`. Look at `task4_solution.f90` if you get stuck:\n"
]
},
{
@ -699,26 +729,32 @@
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"!pgfortran -acc -Minfo=accel -o task4_out task4/task4.f90 && echo \"Compiled Successfully\""
"%%bash\n",
"pgfortran -acc -Minfo=accel -o task4_out task4/task4.f90 && echo \"Compiled Successfully\""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"!./task4_out"
"%%bash\n",
"./task4_out"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Looking at [task4_solution.f90](/rpWFwS8c/edit/FORTRAN/task4/task4_solution.f90), the gang(8) clause on the inner loop tells it to launch 8 blocks in the X(column) direction. The vector(32) clause on the inner loop tells the compiler to use blocks that are 32 threads (one warp) wide. The absence of clause on the outer loop lets the compiler decide how many rows of threads and how many blocks to use in the Y(row) direction. We can see what it says, again, with:"
"Looking at `task4_solution.f90`, the gang(8) clause on the inner loop tells it to launch 8 blocks in the X(column) direction. The vector(32) clause on the inner loop tells the compiler to use blocks that are 32 threads (one warp) wide. The absence of clause on the outer loop lets the compiler decide how many rows of threads and how many blocks to use in the Y(row) direction. We can see what it says, again, with:"
]
},
{
@ -746,7 +782,7 @@
"\n",
"*Note: Low-level languages like CUDA Fortran offer more direct control of the hardware. You can consider optimizing your most critical loops in CUDA Fortran if you need to extract every last bit of performance from your application, while recognizing that doing so may impact the portability of your code. OpenACC and CUDA Fortran are fully interoperable.*\n",
"\n",
"A similar change to the copy loop nest benefits performance by a small amount. After you've made all your changes (look at [task4_solution.f90](/rpWFwS8c/edit/FORTRAN/task4/task4_solution.f90) to be sure) compile your code below:"
"A similar change to the copy loop nest benefits performance by a small amount. After you've made all your changes (look at `task4_solution.f90` to be sure) compile your code below:"
]
},
{
@ -757,7 +793,8 @@
},
"outputs": [],
"source": [
"!pgfortran -acc -Minfo=accel -o task4_out task4/task4.f90"
"%%bash\n",
"pgfortran -acc -Minfo=accel -o task4_out task4/task4.f90"
]
},
{
@ -775,7 +812,8 @@
},
"outputs": [],
"source": [
"!./task4_out"
"%%bash\n",
"./task4_out"
]
},
{
@ -834,7 +872,8 @@
},
"outputs": [],
"source": [
"!./task4_4096_omp"
"%%bash\n",
"./task4_4096_omp"
]
},
{
@ -852,7 +891,8 @@
},
"outputs": [],
"source": [
"!pgfortran -acc -Minfo=accel -o task4_4096_out task4/task4_4096_solution.f90"
"%%bash\n",
"pgfortran -acc -Minfo=accel -o task4_4096_out task4/task4_4096_solution.f90"
]
},
{
@ -863,7 +903,8 @@
},
"outputs": [],
"source": [
"!./task4_4096_out"
"%%bash\n",
"./task4_4096_out"
]
},
{
@ -892,80 +933,7 @@
"* [OpenACC on CUDA Zone](https://developer.nvidia.com/openacc)\n",
"* Search or ask questions on [Stackoverflow](http://stackoverflow.com/questions/tagged/openacc) using the openacc tag\n",
"* Get the free [PGI Comunity Edition](https://www.pgroup.com/products/community.htm) compiler.\n",
"* Attend an in-depth workshop offered by XSEDE (https://portal.xsede.org/overview) or a commercial provider (see the 'classes' tab at OpenACC.org)\n",
"\n",
"---\n",
"\n",
"<a id=\"post-lab\"></a>\n",
"## Post-Lab\n",
"\n",
"Finally, don't forget to save your work from this lab before time runs out and the instance shuts down!!\n",
"\n",
"1. Save this IPython Notebook by going to `File -> Download as -> IPython (.ipynb)` at the top of this window\n",
"2. You can execute the following cell block to create a zip-file of the files you've been working on, and download it with the link below."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"%%bash\n",
"rm -f openacc_files.zip\n",
"zip -r openacc_files.zip task*/*.f90"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**After** executing the above zip command, you should be able to download the zip file [here](files/openacc_files.zip)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id=\"FAQ\"></a>\n",
"---\n",
"# Lab FAQ\n",
"\n",
"Q: I'm encountering issues executing the cells, or other technical problems?<br>\n",
"A: Please see [this](https://developer.nvidia.com/self-paced-labs-faq#Troubleshooting) infrastructure FAQ."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<style>\n",
"p.hint_trigger{\n",
" margin-bottom:7px;\n",
" margin-top:-5px;\n",
" background:#64E84D;\n",
"}\n",
".toggle_container{\n",
" margin-bottom:0px;\n",
"}\n",
".toggle_container p{\n",
" margin:2px;\n",
"}\n",
".toggle_container{\n",
" background:#f0f0f0;\n",
" clear: both;\n",
" font-size:100%;\n",
"}\n",
"</style>\n",
"<script>\n",
"$(\"p.hint_trigger\").click(function(){\n",
" $(this).toggleClass(\"active\").next().slideToggle(\"normal\");\n",
"});\n",
" \n",
"$(\".toggle_container\").hide();\n",
"</script>"
"* Attend an in-depth workshop offered by XSEDE (https://portal.xsede.org/overview) or a commercial provider (see the 'classes' tab at OpenACC.org)"
]
}
],
@ -985,7 +953,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.6"
"version": "2.7.13"
}
},
"nbformat": 4,

Binary file not shown.