{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# OpenACC: 2X in 4 Steps (for Fortran)\n", "\n", "In this self-paced, hands-on lab, we will use [OpenACC](http://openacc.org/) directives to port a basic scientific Fortran program to an accelerator in four simple steps, achieving *at least* a two-fold speed-up.\n", "\n", "Lab created by John Coombs, Mark Harris, and Mark Ebersole (Follow [@CUDAHamster](https://twitter.com/@cudahamster) on Twitter)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next let's get information about the GPUs on the server by executing the command below." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Tue Jun 27 14:46:47 2017 \n", "+-----------------------------------------------------------------------------+\n", "| NVIDIA-SMI 375.66 Driver Version: 375.66 |\n", "|-------------------------------+----------------------+----------------------+\n", "| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |\n", "| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |\n", "|===============================+======================+======================|\n", "| 0 GeForce GTX 950 Off | 0000:01:00.0 On | N/A |\n", "| 18% 56C P8 10W / 99W | 741MiB / 1996MiB | 0% Default |\n", "+-------------------------------+----------------------+----------------------+\n", " \n", "+-----------------------------------------------------------------------------+\n", "| Processes: GPU Memory |\n", "| GPU PID Type Process name Usage |\n", "|=============================================================================|\n", "| 0 1942 G /usr/lib/xorg/Xorg 423MiB |\n", "| 0 3184 G compiz 136MiB |\n", "| 0 3392 G /usr/lib/firefox/firefox 1MiB |\n", "| 0 3526 G ...el-token=6C5C01D5B0057C12B571711999D42376 145MiB |\n", "| 0 3636 G ...s-passed-by-fd --v8-snapshot-passed-by-fd 31MiB |\n", "+-----------------------------------------------------------------------------+\n" ] } ], "source": [ "%%bash\n", "nvidia-smi" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Introduction to OpenACC\n", "\n", "Open-specification OpenACC directives are a straightforward way to accelerate existing Fortran and C applications. With OpenACC directives, you provide *hints* via compiler directives to tell the compiler where -- and how -- it should parallelize compute-intensive code for execution on an accelerator. \n", "\n", "If you've done parallel programming using OpenMP, OpenACC is very similar: using directives, applications can be parallelized *incrementally*, with little or no change to the Fortran or C source. Debugging and code maintenance are easier. OpenACC directives are designed for *portability* across operating systems, host CPUs, and accelerators. You can use OpenACC directives with GPU accelerated libraries, explicit parallel programming languages (e.g., CUDA), MPI, and OpenMP, *all in the same program.*\n", "\n", "This hands-on lab walks you through a short sample of a scientific code, and demonstrates how you can employ OpenACC directives using a four-step process. You will make modifications to a simple Fortran program, then compile and execute the newly enhanced code in each step. Along the way, hints and solution are provided, so you can check your work, or take a peek if you get lost.\n", "\n", "If you are confused now, or at any point in this lab, you can consult the FAQ located at the bottom of this page." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The Value of 2X in 4 Steps\n", "\n", "You can accelerate your applications using OpenACC directives and achieve *at least* a 2X speed-up, using 4 straightforward steps:\n", "\n", "1. Characterize your application\n", "2. Add compute directives\n", "3. Minimize data movement\n", "4. Optimize kernel scheduling\n", "\n", "The content of these steps and their order will be familiar if you have ever done parallel programming on other platforms. Parallel programmers deal with the same issues whenever they tackle a new set of code, no matter what platform they are parallelizing an application for. These issues include:\n", "\n", "+ optimizing and benchmarking the serial version of an application\n", "+ profiling to identify the compute-intensive portions of the program that can be executed concurrently\n", "+ expressing concurrency using a parallel programming notation (e.g., OpenACC directives)\n", "+ compiling and benchmarking each new/parallel version of the application\n", "+ locating problem areas and making improvements iteratively until the target level of performance is reached\n", "\n", "The programming manual for some other parallel platform you've used may have suggested five steps, or fifteen. Whether you are an expert or new to parallel programming, we recommend that you walk through the four steps here as a good way to begin accelerating applications by at least 2X using OpenACC directives. We believe *being more knowledgeable about the four steps* will make the process of programming for an accelerator more understandable *and* more manageable. The 2X in 4 Steps process will help you use OpenACC on your own codes more productively, and get significantly better speed-ups in less time." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 1 - Characterize Your Application" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The most difficult part of accelerator programming begins before the first line of code is written. If your program is not highly parallel, an accelerator or coprocesor won't be much use. Understanding the code structure is crucial if you are going to *identify opportunities* and *successfully* parallelize a piece of code. The first step in OpenACC programming then is to *characterize the application*. This includes:\n", "\n", "+ benchmarking the single-thread, CPU-only version of the application\n", "+ understanding the program structure and how data is passed through the call tree\n", "+ profiling the application and identifying computationally-intense \"hot spots\"\n", " + which loop nests dominate the runtime?\n", " + what are the minimum/average/maximum tripcounts through these loop nests?\n", " + are the loop nests suitable for an accelerator?\n", "+ insuring that the algorithms you are considering for acceleration are *safely* parallel\n", "\n", "Note: what we've just said may sound a little scary, so please note that as parallel programming methods go OpenACC is really pretty friendly: think of it as a sandbox you can play in. Because OpenACC directives are incremental, you can add one or two in at a time and see how things work: the compiler provides a *lot* of feedback. The right software plus good tools plus educational experiences like this one should put you on the path to successfully accelerating your programs.\n", "\n", "We will be accelerating a 2D-stencil called the Jacobi Iteration. Jacobi Iteration is a standard method for finding solutions to a system of linear equations. The basic concepts behind a Jacobi Iteration are described in the following video:\n", "\n", "http://www.youtube.com/embed/UOSYi3oLlRs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here is the serial Fortran code for our Jacobi Iteration:\n", "\n", " program main\n", " use openacc\n", " implicit real(4) (A-H,O-Z)\n", " integer, parameter :: NN = 1024\n", " integer, parameter :: NM = 1024\n", "\n", " real(4) A(NN,NM), Anew(NN,NM)\n", " iter_max = 1000\n", " tol = 1.0e-6\n", " error = 1.0\n", "\n", " A(1,:) = 1.0\n", " A(2:NN,:) = 0.0\n", " Anew(1,:) = 1.0\n", " Anew(2:NN,:) = 0.0\n", " \n", " print 100,NN,NM\n", " \n", " call cpu_time(t1)\n", " iter = 0\n", " do while ( (error > tol) .and. (iter < iter_max) )\n", " error = 0.0\n", " do j = 2, NM-1\n", " do i = 2, NN-1\n", " Anew(i,j) = 0.25 * ( A(i+1,j) + A(i-1,j) + &\n", " A(i,j-1) + A(i,j+1) )\n", " error = max( error, abs(Anew(i,j) - A(i,j)) )\n", " end do\n", " end do\n", " \n", " do j = 2, NM-1\n", " do i = 2, NN-1\n", " A(i,j) = Anew(i,j)\n", " end do\n", " end do\n", "\n", " if(mod(iter,100) == 0) print 101,iter,error\n", " iter = iter + 1\n", " end do\n", " call cpu_time(t2)\n", " print 102,t2-t1\n", "\n", " 100 format(\"Jacobi relaxation Calculation: \",i4,\" x \",i4,\" mesh\")\n", " 101 format(2x,i4,2x,f9.6)\n", " 102 format(\"total: \",f9.6,\" s\")\n", " end program\n", " \n", "\n", "In this code, the outer 'while' loop iterates until the solution has converged, by comparing the computed error to a specified error tolerance, *tol*. The first of two sets of inner nested loops applies a 2D Laplace operator at each element of a 2D grid, while the second set copies the output back to the input for the next iteration." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Benchmarking\n", "\n", "Before you start modifying code and adding OpenACC directives, you should benchmark the serial version of the program. To facilitate benchmarking after this and every other step in our parallel porting effort, we have built a timing routine around the main structure of our program -- a process we recommend you follow in your own efforts. Let's run the `task1.f90` file without making any changes -- using the *-fast* set of compiler options on the serial version of the Jacobi Iteration program -- and see how fast the serial program executes. This will establish a baseline for future comparisons. Execute the following two commands to compile and run the program." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Compiled Successfully!\n" ] } ], "source": [ "%%bash\n", "# To be sure we see some output from the compiler, we'll echo out \"Compiled Successfully!\" \n", "#(if the compile does not return an error)\n", "pgfortran -fast -o task1_pre_out task1/task1.f90 && echo \"Compiled Successfully!\"" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Jacobi relaxation Calculation: 1024 x 1024 mesh\n", " 0 0.250000\n", " 100 0.002397\n", " 200 0.001204\n", " 300 0.000804\n", " 400 0.000603\n", " 500 0.000483\n", " 600 0.000403\n", " 700 0.000345\n", " 800 0.000302\n", " 900 0.000269\n", "total: 1.507478 s\n" ] } ], "source": [ "%%bash\n", "# Execute our single-thread CPU-only Jacobi Iteration to get timing information. Make sure you compiled\n", "# successfully in the above command first.\n", "./task1_pre_out" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Quality Checking/Keeping a Record\n", "\n", "This is a good time to briefly talk about having a quality check in your code before starting to offload computation to an accelerator (or do any optimizations, for that matter). It doesn't do you any good to make an application run faster if it does not return the correct results. It is thus very important to have a quality check built into your application before you start accelerating or optimizing. This can be a simple value print out (one you can compare to a non-accelerated version of the algorithm) or something else.\n", "\n", "In our case, on every 100th iteration of the outer `do while` loop, we print the current max error. (You just saw an example when we executed *task1_pre_out*.) As we add directives to accelerate our code later in this lab, you can look back at these values to verify that we're getting the correct answer. These print-outs also help us verify that we are converging on a solution -- which means that we should see that, as we proceed, the values are approaching zero.\n", "\n", "**Note:** NVIDIA GPUs implement IEEE-754 compliant floating point arithmetic just like most modern CPUs. However, because floating point arithmetic is not associative, the order of operations can affect the rounding error inherent with floating-point operations: you may not get exactly the same answer when you move to a different processor. Therefore, you'll want to make sure to verify your answer within an acceptable error bound. Please read [this](https://developer.nvidia.com/content/precision-performance-floating-point-and-ieee-754-compliance-nvidia-gpus) article at a later time, if you would like more details.\n", "\n", "*After each step*, we will record the results from our benchmarking and correctness tests in a table like this one: \n", "\n", "|Step| Execution | ExecutionTime (s) | Speedup vs. 1 CPU Thread | Correct? | Programming Time |\n", "|:--:| --------------- | ---------------------:| ------------------------------:|:--------:| -----------------|\n", "|1 | CPU 1 thread | 1.624 | | Yes | | |\n", "\n", "*Note: Problem Size: 1024 x 1024; System Information: GK520; Compiler: PGI Community Edition 17.4*\n", "\n", "(The execution times quoted will be times we got running on our GK520 -- your times throughout the lab may vary for one reason or another.)\n", "\n", "You may also want to track how much time you spend porting your application, step by step, so a column has been included for recording time spent." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Profiling\n", "\n", "Back to our lab. Your objective in the step after this one (Step 2) will be to modify `task2.f90` in a way that moves the most computationally intensive, independent loops to the accelerator. With a simple code, you can identify which loops are candidates for acceleration with a little bit of code inspection. On more complex codes, a great way to find these computationally intense areas is to use a profiler (such as PGI's PGPROF or open-source `gprof`) to determine which functions are consuming the largest amounts of compute time. To profile a program on your own workstation, you'd type the lines below on the command line, but in this workshop, you just need to execute the following command, and then click on the link below it to see the PGPROF interface" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "%%bash\n", "pgfortran -fast -Minfo=all,ccff -o task1/task1_simple_out task1/task1.f90 && echo \"Compiled Successfully!\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this lab, to open the PGI profiler in a new window click here." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Click on `File > New Session` to start a new profiling session. Select the executable to profile by pressing the `Browse` button, clicking `ubuntu` from the file left side of the file selector, the selecting `notebook`, then `FORTRAN`, and finally selecting `task1_simple_out`.\n", "\n", "
\n", "\n", "Clicking `Next` will bring up a screen with a list profiling settings for this session. We can leave those at their default settings for now. Clicking `Finish` will cause `pgprof` to launch your executable for profiling. Since we are profiling a regular CPU application (no acceleration added yet) we should refer to the `CPU Details` tab along the bottom of the window for a summary of what functions in our program take the most compute time on the CPU. If you do not have a `CPU Details` tab, click `View` -> `Show CPU Details View`.\n", "\n", "\n", "\n", "Double-clicking on the most time-consuming function in the table, `MAIN_` in this case, will bring up another file browser. This time click on `Recently Used` and then `FORTRAN` and press `OK`. This will open the source file for the `main` function. \n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In our Jacobi code sample, the two nests of `do` loops within the `do while` loop account for the majority of the runtime. \n", "\n", "Let's see what it takes to accelerate those loops." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 2 - Add Compute Directives" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In Fortran, an OpenACC directive is indicated in the code by `!$acc *your directive`* . This is very similar to OpenMP programming and gives hints to the compiler on how to handle the compilation of your source. If you are using a compiler which does not support OpenACC directives, it will simply ignore the `!$acc` directives and move on with the compilation.\n", "\n", "In Step 2, you will add compute regions around your expensive parallel loop(s). The first OpenACC directive you're going to learn about is the *kernels* directive. The kernels directive gives the compiler a lot of freedom in how it tries to accelerate your code - it basically says, \"Compiler, I believe the following code block is parallelizable, so I want you to try and accelerate it as best you can.\"\n", "\n", "Like most OpenACC directives in Fortran, the kernels directive applies to the structured code block defined by the opening and closing !$acc *directives*. For example, each of the following code samples instructs the compiler to generate a kernel -- from suitable loops -- for execution on an accelerator:\n", "\n", " !$acc kernels\n", " \n", " ! accelerate suitable loops here \n", " \n", " !$acc end kernels\n", " \n", " ! but not loops outside of the region\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "At some point, you will encounter the OpenACC *parallel* directive, which provides another method for defining compute regions in OpenACC. For now, let's drop in a simple OpenACC `kernels` directive in front of the two do-loop codeblocks that follow the do while loop. The kernels directive is designed to find the parallel acceleration opportunities implicit in the do-loops in the Jacobi Iteration code. \n", "\n", "To get some hints about how and where to place your kernels directives, click on the green boxes below. When you feel you are done, **make sure to save the `task2.f90` file you've modified with File -> Save, and continue on.** If you get completely stuck, you can look at `task2_solution.f90` to see the answer." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Hint #1\n", "
kernels
directive to the outer do
loop and everything inside of it: \n",
" !$ acc kernels\n",
" do j=2, NM-1\n",
" do i=2, NN-1\n",
" A(i,j) = Anew(i,j)\n",
" ...\n",
" end do\n",
" end do\n",
" !$ end kernels\n",
"