{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# OpenACC: 2X in 4 Steps (for Fortran)\n",
    "\n",
    "In this self-paced, hands-on lab, we will use [OpenACC](http://openacc.org/) directives to port a basic scientific Fortran program to an accelerator in four simple steps, achieving *at least* a two-fold speed-up.\n",
    "\n",
    "Lab created by John Coombs, Mark Harris, and Mark Ebersole (Follow [@CUDAHamster](https://twitter.com/@cudahamster) on Twitter)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next let's get information about the GPUs on the server by executing the command below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Tue Jun 27 14:46:47 2017       \n",
      "+-----------------------------------------------------------------------------+\n",
      "| NVIDIA-SMI 375.66                 Driver Version: 375.66                    |\n",
      "|-------------------------------+----------------------+----------------------+\n",
      "| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |\n",
      "| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |\n",
      "|===============================+======================+======================|\n",
      "|   0  GeForce GTX 950     Off  | 0000:01:00.0      On |                  N/A |\n",
      "| 18%   56C    P8    10W /  99W |    741MiB /  1996MiB |      0%      Default |\n",
      "+-------------------------------+----------------------+----------------------+\n",
      "                                                                               \n",
      "+-----------------------------------------------------------------------------+\n",
      "| Processes:                                                       GPU Memory |\n",
      "|  GPU       PID  Type  Process name                               Usage      |\n",
      "|=============================================================================|\n",
      "|    0      1942    G   /usr/lib/xorg/Xorg                             423MiB |\n",
      "|    0      3184    G   compiz                                         136MiB |\n",
      "|    0      3392    G   /usr/lib/firefox/firefox                         1MiB |\n",
      "|    0      3526    G   ...el-token=6C5C01D5B0057C12B571711999D42376   145MiB |\n",
      "|    0      3636    G   ...s-passed-by-fd --v8-snapshot-passed-by-fd    31MiB |\n",
      "+-----------------------------------------------------------------------------+\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "nvidia-smi"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Introduction to OpenACC\n",
    "\n",
    "Open-specification OpenACC directives are a straightforward way to accelerate existing Fortran and C applications. With OpenACC directives, you provide *hints* via compiler directives to tell the compiler where -- and how -- it should parallelize compute-intensive code for execution on an accelerator. \n",
    "\n",
    "If you've done parallel programming using OpenMP, OpenACC is very similar: using directives, applications can be parallelized *incrementally*, with little or no change to the Fortran or C source. Debugging and code maintenance are easier. OpenACC directives are designed for *portability* across operating systems, host CPUs, and accelerators. You can use OpenACC directives with GPU accelerated libraries, explicit parallel programming languages (e.g., CUDA), MPI, and OpenMP, *all in the same program.*\n",
    "\n",
    "This hands-on lab walks you through a short sample of a scientific code, and demonstrates how you can employ OpenACC directives using a four-step process. You will make modifications to a simple Fortran program, then compile and execute the newly enhanced code in each step. Along the way, hints and solution are provided, so you can check your work, or take a peek if you get lost.\n",
    "\n",
    "If you are confused now, or at any point in this lab, you can consult the <a href=\"#FAQ\">FAQ</a> located at the bottom of this page."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## The Value of 2X in 4 Steps\n",
    "\n",
    "You can accelerate your applications using OpenACC directives and achieve *at least* a 2X speed-up, using 4 straightforward steps:\n",
    "\n",
    "1. Characterize your application\n",
    "2. Add compute directives\n",
    "3. Minimize data movement\n",
    "4. Optimize kernel scheduling\n",
    "\n",
    "The content of these steps and their order will be familiar if you have ever done parallel programming on other platforms. Parallel programmers deal with the same issues whenever they tackle a new set of code, no matter what platform they are parallelizing an application for. These issues include:\n",
    "\n",
    "+ optimizing and benchmarking the serial version of an application\n",
    "+ profiling to identify the compute-intensive portions of the program that can be executed concurrently\n",
    "+ expressing concurrency using a parallel programming notation (e.g., OpenACC directives)\n",
    "+ compiling and benchmarking each new/parallel version of the application\n",
    "+ locating problem areas and making improvements iteratively until the target level of performance is reached\n",
    "\n",
    "The programming manual for some other parallel platform you've used may have suggested five steps, or fifteen. Whether you are an expert or new to parallel programming, we recommend that you walk through the four steps here as a good way to begin accelerating applications by at least 2X using OpenACC directives. We believe *being more knowledgeable about the four steps* will make the process of programming for an accelerator more understandable *and* more manageable. The 2X in 4 Steps process will help you use OpenACC on your own codes more productively, and get significantly better speed-ups in less time."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 1 - Characterize Your Application"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The most difficult part of accelerator programming begins before the first line of code is written. If your program is not highly parallel, an accelerator or coprocesor won't be much use. Understanding the code structure is crucial if you are going to *identify opportunities* and *successfully* parallelize a piece of code. The first step in OpenACC programming then is to *characterize the application*. This includes:\n",
    "\n",
    "+ benchmarking the single-thread, CPU-only version of the application\n",
    "+ understanding the program structure and how data is passed through the call tree\n",
    "+ profiling the application and identifying computationally-intense \"hot spots\"\n",
    "    + which loop nests dominate the runtime?\n",
    "    + what are the minimum/average/maximum tripcounts through these loop nests?\n",
    "    + are the loop nests suitable for an accelerator?\n",
    "+ insuring that the algorithms you are considering for acceleration are *safely* parallel\n",
    "\n",
    "Note: what we've just said may sound a little scary, so please note that as parallel programming methods go OpenACC is really pretty friendly: think of it as a sandbox you can play in. Because OpenACC directives are incremental, you can add one or two in at a time and see how things work: the compiler provides a *lot* of feedback. The right software plus good tools plus educational experiences like this one should put you on the path to successfully accelerating your programs.\n",
    "\n",
    "We will be accelerating a 2D-stencil called the Jacobi Iteration. Jacobi Iteration is a standard method for finding solutions to a system of linear equations. The basic concepts behind a Jacobi Iteration are described in the following video:\n",
    "\n",
    "http://www.youtube.com/embed/UOSYi3oLlRs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here is the serial Fortran code for our Jacobi Iteration:\n",
    "\n",
    "    program main\n",
    "    use openacc\n",
    "    implicit real(4) (A-H,O-Z)\n",
    "    integer, parameter :: NN = 1024\n",
    "    integer, parameter :: NM = 1024\n",
    "\n",
    "    real(4) A(NN,NM), Anew(NN,NM)\n",
    "    iter_max = 1000\n",
    "    tol   = 1.0e-6\n",
    "    error = 1.0\n",
    "\n",
    "    A(1,:) = 1.0\n",
    "    A(2:NN,:) = 0.0\n",
    "    Anew(1,:) = 1.0\n",
    "    Anew(2:NN,:) = 0.0\n",
    "    \n",
    "    print 100,NN,NM\n",
    "    \n",
    "    call cpu_time(t1)\n",
    "    iter = 0\n",
    "    do while ( (error > tol) .and. (iter < iter_max) )\n",
    "    error = 0.0\n",
    "    do j = 2, NM-1\n",
    "      do i = 2, NN-1\n",
    "        Anew(i,j) = 0.25 * ( A(i+1,j) + A(i-1,j) + &\n",
    "                             A(i,j-1) + A(i,j+1) )\n",
    "        error = max( error, abs(Anew(i,j) - A(i,j)) )\n",
    "      end do\n",
    "    end do\n",
    "        \n",
    "    do j = 2, NM-1\n",
    "      do i = 2, NN-1\n",
    "        A(i,j) = Anew(i,j)\n",
    "      end do\n",
    "    end do\n",
    "\n",
    "    if(mod(iter,100) == 0) print 101,iter,error\n",
    "    iter = iter + 1\n",
    "    end do\n",
    "    call cpu_time(t2)\n",
    "    print 102,t2-t1\n",
    "\n",
    "    100 format(\"Jacobi relaxation Calculation: \",i4,\" x \",i4,\" mesh\")\n",
    "    101 format(2x,i4,2x,f9.6)\n",
    "    102 format(\"total: \",f9.6,\" s\")\n",
    "    end program\n",
    " \n",
    "\n",
    "In this code, the outer 'while' loop iterates until the solution has converged, by comparing the computed error to a specified error tolerance, *tol*. The first of two sets of inner nested loops applies a 2D Laplace operator at each element of a 2D grid, while the second set copies the output back to the input for the next iteration."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Benchmarking\n",
    "\n",
    "Before you start modifying code and adding OpenACC directives, you should benchmark the serial version of the program. To facilitate benchmarking after this and every other step in our parallel porting effort, we have built a timing routine around the main structure of our program -- a process we recommend you follow in your own efforts. Let's run the `task1.f90` file without making any changes -- using the *-fast* set of compiler options on the serial version of the Jacobi Iteration program -- and see how fast the serial program executes. This will establish a baseline for future comparisons.  Execute the following two commands to compile and run the program."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Compiled Successfully!\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "# To be sure we see some output from the compiler, we'll echo out \"Compiled Successfully!\" \n",
    "#(if the compile does not return an error)\n",
    "pgfortran -fast -o task1_pre_out task1/task1.f90 && echo \"Compiled Successfully!\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Jacobi relaxation Calculation: 1024 x 1024 mesh\n",
      "     0   0.250000\n",
      "   100   0.002397\n",
      "   200   0.001204\n",
      "   300   0.000804\n",
      "   400   0.000603\n",
      "   500   0.000483\n",
      "   600   0.000403\n",
      "   700   0.000345\n",
      "   800   0.000302\n",
      "   900   0.000269\n",
      "total:  1.507478 s\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "# Execute our single-thread CPU-only Jacobi Iteration to get timing information.  Make sure you compiled\n",
    "# successfully in the above command first.\n",
    "./task1_pre_out"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Quality Checking/Keeping a Record\n",
    "\n",
    "This is a good time to briefly talk about having a quality check in your code before starting to offload computation to an accelerator (or do any optimizations, for that matter).  It doesn't do you any good to make an application run faster if it does not return the correct results.  It is thus very important to have a quality check built into your application before you start accelerating or optimizing.  This can be a simple value print out (one you can compare to a non-accelerated version of the algorithm) or something else.\n",
    "\n",
    "In our case, on every 100th iteration of the outer `do while` loop, we print the current max error. (You just saw an example when we executed *task1_pre_out*.) As we add directives to accelerate our code later in this lab, you can look back at these values to verify that we're getting the correct answer.  These print-outs also help us verify that we are converging on a solution -- which means that we should see that, as we proceed, the values are approaching zero.\n",
    "\n",
    "**Note:** NVIDIA GPUs implement IEEE-754 compliant floating point arithmetic just like most modern CPUs. However, because floating point arithmetic is not associative, the order of operations can affect the rounding error inherent with floating-point operations: you may not get exactly the same answer when you move to a different processor. Therefore, you'll want to make sure to verify your answer within an acceptable error bound. Please read [this](https://developer.nvidia.com/content/precision-performance-floating-point-and-ieee-754-compliance-nvidia-gpus) article at a later time, if you would like more details.\n",
    "\n",
    "*After each step*, we will record the results from our benchmarking and correctness tests in a table like this one: \n",
    "\n",
    "|Step| Execution       | ExecutionTime (s)     | Speedup vs. 1 CPU Thread       | Correct? | Programming Time |\n",
    "|:--:| --------------- | ---------------------:| ------------------------------:|:--------:| -----------------|\n",
    "|1   | CPU 1 thread    | 1.624            |                                | Yes      |                |  |\n",
    "\n",
    "*Note: Problem Size: 1024 x 1024; System Information: GK520; Compiler: PGI Community Edition 17.4*\n",
    "\n",
    "(The execution times quoted will be times we got running on our GK520 -- your times throughout the lab may vary for one reason or another.)\n",
    "\n",
    "You may also want to track how much time you spend porting your application, step by step, so a column has been included for recording time spent."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Profiling\n",
    "\n",
    "Back to our lab. Your objective in the step after this one (Step 2) will be to modify `task2.f90` in a way that moves the most computationally intensive, independent loops to the accelerator. With a simple code, you can identify which loops are candidates for acceleration with a little bit of code inspection. On more complex codes, a great way to find these computationally intense areas is to use a profiler (such as PGI's PGPROF or open-source `gprof`) to determine which functions are consuming the largest amounts of compute time. To profile a program on your own workstation, you'd type the lines below on the command line, but in this workshop, you just need to execute the following command, and then click on the link below it to see the PGPROF interface"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "%%bash\n",
    "pgfortran -fast -Minfo=all,ccff -o task1/task1_simple_out task1/task1.f90 && echo \"Compiled Successfully!\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this lab, to open the PGI profiler in a new window <a href=\"/vnc\" onclick=\"window.open(this.href, 'Profiler',\n",
    "'left=20,top=20,width=1028,height=772,toolbar=1,resizable=0'); return false;\">click here</a>."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Click on `File > New Session` to start a new profiling session. Select the executable to profile by pressing the `Browse` button, clicking `ubuntu` from the file left side of the file selector, the selecting `notebook`, then `FORTRAN`, and finally selecting `task1_simple_out`.\n",
    "\n",
    "<div align=\"center\"><img src=\"files/pgprof17_create_new_session.png\" width=\"60%\"></div>\n",
    "\n",
    "Clicking `Next` will bring up a screen with a list profiling settings for this session. We can leave those at their default settings for now. Clicking `Finish` will cause `pgprof` to launch your executable for profiling. Since we are profiling a regular CPU application (no acceleration added yet) we should refer to the `CPU Details` tab along the bottom of the window for a summary of what functions in our program take the most compute time on the CPU. If you do not have a `CPU Details` tab, click `View` -> `Show CPU Details View`.\n",
    "\n",
    "<div align=\"center\"><img src=\"files/pgprof17_cpu_details.png\" width=\"60%\"></div>\n",
    "\n",
    "Double-clicking on the most time-consuming function in the table, `MAIN_` in this case, will bring up another file browser. This time click on `Recently Used` and then `FORTRAN` and press `OK`. This will open the source file for the `main` function. \n",
    "\n",
    "<div align=\"center\"><img src=\"files/pgprof17_see_main_.png\" width=\"60%\"></div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In our Jacobi code sample, the two nests of `do` loops within the `do while` loop account for the majority of the runtime. \n",
    "\n",
    "Let's see what it takes to accelerate those loops."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 2 - Add Compute Directives"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In Fortran, an OpenACC directive is indicated in the code by `!$acc *your directive`* . This is very similar to OpenMP programming and gives hints to the compiler on how to handle the compilation of your source. If you are using a compiler which does not support OpenACC directives, it will simply ignore the `!$acc` directives and move on with the compilation.\n",
    "\n",
    "In Step 2, you will add compute regions around your expensive parallel loop(s). The first OpenACC directive you're going to learn about is the *kernels* directive. The kernels directive gives the compiler a lot of freedom in how it tries to accelerate your code - it basically says, \"Compiler, I believe the following code block is parallelizable, so I want you to try and accelerate it as best you can.\"\n",
    "\n",
    "Like most OpenACC directives in Fortran, the kernels directive applies to the structured code block defined by the opening and closing !$acc *directives*. For example, each of the following code samples instructs the compiler to generate a kernel -- from suitable loops -- for execution on an accelerator:\n",
    "\n",
    "    !$acc kernels\n",
    "    \n",
    "        ! accelerate suitable loops here \n",
    "    \n",
    "    !$acc end kernels\n",
    "   \n",
    "    ! but not loops outside of the region\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "At some point, you will encounter the OpenACC *parallel* directive, which provides another method for defining compute regions in OpenACC. For now, let's drop in a simple OpenACC `kernels` directive in front of the two do-loop codeblocks that follow the do while loop. The kernels directive is designed to find the parallel acceleration opportunities implicit in the do-loops in the Jacobi Iteration code.  \n",
    "\n",
    "To get some hints about how and where to place your kernels directives, click on the green boxes below.  When you feel you are done, **make sure to save the `task2.f90` file you've modified with File -> Save, and continue on.**  If you get completely stuck, you can look at `task2_solution.f90` to see the answer."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<p class=\"hint_trigger\">Hint #1\n",
    "      <div class=\"toggle_container\"><div class=\"input_area box-flex1\"><div class=\\\"highlight\\\">Remember that in Fortran, an OpenACC directive applies to the next structured code block.  So for example, the following applies the <code>kernels</code> directive to the outer <code>do</code> loop and everything inside of it:<pre><code>  \n",
    "    !$ acc kernels\n",
    "    do j=2, NM-1\n",
    "      do i=2, NN-1\n",
    "        A(i,j) = Anew(i,j)\n",
    "            ...\n",
    "      end do\n",
    "    end do\n",
    "    !$ end kernels\n",
    "    </code></pre></div></div></div></p>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's now compile our `task2.f90` file by executing the command below with Ctrl-Enter (or press the play button in the toolbar above)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "%%bash\n",
    "# Compile the task2.f90 file with the pgfortran compiler\n",
    "# -fast is the standard optimization flag\n",
    "# -acc tells the compiler to process the source recognizing !$acc directives\n",
    "# -ta=tesla tells the compiler to target an NVIDIA Tesla accelerator\n",
    "# -Minfo tells the compiler to share information about the compilation process\n",
    "pgfortran -fast -acc -ta=tesla -Minfo -o task2_out task2/task2.f90"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you successfully added `!$acc kernels` in the proper spots, you should see the following in the output of the compiler:\n",
    "\n",
    "````\n",
    "main:\n",
    "     12, Loop not vectorized: may not be beneficial\n",
    "         Unrolled inner loop 8 times\n",
    "         Generated 7 prefetches in scalar loop\n",
    "     13, Memory zero idiom, loop replaced by call to __c_mzero4\n",
    "     14, Loop not vectorized: may not be beneficial\n",
    "         Unrolled inner loop 8 times\n",
    "         Generated 7 prefetches in scalar loop\n",
    "     15, Memory zero idiom, loop replaced by call to __c_mzero4\n",
    "     21, Loop not vectorized/parallelized: potential early exits\n",
    "     23, Generating implicit copyin(a(:,:))\n",
    "         Generating implicit copyout(anew(2:1023,2:1023))\n",
    "     24, Loop is parallelizable\n",
    "     25, Loop is parallelizable\n",
    "         Accelerator kernel generated\n",
    "         Generating Tesla code\n",
    "         24, !$acc loop gang, vector(4) ! blockidx%y threadidx%y\n",
    "         25, !$acc loop gang, vector(32) ! blockidx%x threadidx%x\n",
    "         28, Generating implicit reduction(max:error)\n",
    "     33, Generating implicit copyout(a(2:1023,2:1023))\n",
    "         Generating implicit copyin(anew(2:1023,2:1023))\n",
    "     34, Loop is parallelizable\n",
    "     35, Loop is parallelizable\n",
    "         Accelerator kernel generated\n",
    "         Generating Tesla code\n",
    "         34, !$acc loop gang, vector(4) ! blockidx%y threadidx%y\n",
    "         35, !$acc loop gang, vector(32) ! blockidx%x threadidx%x\n",
    "````\n",
    "             \n",
    "If you do not get similar output, please check your work and try re-compiling.  If you're stuck, you can compare what you have to `task2_solution.f90` in the editor above.\n",
    "\n",
    "*The output provided by the compiler is extremely useful, and should not be ignored when accelerating your own code with OpenACC.*  Let's break it down a bit and see what it's telling us.\n",
    "\n",
    "1. First since we used the `-Minfo` command-line option, we will see all output from the compiler.  If we were to use `-Minfo=accel` we would only see the output corresponding to the accelerator, in this case an NVIDIA GPU.\n",
    "2. The first line of the output, *main*, tells us which function the following information is in reference to.\n",
    "3. The line starting with `24, Loop is parallelizable` of the output tells us that on line `24` in our source, an accelerated kernel was generated.  This is the the loop just after where we put our first `!$acc kernels`.\n",
    "4. The following lines provide more details on the accelerator kernel on line 24.  It shows we created a parallel OpenACC `loop`. This loop is made up of gangs (a grid of blocks in CUDA language) and vector parallelism (threads in CUDA language) with the vector size being 128 per gang.\n",
    "5. At line 34, the compiler tells us it found another loop to accelerate.\n",
    "6. The rest of the information concerns data movement which we'll get into later in this lab.\n",
    "\n",
    "So as you can see, lots of useful information is provided by the compiler, and it's very important that you carefuly inspect this information to make sure the compiler is doing what you've asked of it.\n",
    "\n",
    "Finally, let's execute this program to verify we are getting the correct answer (execute the command below). "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once you feel your code is correct, try running it by executing the command below.  You'll want to review our quality check from the beginning of task2 to make sure you didn't break the functionality of your application."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "%%bash\n",
    "./task2_out"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's record our results in the table:\n",
    "\n",
    "|Step| Execution    | Time(s)     | Speedup vs. 1 CPU Thread  | Correct? | Programming Time |\n",
    "| -- || ------------ | ----------- | ------------------------- | -------- | ---------------- |\n",
    "|1| CPU 1 thread |1.624     |                           |   Yes       |  |\n",
    "|2| Add kernels directive  |5.794      | 0.28X                    | Yes      | ||\n",
    "\n",
    "*Note: Problem Size: 1024x1024; System Information: GK520; Compiler: PGI Community Edition 17.4*\n",
    "\n",
    "\n",
    "Now, if your solution is similar to the one in `task2_solution.f90`, you have probably noticed that we're executing **slower** than the non-accelerated, CPU-only version we started with.  What gives?! Let's see what pgprof can tell us about the performance of the code. Return to your PGPROF window from earlier, start another new session, but this time loading task2_out as your executable (it's in the same directory as before). This time we'll find a colorful graph of what our program is doing, this is the GPU timeline. We can't tell much from the default view, but we can zoom in by using the + magnifying glass at the top of the window. If you zoom in far enough, you'll begin to see a pattern like the one in the screenshot below. The teal and purple boxes are the compute kernels that go with the two loops in our kernels region. Each of these groupings of kernels is surrounded by tan coloer boxes representing data movement. What this graph is showing us is that for every step of our while loop, we're copying data to the GPU and then back out. Let's try to figure out why.\n",
    "\n",
    "<div align=\"center\"><img src=\"files/pgprof17_excessive_data_movement.png\" width=\"60%\"></div>\n",
    "\n",
    "The compiler feedback we collected earlier tells you quite a bit about data movement. If we look again at the compiler feedback from above, we see the following.\n",
    "\n",
    "      23, Generating implicit copyin(a(:,:))\n",
    "         Generating implicit copyout(anew(2:1023,2:1023))\n",
    "         <output snipped>\n",
    "      33, Generating implicit copyout(a(2:1023,2:1023))\n",
    "         Generating implicit copyin(anew(2:1023,2:1023))\n",
    "\n",
    "This is telling us that the compiler has inserted data movement around our `kernels` region from line 23 to 33 which copies the `A` array *in* and *out* of GPU memory and also copies `Anew` out. This problem of copying back and forth on every iteration of a loop is sometimes called \"data sloshing\".\n",
    "\n",
    "The OpenACC compiler can only work with the information we have given it.  It knows we need the `A` and `Anew` arrays on the GPU for each of our two accelerated sections, but we didn't tell it anything about what happens to the data outside of those sections.  Without this knowledge, it has to copy the full arrays *to the GPU and back to the CPU* for each accelerated section, *every time* it went through the while loop.  That is a LOT of wasted data transfers.\n",
    "\n",
    "Ideally, we would just transfer `A` to the GPU and create space for `Anew` at the beginning of the Jacobi Iteration, and then only transfer `A` back to the CPU at the end.\n",
    "\n",
    "Because overall accelerator performance is detetermined largely by how well memory transfers are optimized, the OpenACC specification defines the `data` directive and several modifying clauses to manage all the various forms of data movement."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "## Step 3 - Manage Data Movement\n",
    "\n",
    "We need to give the compiler more information about how to reduce unnecessary data movement for the Jacobi Iteration.  We are going to do this with the OpenACC `data` directive and some modifying clauses defined in the OpenACC specification. \n",
    "\n",
    "In Fortran, the `data` directive applies to the next structured code block.  The compiler will manage data according to the provided clauses.  It does this at the beginning of the `data` directive code block, and then again at the end.  Some of the clauses available for use with the `data` directive are:\n",
    "\n",
    "* `copy( list )` - Allocates memory on GPU and copies data from host to GPU when entering region and copies data to the host when exiting region.\n",
    "* `copyin( list )` - Allocates memory on GPU and copies data from host to GPU when entering region.\n",
    "* `copyout( list )` - Allocates memory on GPU and copies data to the host when exiting region.\n",
    "* `create( list )` - Allocates memory on GPU but does not copy.\n",
    "* `present( list )` - Data is already present on GPU from another containing data region.\n",
    "\n",
    "As an example, the following directive copies array A to the GPU at the beginning of the code block, and back to the CPU at the end.  It also copies arrays B and C *to the CPU* at the *end* of the code block, but does **not** copy them both to the GPU at the beginning:\n",
    "\n",
    "<pre><code>\n",
    "   !$acc data copy( A ), copyout( B, C )\n",
    "\n",
    "   ....\n",
    "   !$acc end data\n",
    "</code></pre>\n",
    "\n",
    "For detailed information on the `data` directive clauses, you can refer to the [OpenACC 2.5](http://www.openacc.org/sites/default/files/OpenACC_2pt5.pdf) specification.\n",
    "\n",
    "In the `task3.f90` editor, see if you can add in a `data` directive to minimize data transfers in the Jacobi Iteration. There's a place for  the `create` clause in this exercise too.   As usual, there are some hints provided, and you can look at `task3_solution.f90` to see the answer if you get stuck or want to check your work.  **Don't forget to save with File -> Save in the editor below before moving on.**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<p class=\"hint_trigger\">Hint #1\n",
    "      <div class=\"toggle_container\"><div class=\"input_area box-flex1\"><div class=\\\"highlight\\\">You should only have to worry about managing the transfer of data in arrays <code>A</code> and <code>Anew</code>.</div></div></div></p>\n",
    "<p class=\"hint_trigger\">Hint #2\n",
    "      <div class=\"toggle_container\"><div class=\"input_area box-flex1\"><div class=\\\"highlight\\\">You want to put the data directive just above the outer <code>while</code> loop. Don't forget to insert a `!$acc end data` directive in the appropriate spot.</div></div></div></p>\n",
    "<p class=\"hint_trigger\">Hint #3\n",
    "      <div class=\"toggle_container\"><div class=\"input_area box-flex1\"><div class=\\\"highlight\\\">You'll want to `copy( A )` so it is transferred to the GPU and back again after the final iterations through the `data` region.  But you only need to `create( Anew )` as it is just used for temporary storage on the GPU, so there is no need to ever transfer it back and forth.</div></div></div></p>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once you think you have `task3.f90` saved with a directive to manage data transfer, compile it with the below command and note the changes in the compiler output in the areas discussing data movement (lines starting with `Generating ...`). Then modify Anew using the `create` clause, if you haven't yet, and compile again."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "%%bash\n",
    "pgfortran -fast -ta=tesla -acc -Minfo=accel -o task3_out task3/task3.f90 && echo 'Compiled Successfully!'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "How are we doing on our timings? Let's execute our step 3 program and see if we have indeed accelerated the application versus the execution time we recorded after Step #2."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "%%bash\n",
    "./task3_out"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After making these changes, our accelerator code is much faster -- with just a few lines of OpenACC directives we have made our code more than twice as fast by running it on an accelerator, as shown in this table.\n",
    "\n",
    "|Step| Execution                 | Time (s)               | Speedup vs. 1 CPU thread | Correct? | Programming Time |\n",
    "| -- | ------------------------- | ---------------------- | ------------------------ | -------- | ---------------- |\n",
    "|1| CPU 1 thread              | 1.624               |                          |     Yes     | |\n",
    "|2| Add kernels directive     | 5.794      | 0.28X                   |  Yes        | |\n",
    "|3| Manage data movement        | 1.778                 |  0.91X                  | Yes      | ||\n",
    "\n",
    "*Note: Problem Size: 1024x1024; System Information: GK520; Compiler: PGI Community Edition 17.4*\n",
    "\n",
    "We are making good progress, but we can still improve performance."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 4 - Optimize Kernel Scheduling"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The final step in our tuning process is to tune the OpenACC compute region schedules using the *gang* and *vector* clauses. These clauses let us use OpenACC directives to take more explicit control over how the compiler parallelizes our code *for the accelerator we will be using*. \n",
    "\n",
    "Kernel scheduling optimizations *may* give you significantly higher speedup, but be aware that these particular optimizations can significantly reduce performance portability. The vast majority of the time, the default kernel schedules chosen by the OpenACC compilers are quite good, but other times the compiler doesn't do as well. Let's spend a little time examining how we could do better, if we were in a situation where we felt we needed to. \n",
    "\n",
    "First, we need to get some additional insight into how our Jacobi Iteration code with the data optimizations is running on the accelerator. Let's run the Fortran code with all your data movement optimizations on the accelerator again, this time setting the environment variable PGI_ACC_TIME, which will tell the program to print some high-level timers after program execution."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "%%bash\n",
    "export PGI_ACC_TIME=1\n",
    "pgfortran -fast -ta=tesla -acc -Minfo=accel -o accel_timing_out task3/task3_solution.f90\n",
    "./accel_timing_out"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This generates some information we haven't seen previously from the PGI compiler:"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {},
   "source": [
    "Accelerator Kernel Timing data\n",
    "/home/ubuntu/notebook/FORTRAN/task3/task3.f90\n",
    "  main  NVIDIA  devicenum=0\n",
    "    time(us): 348,997\n",
    "    21: data region reached 2 times\n",
    "        21: data copyin transfers: 1\n",
    "             device time(us): total=369 max=369 min=369 avg=369\n",
    "        45: data copyout transfers: 1\n",
    "             device time(us): total=346 max=346 min=346 avg=346\n",
    "    24: compute region reached 1000 times\n",
    "        24: data copyin transfers: 1000\n",
    "             device time(us): total=8,614 max=20 min=8 avg=8\n",
    "        26: kernel launched 1000 times\n",
    "            grid: [32x256]  block: [32x4]\n",
    "             device time(us): total=212,547 max=223 min=208 avg=212\n",
    "            elapsed time(us): total=232,595 max=483 min=227 avg=232\n",
    "        26: reduction kernel launched 1000 times\n",
    "            grid: [1]  block: [256]\n",
    "             device time(us): total=17,388 max=25 min=17 avg=17\n",
    "            elapsed time(us): total=37,193 max=58 min=35 avg=37\n",
    "        26: data copyout transfers: 1000\n",
    "             device time(us): total=17,712 max=28 min=13 avg=17\n",
    "    34: compute region reached 1000 times\n",
    "        36: kernel launched 1000 times\n",
    "            grid: [32x256]  block: [32x4]\n",
    "             device time(us): total=92,021 max=100 min=91 avg=92\n",
    "            elapsed time(us): total=112,922 max=143 min=111 avg=112\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There is a lot of information here about how the compiler mapped the computational kernels in our program to our particular accelerator (in this case, an NVIDIA GPU). We can see three regions. The first one is the memcopy loop nest starting on line 21, which takes only a tiny fraction of the total system time. The second region is the nested computation loop starting on line 24, and finally the copyback loop then executes beginning with line 34. If we look at the main loop nests, we can see these lines: "
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {},
   "source": [
    "grid: [32x256]  block[32x4]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The terms *grid* and *block* come from the CUDA programming model. A GPU executes groups of threads called *thread blocks*. To execute a kernel, the application launches a *grid* of these thread blocks. Each block runs on one of the GPUs *multiprocessors* and is assigned a certain range of IDs that it uses to address a unique data range. In this case our thread blocks have 128 threads each. The grid the compiler has constructed is also 2D, 8 blocks wide and 1022 blocks tall. This is just enough to cover our 1024x1024 grid. But we don't really need that many blocks -- if we tell the compiler to launch fewer, it will automatically generate a sequential loop over data blocks within the kernel code run by each thread.\n",
    "\n",
    "*Note: You can let the compiler do the hard work of mapping loop nests, unless you are certain you can do it better (and portability is not a concern.) When you decide to intervene, think about different parallelization strategies (loop schedules): in nested loops, distributing the work of the outer loops to the GPU multiprocessors (on PGI = gangs) in 1D grids. Similarly, think about mapping the work of the inner loops to the cores of the multiprocessors (CUDA threads, vectors)  in 1D blocks. The grids (gangs) and block (vector) sizes can be viewed by setting the environment variable ACC_NOTIFY. To get you started, here are some experiments we conducted for these computational kernels and this accelerator:*\n",
    "\n",
    "| Accelerator | Grid          |Outer Loop Gang | Outer Loop Vector | Inner Loop Gang | Inner Loop Vector | Seconds |\n",
    "| ----------- | ------------- |------------- | --------------- | ------------- | --------------- | ------- |\n",
    "| GK520       | 1024x1024     |              |               8 |               |              32 | 1.721    |\n",
    "|             |               |              |               4 |               |              64 | 0.479   |\n",
    "|             |               |              |                 |            8  |              32 | 0.354   |\n",
    "|             |               |              |                 |           16  |              32 | 0.380 |\n",
    "|             |               |              |                 |            4  |              64 | 0.355  |\n",
    "\n",
    "Try to modify the `task4.f90` code for the main computational loop nests in the window below. You'll be using the openacc loop constructs `gang()` and  `vector()`. Look at `task4_solution.f90` if you get stuck:\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<p class=\"hint_trigger\">Hint #1\n",
    "      <div class=\"toggle_container\"><div class=\"input_area box-flex1\"><div class=\\\"highlight\\\">You'll want a gang() and vector() clause on the inner loops, but you may want to let the compiler decide the dimensions of the outer loopa.In that case, you can use a loop directive without any modifying clauses.</div></div></div></p>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After you've made some changes, save your work, then compile and run in the boxes below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "%%bash\n",
    "pgfortran -acc -Minfo=accel -o task4_out task4/task4.f90 && echo \"Compiled Successfully\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "%%bash\n",
    "./task4_out"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Looking at `task4_solution.f90`, the gang(8) clause on the inner loop tells it to launch 8 blocks in the X(column) direction. The vector(32) clause on the inner loop tells the compiler to use blocks that are 32 threads (one warp) wide. The absence of clause on the outer loop lets the compiler decide how many rows of threads and how many blocks to use in the Y(row) direction. We can see what it says, again, with:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "%%bash\n",
    "export PGI_ACC_TIME=1\n",
    "./task4_out"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*Note: we usually want the inner loop to be vectorized, because it allows coalesced loading of data from global memory. This is almost guaranteed to give a big performance increase. Other optimizations are often trial and error. When selecting grid sizes, the most obvious mapping is to have \n",
    "\n",
    "    the number of gangs * the number of workers * the number of vectors = the total problem size. \n",
    "    \n",
    "*We may choose to manipulate this number, as we are doing here, so that each thread does multiple pieces of work -- this helps amortize the cost of setup for simple kernels.*\n",
    "\n",
    "*Note: Low-level languages like CUDA Fortran offer more direct control of the hardware. You can consider optimizing your most critical loops in CUDA Fortran if you need to extract every last bit of performance from your application, while recognizing that doing so may impact the portability of your code. OpenACC and CUDA Fortran are fully interoperable.*\n",
    "\n",
    "A similar change to the copy loop nest benefits performance by a small amount. After you've made all your changes (look at `task4_solution.f90` to be sure) compile your code below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "%%bash\n",
    "pgfortran -acc -Minfo=accel -o task4_out task4/task4.f90"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then run it and record the run time of the optimized code in the table:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "%%bash\n",
    "./task4_out"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here is the perfomance after these final optimizations:\n",
    "\n",
    "|Step| Execution            | Time (s)      | Speedup vs. 1 CPU Thread        | Correct? | Programming Time |\n",
    "| -- | -------------------- | ------------- | ------------------------------- | -------- | ---------------- |\n",
    "|1| CPU 1 thread         | 1.624       |                                 |    Yes      | |\n",
    "|2| Add kernel directive | 5.794      | 0.28X                           |      Yes    | |\n",
    "|3| Manage data movement   | 1.778                 |  0.91X                           | Yes         | |\n",
    "|4| Optimize kernel scheduling | 0.354         | 4.59X                      | Yes      | ||\n",
    "\n",
    "*Note: Problem Size: 1024x1024; System Information: GK520; Compiler: PGI Community Edition 17.4*\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "At this point, some of you may be wondering what kind of speed-up we get against the OpenMP version of this code.  If you look at task3_omp.f90 in the text editor above, you can see a simple OpenMP version of the Jacobi Iteration code.  Running this using 8-OpenMP threads on an Intel Xeon E5-2670 , our Kepler GK520  is just slightly slower.  The reason for this is that in this lab so far we've been using a rather small matrix in our example: 1024x1024.  This size was chosen in the interest of keeping the time you are spending here reasonable, as larger matrices take longer to run, especially before we optimize for data movement.\n",
    "\n",
    "If we scale the matrix up to a more realistic size, say 4096x4096, our Kepler GK520 GPU becomes significantly faster than the 8-OpenMP thread version.  If you have some time remaining in this lab, feel free to compile & run the OpenMP and OpenACC versions below with the larger matrices.\n",
    "\n",
    "First, compile the OpenMP version, with two threads on each of the 4 cores of the host CPUs:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "%%bash\n",
    "pgfortran -fast -mp=allcores -Minfo -o task4_4096_omp task4/task4_4096_omp.f90"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now run the OpenMP code you just created, and record your results in the new table for the larger matrix.\n",
    "\n",
    "*Note: because our dataset has now grown by 16-fold your CPU may not seem as responsive. We're using `-Minfo` in the compile so you can see that something is indeed happening, but you may need to be patient*."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "%%bash\n",
    "./task4_4096_omp"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, compile and run the OpenACC solution for the larger 4096x4096 matrix using the next two boxes:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "%%bash\n",
    "pgfortran -acc -Minfo=accel -o task4_4096_out task4/task4_4096_solution.f90"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "%%bash\n",
    "./task4_4096_out"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here's our comparison with the larger matrix size:\n",
    "\n",
    "| Execution            | matrix size | Time (s) | Speedup vs. 8 CPU threads | Correct? | Programming Time |\n",
    "| -------------------- | ----------- | -------- | ------------------------- |          | |\n",
    "| CPU 8   threads      | 4096x4096       | 12.336     |                           |          | |\n",
    "| GPU optimized kernel | 4096x4096        | 3.538       | 2.890X                       |   Yes    | ||\n",
    "\n",
    "*Note: System Information: GK520; Compiler: PGI Community Edition 17.4*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Learn More\n",
    "\n",
    "If you are interested in learning more about OpenACC, you can use the following resources:\n",
    "\n",
    "* [openacc.org](http://openacc.org/)\n",
    "* [OpenACC on CUDA Zone](https://developer.nvidia.com/openacc)\n",
    "* Search or ask questions on [Stackoverflow](http://stackoverflow.com/questions/tagged/openacc) using the openacc tag\n",
    "* Get the free [PGI Comunity Edition](https://www.pgroup.com/products/community.htm) compiler.\n",
    "* Attend an in-depth workshop offered by XSEDE (https://portal.xsede.org/overview) or a commercial provider (see the 'classes' tab at OpenACC.org)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}