First version of Jupyter notebook, adjusted for giving the course

on local hardware. This notebook is meant to be converted into pdf, so that it can be used without Jupyter.
2017-06-20 12:48:25 +02:00 · 2017-06-20 12:48:25 +02:00 · cce7f726c0
commit cce7f726c0
parent 751fc9a755
1 changed files with 296 additions and 123 deletions
--- a/lab1/C/OpenACC
+++ b/lab1/C/OpenACC
@ -6,8 +6,6 @@
   "source": [
    "# OpenACC: 2X in 4 Steps (for C)\n",
    "\n",
-    "([Fortran version](../FORTRAN/OpenACC%20Fortran.ipynb))\n",
-    "\n",
    "In this self-paced, hands-on lab, we will use [OpenACC](http://openacc.org/) directives to port a basic C program to an accelerator in four simple steps, achieving *at least* a two-fold speed-up.\n",
    "\n",
    "Lab created by John Coombs, Mark Harris, and Mark Ebersole (Follow [@CUDAHamster](https://twitter.com/@cudahamster) on Twitter)"
@ -17,48 +15,50 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "---\n",
-    "Before we begin, let's verify [WebSockets](http://en.wikipedia.org/wiki/WebSocket) are working on your system.  To do this, execute the cell block *below* by giving it focus (clicking on it with your mouse), and hitting Ctrl-Enter, or by pressing the play button in the toolbar *above*.  If all goes well, you should see get some output returned below the grey cell.  If not, please consult the [Self-paced Lab Troubleshooting FAQ](https://developer.nvidia.com/self-paced-labs-faq#Troubleshooting) to debug the issue."
+    "Lets begin by getting information about the GPUs on the server by running the command below."
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 4,
   "metadata": {
-    "collapsed": true
+    "collapsed": false,
+    "scrolled": true
   },
-   "outputs": [],
-   "source": [
-    "print \"The answer should be three: \" + str(1+2)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Next let's get information about the GPUs on the server by executing the cell below."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Wed Jun  7 13:36:24 2017       \r\n",
+      "+-----------------------------------------------------------------------------+\r\n",
+      "| NVIDIA-SMI 375.66                 Driver Version: 375.66                    |\r\n",
+      "|-------------------------------+----------------------+----------------------+\r\n",
+      "| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |\r\n",
+      "| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |\r\n",
+      "|===============================+======================+======================|\r\n",
+      "|   0  GeForce GTX 950     Off  | 0000:01:00.0      On |                  N/A |\r\n",
+      "|  1%   54C    P5    11W /  99W |    932MiB /  1996MiB |      0%      Default |\r\n",
+      "+-------------------------------+----------------------+----------------------+\r\n",
+      "                                                                               \r\n",
+      "+-----------------------------------------------------------------------------+\r\n",
+      "| Processes:                                                       GPU Memory |\r\n",
+      "|  GPU       PID  Type  Process name                               Usage      |\r\n",
+      "|=============================================================================|\r\n",
+      "|    0      1974    G   /usr/lib/xorg/Xorg                             624MiB |\r\n",
+      "|    0      3776    G   ...s-passed-by-fd --v8-snapshot-passed-by-fd    32MiB |\r\n",
+      "|    0      3875    G   compiz                                         106MiB |\r\n",
+      "|    0      4275    G   ...el-token=884290AA53D676228DE3F70F025B1D21   133MiB |\r\n",
+      "|    0      4324    G   ...s-passed-by-fd --v8-snapshot-passed-by-fd    32MiB |\r\n",
+      "|    0     28457    G   /usr/lib/firefox/firefox                         1MiB |\r\n",
+      "+-----------------------------------------------------------------------------+\r\n"
+     ]
+    }
+   ],
   "source": [
    "!nvidia-smi"
   ]
  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "---\n",
-    "The following video will explain the infrastructure we are using for this self-paced lab, as well as give some tips on it's usage.  If you've never taken a lab on this system before, it's highly encourage you watch this short video first.<br><br>\n",
-    "<div align=\"center\"><iframe width=\"640\" height=\"390\" src=\"http://www.youtube.com/embed/ZMrDaLSFqpY\" frameborder=\"0\" allowfullscreen>"
-   ]
-  },
  {
   "cell_type": "markdown",
   "metadata": {},
@ -69,13 +69,7 @@
    "\n",
    "If you've done parallel programming using OpenMP, OpenACC is very similar: using directives, applications can be parallelized *incrementally*, with little or no change to the Fortran, C or C++ source. Debugging and code maintenance are easier. OpenACC directives are designed for *portability* across operating systems, host CPUs, and accelerators. You can use OpenACC directives with GPU accelerated libraries, explicit parallel programming languages (e.g., CUDA), MPI, and OpenMP, *all in the same program.*\n",
    "\n",
-    "Watch the following short video introduction to OpenACC:\n",
-    "\n",
-    "<div align=\"center\"><iframe width=\"640\" height=\"390\" style=\"margin: 0 auto;\" src=\"http://www.youtube.com/embed/c9WYCFEt_Uo\" frameborder=\"0\" allowfullscreen></iframe></div>\n",
-    "\n",
-    "This hands-on lab walks you through a short sample of a scientific code, and demonstrates how you can employ OpenACC directives using a four-step process. You will make modifications to a simple C program, then compile and execute the newly enhanced code in each step. Along the way, hints and solution are provided, so you can check your work, or take a peek if you get lost.\n",
-    "\n",
-    "If you are confused now, or at any point in this lab, you can consult the <a href=\"#FAQ\">FAQ</a> located at the bottom of this page."
+    "This hands-on lab walks you through a short sample of a scientific code, and demonstrates how you can employ OpenACC directives using a four-step process. You will make modifications to a simple C program, then compile and execute the newly enhanced code in each step. Along the way, hints and solution are provided, so you can check your work, or take a peek if you get lost."
   ]
  },
  {
@ -127,7 +121,7 @@
    "\n",
    "We will be accelerating a 2D-stencil called the Jacobi Iteration. Jacobi Iteration is a standard method for finding solutions to a system of linear equations. The basic concepts behind a Jacobi Iteration are described in the following video:\n",
    "\n",
-    "<div align=\"center\"><iframe width=\"640\" height=\"390\" src=\"http://www.youtube.com/embed/UOSYi3oLlRs\" frameborder=\"0\" allowfullscreen></iframe></div>"
+    "http://www.youtube.com/embed/UOSYi3oLlRs"
   ]
  },
  {
@ -214,13 +208,15 @@
   "source": [
    "### Benchmarking\n",
    "\n",
-    "Before you start modifying code and adding OpenACC directives, you should benchmark the serial version of the program. To facilitate benchmarking after this and every other step in our parallel porting effort, we have built a timing routine around the main structure of our program -- a process we recommend you follow in your own efforts. Let's run the [`task1.c`](/rpWFwS8c/edit/C/task1/task1.c) file without making any changes -- using the *-fast* set of compiler options on the serial version of the Jacobi Iteration program -- and see how fast the serial program executes. This will establish a baseline for future comparisons.  Execute the following two cells to compile and run the program."
+    "Before you start modifying code and adding OpenACC directives, you should benchmark the serial version of the program. To facilitate benchmarking after this and every other step in our parallel porting effort, we have built a timing routine around the main structure of our program -- a process we recommend you follow in your own efforts. Let's run the [`task1.c`](/4vwkFv7K/edit/C/task1/task1.c) file without making any changes -- using the *-fast* set of compiler options on the serial version of the Jacobi Iteration program -- and see how fast the serial program executes. This will establish a baseline for future comparisons.  Execute the following two commands to compile and run the program."
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 1,
-   "metadata": {},
+   "execution_count": 2,
+   "metadata": {
+    "collapsed": false
+   },
   "outputs": [
    {
     "name": "stdout",
@ -238,15 +234,34 @@
  },
  {
   "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 4,
   "metadata": {
-    "collapsed": true,
+    "collapsed": false,
    "scrolled": true
   },
-   "outputs": [],
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Jacobi relaxation Calculation: 1024 x 1024 mesh\n",
+      "    0, 0.250000\n",
+      "  100, 0.002397\n",
+      "  200, 0.001204\n",
+      "  300, 0.000804\n",
+      "  400, 0.000603\n",
+      "  500, 0.000483\n",
+      "  600, 0.000403\n",
+      "  700, 0.000345\n",
+      "  800, 0.000302\n",
+      "  900, 0.000269\n",
+      " total: 2.884395 s\n"
+     ]
+    }
+   ],
   "source": [
    "# Execute our single-thread CPU-only Jacobi Iteration to get timing information.  Make sure you compiled successfully in the \n",
-    "# above cell first.\n",
+    "# above command first.\n",
    "!./task1_pre_out"
   ]
  },
@ -281,16 +296,43 @@
   "source": [
    "### Profiling\n",
    "\n",
-    "Back to our lab. Your objective in the step after this one (Step 2) will be to modify [`task2.c`](/rpWFwS8c/edit/C/task2/task2.c) in a way that moves the most computationally intensive, independent loops to the accelerator. With a simple code, you can identify which loops are candidates for acceleration with a little bit of code inspection. On more complex codes, a great way to find these computationally intense areas is to use a profiler (such as PGI's pgprof, NVIDIA's nvprof or open-source *gprof*) to determine which functions are consuming the largest amounts of compute time. To profile a C program on your own workstation, you'd type the lines below on the command line, but in this workshop, you just need to execute the following cell, and then click on the link below it to see the pgprof interface"
+    "Back to our lab. Your objective in the step after this one (Step 2) will be to modify [`task2.c`](/4vwkFv7K/edit/C/task2/task2.c) in a way that moves the most computationally intensive, independent loops to the accelerator. With a simple code, you can identify which loops are candidates for acceleration with a little bit of code inspection. On more complex codes, a great way to find these computationally intense areas is to use a profiler (such as PGI's pgprof, NVIDIA's nvprof or open-source *gprof*) to determine which functions are consuming the largest amounts of compute time. To profile a C program on your own workstation, you'd type the lines below on the command line, but in this workshop, you just need to execute the following command, and then click on the link below it to see the pgprof interface"
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 6,
   "metadata": {
-    "collapsed": true
+    "collapsed": false,
+    "scrolled": true
   },
-   "outputs": [],
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Compiled Successfully!\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "GetTimer:\n",
+      "      3, include \"timer.h\"\n",
+      "          62, FMA (fused multiply-add) instruction(s) generated\n",
+      "main:\n",
+      "     25, Loop not fused: function call before adjacent loop\n",
+      "         Loop not vectorized: may not be beneficial\n",
+      "         Unrolled inner loop 8 times\n",
+      "         Generated 7 prefetches in scalar loop\n",
+      "     42, Generated vector simd code for the loop containing reductions\n",
+      "         Generated 3 prefetch instructions for the loop\n",
+      "         Residual loop unrolled 2 times (completely unrolled)\n",
+      "     52, Memory copy idiom, loop replaced by call to __c_mcopy4\n"
+     ]
+    }
+   ],
   "source": [
    "%%bash\n",
    "pgcc -Minfo=all,ccff -fast -o task1/task1_simple_out task1/task1_simple.c && echo \"Compiled Successfully!\""
@ -300,8 +342,18 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "In this lab, to open the PGI profiler in a new window <a href=\"/vnc\" onclick=\"window.open(this.href, 'Profiler',\n",
-    "'left=20,top=20,width=1280,height=724,toolbar=1,resizable=0'); return false;\">click here</a>."
+    "In this lab, to open the PGI profiler run the following command."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "!pgprof"
   ]
  },
  {
@ -369,7 +421,7 @@
   "source": [
    "One, two or several loops may be inside the structured block, the kernels directive will try to parallelize it, telling you what it found and generating as many kernels as it thinks it safely can. At some point, you will encounter the OpenACC *parallel* directive, which provides another method for defining compute regions in OpenACC. For now, let's drop in a simple OpenACC `kernels` directive in front of and embracing *both* the two for-loop codeblocks that follow the while loop using curly braces. The kernels directive is designed to find the parallel acceleration opportunities implicit in the for-loops in the Jacobi Iteration code.  \n",
    "\n",
-    "To get some hints about how and where to place your kernels directives, click on the links below.  When you feel you are done, **make sure to save the [`task2.c`](/rpWFwS8c/edit/C/task2/task2.c) file you've modified with File -> Save, and continue on.**  If you get completely stuck, you can look at [task2_solution.c](/rpWFwS8c/edit/C/task2/task2_solution.c) to see the answer."
+    "To get some hints about how and where to place your kernels directives, click on the links below.  When you feel you are done, **make sure to save the [`task2.c`](/4vwkFv7K/edit/C/task2/task2.c) file you've modified with File -> Save, and continue on.**  If you get completely stuck, you can look at [task2_solution.c](/4vwkFv7K/edit/C/task2/task2_solution.c) to see the answer."
   ]
  },
  {
@ -384,16 +436,49 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Let's now compile our [`task2.c`](/rpWFwS8c/edit/C/task2/task2.c) file by executing the cell below with Ctrl-Enter (or press the play button in the toolbar above). Note that we've now added a new compiler option `-ta` to specify the type of accelerator to use. We've set it to `tesla` as we're using NVIDIA GPUs in this lab."
+    "Let's now compile our [`task2.c`](/4vwkFv7K/edit/C/task2/task2.c) file by executing the command below with Ctrl-Enter (or press the play button in the toolbar above). Note that we've now added a new compiler option `-ta` to specify the type of accelerator to use. We've set it to `tesla` as we're using NVIDIA GPUs in this lab."
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 9,
   "metadata": {
-    "collapsed": true
+    "collapsed": false
   },
-   "outputs": [],
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "GetTimer:\n",
+      "      3, include \"timer.h\"\n",
+      "          62, FMA (fused multiply-add) instruction(s) generated\n",
+      "main:\n",
+      "     23, Loop not fused: function call before adjacent loop\n",
+      "         Loop not vectorized: may not be beneficial\n",
+      "         Unrolled inner loop 8 times\n",
+      "         Generated 7 prefetches in scalar loop\n",
+      "     34, Loop not vectorized/parallelized: potential early exits\n",
+      "     38, Generating implicit copyout(Anew[1:1022][1:1022])\n",
+      "         Generating implicit copyin(A[:][:])\n",
+      "         Generating implicit copyout(A[1:1022][1:1022])\n",
+      "     41, Loop is parallelizable\n",
+      "     43, Loop is parallelizable\n",
+      "         Accelerator kernel generated\n",
+      "         Generating Tesla code\n",
+      "         41, #pragma acc loop gang, vector(4) /* blockIdx.y threadIdx.y */\n",
+      "         43, #pragma acc loop gang, vector(32) /* blockIdx.x threadIdx.x */\n",
+      "         47, Generating implicit reduction(max:error)\n",
+      "     52, Loop is parallelizable\n",
+      "     54, Loop is parallelizable\n",
+      "         Accelerator kernel generated\n",
+      "         Generating Tesla code\n",
+      "         52, #pragma acc loop gang, vector(4) /* blockIdx.y threadIdx.y */\n",
+      "         54, #pragma acc loop gang, vector(32) /* blockIdx.x threadIdx.x */\n",
+      "Compiled Successfully\n"
+     ]
+    }
+   ],
   "source": [
    "# Compile the task2.c file with the pgcc compiler\n",
    "# -acc tells the compiler to process the source recognizing #pragma acc directives\n",
@ -444,23 +529,42 @@
    "\n",
    "So as you can see, lots of useful information is provided by the compiler, and it's very important that you carefuly inspect this information to make sure the compiler is doing what you've asked of it.\n",
    "\n",
-    "Finally, let's execute this program to verify we are getting the correct answer (execute the cell below). "
+    "Finally, let's execute this program to verify we are getting the correct answer (execute the command below). "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Once you feel your code is correct, try running it by executing the cell block below.  You'll want to review our quality check from the beginning of task2 to make sure you didn't break the functionality of your application."
+    "Once you feel your code is correct, try running it by executing the command below.  You'll want to review our quality check from the beginning of task2 to make sure you didn't break the functionality of your application."
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 10,
   "metadata": {
-    "collapsed": true
+    "collapsed": false
   },
-   "outputs": [],
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Jacobi relaxation Calculation: 1024 x 1024 mesh\n",
+      "    0, 0.250000\n",
+      "  100, 0.002397\n",
+      "  200, 0.001204\n",
+      "  300, 0.000804\n",
+      "  400, 0.000603\n",
+      "  500, 0.000483\n",
+      "  600, 0.000403\n",
+      "  700, 0.000345\n",
+      "  800, 0.000302\n",
+      "  900, 0.000269\n",
+      " total: 3.403485 s\n"
+     ]
+    }
+   ],
   "source": [
    "!./task2_out"
   ]
@ -524,7 +628,7 @@
    "\n",
    "For detailed information on the `data` directive clauses, you can refer to the [OpenACC 2.5](http://www.openacc.org/sites/default/files/OpenACC_2pt5.pdf) specification.\n",
    "\n",
-    "In the [`task3.c`](/rpWFwS8c/edit/C/task3/task3.c) file, see if you can add in a `data` directive to minimize data transfers in the Jacobi Iteration. There's a place for  the `create` clause in this exercise too.   As usual, there are some hints provided, and you can look at [`task3_solution.c`](/rpWFwS8c/edit/C/task3/task3_solution.c) to see the answer if you get stuck or want to check your work.  **Don't forget to save with File -> Save in the editor below before moving on.**"
+    "In the [`task3.c`](/4vwkFv7K/edit/C/task3/task3.c) file, see if you can add in a `data` directive to minimize data transfers in the Jacobi Iteration. There's a place for  the `create` clause in this exercise too.   As usual, there are some hints provided, and you can look at [`task3_solution.c`](/4vwkFv7K/edit/C/task3/task3_solution.c) to see the answer if you get stuck or want to check your work.  **Don't forget to save with File -> Save in the editor below before moving on.**"
   ]
  },
  {
@ -540,16 +644,40 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Once you think you have [`task3.c`](/rpWFwS8c/edit/C/task3/task3.c) saved with a directive to manage data transfer, compile it with the below cell and note the changes in the compiler output in the areas discussing data movement (lines starting with `Generating ...`). Then modify Anew using the `create` clause, if you haven't yet, and compile again."
+    "Once you think you have [`task3.c`](/4vwkFv7K/edit/C/task3/task3.c) saved with a directive to manage data transfer, compile it with the below command and note the changes in the compiler output in the areas discussing data movement (lines starting with `Generating ...`). Then modify Anew using the `create` clause, if you haven't yet, and compile again."
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 11,
   "metadata": {
-    "collapsed": true
+    "collapsed": false
   },
-   "outputs": [],
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "main:\n",
+      "     34, Generating create(Anew[:][:])\n",
+      "         Generating copy(A[:][:])\n",
+      "     42, Loop is parallelizable\n",
+      "     44, Loop is parallelizable\n",
+      "         Accelerator kernel generated\n",
+      "         Generating Tesla code\n",
+      "         42, #pragma acc loop gang, vector(4) /* blockIdx.y threadIdx.y */\n",
+      "         44, #pragma acc loop gang, vector(32) /* blockIdx.x threadIdx.x */\n",
+      "         48, Generating implicit reduction(max:error)\n",
+      "     53, Loop is parallelizable\n",
+      "     55, Loop is parallelizable\n",
+      "         Accelerator kernel generated\n",
+      "         Generating Tesla code\n",
+      "         53, #pragma acc loop gang, vector(4) /* blockIdx.y threadIdx.y */\n",
+      "         55, #pragma acc loop gang, vector(32) /* blockIdx.x threadIdx.x */\n",
+      "Compiled Successfully\n"
+     ]
+    }
+   ],
   "source": [
    "!pgcc -fast -acc -Minfo=accel -ta=tesla -o task3_out task3/task3.c && echo \"Compiled Successfully\""
   ]
@ -563,11 +691,30 @@
  },
  {
   "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 12,
   "metadata": {
-    "collapsed": true
+    "collapsed": false
   },
-   "outputs": [],
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Jacobi relaxation Calculation: 1024 x 1024 mesh\n",
+      "    0, 0.250000\n",
+      "  100, 0.002397\n",
+      "  200, 0.001204\n",
+      "  300, 0.000804\n",
+      "  400, 0.000603\n",
+      "  500, 0.000483\n",
+      "  600, 0.000403\n",
+      "  700, 0.000345\n",
+      "  800, 0.000302\n",
+      "  900, 0.000269\n",
+      " total: 0.601428 s\n"
+     ]
+    }
+   ],
   "source": [
    "!./task3_out"
   ]
@ -609,11 +756,79 @@
  },
  {
   "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 14,
   "metadata": {
-    "collapsed": true
+    "collapsed": false
   },
-   "outputs": [],
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Jacobi relaxation Calculation: 1024 x 1024 mesh\n",
+      "    0, 0.250000\n",
+      "  100, 0.002397\n",
+      "  200, 0.001204\n",
+      "  300, 0.000804\n",
+      "  400, 0.000603\n",
+      "  500, 0.000483\n",
+      "  600, 0.000403\n",
+      "  700, 0.000345\n",
+      "  800, 0.000302\n",
+      "  900, 0.000269\n",
+      " total: 0.581272 s\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "main:\n",
+      "     34, Generating create(Anew[:][:])\n",
+      "         Generating copy(A[:][:])\n",
+      "     42, Loop is parallelizable\n",
+      "     44, Loop is parallelizable\n",
+      "         Accelerator kernel generated\n",
+      "         Generating Tesla code\n",
+      "         42, #pragma acc loop gang, vector(4) /* blockIdx.y threadIdx.y */\n",
+      "         44, #pragma acc loop gang, vector(32) /* blockIdx.x threadIdx.x */\n",
+      "         48, Generating implicit reduction(max:error)\n",
+      "     53, Loop is parallelizable\n",
+      "     55, Loop is parallelizable\n",
+      "         Accelerator kernel generated\n",
+      "         Generating Tesla code\n",
+      "         53, #pragma acc loop gang, vector(4) /* blockIdx.y threadIdx.y */\n",
+      "         55, #pragma acc loop gang, vector(32) /* blockIdx.x threadIdx.x */\n",
+      "\n",
+      "Accelerator Kernel Timing data\n",
+      "/home/fokke/OpenACC/labs/lab1/notebook/C/task3/task3.c\n",
+      "  main  NVIDIA  devicenum=0\n",
+      "    time(us): 425,590\n",
+      "    34: data region reached 2 times\n",
+      "        34: data copyin transfers: 1\n",
+      "             device time(us): total=352 max=352 min=352 avg=352\n",
+      "        68: data copyout transfers: 1\n",
+      "             device time(us): total=336 max=336 min=336 avg=336\n",
+      "    37: compute region reached 1000 times\n",
+      "        37: data copyin transfers: 1000\n",
+      "             device time(us): total=2,452 max=13 min=2 avg=2\n",
+      "        44: kernel launched 1000 times\n",
+      "            grid: [32x256]  block: [32x4]\n",
+      "             device time(us): total=307,190 max=311 min=305 avg=307\n",
+      "            elapsed time(us): total=318,460 max=341 min=316 avg=318\n",
+      "        44: reduction kernel launched 1000 times\n",
+      "            grid: [1]  block: [256]\n",
+      "             device time(us): total=13,053 max=19 min=13 avg=13\n",
+      "            elapsed time(us): total=24,253 max=47 min=23 avg=24\n",
+      "        44: data copyout transfers: 1000\n",
+      "             device time(us): total=7,380 max=20 min=7 avg=7\n",
+      "        55: kernel launched 1000 times\n",
+      "            grid: [32x256]  block: [32x4]\n",
+      "             device time(us): total=94,827 max=118 min=92 avg=94\n",
+      "            elapsed time(us): total=108,023 max=1,019 min=104 avg=108\n"
+     ]
+    }
+   ],
   "source": [
    "%%bash\n",
    "export PGI_ACC_TIME=1\n",
@ -692,7 +907,7 @@
    "|             |               |              |                 |           16  |              32 | 0.410  |\n",
    "|             |               |              |                 |            4  |              64 | 0.379   |\n",
    "\n",
-    "Try to modify the [`task4.c`](/rpWFwS8c/edit/C/task4/task4.c) code for the main computational loop nests in the window below. You'll be using the openacc loop constructs `gang()` and  `vector()`. Look at task4_solution.c if you get stuck:\n"
+    "Try to modify the [`task4.c`](/4vwkFv7K/edit/C/task4/task4.c) code for the main computational loop nests in the window below. You'll be using the openacc loop constructs `gang()` and  `vector()`. Look at task4_solution.c if you get stuck:\n"
   ]
  },
  {
@ -735,7 +950,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Looking at [task4_solution.c](/rpWFwS8c/edit/C/task4/task4_solution.c), the gang(8) clause on the inner loop tells it to launch 8 blocks in the X(column) direction. The vector(32) clause on the inner loop tells the compiler to use blocks that are 32 threads (one warp) wide. The absence of clause on the outer loop lets the compiler decide how many rows of threads and how many blocks to use in the Y(row) direction. We can see what it says, again, with:"
+    "Looking at [task4_solution.c](/4vwkFv7K/edit/C/task4/task4_solution.c), the gang(8) clause on the inner loop tells it to launch 8 blocks in the X(column) direction. The vector(32) clause on the inner loop tells the compiler to use blocks that are 32 threads (one warp) wide. The absence of clause on the outer loop lets the compiler decide how many rows of threads and how many blocks to use in the Y(row) direction. We can see what it says, again, with:"
   ]
  },
  {
@ -814,7 +1029,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "At this point, some of you may be wondering what kind of speed-up we get against the OpenMP version of this code.  If you look at [task1_omp.c](/rpWFwS8c/edit/C/task4/task1_omp.c) in the text editor above, you can see a simple OpenMP version of the Jacobi Iteration code.  Running this using 8-OpenMP threads on an Intel Xeon E5-2670 , our Kepler GK520 about 2X faster. If we scale the matrix up to an even larger 4096x4096, our Kepler GK520 GPU becomes significantly faster than the 8-OpenMP thread version.  If you have some time remaining in this lab, feel free to compile & run the OpenMP and OpenACC versions below with the larger matrices.\n",
+    "At this point, some of you may be wondering what kind of speed-up we get against the OpenMP version of this code.  If you look at [task1_omp.c](/4vwkFv7K/edit/C/task4/task1_omp.c) in the text editor above, you can see a simple OpenMP version of the Jacobi Iteration code.  Running this using 8-OpenMP threads on an Intel Xeon E5-2670 , our Kepler GK520 about 2X faster. If we scale the matrix up to an even larger 4096x4096, our Kepler GK520 GPU becomes significantly faster than the 8-OpenMP thread version.  If you have some time remaining in this lab, feel free to compile & run the OpenMP and OpenACC versions below with the larger matrices.\n",
    "\n",
    "First, compile the OpenMP version:"
   ]
@ -906,49 +1121,7 @@
    "* [OpenACC on CUDA Zone](https://developer.nvidia.com/openacc)\n",
    "* Search or ask questions on [Stackoverflow](http://stackoverflow.com/questions/tagged/openacc) using the openacc tag\n",
    "* Get the free [PGI Comunity Edition](https://www.pgroup.com/products/community.htm) compiler.\n",
-    "* Attend an in-depth workshop offered by XSEDE (https://portal.xsede.org/overview) or a commercial provider (see the 'education' page at OpenACC.org)\n",
-    "\n",
-    "---\n",
-    "\n",
-    "<a id=\"post-lab\"></a>\n",
-    "## Post-Lab\n",
-    "\n",
-    "Finally, don't forget to save your work from this lab before time runs out and the instance shuts down!!\n",
-    "\n",
-    "1. Save this IPython Notebook by going to `File -> Download as -> IPython (.ipynb)` at the top of this window\n",
-    "2. You can execute the following cell block to create a zip-file of the files you've been working on, and download it with the link below."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "%%bash\n",
-    "rm -f openacc_files.zip\n",
-    "zip -r openacc_files.zip task*/*.c task*/*.h"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "**After** executing the above zip command, you should be able to download the zip file [here](files/openacc_files.zip)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "<a id=\"FAQ\"></a>\n",
-    "---\n",
-    "# Lab FAQ\n",
-    "\n",
-    "Q: I'm encountering issues executing the cells, or other technical problems?<br>\n",
-    "A: Please see [this](https://developer.nvidia.com/self-paced-labs-faq#Troubleshooting) infrastructure FAQ."
+    "* Attend an in-depth workshop offered by XSEDE (https://portal.xsede.org/overview) or a commercial provider (see the 'education' page at OpenACC.org)\n"
   ]
  },
  {
@ -1039,7 +1212,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
-   "version": "2.7.6"
+   "version": "2.7.13"
  }
 },
 "nbformat": 4,