OpenACC-course/lab1/FORTRAN/OpenACC Fortran.ipynb

50 KiB

OpenACC: 2X in 4 Steps (for Fortran)

In this self-paced, hands-on lab, we will use OpenACC directives to port a basic scientific Fortran program to an accelerator in four simple steps, achieving at least a two-fold speed-up.

Lab created by John Coombs, Mark Harris, and Mark Ebersole (Follow @CUDAHamster on Twitter)

Next let's get information about the GPUs on the server by executing the command below.

In [1]:
%%bash
nvidia-smi
Tue Jun 27 14:46:47 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.66                 Driver Version: 375.66                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 950     Off  | 0000:01:00.0      On |                  N/A |
| 18%   56C    P8    10W /  99W |    741MiB /  1996MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      1942    G   /usr/lib/xorg/Xorg                             423MiB |
|    0      3184    G   compiz                                         136MiB |
|    0      3392    G   /usr/lib/firefox/firefox                         1MiB |
|    0      3526    G   ...el-token=6C5C01D5B0057C12B571711999D42376   145MiB |
|    0      3636    G   ...s-passed-by-fd --v8-snapshot-passed-by-fd    31MiB |
+-----------------------------------------------------------------------------+

Introduction to OpenACC

Open-specification OpenACC directives are a straightforward way to accelerate existing Fortran and C applications. With OpenACC directives, you provide hints via compiler directives to tell the compiler where -- and how -- it should parallelize compute-intensive code for execution on an accelerator.

If you've done parallel programming using OpenMP, OpenACC is very similar: using directives, applications can be parallelized incrementally, with little or no change to the Fortran or C source. Debugging and code maintenance are easier. OpenACC directives are designed for portability across operating systems, host CPUs, and accelerators. You can use OpenACC directives with GPU accelerated libraries, explicit parallel programming languages (e.g., CUDA), MPI, and OpenMP, all in the same program.

This hands-on lab walks you through a short sample of a scientific code, and demonstrates how you can employ OpenACC directives using a four-step process. You will make modifications to a simple Fortran program, then compile and execute the newly enhanced code in each step. Along the way, hints and solution are provided, so you can check your work, or take a peek if you get lost.

If you are confused now, or at any point in this lab, you can consult the FAQ located at the bottom of this page.

The Value of 2X in 4 Steps

You can accelerate your applications using OpenACC directives and achieve at least a 2X speed-up, using 4 straightforward steps:

  1. Characterize your application
  2. Add compute directives
  3. Minimize data movement
  4. Optimize kernel scheduling

The content of these steps and their order will be familiar if you have ever done parallel programming on other platforms. Parallel programmers deal with the same issues whenever they tackle a new set of code, no matter what platform they are parallelizing an application for. These issues include:

  • optimizing and benchmarking the serial version of an application
  • profiling to identify the compute-intensive portions of the program that can be executed concurrently
  • expressing concurrency using a parallel programming notation (e.g., OpenACC directives)
  • compiling and benchmarking each new/parallel version of the application
  • locating problem areas and making improvements iteratively until the target level of performance is reached

The programming manual for some other parallel platform you've used may have suggested five steps, or fifteen. Whether you are an expert or new to parallel programming, we recommend that you walk through the four steps here as a good way to begin accelerating applications by at least 2X using OpenACC directives. We believe being more knowledgeable about the four steps will make the process of programming for an accelerator more understandable and more manageable. The 2X in 4 Steps process will help you use OpenACC on your own codes more productively, and get significantly better speed-ups in less time.

Step 1 - Characterize Your Application

The most difficult part of accelerator programming begins before the first line of code is written. If your program is not highly parallel, an accelerator or coprocesor won't be much use. Understanding the code structure is crucial if you are going to identify opportunities and successfully parallelize a piece of code. The first step in OpenACC programming then is to characterize the application. This includes:

  • benchmarking the single-thread, CPU-only version of the application
  • understanding the program structure and how data is passed through the call tree
  • profiling the application and identifying computationally-intense "hot spots"
    • which loop nests dominate the runtime?
    • what are the minimum/average/maximum tripcounts through these loop nests?
    • are the loop nests suitable for an accelerator?
  • insuring that the algorithms you are considering for acceleration are safely parallel

Note: what we've just said may sound a little scary, so please note that as parallel programming methods go OpenACC is really pretty friendly: think of it as a sandbox you can play in. Because OpenACC directives are incremental, you can add one or two in at a time and see how things work: the compiler provides a lot of feedback. The right software plus good tools plus educational experiences like this one should put you on the path to successfully accelerating your programs.

We will be accelerating a 2D-stencil called the Jacobi Iteration. Jacobi Iteration is a standard method for finding solutions to a system of linear equations. The basic concepts behind a Jacobi Iteration are described in the following video:

http://www.youtube.com/embed/UOSYi3oLlRs

Here is the serial Fortran code for our Jacobi Iteration:

program main
use openacc
implicit real(4) (A-H,O-Z)
integer, parameter :: NN = 1024
integer, parameter :: NM = 1024

real(4) A(NN,NM), Anew(NN,NM)
iter_max = 1000
tol   = 1.0e-6
error = 1.0

A(1,:) = 1.0
A(2:NN,:) = 0.0
Anew(1,:) = 1.0
Anew(2:NN,:) = 0.0

print 100,NN,NM

call cpu_time(t1)
iter = 0
do while ( (error > tol) .and. (iter < iter_max) )
error = 0.0
do j = 2, NM-1
  do i = 2, NN-1
    Anew(i,j) = 0.25 * ( A(i+1,j) + A(i-1,j) + &
                         A(i,j-1) + A(i,j+1) )
    error = max( error, abs(Anew(i,j) - A(i,j)) )
  end do
end do
    
do j = 2, NM-1
  do i = 2, NN-1
    A(i,j) = Anew(i,j)
  end do
end do

if(mod(iter,100) == 0) print 101,iter,error
iter = iter + 1
end do
call cpu_time(t2)
print 102,t2-t1

100 format("Jacobi relaxation Calculation: ",i4," x ",i4," mesh")
101 format(2x,i4,2x,f9.6)
102 format("total: ",f9.6," s")
end program

In this code, the outer 'while' loop iterates until the solution has converged, by comparing the computed error to a specified error tolerance, tol. The first of two sets of inner nested loops applies a 2D Laplace operator at each element of a 2D grid, while the second set copies the output back to the input for the next iteration.

Benchmarking

Before you start modifying code and adding OpenACC directives, you should benchmark the serial version of the program. To facilitate benchmarking after this and every other step in our parallel porting effort, we have built a timing routine around the main structure of our program -- a process we recommend you follow in your own efforts. Let's run the task1.f90 file without making any changes -- using the -fast set of compiler options on the serial version of the Jacobi Iteration program -- and see how fast the serial program executes. This will establish a baseline for future comparisons. Execute the following two commands to compile and run the program.

In [2]:
%%bash
# To be sure we see some output from the compiler, we'll echo out "Compiled Successfully!" 
#(if the compile does not return an error)
pgfortran -fast -o task1_pre_out task1/task1.f90 && echo "Compiled Successfully!"
Compiled Successfully!
In [3]:
%%bash
# Execute our single-thread CPU-only Jacobi Iteration to get timing information.  Make sure you compiled
# successfully in the above command first.
./task1_pre_out
Jacobi relaxation Calculation: 1024 x 1024 mesh
     0   0.250000
   100   0.002397
   200   0.001204
   300   0.000804
   400   0.000603
   500   0.000483
   600   0.000403
   700   0.000345
   800   0.000302
   900   0.000269
total:  1.507478 s

Quality Checking/Keeping a Record

This is a good time to briefly talk about having a quality check in your code before starting to offload computation to an accelerator (or do any optimizations, for that matter). It doesn't do you any good to make an application run faster if it does not return the correct results. It is thus very important to have a quality check built into your application before you start accelerating or optimizing. This can be a simple value print out (one you can compare to a non-accelerated version of the algorithm) or something else.

In our case, on every 100th iteration of the outer do while loop, we print the current max error. (You just saw an example when we executed task1_pre_out.) As we add directives to accelerate our code later in this lab, you can look back at these values to verify that we're getting the correct answer. These print-outs also help us verify that we are converging on a solution -- which means that we should see that, as we proceed, the values are approaching zero.

Note: NVIDIA GPUs implement IEEE-754 compliant floating point arithmetic just like most modern CPUs. However, because floating point arithmetic is not associative, the order of operations can affect the rounding error inherent with floating-point operations: you may not get exactly the same answer when you move to a different processor. Therefore, you'll want to make sure to verify your answer within an acceptable error bound. Please read this article at a later time, if you would like more details.

After each step, we will record the results from our benchmarking and correctness tests in a table like this one:

|Step| Execution | ExecutionTime (s) | Speedup vs. 1 CPU Thread | Correct? | Programming Time | |:--:| --------------- | ---------------------:| ------------------------------:|:--------:| -----------------| |1 | CPU 1 thread | 1.624 | | Yes | | |

Note: Problem Size: 1024 x 1024; System Information: GK520; Compiler: PGI Community Edition 17.4

(The execution times quoted will be times we got running on our GK520 -- your times throughout the lab may vary for one reason or another.)

You may also want to track how much time you spend porting your application, step by step, so a column has been included for recording time spent.

Profiling

Back to our lab. Your objective in the step after this one (Step 2) will be to modify task2.f90 in a way that moves the most computationally intensive, independent loops to the accelerator. With a simple code, you can identify which loops are candidates for acceleration with a little bit of code inspection. On more complex codes, a great way to find these computationally intense areas is to use a profiler (such as PGI's PGPROF or open-source gprof) to determine which functions are consuming the largest amounts of compute time. To profile a program on your own workstation, you'd type the lines below on the command line, but in this workshop, you just need to execute the following command, and then click on the link below it to see the PGPROF interface

In [ ]:
%%bash
pgfortran -fast -Minfo=all,ccff -o task1/task1_simple_out task1/task1.f90 && echo "Compiled Successfully!"

In this lab, to open the PGI profiler in a new window click here.

Click on File > New Session to start a new profiling session. Select the executable to profile by pressing the Browse button, clicking ubuntu from the file left side of the file selector, the selecting notebook, then FORTRAN, and finally selecting task1_simple_out.

No description has been provided for this image

Clicking Next will bring up a screen with a list profiling settings for this session. We can leave those at their default settings for now. Clicking Finish will cause pgprof to launch your executable for profiling. Since we are profiling a regular CPU application (no acceleration added yet) we should refer to the CPU Details tab along the bottom of the window for a summary of what functions in our program take the most compute time on the CPU. If you do not have a CPU Details tab, click View -> Show CPU Details View.

No description has been provided for this image

Double-clicking on the most time-consuming function in the table, MAIN_ in this case, will bring up another file browser. This time click on Recently Used and then FORTRAN and press OK. This will open the source file for the main function.

No description has been provided for this image

In our Jacobi code sample, the two nests of do loops within the do while loop account for the majority of the runtime.

Let's see what it takes to accelerate those loops.

Step 2 - Add Compute Directives

In Fortran, an OpenACC directive is indicated in the code by !$acc *your directive* . This is very similar to OpenMP programming and gives hints to the compiler on how to handle the compilation of your source. If you are using a compiler which does not support OpenACC directives, it will simply ignore the !$acc directives and move on with the compilation.

In Step 2, you will add compute regions around your expensive parallel loop(s). The first OpenACC directive you're going to learn about is the kernels directive. The kernels directive gives the compiler a lot of freedom in how it tries to accelerate your code - it basically says, "Compiler, I believe the following code block is parallelizable, so I want you to try and accelerate it as best you can."

Like most OpenACC directives in Fortran, the kernels directive applies to the structured code block defined by the opening and closing !$acc directives. For example, each of the following code samples instructs the compiler to generate a kernel -- from suitable loops -- for execution on an accelerator:

!$acc kernels

    ! accelerate suitable loops here 

!$acc end kernels

! but not loops outside of the region

At some point, you will encounter the OpenACC parallel directive, which provides another method for defining compute regions in OpenACC. For now, let's drop in a simple OpenACC kernels directive in front of the two do-loop codeblocks that follow the do while loop. The kernels directive is designed to find the parallel acceleration opportunities implicit in the do-loops in the Jacobi Iteration code.

To get some hints about how and where to place your kernels directives, click on the green boxes below. When you feel you are done, make sure to save the task2.f90 file you've modified with File -> Save, and continue on. If you get completely stuck, you can look at task2_solution.f90 to see the answer.

Hint #1

Remember that in Fortran, an OpenACC directive applies to the next structured code block. So for example, the following applies the kernels directive to the outer do loop and everything inside of it:
  
    !$ acc kernels
    do j=2, NM-1
      do i=2, NN-1
        A(i,j) = Anew(i,j)
            ...
      end do
    end do
    !$ end kernels
    

Let's now compile our task2.f90 file by executing the command below with Ctrl-Enter (or press the play button in the toolbar above).

In [ ]:
%%bash
# Compile the task2.f90 file with the pgfortran compiler
# -fast is the standard optimization flag
# -acc tells the compiler to process the source recognizing !$acc directives
# -ta=tesla tells the compiler to target an NVIDIA Tesla accelerator
# -Minfo tells the compiler to share information about the compilation process
pgfortran -fast -acc -ta=tesla -Minfo -o task2_out task2/task2.f90

If you successfully added !$acc kernels in the proper spots, you should see the following in the output of the compiler:

main:
     12, Loop not vectorized: may not be beneficial
         Unrolled inner loop 8 times
         Generated 7 prefetches in scalar loop
     13, Memory zero idiom, loop replaced by call to __c_mzero4
     14, Loop not vectorized: may not be beneficial
         Unrolled inner loop 8 times
         Generated 7 prefetches in scalar loop
     15, Memory zero idiom, loop replaced by call to __c_mzero4
     21, Loop not vectorized/parallelized: potential early exits
     23, Generating implicit copyin(a(:,:))
         Generating implicit copyout(anew(2:1023,2:1023))
     24, Loop is parallelizable
     25, Loop is parallelizable
         Accelerator kernel generated
         Generating Tesla code
         24, !$acc loop gang, vector(4) ! blockidx%y threadidx%y
         25, !$acc loop gang, vector(32) ! blockidx%x threadidx%x
         28, Generating implicit reduction(max:error)
     33, Generating implicit copyout(a(2:1023,2:1023))
         Generating implicit copyin(anew(2:1023,2:1023))
     34, Loop is parallelizable
     35, Loop is parallelizable
         Accelerator kernel generated
         Generating Tesla code
         34, !$acc loop gang, vector(4) ! blockidx%y threadidx%y
         35, !$acc loop gang, vector(32) ! blockidx%x threadidx%x
         

If you do not get similar output, please check your work and try re-compiling. If you're stuck, you can compare what you have to task2_solution.f90 in the editor above.

The output provided by the compiler is extremely useful, and should not be ignored when accelerating your own code with OpenACC. Let's break it down a bit and see what it's telling us.

  1. First since we used the -Minfo command-line option, we will see all output from the compiler. If we were to use -Minfo=accel we would only see the output corresponding to the accelerator, in this case an NVIDIA GPU.
  2. The first line of the output, main, tells us which function the following information is in reference to.
  3. The line starting with 24, Loop is parallelizable of the output tells us that on line 24 in our source, an accelerated kernel was generated. This is the the loop just after where we put our first !$acc kernels.
  4. The following lines provide more details on the accelerator kernel on line 24. It shows we created a parallel OpenACC loop. This loop is made up of gangs (a grid of blocks in CUDA language) and vector parallelism (threads in CUDA language) with the vector size being 128 per gang.
  5. At line 34, the compiler tells us it found another loop to accelerate.
  6. The rest of the information concerns data movement which we'll get into later in this lab.

So as you can see, lots of useful information is provided by the compiler, and it's very important that you carefuly inspect this information to make sure the compiler is doing what you've asked of it.

Finally, let's execute this program to verify we are getting the correct answer (execute the command below).

Once you feel your code is correct, try running it by executing the command below. You'll want to review our quality check from the beginning of task2 to make sure you didn't break the functionality of your application.

In [ ]:
%%bash
./task2_out

Let's record our results in the table:

|Step| Execution | Time(s) | Speedup vs. 1 CPU Thread | Correct? | Programming Time | | -- || ------------ | ----------- | ------------------------- | -------- | ---------------- | |1| CPU 1 thread |1.624 | | Yes | | |2| Add kernels directive |5.794 | 0.28X | Yes | ||

Note: Problem Size: 1024x1024; System Information: GK520; Compiler: PGI Community Edition 17.4

Now, if your solution is similar to the one in task2_solution.f90, you have probably noticed that we're executing slower than the non-accelerated, CPU-only version we started with. What gives?! Let's see what pgprof can tell us about the performance of the code. Return to your PGPROF window from earlier, start another new session, but this time loading task2_out as your executable (it's in the same directory as before). This time we'll find a colorful graph of what our program is doing, this is the GPU timeline. We can't tell much from the default view, but we can zoom in by using the + magnifying glass at the top of the window. If you zoom in far enough, you'll begin to see a pattern like the one in the screenshot below. The teal and purple boxes are the compute kernels that go with the two loops in our kernels region. Each of these groupings of kernels is surrounded by tan coloer boxes representing data movement. What this graph is showing us is that for every step of our while loop, we're copying data to the GPU and then back out. Let's try to figure out why.

No description has been provided for this image

The compiler feedback we collected earlier tells you quite a bit about data movement. If we look again at the compiler feedback from above, we see the following.

  23, Generating implicit copyin(a(:,:))
     Generating implicit copyout(anew(2:1023,2:1023))
     <output snipped>
  33, Generating implicit copyout(a(2:1023,2:1023))
     Generating implicit copyin(anew(2:1023,2:1023))

This is telling us that the compiler has inserted data movement around our kernels region from line 23 to 33 which copies the A array in and out of GPU memory and also copies Anew out. This problem of copying back and forth on every iteration of a loop is sometimes called "data sloshing".

The OpenACC compiler can only work with the information we have given it. It knows we need the A and Anew arrays on the GPU for each of our two accelerated sections, but we didn't tell it anything about what happens to the data outside of those sections. Without this knowledge, it has to copy the full arrays to the GPU and back to the CPU for each accelerated section, every time it went through the while loop. That is a LOT of wasted data transfers.

Ideally, we would just transfer A to the GPU and create space for Anew at the beginning of the Jacobi Iteration, and then only transfer A back to the CPU at the end.

Because overall accelerator performance is detetermined largely by how well memory transfers are optimized, the OpenACC specification defines the data directive and several modifying clauses to manage all the various forms of data movement.

Step 3 - Manage Data Movement

We need to give the compiler more information about how to reduce unnecessary data movement for the Jacobi Iteration. We are going to do this with the OpenACC data directive and some modifying clauses defined in the OpenACC specification.

In Fortran, the data directive applies to the next structured code block. The compiler will manage data according to the provided clauses. It does this at the beginning of the data directive code block, and then again at the end. Some of the clauses available for use with the data directive are:

  • copy( list ) - Allocates memory on GPU and copies data from host to GPU when entering region and copies data to the host when exiting region.
  • copyin( list ) - Allocates memory on GPU and copies data from host to GPU when entering region.
  • copyout( list ) - Allocates memory on GPU and copies data to the host when exiting region.
  • create( list ) - Allocates memory on GPU but does not copy.
  • present( list ) - Data is already present on GPU from another containing data region.

As an example, the following directive copies array A to the GPU at the beginning of the code block, and back to the CPU at the end. It also copies arrays B and C to the CPU at the end of the code block, but does not copy them both to the GPU at the beginning:


   !$acc data copy( A ), copyout( B, C )

   ....
   !$acc end data

For detailed information on the data directive clauses, you can refer to the OpenACC 2.5 specification.

In the task3.f90 editor, see if you can add in a data directive to minimize data transfers in the Jacobi Iteration. There's a place for the create clause in this exercise too. As usual, there are some hints provided, and you can look at task3_solution.f90 to see the answer if you get stuck or want to check your work. Don't forget to save with File -> Save in the editor below before moving on.

Hint #1

You should only have to worry about managing the transfer of data in arrays A and Anew.

Hint #2

You want to put the data directive just above the outer while loop. Don't forget to insert a `!$acc end data` directive in the appropriate spot.

Hint #3

You'll want to `copy( A )` so it is transferred to the GPU and back again after the final iterations through the `data` region. But you only need to `create( Anew )` as it is just used for temporary storage on the GPU, so there is no need to ever transfer it back and forth.

Once you think you have task3.f90 saved with a directive to manage data transfer, compile it with the below command and note the changes in the compiler output in the areas discussing data movement (lines starting with Generating ...). Then modify Anew using the create clause, if you haven't yet, and compile again.

In [ ]:
%%bash
pgfortran -fast -ta=tesla -acc -Minfo=accel -o task3_out task3/task3.f90 && echo 'Compiled Successfully!'

How are we doing on our timings? Let's execute our step 3 program and see if we have indeed accelerated the application versus the execution time we recorded after Step #2.

In [ ]:
%%bash
./task3_out

After making these changes, our accelerator code is much faster -- with just a few lines of OpenACC directives we have made our code more than twice as fast by running it on an accelerator, as shown in this table.

|Step| Execution | Time (s) | Speedup vs. 1 CPU thread | Correct? | Programming Time | | -- | ------------------------- | ---------------------- | ------------------------ | -------- | ---------------- | |1| CPU 1 thread | 1.624 | | Yes | | |2| Add kernels directive | 5.794 | 0.28X | Yes | | |3| Manage data movement | 1.778 | 0.91X | Yes | ||

Note: Problem Size: 1024x1024; System Information: GK520; Compiler: PGI Community Edition 17.4

We are making good progress, but we can still improve performance.

Step 4 - Optimize Kernel Scheduling

The final step in our tuning process is to tune the OpenACC compute region schedules using the gang and vector clauses. These clauses let us use OpenACC directives to take more explicit control over how the compiler parallelizes our code for the accelerator we will be using.

Kernel scheduling optimizations may give you significantly higher speedup, but be aware that these particular optimizations can significantly reduce performance portability. The vast majority of the time, the default kernel schedules chosen by the OpenACC compilers are quite good, but other times the compiler doesn't do as well. Let's spend a little time examining how we could do better, if we were in a situation where we felt we needed to.

First, we need to get some additional insight into how our Jacobi Iteration code with the data optimizations is running on the accelerator. Let's run the Fortran code with all your data movement optimizations on the accelerator again, this time setting the environment variable PGI_ACC_TIME, which will tell the program to print some high-level timers after program execution.

In [ ]:
%%bash
export PGI_ACC_TIME=1
pgfortran -fast -ta=tesla -acc -Minfo=accel -o accel_timing_out task3/task3_solution.f90
./accel_timing_out

This generates some information we haven't seen previously from the PGI compiler:

Accelerator Kernel Timing data /home/ubuntu/notebook/FORTRAN/task3/task3.f90 main NVIDIA devicenum=0 time(us): 348,997 21: data region reached 2 times 21: data copyin transfers: 1 device time(us): total=369 max=369 min=369 avg=369 45: data copyout transfers: 1 device time(us): total=346 max=346 min=346 avg=346 24: compute region reached 1000 times 24: data copyin transfers: 1000 device time(us): total=8,614 max=20 min=8 avg=8 26: kernel launched 1000 times grid: [32x256] block: [32x4] device time(us): total=212,547 max=223 min=208 avg=212 elapsed time(us): total=232,595 max=483 min=227 avg=232 26: reduction kernel launched 1000 times grid: [1] block: [256] device time(us): total=17,388 max=25 min=17 avg=17 elapsed time(us): total=37,193 max=58 min=35 avg=37 26: data copyout transfers: 1000 device time(us): total=17,712 max=28 min=13 avg=17 34: compute region reached 1000 times 36: kernel launched 1000 times grid: [32x256] block: [32x4] device time(us): total=92,021 max=100 min=91 avg=92 elapsed time(us): total=112,922 max=143 min=111 avg=112

There is a lot of information here about how the compiler mapped the computational kernels in our program to our particular accelerator (in this case, an NVIDIA GPU). We can see three regions. The first one is the memcopy loop nest starting on line 21, which takes only a tiny fraction of the total system time. The second region is the nested computation loop starting on line 24, and finally the copyback loop then executes beginning with line 34. If we look at the main loop nests, we can see these lines:

grid: [32x256] block[32x4]

The terms grid and block come from the CUDA programming model. A GPU executes groups of threads called thread blocks. To execute a kernel, the application launches a grid of these thread blocks. Each block runs on one of the GPUs multiprocessors and is assigned a certain range of IDs that it uses to address a unique data range. In this case our thread blocks have 128 threads each. The grid the compiler has constructed is also 2D, 8 blocks wide and 1022 blocks tall. This is just enough to cover our 1024x1024 grid. But we don't really need that many blocks -- if we tell the compiler to launch fewer, it will automatically generate a sequential loop over data blocks within the kernel code run by each thread.

Note: You can let the compiler do the hard work of mapping loop nests, unless you are certain you can do it better (and portability is not a concern.) When you decide to intervene, think about different parallelization strategies (loop schedules): in nested loops, distributing the work of the outer loops to the GPU multiprocessors (on PGI = gangs) in 1D grids. Similarly, think about mapping the work of the inner loops to the cores of the multiprocessors (CUDA threads, vectors) in 1D blocks. The grids (gangs) and block (vector) sizes can be viewed by setting the environment variable ACC_NOTIFY. To get you started, here are some experiments we conducted for these computational kernels and this accelerator:

Accelerator Grid Outer Loop Gang Outer Loop Vector Inner Loop Gang Inner Loop Vector Seconds
GK520 1024x1024 8 32 1.721
4 64 0.479
8 32 0.354
16 32 0.380
4 64 0.355

Try to modify the task4.f90 code for the main computational loop nests in the window below. You'll be using the openacc loop constructs gang() and vector(). Look at task4_solution.f90 if you get stuck:

Hint #1

You'll want a gang() and vector() clause on the inner loops, but you may want to let the compiler decide the dimensions of the outer loopa.In that case, you can use a loop directive without any modifying clauses.

After you've made some changes, save your work, then compile and run in the boxes below:

In [ ]:
%%bash
pgfortran -acc -Minfo=accel -o task4_out task4/task4.f90 && echo "Compiled Successfully"
In [ ]:
%%bash
./task4_out

Looking at task4_solution.f90, the gang(8) clause on the inner loop tells it to launch 8 blocks in the X(column) direction. The vector(32) clause on the inner loop tells the compiler to use blocks that are 32 threads (one warp) wide. The absence of clause on the outer loop lets the compiler decide how many rows of threads and how many blocks to use in the Y(row) direction. We can see what it says, again, with:

In [ ]:
%%bash
export PGI_ACC_TIME=1
./task4_out

*Note: we usually want the inner loop to be vectorized, because it allows coalesced loading of data from global memory. This is almost guaranteed to give a big performance increase. Other optimizations are often trial and error. When selecting grid sizes, the most obvious mapping is to have

the number of gangs * the number of workers * the number of vectors = the total problem size. 

We may choose to manipulate this number, as we are doing here, so that each thread does multiple pieces of work -- this helps amortize the cost of setup for simple kernels.

Note: Low-level languages like CUDA Fortran offer more direct control of the hardware. You can consider optimizing your most critical loops in CUDA Fortran if you need to extract every last bit of performance from your application, while recognizing that doing so may impact the portability of your code. OpenACC and CUDA Fortran are fully interoperable.

A similar change to the copy loop nest benefits performance by a small amount. After you've made all your changes (look at task4_solution.f90 to be sure) compile your code below:

In [ ]:
%%bash
pgfortran -acc -Minfo=accel -o task4_out task4/task4.f90

Then run it and record the run time of the optimized code in the table:

In [ ]:
%%bash
./task4_out

Here is the perfomance after these final optimizations:

|Step| Execution | Time (s) | Speedup vs. 1 CPU Thread | Correct? | Programming Time | | -- | -------------------- | ------------- | ------------------------------- | -------- | ---------------- | |1| CPU 1 thread | 1.624 | | Yes | | |2| Add kernel directive | 5.794 | 0.28X | Yes | | |3| Manage data movement | 1.778 | 0.91X | Yes | | |4| Optimize kernel scheduling | 0.354 | 4.59X | Yes | ||

Note: Problem Size: 1024x1024; System Information: GK520; Compiler: PGI Community Edition 17.4

At this point, some of you may be wondering what kind of speed-up we get against the OpenMP version of this code. If you look at task3_omp.f90 in the text editor above, you can see a simple OpenMP version of the Jacobi Iteration code. Running this using 8-OpenMP threads on an Intel Xeon E5-2670 , our Kepler GK520 is just slightly slower. The reason for this is that in this lab so far we've been using a rather small matrix in our example: 1024x1024. This size was chosen in the interest of keeping the time you are spending here reasonable, as larger matrices take longer to run, especially before we optimize for data movement.

If we scale the matrix up to a more realistic size, say 4096x4096, our Kepler GK520 GPU becomes significantly faster than the 8-OpenMP thread version. If you have some time remaining in this lab, feel free to compile & run the OpenMP and OpenACC versions below with the larger matrices.

First, compile the OpenMP version, with two threads on each of the 4 cores of the host CPUs:

In [ ]:
%%bash
pgfortran -fast -mp=allcores -Minfo -o task4_4096_omp task4/task4_4096_omp.f90

Now run the OpenMP code you just created, and record your results in the new table for the larger matrix.

Note: because our dataset has now grown by 16-fold your CPU may not seem as responsive. We're using -Minfo in the compile so you can see that something is indeed happening, but you may need to be patient.

In [ ]:
%%bash
./task4_4096_omp

Now, compile and run the OpenACC solution for the larger 4096x4096 matrix using the next two boxes:

In [ ]:
%%bash
pgfortran -acc -Minfo=accel -o task4_4096_out task4/task4_4096_solution.f90
In [ ]:
%%bash
./task4_4096_out

Here's our comparison with the larger matrix size:

| Execution | matrix size | Time (s) | Speedup vs. 8 CPU threads | Correct? | Programming Time | | -------------------- | ----------- | -------- | ------------------------- | | | | CPU 8 threads | 4096x4096 | 12.336 | | | | | GPU optimized kernel | 4096x4096 | 3.538 | 2.890X | Yes | ||

Note: System Information: GK520; Compiler: PGI Community Edition 17.4

Learn More

If you are interested in learning more about OpenACC, you can use the following resources: