# Parallel Applications 2

## Public Resources

No login is required.
If provided, read carefully additional texts that can contain some helpful information on the origins of the files, licence agreements, etc.

## Internal Resources

Login is required to view all content and to download files in this section.
Do not enter your LDAP credentials. A common user name and password were set for all students at the beginning of semester.

### Lesson 1

Prerequisites

• download CUDA 10 project template with all additional libraries for further usage
• knowledge of C++

• try to compile the project template
• explore the project structure

### Lesson 2

Prerequisites

• download the project template with all additional libraries for further usage
• knowledge of C++

• Allocate the HOST memory that will represent two M-dimensional vectors (A, B) and fill them with some values.
• Allocate the DEVICE memory to be able to copy data from HOST.
• Allocate the DEVICE memory to store an output M-dimensional vector C.
• Create a kernel that sums scalar values such that C[i] = A[i] + B[i].
• Allocate the HOST memory that will represent N M-dimensional vectors (A_0,...A_n-1, B_0, ... B_n-1) and fill them with some values.
• Allocate the DEVICE memory to be able to copy data from HOST.
• Allocate the DEVICE memory to store output M-dimensional vectors C_0 ... C_n-1.
• Create a kernel that sums all vectors pairs that C_0[i] = A_0[i] + B_0[i], ... C_n-1[i] = A_n-1[i] + B_n-1[i].
• THINK ABOUT THE VARIANTS OF YOUR SOLUTION, CONSIDER THE PROS AND CONS.

### Lesson 3

Prerequisites

• download the project template with all additional libraries for further usage
• CUDA - memory allocation, page-locked memory

• Create a column matrix m[mRows,mCols] containing the numbers 0 1 2 3 ...
• The data should be well alligned in the page-locked memory.
• The matrix should be filled in CUDA kernel.
• You must use a Pitch CUDA memory with appropriate alignment. Moreover you must use 2D grid of 2D blocks of size 8x8.
• Increment the values of the matrix.
• Finally, copy the matrix to HOST using cudaMemcpy2D function.

Help for students

### Lesson 4

Prerequisites

• CUDA - shared memory

• Lets have a simple particle system representing a set of positions of N rain drops in the 3D space, where N>=1M.
• Create a suitable data representation of the mentioned set of rain drops.
• Lets have a filed of 256 wind power plants that give 256 movement vectors. The movement vectors invoke changes of all rain drops positions in a second.
• Create a kernel that simulates the falling of rain drops.
• Just for sake of simplicity suppose that a single kernel call simulates one second in the simulated world.

### Lesson 5

This lesson is focused on discussion about students projects. In the rest of time, the following tasks should be solved.

Prerequisites

• CUDA - constant memory

• Try to write a simple code that will allocate and set a scalar value in the GPU constant memory.
• Copy the data back to HOST and check the value.
• Do the same with custom structure and then with some array.

• On the host, create an array called Reference of 10 mil. random floating point numbers from interval <0, 1>. Create an array (called Pattern) of at least 16 random floating point numbers from interval <0, 1>.
• Create a method, that will find the best Pattern to Reference match.
• Questions you have to answer in advance: How and where to store the data? What is the matching function? How to find the best match? Do I need some additional data structure? How to return a result?

### Lesson 6

Prerequisites

• CUDA - texture memory
• DOWNLOAD the resource files and code snippets
• CUDA RUNTIME API methods/structures that we have not used yet:
• texture<...> for texture references
• cudaCreateChannelDesc()
• cudaBindTexture2D()
• cudaMallocArray(), cudaMemcpyToArray(), cudaBindTextureToArray(), cudaFreeArray()

All you need is to load an image from a file (terrain10x10.tif) and prepare cuda textures using A) linear pitch memory and B) cudaArray. The terrain represents a 10x10 heightmap, where each pixel is 8-bit value. However, the goal is to convert this into a 10x10 texture with floating-point numbers, i.e. 32-bits per pixel/texel.

• See the imageManager.h in /common/Utils/...  This helps you to upload the *.TIF image into HOST memory.
• See the loadSourceImage() method in runner.cu. Here a linear DEVICE memory is allocated and a copy of the source image data is created. CUDA textures needs at least linear pitch memory and here your tasks start.
• Create a texture reference. The texture reference texRef must be declared in the file-scope (runner.cu)
• expected texture features: float, 2D, element type read mode (we want to read the texture as a simple table)
• Finish the method createTextureFromLinearPitchMemory().
• allocate linear pitch memory
• converts image data into floats. Call the colorToFloat() from  imageKernel.cuh and beware of data alignment, image size and bits per pixel.
• Create the texture channel description using cudaCreateChannelDesc(). NOTICE: We want a floating-point texture. That is why the storage should be 32-bits in a RED channel, 0-bits in green, 0-bits in blue and 0-bits in alpha channel, plus the type of the channel format must support floats.
• Set all texRef properties (normalized, filterMode, addessMode[0], addressMode[1]).
• Call cudaBindTexture2D()
•  Do the same again but with the use of cudaArray instead of linear pitch memory.
• createTextureFrom2DArray().
• See the following methods from CUDA RUNTIME API: cudaMallocArray, cudaMemcpyToArray, cudaBindTextureToArray

Click HERE to download an illustrative solution. This solution has been designed with respect to your existing knowledge.

### Lesson 7 - competition

As the current situation still does not allow you to be personally on the seminar, and the results of the previous task were quite similar, I have prepared a small competition. You can prove what did you learn about different types of memories, data access strategies, etc. First, download and see the input data - two images. One is a reference.tif image that represents some noise data, where you must find a pattern represented by a query.tif image. You can use relevant parts of your previous code.

Prerequisites

• CUDA memory

The query and an appropriate part of the reference image must be compared somehow. Do not waste a time and use a simple Euclidean distance as a separate device function. The result will be represented by position (a ROW-COLUMN pair) of the most left top corner of the placed pattern. Think about the image coordinate system. Where is the origin?

• Find the query image data in the reference image as fast as you can and return resulting position of the best placement.
• Add a relevant time measurement that will measure the total computation time of your pattern matching function only.
• Write another device function

In case of two solutions return almost the same computation time, the winner will be determined based on the time when its particular solution will be received on my email.

### Lesson 7 - online seminar

Prerequisites

• CUDA - unified memory (Unified Virtual Addressing)
• "grid" data in memory - access patterns

• Create a normal map from the input image using a Sobel operator

### Lesson 8 - ONLINE SEMINAR

Prerequisites

• Texture reference API vs. Object API
• OpenGL, freeglut, glew, textures, pixel buffer objects
• some help (calling pipeline) - DOWNLOAD
• CUDA - OpenGL interoperability: source file for online seminar - DOWNLOAD

• Follow these instructions from the on-line seminar.
• Virus spread simulation: The world map has three basic colors. The black color is a general background, large water areas like oceans, seas, etc. The blue color represents lakes and some other small water areas. The white color represents the mainland.
• In the first iteration, create a single red pixel on the mainland that will represent an epicenter of infection.
• Each subsequent iteration of the simulation will add some random red pixels on the mainland. Next all surrounding pixels of all red pixels from the previous iteration will get red as well.
• The virus can spread only over the mainland, i.e. over the bright white color.
• Think about the shapes of red areas that will be created during the simulation. What must be changed to obtain circle areas?

### Lesson 9 - ONLINE SEMINAR

Prerequisites

• Unified memory
• Atomic functions

• Load some RGB image and store its pixel raw data in the unified memory
• Create a kernel that computes the image histogram (for each color channel separately)
• Find the most exposed pixel
• Again, think about the optimization, kernel settings (grid, blocks), etc.

### Lesson 10 - online seminar

Prerequisites

• Try to finish a given application. To do that, you have to implement all subtasks in the code. There are two vectors A and B (dim ~= 1M) that will be N times duplicated in a loop. A simple kernel makes vector sum A+B=C. Everything will be done in streams with respect to the following tasks.
• TASK 1: Simple stream
• TASK 2: two streams - depth first approach
• TASK 3: two streams - breadth first approach

### Lesson 11 - online seminar

Prerequisites

• Try to finish a given application. To do that, you have to implement all subtasks in the code. You must create a distance matrix that will contain the distances between vectors in 3D.
• The distances can be computed in several different ways. Try to use BLAS3 functions. It means that you must deal with matrix operations.

You have to send the final runner.cu file to my e-mail address and the file must be received until 6th May, 2020 11:59 PM to have credit for the seminar and to get activity points based on the quality of your source code!!!