 # The most simple Difference of Gaussian image pyramid with OpenCL

06 Aug 2023 - tsp
Last update 06 Aug 2023 22 mins

The following blog article summarizes a quick implementation of a Difference of Gaussian (DoG) pyramid as approximation of a Laplacian of a Gaussian pyramid. It does not utilize optimizations so performance (around 10ms for a simple 6 octave pyramid) is rather low. Of course all timings taken here are dependent on the CPU and GPU that has been used. To generate the graphs and timing shown in this blog article an AMD Radeon R9 360 has been used with a Xeon E3 host CPU on FreeBSD 13.

## Theory

Image Pyramids are often used during machine vision. They usually are used to create multi-scale image representations by repeated application of downsampling. To reduce resolution of a picture one usually performs subsampling after applying (Gaussian) blur. The blurring step is necessary to prevent total loss of high frequency information (for example think about an binary image where every second pixel in every second row is set to white and every other to black where you downsample by a factor of two - then you only would get a single image. When applying blur it would be gray afterwards). This still forms kind of a low pass since one looses high frequency information inside the images - and it forms the different scale spaces. Depending on application one can also device to not perform blurring but applying the mean to build the downsampled version - how one does this depends entirely on the application (calculating a mean for 9 pixels for example would only require only 8 additions and one multiplication, applying an 3x3 Gaussian blur would require 9 multiplications and 8 additions so way more operations).

Inside the scale spaces one often performs bandpass filtering or searching for maxima and minima. This is usually done by calculating the derivative of the image. This can be approximated by applying a Gaussian blur multiple times and calculating differences between the resulting images:

[ f(\sigma_1; \sigma_2) = I * \frac{1}{\sigma_1 \sqrt{2 \pi}} e^{-\frac{x^2 - y^2}{2 \sigma_1^2}} - I * \frac{1}{\sigma_2 \sqrt{2 \pi}} e^{-\frac{x^2 - y^2}{2 \sigma_2^2}} ]

Even though one performs a convolution with the image this yields

[ f(\sigma_1; \sigma_2) = I * \left( \frac{1}{\sigma_1 \sqrt{2 \pi}} e^{-\frac{x^2 - y^2}{2 \sigma_1^2}} - \frac{1}{\sigma_2 \sqrt{2 \pi}} e^{-\frac{x^2 - y^2}{2 \sigma_2^2}} \right) ]

By building those differences from can approximate the Laplacian of a Gaussian well enough - this is an approach that’s often called Lowe’s Pyramid scheme:

• Building scale space octaves by subsampling by a factor of two multiple times. This is the number of scale spaces or octaves $n_o$.
• Inside each scale space apply the same Gaussian again an again which corresponds to the calculation of blurs with $\sigma_i = 2^\frac{-i}{n_s} \sigma_0$ ($n_s$ being the number of images per scale space)

Since one then only wants to perform scale space maxima one can show that it’s sufficient to calculate $3$ difference levels inside each octave - this can be verified by checking how many stable keypoints one can find when extending the number of scale spaces which has been done extensively in literature.

## Implementing image and difference of Gaussian (DoG) pyramids in OpenCL

It has been shown previously how to implement convolution more efficient on the GPU than on the CPU. The convolution kernel had been pretty simple. Now the next step will be to implement the whole subsampling and blurring for the scale spaces as well as the difference of Gaussian pyramids on the GPU.

What’s necessary to build a DoG Pyramid?

• First the input image should be grayscaled
• Then one has to build the images for scale spaces. One requires 4 different blurring levels (that can also be used for subsampling when done correct by increasing the scale by $\sqrt{2}$ instead of $2$ each step). It’s a good idea to keep all 4 images in the same buffer. Since a separable kernel is used another temporary buffer is required to apply the second kernel without destroying it’s own input data
• In the last step one has to build the difference images. Since one only requires 3 images for each level one can put all levels into a single image buffer object. The next scale space always fits in the remaining part of the 4th image when arranging them in a 2x2 array pattern.

## OpenCL initialization

The first step is to initialize the OpenCL framework. Since I’m going to test the kernels in Python this code uses pyopencl. The very simple testbench simply fetches a list of all platforms and .their devices and then selects a predefined one (usedPlatform and usedDevice). Since this code is only thought as a testbench for the OpenCL code there is no external configuration option. A detailed description can be found in a previous blog article

The initialization creates or initializes:

• The platform
• A single device on the selected platform
• A context for the given device
• A command queue for the device
import pyopencl as cl
import numpy as np
import matplotlib.pyplot as plt

from PIL import Image

from time import time

platforms = cl.get_platforms()
for idxPlatform, platform in enumerate(platforms):
devices = platform.get_devices()
print(f"Platform {platform.name}:")
for idxDevice, dev in enumerate(devices):
print(f"   {idxDevice}: {dev.name}")

usedPlatform = 0
usedDevice = 0

clPlatform = platforms[usedPlatform]
clDevice = clPlatform.get_devices()[usedDevice]
clContext = cl.Context([ clDevice ])
clQueue = cl.CommandQueue(clContext, clDevice)


## Defining the grayscale kernel

The first step for most image processing algorithms is grayscaling. This of course reduces the amount of data and especially the complexity - but also reduces the dimensionality and amount of information in your data. The grayscale kernel will accept an RGBA image with 4 bytes per pixel, one byte per channel - the alpha channel will be ignored. It then grayscales with the factors specified in ITU-R BT.709 standard that’s also used in HDTV to compute the luma component. It then clamps the output value to the 0 to 255 range and writes it into a simple output array. For a quick test it’s assumed all images are RGBA images - in a real world scenario one will have to support a variety of input formats (at least RGB, RGBA and YUV frames as well as accepting already grayscaled images).

The ITU-R BT.709 formula assigns the luma component as:

$Y = 0.2126 * R + 0.7152 * G + 0.0722 * B$

This tries to represent the sensitivity of the human eye to different colors in some way. Since all pixels are independent this is a very simple kernel:

clkernel_Grayscale_Tpl = '''//CL//

#ifndef GRAYSCALE_FACTOR_R
#define GRAYSCALE_FACTOR_R 0.2126
#endif
#ifndef GRAYSCALE_FACTOR_G
#define GRAYSCALE_FACTOR_G 0.7152
#endif
#ifndef GRAYSCALE_FACTOR_B
#define GRAYSCALE_FACTOR_B 0.0722
#endif

__kernel void grayscale_RGBA2Pyrabuffer(
__global unsigned char* lpRGBAIn,
__global unsigned char* lpPyraBufferOut
) {
int y = get_global_id(0);
int x = get_global_id(1);

float gray_value = ((float)(lpRGBAIn[(y + x * IMAGEHEIGHT) * 4 + 0])) * GRAYSCALE_FACTOR_R + ((float)(lpRGBAIn[(y + x * IMAGEHEIGHT) * 4 + 1])) * GRAYSCALE_FACTOR_G + ((float)(lpRGBAIn[(y + x * IMAGEHEIGHT) * 4 + 2])) * GRAYSCALE_FACTOR_B;

if (gray_value > 255) {
gray_value = 255.0;
}
if (gray_value < 1) {
gray_value = 0;
}

lpPyraBufferOut[(y + x * 2 * IMAGEHEIGHT)] = (unsigned char)gray_value;
}
'''

#
# clkernel_Grayscale = '''
# #define IMAGEHEIGHT {0}
# #define IMAGEWIDTH {1}
# '''.format(image_input.shape, image_input.shape)  + clkernel_Grayscale_Tpl
#
# clProg_Grayscale = cl.Program(clContext, clkernel_Grayscale).build()


## Defining the (special) convolution kernel

This kernel is used to perform convolution with a Gaussian in horizontal and vertical direction only and performs simple repeating of border pixels - different modes will require a different implementation, for example using preprocessor definitions. It reads from the same buffer as it writes into and only accepts a source and target index in the horizontal and vertical direction (which is used to apply the convolution only to a window inside a buffer to allow all images to be stored side by side - in the same buffer object) as well as the selector for the horizontal or vertical convolution. The simple implementation of a convolution is described in a previous blog article.

In addition a short Python function is supplied that initializes the buffer containing the Gaussian kernel constants. Those are calculated on the host instead of being calculated on the GPU on the fly.

def gaussiankernel(width, sigma):
if width % 2 != 1:
raise ValueError("Require odd kernel size")

r = np.empty((width,))

for k in range(int(width/2) + 1):
r[int(width/2) + k] = 1 / (sigma * np.sqrt(2 * np.pi)) * np.exp(- (k*k) / (2 * sigma*sigma))
r[int(width/2) - k] = 1 / (sigma * np.sqrt(2 * np.pi)) * np.exp(- (k*k) / (2 * sigma*sigma))

# Apply norm ...
r = r / np.sum(r)

return r

#imagekernel_GaussianBlurGrayscalePyramid = gaussiankernel(5, np.sqrt(2))
imagekernel_GaussianBlurGrayscalePyramid = gaussiankernel(7, np.sqrt(2))

clkernel_GaussianBlurGrayscalePyramid = '''//CL//
__kernel void convolveGrayscalePyramid(
__global unsigned char* lpBuffer,
__global unsigned char* lpTempBuffer,
__constant float* lpKernel,

int sourceX, int sourceY,
int destX, destY
) {
int x = get_global_id(1);
int y = get_global_id(0);

float v = 0.0;

/* Read the pixel values and apply Gaussian weights */

for(int i = -KERNELHALFWIDTH; i <= KERNELHALFWIDTH; i=i+1) {
#if defined(CONVOLVE_HORIZONTAL)
int realx = x + i;
if(realx < 0) {
realx = 0;
} else if(realx >= IMAGEWIDTH) {
realx = IMAGEWIDTH-1;
}

float kv =  lpKernel[i + KERNELHALFWIDTH];

realx = realx + sourceX * IMAGEWIDTH;
int realy = y + sourceY * IMAGEHEIGHT;

v = v + lpBuffer[realy + realx * 2 * IMAGEHEIGHT] * kv;
#else
int realy = y + i;
if(realy < 0) {
realy = 0;
} else if(realy >= IMAGEHEIGHT) {
realy = IMAGEHEIGHT-1;
}

float kv =  lpKernel[i + KERNELHALFWIDTH];

v = v + lpTempBuffer[realy + x * IMAGEHEIGHT] * kv;
#endif
}

if(v > 255) {
v = 255;
} else if(v < 1) {
v = 0;
}

#if defined(CONVOLVE_HORIZONTAL)
lpTempBuffer[y + (x * IMAGEHEIGHT)] = (unsigned char)v;
#else
lpBuffer[(y + (destY * IMAGEHEIGHT)) + ((x + (destX * IMAGEWIDTH)) * 2 * IMAGEHEIGHT)] = (unsigned char)v;
#endif
}
'''

# clkernel = '''
# #define KERNELHALFWIDTH {0}
# #define IMAGEHEIGHT {1}
# #define IMAGEWIDTH {2}
# #define CONVOLVE_HORIZONTAL 1
# '''.format(int(len(imagekernel)/2), image_input.shape, image_input.shape)  + clkernel

# clProgH = cl.Program(clContext, clkernel).build()

# clkernel = '''
# #define KERNELHALFWIDTH {0}
# #define IMAGEHEIGHT {1}
# #define IMAGEWIDTH {2}
# '''.format(int(len(imagekernel)/2), image_input.shape, image_input.shape)  + clkernel

# clProgV = cl.Program(clContext, clkernel).build()


## Definition of the subsampling kernel

To decrease the resolution of an image it will be first blurred using a Gaussian (since this will also be done for the scale spaces anyways) - then one can perform simple subsampling. The subsampling kernel reduces the resolution of an image by a factor of two in each direction. It always reads from the third image in the source buffer and writes into the first image in the target buffer. It will simply get launched for half the number of input pixels - again every pixel is independent of each other so the kernel is very simple:

clkernel_SubsampleGrayscalePyramid = '''//CL//
__kernel void subsampleGrayscalePyramid(
__global unsigned char* lpPyramidLevelIn,
__global unsigned char* lpPyramidLevelOut
) {
int x = get_global_id(1);
int y = get_global_id(0);

// We source from the third image ...
float v = lpPyramidLevelIn[(y*2) + IMAGEHEIGHT + ((x*2) * 2 * IMAGEHEIGHT)];

// And write into the first in the output ... not using IMAGEHEIGHT >> 2 since requiring * 2 again
lpPyramidLevelOut[y + (x * (IMAGEHEIGHT))] = v;
}
'''

# clkernel = '''
# #define IMAGEHEIGHT {0}
# #define IMAGEWIDTH {1}
# '''.format(image_input.shape, image_input.shape)  + clkernel

# clProgSubsample = cl.Program(clContext, clkernel).build()


## Definition of the difference kernel

The last step for a Difference of Gaussian (DoG) pyramid is the difference step. It will subtract all scale images in a single octave at once and write three different images at the same time. This might not be the wisest implementation due to caching issues - it’s just one of the most simple ones and enough for a first try. The kernel accepts the octaves buffer as well as the single difference map output buffer. In addition the source width and height is specified together with the target offset to locate the position of the fourth image of the previous octaves difference images.

# Definitions:
#  IMAGEHEIGHT_DIFFPYRAMID
#  IMAGEWIDTH_DIFFPYRAMID

#define IMAGEHEIGHT_DIFFPYRAMID {0}
#define IMAGEWIDTH_DIFFPYRAMID {1}

clkernel_DifferencePyramidDifference = '''//CL//
__kernel void grayscalePyramidDifference(
__global unsigned char* lpPyramidLevelIn,
__global unsigned char* lpDifferenceMapOut,

int targetOffset_X, int targetOffset_Y,
int sourceWidth, int sourceHeight
) {
int x = get_global_id(1);
int y = get_global_id(0);

float v1 = lpPyramidLevelIn[y + (x * sourceHeight)];
float v2 = lpPyramidLevelIn[y + (sourceHeight / 2) + (x * sourceHeight)];
float v3 = lpPyramidLevelIn[y + ((x + sourceWidth / 2) * sourceHeight)];
float v4 = lpPyramidLevelIn[y + sourceHeight / 2 + ((x + sourceWidth / 2) * sourceHeight)];

lpDifferenceMapOut[(y + targetOffset_Y) +  (x + targetOffset_X) * IMAGEHEIGHT_DIFFPYRAMID] = v3 - v1;
lpDifferenceMapOut[(y + targetOffset_Y + (sourceHeight >> 1)) +  (x + targetOffset_X) * IMAGEHEIGHT_DIFFPYRAMID] = v2 - v3;
lpDifferenceMapOut[(y + targetOffset_Y) +  (x + targetOffset_X + (sourceWidth >> 1)) * IMAGEHEIGHT_DIFFPYRAMID] = v4 - v2;
}
'''


## Quick tests

The following is just a quick run of all kernels after each other to check they really work as expected before assembling a whole pyramid calculation function

### Quick test of the grayscale kernel

# Load image

image_input = np.asarray(Image.open("imagetest01.jpg").convert('RGBA'), dtype = np.uint8)

h, w, c = image_input.shape, image_input.shape, image_input.shape
print(f"Source image has dimensions {h} x {w} x {c}")

# Compile our kernel
clkernel_Grayscale = '''
#define IMAGEHEIGHT {0}
#define IMAGEWIDTH {1}
'''.format(image_input.shape, image_input.shape)  + clkernel_Grayscale_Tpl

clProg_Grayscale = cl.Program(clContext, clkernel_Grayscale).build()

# Image input buffer

clBufferIn = cl.Buffer(clContext, cl.mem_flags.READ_ONLY | cl.mem_flags.COPY_HOST_PTR, hostbuf = image_input)

# Would require / 4 for the ARGB -> Grayscale and * 4 for the 4 images so the size stays the same ...
npBuffer_Scales = [
np.empty(clBufferIn.get_info(cl.mem_info.SIZE), dtype=np.uint8),
np.empty(int(clBufferIn.get_info(cl.mem_info.SIZE) / 4), dtype=np.uint8),
]
clBuffer_Scales = [
cl.Buffer(clContext, cl.mem_flags.READ_WRITE | cl.mem_flags.COPY_HOST_PTR, hostbuf = npBuffer_Scales),
cl.Buffer(clContext, cl.mem_flags.READ_WRITE | cl.mem_flags.COPY_HOST_PTR, hostbuf = npBuffer_Scales)
]

krnlGray = clProg_Grayscale.grayscale_RGBA2Pyrabuffer

krnlGray(clQueue, (w, h,), None , clBufferIn, clBuffer_Scales)
cl.enqueue_copy(clQueue, npBuffer_Scales, clBuffer_Scales) #, origin=(0,0), region=(w,h))

plt.imshow(npBuffer_Scales.reshape((h*2, w*2)))

# Release buffers

#for buf in clBuffer_Scales:
#    buf.release() ### Quick test of Gaussian blur kernel

clkernel = '''
#define KERNELHALFWIDTH {0}
#define IMAGEHEIGHT {1}
#define IMAGEWIDTH {2}
#define CONVOLVE_HORIZONTAL 1
'''.format(int(len(imagekernel_GaussianBlurGrayscalePyramid)/2), image_input.shape, image_input.shape)  + clkernel_GaussianBlurGrayscalePyramid

clProgH = cl.Program(clContext, clkernel).build()

clkernel = '''
#define KERNELHALFWIDTH {0}
#define IMAGEHEIGHT {1}
#define IMAGEWIDTH {2}
'''.format(int(len(imagekernel_GaussianBlurGrayscalePyramid)/2), image_input.shape, image_input.shape)  + clkernel_GaussianBlurGrayscalePyramid

clProgV = cl.Program(clContext, clkernel).build()

krnlH = clProgH.convolveGrayscalePyramid
krnlV = clProgV.convolveGrayscalePyramid

npKernel = np.asarray(imagekernel_GaussianBlurGrayscalePyramid, dtype = np.float32)
clBufferKernel = cl.Buffer(clContext, cl.mem_flags.READ_ONLY | cl.mem_flags.COPY_HOST_PTR, hostbuf = npKernel)

npTempBuffer = np.empty((h, w), dtype = np.uint8)
clTempBuffer = cl.Buffer(clContext, cl.mem_flags.READ_WRITE | cl.mem_flags.COPY_HOST_PTR, hostbuf = npTempBuffer)

krnlH(clQueue, (w, h,), None , clBuffer_Scales, clTempBuffer, clBufferKernel, np.int32(0), np.int32(0), np.int32(0), np.int32(1))
krnlV(clQueue, (w, h,), None , clBuffer_Scales, clTempBuffer, clBufferKernel, np.int32(0), np.int32(0), np.int32(0), np.int32(1))

krnlH(clQueue, (w, h,), None , clBuffer_Scales, clTempBuffer, clBufferKernel, np.int32(0), np.int32(1), np.int32(1), np.int32(0))
krnlV(clQueue, (w, h,), None , clBuffer_Scales, clTempBuffer, clBufferKernel, np.int32(0), np.int32(1), np.int32(1), np.int32(0))

krnlH(clQueue, (w, h,), None , clBuffer_Scales, clTempBuffer, clBufferKernel, np.int32(1), np.int32(0), np.int32(1), np.int32(1))
krnlV(clQueue, (w, h,), None , clBuffer_Scales, clTempBuffer, clBufferKernel, np.int32(1), np.int32(0), np.int32(1), np.int32(1))

cl.enqueue_copy(clQueue, npBuffer_Scales, clBuffer_Scales) #, origin=(0,0), region=(w,h))

plt.imshow(npBuffer_Scales.reshape((h*2, w*2))) ### Quick test of subsampling

clkernel = '''
#define IMAGEHEIGHT {0}
#define IMAGEWIDTH {1}
'''.format(image_input.shape, image_input.shape)  + clkernel_SubsampleGrayscalePyramid

clProgSubsample = cl.Program(clContext, clkernel).build()

krnlSubsample = clProgSubsample.subsampleGrayscalePyramid

krnlSubsample(clQueue, (int(w / 2), int(h / 2),), None, clBuffer_Scales, clBuffer_Scales)

cl.enqueue_copy(clQueue, npBuffer_Scales, clBuffer_Scales)

plt.imshow(npBuffer_Scales.reshape((h, w))) #clBufferKernel.release()
for buf in clBuffer_Scales:
buf.release()


## Building a whole Pyramid framework

Since the kernels seem to work - now let’s build the whole framework for a pyramid and do some performance tests of this very simple implementation. It will just load the input image using Pillow, convert it into a native array using numpy and transfer it onto the GPU. Note that the data transfer to and from the GPU will not be counted in the performance evaluation. This is done since it’s assumed that the images are already available on the GPU when executing such an task and that there will be more processing being done on the GPU before transferring the images back.

The test program works very simple:

• First the image will get loaded via Pillow and converted into a native array by numpy.
• A single grayscale kernel will get compiled and it’s grayscale_RGBA2Pyrabuffer reference will be fetched
• The image buffer will be allocated on the GPU as read only buffer.
• The pyramid output buffer (containing the difference images, not the scale space octaves) will be allocated on the GPU as write only buffer so caches will not be poisoned by data from this buffer.
• The difference kernel gets compiled only once since the same kernel is used for all scales
• For each and every octave:
• Create a scale space buffer for the four images kept in the given octave
• The Gaussian blur convolution kernel with the given preprocessor constants get compiled, once horizontally and once vertically - for each octave
• The temporary buffer required to compute the separated convolution without destroying data is allocated
• The subsampling kernel is compiled for the given octave size

Then comes the execution phase. For each repeated iteration - which are done to get a more clean performance measurement:

• The grayscale kernel gets executed and transfers data into the first image of the first scale space
• For each octave:
• Three times after each other horizontal and vertical convolution with the Gaussian blur is executed - in sequence since all runs depend on each other.
• For debugging purposes the octaves data is transferred to the host buffer (in background). This can be left out for a real world application of course
• The difference kernel creates the octaves difference data in the pyramid buffer.
• The subsampling kernel subsamples one of the images in the current octave into the initial image of the next octave.

The execution phase is times multiple times to get a rough feeling about the performance of the OpenCL implementation

nScales = 6
showDebugPlots = True

#image_input = np.asarray(Image.open("imagetest01.jpg").convert('RGBA'), dtype = np.uint8)
image_input = np.asarray(Image.open("butterfly_external_testimage.jpg").convert('RGBA'), dtype = np.uint8)
h, w, c = image_input.shape, image_input.shape, image_input.shape
print(f"Source image has dimensions {h} x {w} x {c}")

# Compile our kernel
clkernel_Grayscale = '''
#define IMAGEHEIGHT {0}
#define IMAGEWIDTH {1}
'''.format(image_input.shape, image_input.shape)  + clkernel_Grayscale_Tpl

clProg_Grayscale = cl.Program(clContext, clkernel_Grayscale).build()
krnlGray = clProg_Grayscale.grayscale_RGBA2Pyrabuffer

# Image input buffer

clBufferIn = cl.Buffer(clContext, cl.mem_flags.READ_ONLY | cl.mem_flags.COPY_HOST_PTR, hostbuf = image_input)

# Pyramid output buffer

npPyramidBuffer = np.empty((image_input.shape * 2, image_input.shape * 2), dtype = np.uint8)
clPyramidBuffer = cl.Buffer(clContext, cl.mem_flags.WRITE_ONLY | cl.mem_flags.COPY_HOST_PTR, hostbuf = npPyramidBuffer)

# Difference Kernel

clkernel_Difference = '''
#define IMAGEHEIGHT_DIFFPYRAMID {0}
#define IMAGEWIDTH_DIFFPYRAMID {1}
'''.format(int(image_input.shape * 2), int(image_input.shape * 2)) + clkernel_DifferencePyramidDifference
clProg_Difference = cl.Program(clContext, clkernel_Difference).build()
krnlDifference = clProg_Difference.grayscalePyramidDifference

# Would require / 4 for the ARGB -> Grayscale and * 4 for the 4 images so the size stays the same ...
curSize = clBufferIn.get_info(cl.mem_info.SIZE)
shapeX = image_input.shape
shapeY = image_input.shape

clBuffer_Scales = []
npBuffer_Scales = []
clProgH = []
clProgV = []
clKrnlH = []
clKrnlV = []
clProgSubsample = []
clKrnlSubsample = []
npTempBuffer = []
clTempBuffer = []

for iScale in range(nScales):
print(f"Scale {iScale}: {shapeX} x {shapeY} -> {curSize}")

npBuffer_Scales.append(np.empty(curSize, dtype=np.uint8))
clBuffer_Scales.append(cl.Buffer(clContext, cl.mem_flags.READ_WRITE | cl.mem_flags.COPY_HOST_PTR, hostbuf = npBuffer_Scales[-1]))

# Blurring kernels
clkernel = '''
#define KERNELHALFWIDTH {0}
#define IMAGEHEIGHT {1}
#define IMAGEWIDTH {2}
#define CONVOLVE_HORIZONTAL 1
'''.format(int(len(imagekernel_GaussianBlurGrayscalePyramid)/2), shapeX, shapeY)  + clkernel_GaussianBlurGrayscalePyramid
clProgH.append(cl.Program(clContext, clkernel).build())

clkernel = '''
#define KERNELHALFWIDTH {0}
#define IMAGEHEIGHT {1}
#define IMAGEWIDTH {2}
'''.format(int(len(imagekernel_GaussianBlurGrayscalePyramid)/2), shapeX, shapeY)  + clkernel_GaussianBlurGrayscalePyramid
clProgV.append(cl.Program(clContext, clkernel).build())

clKrnlH.append(clProgH[-1].convolveGrayscalePyramid)
clKrnlV.append(clProgV[-1].convolveGrayscalePyramid)

# Temporary buffers
npTempBuffer.append(np.empty((shapeY, shapeX), dtype = np.uint8))
clTempBuffer.append(cl.Buffer(clContext, cl.mem_flags.READ_WRITE | cl.mem_flags.COPY_HOST_PTR, hostbuf = npTempBuffer[-1]))

# Subsampling kernels

clkernel = '''
#define IMAGEHEIGHT {0}
#define IMAGEWIDTH {1}
'''.format(shapeX, shapeY)  + clkernel_SubsampleGrayscalePyramid
clProgSubsample.append(cl.Program(clContext, clkernel).build())
clKrnlSubsample.append(clProgSubsample[-1].subsampleGrayscalePyramid)

if shapeX % 2 == 0:
shapeX = shapeX + 1
if shapeY % 2 == 0:
shapeY = shapeY + 1
shapeX = int(shapeX / 2)
shapeY = int(shapeY / 2)
# curSize = int(curSize / 4)
curSize = (shapeX * shapeY) * 4

# Now execute the kernels on a single input image ...

durations = []
showDebugPlots = False

targetOffset_X = 0
targetOffset_Y = 0

for iIteration in range(500):
print(f"{iIteration} ... ", end = "")
stime = time()

# Grayscale kernel is only required once. Also grayscaling only runs once ...
krnlGray(clQueue, (w, h,), None , clBufferIn, clBuffer_Scales)
if showDebugPlots:
cl.enqueue_copy(clQueue, npBuffer_Scales, clBuffer_Scales) #, origin=(0,0), region=(w,h))
plt.imshow(npBuffer_Scales.reshape((h*2, w*2)))
plt.show()

for iScale in range(nScales):
clKrnlH[iScale](clQueue, (w, h,), None , clBuffer_Scales[iScale], clTempBuffer[iScale], clBufferKernel, np.int32(0), np.int32(0), np.int32(0), np.int32(1))
clKrnlV[iScale](clQueue, (w, h,), None , clBuffer_Scales[iScale], clTempBuffer[iScale], clBufferKernel, np.int32(0), np.int32(0), np.int32(0), np.int32(1))

clKrnlH[iScale](clQueue, (w, h,), None , clBuffer_Scales[iScale], clTempBuffer[iScale], clBufferKernel, np.int32(0), np.int32(1), np.int32(1), np.int32(0))
clKrnlV[iScale](clQueue, (w, h,), None , clBuffer_Scales[iScale], clTempBuffer[iScale], clBufferKernel, np.int32(0), np.int32(1), np.int32(1), np.int32(0))

clKrnlH[iScale](clQueue, (w, h,), None , clBuffer_Scales[iScale], clTempBuffer[iScale], clBufferKernel, np.int32(1), np.int32(0), np.int32(1), np.int32(1))
clKrnlV[iScale](clQueue, (w, h,), None , clBuffer_Scales[iScale], clTempBuffer[iScale], clBufferKernel, np.int32(1), np.int32(0), np.int32(1), np.int32(1))

cl.enqueue_copy(clQueue, npBuffer_Scales[iScale], clBuffer_Scales[iScale]) #, origin=(0,0), region=(w,h))

if showDebugPlots:
plt.imshow(npBuffer_Scales[iScale].reshape((h*2, w*2)))
plt.show()

if iScale != (nScales - 1):
clKrnlSubsample[iScale](clQueue, (int(w / 2), int(h / 2),), None, clBuffer_Scales[iScale], clBuffer_Scales[iScale + 1])
#    if showDebugPlots:
#        cl.enqueue_copy(clQueue, npBuffer_Scales[iScale + 1], clBuffer_Scales[iScale + 1])
#        plt.imshow(npBuffer_Scales[iScale + 1].reshape((h, w)))

# Perform difference calculation
krnlDifference(clQueue, (w, h,), None, clBuffer_Scales[iScale], clPyramidBuffer, np.int32(targetOffset_X), np.int32(targetOffset_Y), np.int32(2*h), np.int32(2*w))

targetOffset_X = targetOffset_X + h
targetOffset_Y = targetOffset_Y + w

w = int(w / 2)
h = int(h / 2)

etime = time()
durations.append((etime - stime) * 1000)

# Fetch LAST Pyramid into host buffer ...
cl.enqueue_copy(clQueue, npPyramidBuffer, clPyramidBuffer)
fig, ax = plt.subplots(figsize = (6.4 * 4, 4.8 * 4))
npPyramidBuffer = (((npPyramidBuffer - np.min(npPyramidBuffer)) / np.max(npPyramidBuffer)) * 255.0).astype(np.uint8)
ax.imshow(npPyramidBuffer.reshape((image_input.shape * 2, image_input.shape * 2)))
#plt.savefig("imagepyramids2_initialpyramid.png")
plt.show()

fig, ax = plt.subplots()
ax.set_xlabel("Iteration")
ax.set_ylabel("Duration [ms]")
ax.plot(durations)
ax.grid()
ax.set_title(f"DoG pyramid, {nScales} scales (average {np.mean(durations)} ms)")
#plt.savefig("imagepyramids2_initialpyramid_timing.png")
#plt.savefig("imagepyramids2_initialpyramid_timing.svg")
plt.show()  In the end one has to clean up allocated resources again:

clBufferKernel.release()
clPyramidBuffer.release()
clBufferIn.release()

for i in range(nScales):
clBuffer_Scales[i].release()
clBuffer_Scales[i] = None
clProgH[i] = None
clProgV[i] = None
clKrnlH[i] = None
clKrnlV[i] = None
clProgSubsample[i] = None
clKrnlSubsample[i] = None
clTempBuffer[i].release()
clTempBuffer[i] = None


## Conclusion

This blog article has demonstrated that it’s pretty simple to calculate Lowes image pyramid schema on the GPU using OpenCL. It’s by far not the most performant (I would argue maybe one of the worst) implementations but it can provide a base for optimizing from a working implementation by modifying one kernel after each other. This blog article shows a testbed in which one can play around with image pyramids as well as the optimization of kernels in a very simple way. All kernels have been written in a way to be as readable as possible.