 # Some simple image filters by convolution in Python and OpenCL

14 May 2023 - tsp
Last update 14 May 2023 71 mins

Upfront a note of caution: Of course all timings taken here are dependent on the CPU and GPU that has been used. To generate the graphs and timing shown in this blog article an AMD Radeon R9 360 has been used with a Xeon E3 host CPU on FreeBSD 13. This short article contains a summary about image filters and how one can realize them. It will contain a short reminder of how convolution works, what separable filters are and some example of image filters and their (most inefficient but direct) implementation in Python. In the end it will also show how one might naively implement convolution in OpenCL to increase performance in comparison to the naive Python implementations shown. This article provides one with a small playground to quickly try out separable image filter kernels and execute them on the GPU using pyopencl from a Python testing environment.

import numpy as np
import matplotlib.pyplot as plt
from time import time
from PIL import Image


## Definition

First - what is a convolution anyways? For a mathematician a convolution is defined as an integral operation:

[ (f \times g)(t) := \int_{-\infty}^{\infty} f(\tau) g(t - \tau) d\tau ]

In most mathematics lectures this is just dropped in as a definition of an operation that acts on two functions and yields another function. For a statistics person this immediately looks like a correlation function but with a flipped sign - so it’s also equal to the correlation function between $f(t)$ and $g(-t)$ or $f(-t)$ and $g(t)$. But what does this actually mean?

• In statistical interpretation the correlation function tells one how similar two functions are which can be used for signal analysis, target location, anomaly detection, etc.. This interpretation is usually not used for image processing
• In signal analysis there is another interpretation. When one applies the Fourier transform or Laplace transformation to two input functions and multiplies them in frequency space one gets the same structure - the convolution is just a multiplication in frequency space.
• In systems theory the convolution is used to simply calculate the response of an linear time invariant (LTI) system - this also arises from applying the Laplace transformation to the system function

None of those is really used in image processing. In image processing (and in neuronal networks) one sees that a convolution basically applies an operator (the kernel) to the neighborhood of a pixel. For example a simple convolution might take a 3x3 window around each and every pixel in an image and substitute it with the average of all those pixels and thus performing a smoothing operation (of course it turns out this also works with a correlation function but this doesn’t really matter since most filters used in image processing are symmetric anyways. This can also be used when modeling systems where one applies a noise model to ones system model (for example applying Gaussian blur to every point in ones dataset or modeling a system that has some given systems and response and Gaussian noise superimposed - for example when modeling the response of a particle detectors signal shaping network together with it’s electronic shot noise).

## Separable filters

So what’s the thing about separable filters? When for example applying a 3x3 convolution kernel to an image processing each pixel will take 9 multiplications and 8 additions (as well as 9 read accesses to memory and one write access). There are filters that can be separated into a horizontal and vertical kernel though. Often used examples for separable kernels are for example Gaussian blur and the Sobel kernel that’s often used in edge detection:

[ \begin{pmatrix} -1 & 0 & 1 \\ -2 & 0 & 2 \\ -1 & 0 & 1 \end{pmatrix} = \begin{pmatrix} 1 \\ 2 \\ 1 \end{pmatrix} \times \begin{pmatrix} -1 & 0 & 1 \end{pmatrix} ]

When a kernel is separable one can perform the convolution with the image in two separate operations - once applying the horizontal and once applying the vertical convolution. Each of them only taking 3 multiplications and 2 additions for this simple example (and thus even for a small 3x3 kernel it reduced the processing time by over 70% even when looking only at arithmetic operations).

## Direct Python implementation of separable kernels

First as usual and for direct reference lets make a direct implementation of the convolution (or rather the correlation function) with a finite window size. In the following implementation it’s assumed that images (input and output) are kept in a wrapper structure (obj) that’s passed through a processing pipeline. The dictionary keys used to store the images are called keyin and keyout. In addition the function will allow one to address single channels of the image via the channels attribute for RGB images and perform clamping (if required) to the input and output values. So this function basically performs the following summation process:

[ H: I_{out}(x,y) = \sum_{i = 0}^{k_{size}} I(x - \frac{k_{size}}{2} + i, y) * K(i) \\ V: I_{out}(x,y) = \sum_{i = 0}^{k_{size}} I(x, y - \frac{k_{size}}{2} + i) * K(i) \\ ]

As one can see this is still a pretty expensive operation that scales linearly with the image width or height and the kernel size - but the structure shows promising structure for parallelization later on though we will not be able to perform the operation in place - so we have to use different buffers as source and destination. In addition one can see that the indices inside the sum will at all edge cases reach outside the defined boundaries between 0 and the images height or width by $\frac{k_{size}}{2}$. To handle this one has to decide on a strategy for handling the borders of the images. Some popular choices are:

• Simply clamping the image. This is often done in signal processing to only evaluate the convolution where one can really produce valid results. The resulting image thus is smaller in the desired dimension by $k_{size} - 1$ due to limitations on both low and high boundaries
• Repeating the border value multiple times. This will lead to distortions for gradients but work pretty well for noisy images. In this case an index smaller than 0 will simply be clamped to 0 and larger than width or height to the specific maximum index value. It’s also rather easy to implement.
• Mirroring - when one addresses pixel -2 one accesses pixel 2 and so on.
• Wrapping around addresses the pixels in a circular fashion. When leaving the image to the right access re-enters on the left and vice versa.
• Simply assuming all values outside are zero. This will usually lead to some kind of fading halo on the edges.

In addition we will allow to select the output datatype (for the RGB images we’re usually working with it will be np.uint8 but for other images it might even be floating point numbers or larger integer spaces (for example for HDR images). In addition a minimum and maximum value will be settable to allow clamping to an allowed data range. Note that most of the decisions will later be done at compile time of the kernel when implementing it in OpenCL so this flexibility will just hurt performance in this naive implementation that will be terribly slow anyways and just serves demonstration purposes.

def filter_convolve(obj, keyin, keyout, kernel, direction = 'h', channels = [ 0, 1, 2 ], edgehandling = 'repeat', minout = 0, maxout = 255, outdtype = np.uint8):
# The input image is called obj[keyin], the result will be stored in obj[keyout]
#
# Kernel is an (oddly sized) kernel in 1D
if len(kernel) % 2 != 1:
raise ValueError("The kernel has to contain an odd number of elements")
if (direction != 'h') and (direction != 'v'):
raise ValueError("One can only apply the kernel in horizontal or vertical direction")
if edgehandling not in [ "repeat", "clamp", "mirror", "wrap", "zero" ]:
raise ValueError("Edge handling not supported ...")

# Calculate output size
outsize = [ obj[keyin].shape, obj[keyin].shape, obj[keyin].shape ]

if (edgehandling == "clamp"):
if direction == 'h':
outsize = outsize - len(kernel) + 1
if direction == 'v':
outsize = outsize - len(kernel) + 1

# Create output buffer:
r = np.empty(outsize, dtype = outdtype)

dk = int(len(kernel) / 2)

if edgehandling == "clamp":
for x in range(outsize):
for y in range(outsize):
for c in channels:
v = 0
if direction == 'h':
for ik, k in enumerate(range(-dk, dk+1)):
v = v + obj[keyin][x + k, y, c] * kernel[ik]
if direction == 'v':
for ik, k in enumerate(range(-dk, dk+1)):
v = v + obj[keyin][x, y + k, c] * kernel[ik]
if v < minout:
v = minout
if v > maxout:
v = maxout
r[x,y,c] = v

if edgehandling == "repeat":
for x in range(outsize):
for y in range(outsize):
for c in channels:
v = 0
if direction == 'h':
for ik, k in enumerate(range(-dk, dk+1)):
rx = x + k
if rx < 0:
rx = 0
if rx >= outsize:
rx = outsize-1
v = v + obj[keyin][rx, y, c] * kernel[ik]
if direction == 'v':
for ik, k in enumerate(range(-dk, dk+1)):
ry = y + k
if ry < 0:
ry = 0
if ry >= outsize:
ry = outsize-1
v = v + obj[keyin][x, ry, c] * kernel[ik]
if v < minout:
v = minout
if v > maxout:
v = maxout
r[x,y,c] = v

if edgehandling == "mirror":
for x in range(outsize):
for y in range(outsize):
for c in channels:
v = 0
if direction == 'h':
for ik, k in enumerate(range(-dk, dk+1)):
rx = x + k
if rx < 0:
rx = int(-1 * rx)
if rx >= outsize:
rx = 2 * outsize - 1 - rx
v = v + obj[keyin][rx, y, c] * kernel[ik]
if direction == 'v':
for ik, k in enumerate(range(-dk, dk+1)):
ry = y + k
if ry < 0:
ry = int(-1 * ry)
if ry >= outsize:
ry = 2 * outsize - 1 - ry
v = v + obj[keyin][x, ry, c] * kernel[ik]
if v < minout:
v = minout
if v > maxout:
v = maxout
r[x,y,c] = v

if edgehandling == "wrap":
for x in range(outsize):
for y in range(outsize):
for c in channels:
v = 0
if direction == 'h':
for ik, k in enumerate(range(-dk, dk+1)):
rx = x + k
if rx < 0:
rx = rx + outsize
if rx >= outsize:
rx = rx - outsize
v = v + obj[keyin][rx, y, c] * kernel[ik]
if direction == 'v':
for ik, k in enumerate(range(-dk, dk+1)):
ry = y + k
if ry < 0:
ry = ry + outsize
if ry >= outsize:
ry = ry - outsize
v = v + obj[keyin][x, ry, c] * kernel[ik]
if v < minout:
v = minout
if v > maxout:
v = maxout
r[x,y,c] = v

obj[keyout] = r


As one can see a pretty simple direct route has been taken - just express the formula with loops having a special branch for each and every type of handling the borders of the image. In this naive implementation the border is handled by just modifying the indices when accessing the image (not this will not be possible when running on the GPU later on with massive impact. This is out of scope for this short blog article).

## Implementing separable filters

### Gaussian blur

Gaussian blur is one of the most prominent image filters. It’s one of the most natural choices when trying to reduce image noise and detail by reducing high frequency components. In contrast to moving average filters Gaussian blurring will do this continuously - the image will then look like viewed through some kind of sub optimal translucent surface since neighboring pixels are influencing each others and high frequency components are smeared out. This also resembles crosstalk between pixels pretty well. Gaussian blur is usually also used before subsampling images to include any high frequency components in the subsampled pictures (in case they would vanish due to aliasing effects). The implementation of the 2D Gaussian filter is very simple. The definition of a 1D Gaussian with an expectation value of $\mu = 0$ and a standard deviation of $\sigma$ is given by

[ G(x) = \frac{1}{\sigma \sqrt{2 \pi}} e^{- \frac{1}{2} \frac{x^2}{\sigma ^2}} ]

The 2D Gaussian is a simple (renormalized) product of two 1D Gaussians:

[ \begin{aligned} G(x,y) &= G(x) * G(y) \\ &= \frac{1}{\sigma \sqrt{2 \pi}} e^{-\frac{1}{2} \frac{x^2}{\sigma^2}} e^{-\frac{1}{2} \frac{y^2}{\sigma^2}} \\ &= \frac{1}{\sigma \sqrt{2 \pi}} e^{-\frac{1}{2} \frac{x^2 + y^2}{\sigma^2}} \\ \end{aligned} ]

Thus one can separate the 2D Gaussian kernel into two 1D Gaussians. Lets apply the 1D Gaussian in horizontal and vertical direction with each of the border handling methods to very everything works as expected:

##### Clamping at the edge
def gaussiankernel(width, sigma):
if width % 2 != 1:
raise ValueError("Require odd kernel size")

r = np.empty((width,))

for k in range(int(width/2) + 1):
r[int(width/2) + k] = 1 / (sigma * np.sqrt(2 * np.pi)) * np.exp(- (k*k) / (2 * sigma*sigma))
r[int(width/2) - k] = 1 / (sigma * np.sqrt(2 * np.pi)) * np.exp(- (k*k) / (2 * sigma*sigma))

# Apply norm ...
r = r / np.sum(r)

return r

kernel = gaussiankernel(51, 100)

tstimage = {}
tstimage['rawin'] = np.asarray(Image.open("imagetest01.jpg"))
filter_convolve(tstimage, "rawin", "blurred", kernel, direction = 'h', edgehandling = 'clamp')
filter_convolve(tstimage, "blurred", "blurred2", kernel, direction = 'v', edgehandling = 'clamp')

fig, ax = plt.subplots(2, figsize=(6.4*4, 4.8*2))
ax.imshow(tstimage["blurred"])
ax.imshow(tstimage["blurred2"])

plt.savefig("convolve_clamp_edge.png")
plt.show() ##### Mirroring at the edge
def gaussiankernel(width, sigma):
if width % 2 != 1:
raise ValueError("Require odd kernel size")

r = np.empty((width,))

for k in range(int(width/2) + 1):
r[int(width/2) + k] = 1 / (sigma * np.sqrt(2 * np.pi)) * np.exp(- (k*k) / (2 * sigma*sigma))
r[int(width/2) - k] = 1 / (sigma * np.sqrt(2 * np.pi)) * np.exp(- (k*k) / (2 * sigma*sigma))

# Apply norm ...
r = r / np.sum(r)

return r

kernel = gaussiankernel(51, 100)

tstimage = {}
tstimage['rawin'] = np.asarray(Image.open("imagetest01.jpg"))
filter_convolve(tstimage, "rawin", "blurred", kernel, direction = 'h', edgehandling = 'mirror')
filter_convolve(tstimage, "blurred", "blurred2", kernel, direction = 'v', edgehandling = 'mirror')

fig, ax = plt.subplots(2, figsize=(6.4*4, 4.8*2))
ax.imshow(tstimage["blurred"])
ax.imshow(tstimage["blurred2"])

plt.savefig("convolve_mirror_edge.png")
plt.show() ##### Wrapping around at the edge
def gaussiankernel(width, sigma):
if width % 2 != 1:
raise ValueError("Require odd kernel size")

r = np.empty((width,))

for k in range(int(width/2) + 1):
r[int(width/2) + k] = 1 / (sigma * np.sqrt(2 * np.pi)) * np.exp(- (k*k) / (2 * sigma*sigma))
r[int(width/2) - k] = 1 / (sigma * np.sqrt(2 * np.pi)) * np.exp(- (k*k) / (2 * sigma*sigma))

# Apply norm ...
r = r / np.sum(r)

return r

kernel = gaussiankernel(51, 100)

tstimage = {}
tstimage['rawin'] = np.asarray(Image.open("imagetest01.jpg"))
filter_convolve(tstimage, "rawin", "blurred", kernel, direction = 'h', edgehandling = 'wrap')
filter_convolve(tstimage, "blurred", "blurred2", kernel, direction = 'v', edgehandling = 'wrap')

fig, ax = plt.subplots(2, figsize=(6.4*4, 4.8*2))
ax.imshow(tstimage["blurred"])
ax.imshow(tstimage["blurred2"])

plt.savefig("convolve_wrap_edge.png")
plt.show() ##### Repeating the outermost pixel at the edge
def gaussiankernel(width, sigma):
if width % 2 != 1:
raise ValueError("Require odd kernel size")

r = np.empty((width,))

for k in range(int(width/2) + 1):
r[int(width/2) + k] = 1 / (sigma * np.sqrt(2 * np.pi)) * np.exp(- (k*k) / (2 * sigma*sigma))
r[int(width/2) - k] = 1 / (sigma * np.sqrt(2 * np.pi)) * np.exp(- (k*k) / (2 * sigma*sigma))

# Apply norm ...
r = r / np.sum(r)

return r

kernel = gaussiankernel(51, 100)

tstimage = {}
tstimage['rawin'] = np.asarray(Image.open("imagetest01.jpg"))
filter_convolve(tstimage, "rawin", "blurred", kernel, direction = 'h', edgehandling = 'repeat')
filter_convolve(tstimage, "blurred", "blurred2", kernel, direction = 'v', edgehandling = 'repeat')

fig, ax = plt.subplots(2, figsize=(6.4*4, 4.8*2))
ax.imshow(tstimage["blurred"])
ax.imshow(tstimage["blurred2"])

plt.savefig("convolve_repeat_edge.png")
plt.show() #### Measuring time taken for different sizes of the kernel

So how long does this processing effectively take? Let’s measure this for different kernel sizes and see if the behavior is really linear as it had to be expected by the structure of the formula above:

dta_sizes = []
dta_time = []
dta_resultims = []

for ksize in range(3, 25):
if ksize % 2 == 0:
continue

print(f"Kernel size {ksize} ... ", end = "")
kernel = gaussiankernel(ksize, 100)

tstimage = {}
tstimage['rawin'] = np.asarray(Image.open("imagetest01.jpg"))

dta_sizes.append(ksize)

stime = time()
filter_convolve(tstimage, "rawin", "blurred", kernel, direction = 'h', edgehandling = 'clamp')
filter_convolve(tstimage, "blurred", "blurred2", kernel, direction = 'v', edgehandling = 'clamp')
etime = time()

print(f"{etime - stime}s")
dta_time.append(etime - stime)
dta_resultims.append(tstimage["blurred2"])

fig, ax = plt.subplots(len(dta_sizes) + 1, figsize = (6.4 * 4, 4.8 * 4 * (len(dta_sizes) + 1)))

ax.plot(dta_sizes, dta_time)
ax.set_xlabel("Kernel size")
ax.set_ylabel("Execution time")
ax.grid()

for iim, im in enumerate(dta_resultims):
ax[iim + 1].imshow(im)
ax[iim + 1].set_title(f"Kernel size {dta_sizes[iim]} took {dta_time[iim]} seconds")

plt.savefig("convolution_gaussiantime_separated_kernelsizes.png")
plt.savefig("convolution_gaussiantime_separated_kernelsizes.svg")
plt.show()

np.savez("convolution_gaussiantime_separated_kernelsizes.npz", sizes = dta_sizes, times = dta_time)

    Kernel size 3 ... 220.67070722579956s
Kernel size 5 ... 317.04671907424927s
Kernel size 7 ... 390.5717372894287s
Kernel size 9 ... 484.13199639320374s
Kernel size 11 ... 562.4936060905457s
Kernel size 13 ... 646.5467569828033s
Kernel size 15 ... 730.4715511798859s
Kernel size 17 ... 837.4274179935455s
Kernel size 19 ... 912.1897475719452s
Kernel size 21 ... 992.2135629653931s
Kernel size 23 ... 1075.629760503769s #### Using Gaussian blur for subsampling

Usually Gaussian blur is applied before subsampling of images. There are different approaches to preserve part of the information from discarded pixels - but usually some kind of blurring or averaging is applied to suppress aliasing and moire effects. Any low pass filter will do to prevent high frequency components to appear in the downsampled images. Typical choices are filters like averaging and Gaussians - the choice usually depends on the application. The upside of Gaussians is that they’re continuous both in spatial and frequency domain and both have an theoretical infinite carrier. In practice the choice does not really matter for many applications. It’s also often considered a downside that calculating an Gaussian kernel takes longer than an averaging kernel or applying some kind of triangle shape - but since one’s always convolving with a discrete filter matrix any separable kernel will do when selecting the same size of the support. For some applications (like building image pyramids) it’s an upside of the Gaussian that repeated application of a Gaussian is equivalent to applying a Gaussian with larger standard deviation $\sigma$ - at least in theory when using infinite support size.

kernel = gaussiankernel(3, np.sqrt(2))

tstimage = {}
tstimage['rawin'] = np.asarray(Image.open("imagetest01.jpg"))

filter_convolve(tstimage, "rawin", "blurred_1_h", kernel, direction = 'h', edgehandling = 'clamp')
filter_convolve(tstimage, "blurred_1_h", "blurred_1", kernel, direction = 'v', edgehandling = 'clamp')

filter_convolve(tstimage, "blurred_1", "blurred_2_h", kernel, direction = 'h', edgehandling = 'clamp')
filter_convolve(tstimage, "blurred_2_h", "blurred_2", kernel, direction = 'v', edgehandling = 'clamp')

# Subsample ...
newim = {}
newim['rawin'] = np.empty((int(tstimage['blurred_2'].shape / 2), int(tstimage['blurred_2'].shape / 2), tstimage['blurred_2'].shape), dtype = np.uint8)
for x in range(tstimage['blurred_2'].shape):
if x % 2 != 0:
continue
for y in range(tstimage['blurred_2'].shape):
if y % 2 != 0:
continue
newim['rawin'][int(x/2), int(y/2)] = tstimage['blurred_2'][x,y]

filter_convolve(newim, "rawin", "blurred_1_h", kernel, direction = 'h', edgehandling = 'clamp')
filter_convolve(newim, "blurred_1_h", "blurred_1", kernel, direction = 'v', edgehandling = 'clamp')

filter_convolve(newim, "blurred_1", "blurred_2_h", kernel, direction = 'h', edgehandling = 'clamp')
filter_convolve(newim, "blurred_2_h", "blurred_2", kernel, direction = 'v', edgehandling = 'clamp')

# Subsample ...
newim2 = {}
newim2['rawin'] = np.empty((int(newim['blurred_2'].shape / 2), int(newim['blurred_2'].shape / 2), newim['blurred_2'].shape), dtype = np.uint8)
for x in range(newim['blurred_2'].shape):
if x % 2 != 0:
continue
for y in range(newim['blurred_2'].shape):
if y % 2 != 0:
continue
newim2['rawin'][int(x/2), int(y/2)] = newim['blurred_2'][x,y]

filter_convolve(newim2, "rawin", "blurred_1_h", kernel, direction = 'h', edgehandling = 'clamp')
filter_convolve(newim2, "blurred_1_h", "blurred_1", kernel, direction = 'v', edgehandling = 'clamp')

filter_convolve(newim2, "blurred_1", "blurred_2_h", kernel, direction = 'h', edgehandling = 'clamp')
filter_convolve(newim2, "blurred_2_h", "blurred_2", kernel, direction = 'v', edgehandling = 'clamp')

fig, ax = plt.subplots(3, 3, figsize=(6.4*2*3, 4.8*2*3))
ax.imshow(tstimage['rawin'])
ax.imshow(tstimage['blurred_1'])
ax.imshow(tstimage['blurred_2'])
ax.imshow(newim['rawin'])
ax.imshow(newim['blurred_1'])
ax.imshow(newim['blurred_2'])
ax.imshow(newim2['rawin'])
ax.imshow(newim2['blurred_1'])
ax.imshow(newim2['blurred_2'])

plt.savefig("convolve_subsample.png")
plt.show() ### Sobel filter (edge detection)

Another often used separable filter is the so called Sobel filter. This filter calculated the derivative along one spatial domain while smoothing along the other. A single application of the Sobel filter will lead to large values at horizontal or vertical (depending on the kernel orientation) edges that can then be detected by applying a simple threshold operation. When wanting to do orientation independent edge detection one can simply calculate the horizontal and vertical Sobel filter and then calculate the absolute sum $I = \sqrt{I_x^2 + I_y^2}$ for each pixel.

The typical Sobel kernel can be presented as

[ \begin{pmatrix} 1 & 0 & -1 \\ 2 & 0 & -2 \\ 1 & 0 & -1 \end{pmatrix} = \begin{pmatrix} 1 \\ 2 \\ 1 \end{pmatrix} \times \begin{pmatrix} 1 & 0 & -1 \end{pmatrix} ]
kernelA = np.asarray([ 1.0, 2.0, 1.0 ])
kernelB = np.asarray([ -1.0, 0.0, +1.0 ])

kernelA = kernelA / np.sum(kernelA)
#kernelB = kernelB / np.sum(kernelB)

#plt.plot(kernel)
#print(kernel)

tstimage = {}
tstimage['rawin'] = np.asarray(Image.open("imagetest01.jpg"))
filter_convolve(tstimage, "rawin", "filter1H", kernelA, direction = 'h', edgehandling = 'clamp')
filter_convolve(tstimage, "filter1H", "filter1V", kernelB, direction = 'v', edgehandling = 'clamp')

filter_convolve(tstimage, "rawin", "filter1H_2", kernelA, direction = 'v', edgehandling = 'clamp')
filter_convolve(tstimage, "filter1H_2", "filter1V_2", kernelB, direction = 'h', edgehandling = 'clamp')

fig, ax = plt.subplots(2, 3, figsize=(6.4*3*4, 4.8*2*4))
ax.imshow(tstimage["rawin"])
ax.imshow(tstimage["filter1H"])
ax.imshow(tstimage["filter1V"])

ax.imshow(tstimage["rawin"])
ax.imshow(tstimage["filter1H_2"])
ax.imshow(tstimage["filter1V_2"])

plt.savefig("convolve_sobel01.png")
plt.show() Now those images don’t look like very much on the first sight. This is due to the fact those images do not contain too many exactly horizontal and vertical edges with high contrast. Let’s first calculate the absolute sum of the horizontal and vertical edge detector:

fig, ax = plt.subplots(figsize=(6.4*4, 4.8*4))
ax.imshow((np.sqrt(tstimage["filter1V_2"] * tstimage["filter1V_2"] + tstimage["filter1V"] * tstimage["filter1V"])).astype(int))
plt.savefig("convolve_sobel02.png")
plt.show() Now let’s simply rescale the generated output to the range from 0 to 255 to see what’s really happening:

fig, ax = plt.subplots(figsize=(6.4*4, 4.8*4))
rescaled = (np.sqrt(tstimage["filter1V_2"] * tstimage["filter1V_2"] + tstimage["filter1V"] * tstimage["filter1V"])).astype(int)
rescaled = (((rescaled - np.min(rescaled)) / (np.max(rescaled)) * 255.0).astype(int))
ax.imshow(rescaled)
plt.savefig("convolve_sobel03.png")
plt.show() ## Speeding up the implementation using OpenCL

So as one could see above the performance of the filters had been very low - way off for any realtime applications. A typical photograph took more than 220 seconds for a single Gaussian blur - we usually want to achieve framerates above 30 frames per second with multiple Gaussian blurs and other algorithms being performed on the same image. There are two reasons for the low performance:

• We coded the loop performing the convolution in plain Python. Python for number crunching is very slow, basically it’s not suited to do any real work directly. One has to at least use one of the native support libraries out there (like numpy or scipy) to get any useful performance. The plain Python implementation had been done to illustrate how the algorithm works and serves as a very simple to read reference implementation to validate higher performance implementations against. It should never be used in real applications that way.
• We do the operation on the CPU. Even though CPUs support limited support for vectorization (which we don’t use) such a problem is a very good fit for a GPU. We do perform the same operation on a huge number of pixels with no dependence between the different pixels. For the convolution memory access patterns will be crucial though as we will see later.

To utilize the GPU one has to use one of the available GPU APIs - the most prominent being CUDA which only works for NVidia GPUs and on a limited set of operating systems or the open OpenCL that also targets heterogeneous systems as well as GPUs, CPUs and TPUs with the same runtime and code. Obviously the choice will fall on OpenCL. For testing it will be used with pyopencl that somehow allows one to use numpy arrays as RAM side of GPU buffers as well as Python support code to interact with kernels. This is done to provide a simple and easy experimental environment - note that usage of numpy arrays as backend prevents one from using the most performant memory access patterns on the GPUs due to limited control over alignment and memory mapping of buffers into PCIe space. But it’s sufficient to get started playing with OpenCL kernels anyways.

### A numpy based CPU implementation

As a reference on how fast convolution on the CPU usually really is without Pythons massive overhead one can use an optimized native implementation of convolution from the heavily optimized numpy package. The numpy.convolve method only performs 1D convolution though - which is sufficient for our current application with separable filters.

dta_sizes = []
dta_time = []
dta_resultims = []

for ksize in range(3, 25):
if ksize % 2 == 0:
continue

print(f"Kernel size {ksize} ... ", end = "")
kernel = gaussiankernel(ksize, 100)

imagein = np.asarray(Image.open("imagetest01.jpg"))
imagetemp = np.empty(imagein.shape, dtype = np.uint8)
imageout = np.empty(imagein.shape, dtype = np.uint8)

dta_sizes.append(ksize)
stime = time()
for c in range(imagein.shape):
for x in range(imagein.shape):
imagetemp[x,:,c] = np.convolve(imagein[x,:,c], kernel, 'same')
for y in range(imagein.shape):
imageout[:,y,c] = np.convolve(imagein[:,y,c], kernel, 'same')
etime = time()

print(f"{etime - stime}s")
dta_time.append(etime - stime)
dta_resultims.append(imageout)

fig, ax = plt.subplots(len(dta_sizes) + 1, figsize = (6.4 * 4, 4.8 * 4 * (len(dta_sizes) + 1)))

ax.plot(dta_sizes, dta_time)
ax.set_xlabel("Kernel size")
ax.set_ylabel("Execution time")
ax.grid()

for iim, im in enumerate(dta_resultims):
ax[iim + 1].imshow(im)
ax[iim + 1].set_title(f"Kernel size {dta_sizes[iim]} took {dta_time[iim]} seconds")

plt.savefig("convolution_gaussiantime_separated_kernelsizes_np.png")
plt.savefig("convolution_gaussiantime_separated_kernelsizes_np.svg")
plt.show()

np.savez("convolution_gaussiantime_separated_kernelsizes_np.npz", sizes = dta_sizes, times = dta_time)

    Kernel size 3 ... 1.2034239768981934s
Kernel size 5 ... 1.2064249515533447s
Kernel size 7 ... 1.2763926982879639s
Kernel size 9 ... 1.3243684768676758s
Kernel size 11 ... 1.4453110694885254s
Kernel size 13 ... 2.6087560653686523s
Kernel size 15 ... 2.700711965560913s
Kernel size 17 ... 2.855266571044922s
Kernel size 19 ... 3.021843910217285s
Kernel size 21 ... 3.186105251312256s
Kernel size 23 ... 3.2412893772125244s As one can see the speedup of using an optimized native implementation is of course massive - as expected. The 3x3 separated Gaussian kernel does not take 220 seconds but only 1.2 seconds using this method - a 182x speedup. For the 23x23 Kernel the speedup is even more massive - a more than 331 times increase in speed. This impressively shows: Python is suited to demonstrate how algorithms work and glue stuff together like a scripting language but the actual number crunching has to happen in optimized native implementations.

### The first try using OpenCL

To increase performance even more and to move processing onto the GPU which might be interesting even when applying convolution will not get a massive increase since transferring data to and from the GPU takes massive amounts of time (so when one uses other algorithms that have a much higher profit of running on the GPU one can keep the images on the GPU anyways) let’s now try a naive implementation on the GPU using OpenCL. Since the memory transfer takes some and and the actual computation will only be a very small part of the task - and the overlapping memory access pattern in the unoptimized implementation - a very huge gain is not the be expected.

import pyopencl as cl
import numpy as np

from PIL import Image

from time import time

platforms = cl.get_platforms()
for idxPlatform, platform in enumerate(platforms):
devices = platform.get_devices()
print(f"Platform {platform.name}:")
for idxDevice, dev in enumerate(devices):
print(f"   {idxDevice}: {dev.name}")

usedPlatform = 0
usedDevice = 0

clPlatform = platforms[usedPlatform]
clDevice = clPlatform.get_devices()[usedDevice]
clContext = cl.Context([ clDevice ])
clQueue = cl.CommandQueue(clContext, clDevice)

    Platform AMD Accelerated Parallel Processing:
0: Tonga

image_input = np.asarray(Image.open("imagetest01.jpg").convert('RGBA'), dtype = np.uint8)
w, h, c = image_input.shape, image_input.shape, image_input.shape
print(f"Source image has dimensions {h} x {w} x {c}")

    Source image has dimensions 4160 x 2340 x 4


Now lets define our convolution kernel.

imagekernel = gaussiankernel(23, 100)

clkernel = '''//CL//
__kernel void convolve(
__global unsigned char* lpImage,
__global unsigned char* lpImageOut,
__constant float* lpKernel

) {
int x = get_global_id(0);
int y = get_global_id(1);

float v_r;
float v_g;
float v_b;

for(int i = -KERNELHALFWIDTH; i <= KERNELHALFWIDTH; i=i+1) {
#if defined(CONVOLE_HORIZONTAL)
int realx = x + i;
if(realx < 0) {
realx = 0;
} else if(realx >= IMAGEWIDTH) {
realx = IMAGEWIDTH-1;
}

float kv =  lpKernel[i + KERNELHALFWIDTH];
v_r = v_r + lpImage[(realx + y * IMAGEHEIGHT) * 4 + 0] * kv;
v_g = v_g + lpImage[(realx + y * IMAGEHEIGHT) * 4 + 1] * kv;
v_b = v_b + lpImage[(realx + y * IMAGEHEIGHT) * 4 + 2] * kv;
#else
int realy = y + i;
if(realy < 0) {
realy = 0;
} else if(realy >= IMAGEHEIGHT) {
realy = IMAGEHEIGHT-1;
}

float kv =  lpKernel[i + KERNELHALFWIDTH];
v_r = v_r + lpImage[(x + realy * IMAGEHEIGHT) * 4 + 0] * kv;
v_g = v_g + lpImage[(x + realy * IMAGEHEIGHT) * 4 + 1] * kv;
v_b = v_b + lpImage[(x + realy * IMAGEHEIGHT) * 4 + 2] * kv;
#endif
}

if(v_r > 255) {
v_r = 255;
} else if(v_r < 0) {
v_r = 0;
}
if(v_g > 255) {
v_g = 255;
} else if(v_g < 0) {
v_g = 0;
}
if(v_b > 255) {
v_b = 255;
} else if(v_b < 0) {
v_b = 0;
}

lpImageOut[(x + y * IMAGEHEIGHT) * 4 + 0] = (unsigned char)v_r;
lpImageOut[(x + y * IMAGEHEIGHT) * 4 + 1] = (unsigned char)v_g;
lpImageOut[(x + y * IMAGEHEIGHT) * 4 + 2] = (unsigned char)v_b;
lpImageOut[(x + y * IMAGEHEIGHT) * 4 + 3] = 255;
}
'''

clkernel = '''
#define KERNELHALFWIDTH {0}
#define IMAGEHEIGHT {1}
#define IMAGEWIDTH {2}
#define CONVOLVE_HORIZONTAL 1
'''.format(int(len(imagekernel)/2), image_input.shape, image_input.shape)  + clkernel

clProgH = cl.Program(clContext, clkernel).build()

clkernel = '''
#define KERNELHALFWIDTH {0}
#define IMAGEHEIGHT {1}
#define IMAGEWIDTH {2}
'''.format(int(len(imagekernel)/2), image_input.shape, image_input.shape)  + clkernel

clProgV = cl.Program(clContext, clkernel).build()

tstart = time()

# Allocate out input buffer (we do not need write access _in this case_. This might be interesting when applying multiple
# Convolutions without keeping the intermediate results when swapping two GPU buffers forth and back without
# having to do a copying round ...

#clBufferIn = cl.Buffer(clContext, cl.mem_flags.READ_WRITE | cl.mem_flags.COPY_HOST_PTR, hostbuf = image_input)
clBufferIn = cl.Buffer(clContext, cl.mem_flags.READ_ONLY | cl.mem_flags.COPY_HOST_PTR, hostbuf = image_input)

npImageOut = np.empty(clBufferIn.get_info(cl.mem_info.SIZE), dtype=np.uint8)
clBufferTemp = cl.Buffer(clContext, cl.mem_flags.READ_WRITE | cl.mem_flags.COPY_HOST_PTR, hostbuf = npImageOut)
clBufferOut = cl.Buffer(clContext, cl.mem_flags.WRITE_ONLY | cl.mem_flags.COPY_HOST_PTR, hostbuf = npImageOut)

npKernel = np.asarray(imagekernel, dtype = np.float32)
clBufferKernel = cl.Buffer(clContext, cl.mem_flags.READ_ONLY | cl.mem_flags.COPY_HOST_PTR, hostbuf = npKernel)

krnlH = clProgH.convolve
krnlV = clProgV.convolve

krnlH(clQueue, (h, w,), None , clBufferIn, clBufferTemp, clBufferKernel)
krnlV(clQueue, (h, w,), None , clBufferTemp, clBufferOut, clBufferKernel)

cl.enqueue_copy(clQueue, npImageOut, clBufferOut) #, origin=(0,0), region=(w,h))
tend = time()

print(f"Total time: {tend - tstart}")

    Total time: 0.21088314056396484

plt.imshow(image_input.reshape(image_input.shape))
plt.savefig("convolve_ocl01.png")
plt.show() plt.imshow(npImageOut.reshape(image_input.shape))
plt.savefig("convolve_ocl02.png")
plt.show() #### Speedup measurement

Now the kernel seems to work in it’s simple form with the repeat border mode (which is very sub optimal for memory access along the edges which will massively hurt performance) let’s run a timing loop over the required operations (transferring the image, transferring the kernel, running the operation and transferring the result back to host memory) as before:

dta_sizes = []
dta_time = []
dta_resultims = []

for ksize in range(3, 25):
if ksize % 2 == 0:
continue

print(f"Kernel size {ksize} ... ", end = "")

image_input = np.asarray(Image.open("imagetest01.jpg").convert('RGBA'), dtype = np.uint8)
w, h, c = image_input.shape, image_input.shape, image_input.shape

imagekernel = gaussiankernel(ksize, 100)

clkernel = '''//CL//
__kernel void convolve(
__global unsigned char* lpImage,
__global unsigned char* lpImageOut,
__constant float* lpKernel

) {
int x = get_global_id(0);
int y = get_global_id(1);

float v_r;
float v_g;
float v_b;

for(int i = -KERNELHALFWIDTH; i <= KERNELHALFWIDTH; i=i+1) {
#if defined(CONVOLE_HORIZONTAL)
int realx = x + i;
if(realx < 0) {
realx = 0;
} else if(realx >= IMAGEWIDTH) {
realx = IMAGEWIDTH-1;
}

float kv =  lpKernel[i + KERNELHALFWIDTH];
v_r = v_r + lpImage[(realx + y * IMAGEHEIGHT) * 4 + 0] * kv;
v_g = v_g + lpImage[(realx + y * IMAGEHEIGHT) * 4 + 1] * kv;
v_b = v_b + lpImage[(realx + y * IMAGEHEIGHT) * 4 + 2] * kv;
#else
int realy = y + i;
if(realy < 0) {
realy = 0;
} else if(realy >= IMAGEHEIGHT) {
realy = IMAGEHEIGHT-1;
}

float kv =  lpKernel[i + KERNELHALFWIDTH];
v_r = v_r + lpImage[(x + realy * IMAGEHEIGHT) * 4 + 0] * kv;
v_g = v_g + lpImage[(x + realy * IMAGEHEIGHT) * 4 + 1] * kv;
v_b = v_b + lpImage[(x + realy * IMAGEHEIGHT) * 4 + 2] * kv;
#endif
}

if(v_r > 255) {
v_r = 255;
} else if(v_r < 0) {
v_r = 0;
}
if(v_g > 255) {
v_g = 255;
} else if(v_g < 0) {
v_g = 0;
}
if(v_b > 255) {
v_b = 255;
} else if(v_b < 0) {
v_b = 0;
}

lpImageOut[(x + y * IMAGEHEIGHT) * 4 + 0] = (unsigned char)v_r;
lpImageOut[(x + y * IMAGEHEIGHT) * 4 + 1] = (unsigned char)v_g;
lpImageOut[(x + y * IMAGEHEIGHT) * 4 + 2] = (unsigned char)v_b;
lpImageOut[(x + y * IMAGEHEIGHT) * 4 + 3] = 255;
}
'''

clkernel = '''
#define KERNELHALFWIDTH {0}
#define IMAGEHEIGHT {1}
#define IMAGEWIDTH {2}
#define CONVOLVE_HORIZONTAL 1
'''.format(int(len(imagekernel)/2), image_input.shape, image_input.shape)  + clkernel

clProgH = cl.Program(clContext, clkernel).build()

clkernel = '''
#define KERNELHALFWIDTH {0}
#define IMAGEHEIGHT {1}
#define IMAGEWIDTH {2}
'''.format(int(len(imagekernel)/2), image_input.shape, image_input.shape)  + clkernel

clProgV = cl.Program(clContext, clkernel).build()

imagetemp = np.empty(imagein.shape, dtype = np.uint8)
imageout = np.empty(imagein.shape, dtype = np.uint8)

dta_sizes.append(ksize)
stime = time()

# Allocate out input buffer (we do not need write access _in this case_. This might be interesting when applying multiple
# Convolutions without keeping the intermediate results when swapping two GPU buffers forth and back without
# having to do a copying round ...

#clBufferIn = cl.Buffer(clContext, cl.mem_flags.READ_WRITE | cl.mem_flags.COPY_HOST_PTR, hostbuf = image_input)
clBufferIn = cl.Buffer(clContext, cl.mem_flags.READ_ONLY | cl.mem_flags.COPY_HOST_PTR, hostbuf = image_input)

npImageOut = np.empty(clBufferIn.get_info(cl.mem_info.SIZE), dtype=np.uint8)
clBufferTemp = cl.Buffer(clContext, cl.mem_flags.READ_WRITE | cl.mem_flags.COPY_HOST_PTR, hostbuf = npImageOut)
clBufferOut = cl.Buffer(clContext, cl.mem_flags.WRITE_ONLY | cl.mem_flags.COPY_HOST_PTR, hostbuf = npImageOut)

npKernel = np.asarray(imagekernel, dtype = np.float32)
clBufferKernel = cl.Buffer(clContext, cl.mem_flags.READ_ONLY | cl.mem_flags.COPY_HOST_PTR, hostbuf = npKernel)

krnlH = clProgH.convolve
krnlV = clProgV.convolve

krnlH(clQueue, (h, w,), None , clBufferIn, clBufferTemp, clBufferKernel)
krnlV(clQueue, (h, w,), None , clBufferTemp, clBufferOut, clBufferKernel)

cl.enqueue_copy(clQueue, npImageOut, clBufferOut) #, origin=(0,0), region=(w,h))
etime = time()

imageout = npImageOut.reshape(image_input.shape)
print(f"{etime - stime}s")
dta_time.append(etime - stime)
dta_resultims.append(imageout)

fig, ax = plt.subplots(len(dta_sizes) + 1, figsize = (6.4 * 4, 4.8 * 4 * (len(dta_sizes) + 1)))

ax.plot(dta_sizes, dta_time)
ax.set_xlabel("Kernel size")
ax.set_ylabel("Execution time")
ax.grid()

for iim, im in enumerate(dta_resultims):
ax[iim + 1].imshow(im)
ax[iim + 1].set_title(f"Kernel size {dta_sizes[iim]} took {dta_time[iim]} seconds")

plt.savefig("convolution_gaussiantime_separated_kernelsizes_gpu_naive1.png")
plt.savefig("convolution_gaussiantime_separated_kernelsizes_gpu_naive1.svg")
plt.show()

np.savez("convolution_gaussiantime_separated_kernelsizes_gpu_naive1.npz", sizes = dta_sizes, times = dta_time)

    Kernel size 3 ... 0.10095095634460449s
Kernel size 5 ... 0.08750176429748535s
Kernel size 7 ... 0.09695291519165039s
Kernel size 9 ... 0.09610867500305176s
Kernel size 11 ... 0.09396982192993164s
Kernel size 13 ... 0.09184885025024414s
Kernel size 15 ... 0.0949549674987793s
Kernel size 17 ... 0.09395337104797363s
Kernel size 19 ... 0.0939486026763916s
Kernel size 21 ... 0.09995079040527344s
Kernel size 23 ... 0.09695005416870117s #### Measuring the speedup of the computation without memory transfer

Even for the naive implementation the speedup compared to the numpy based implementation - even when including memory transfer - is a factor of about 12 for the 3x3 kernel and 33 for the 23x23 kernel. As one can see the execution time is pretty independent of the kernel size. This is caused by the dominance of the memory transfers. So let’s take a look into the actual execution time of the kernel. To do this we have to use a command queue with profiling enabled. This will enable us to query the profile.start and profile.end timestamps of the events returned for the kernel executions. We’ll also add synchronization primitives after each kernel execution (event.wait()) which will hurt performance a little bit since we don’t fill the command queue for the GPU in an asynchronous fashion while the GPU is doing computational stuff - so we switch between CPU and GPU performing work which should not be done in real world applications - in best case all used platforms should be utilized as asynchronously as possible to exploit high utilization. But this example is not about how to optimize the usage of the GPU anyways.

platforms = cl.get_platforms()
for idxPlatform, platform in enumerate(platforms):
devices = platform.get_devices()
print(f"Platform {platform.name}:")
for idxDevice, dev in enumerate(devices):
print(f"   {idxDevice}: {dev.name}")

usedPlatform = 0
usedDevice = 0

clPlatform = platforms[usedPlatform]
clDevice = clPlatform.get_devices()[usedDevice]
clContext = cl.Context([ clDevice ])
clQueue = cl.CommandQueue(clContext, clDevice, properties=cl.command_queue_properties.PROFILING_ENABLE)

    Platform AMD Accelerated Parallel Processing:
0: Tonga

dta_sizes = []
dta_time = []
dta_krnltimesH = []
dta_krnltimesV = []
dta_resultims = []

for ksize in range(3, 25):
if ksize % 2 == 0:
continue

print(f"Kernel size {ksize} ... ", end = "")

image_input = np.asarray(Image.open("imagetest01.jpg").convert('RGBA'), dtype = np.uint8)
w, h, c = image_input.shape, image_input.shape, image_input.shape

imagekernel = gaussiankernel(ksize, 100)

clkernel = '''//CL//
__kernel void convolve(
__global unsigned char* lpImage,
__global unsigned char* lpImageOut,
__constant float* lpKernel

) {
int x = get_global_id(0);
int y = get_global_id(1);

float v_r;
float v_g;
float v_b;

for(int i = -KERNELHALFWIDTH; i <= KERNELHALFWIDTH; i=i+1) {
#if defined(CONVOLE_HORIZONTAL)
int realx = x + i;
if(realx < 0) {
realx = 0;
} else if(realx >= IMAGEWIDTH) {
realx = IMAGEWIDTH-1;
}

float kv =  lpKernel[i + KERNELHALFWIDTH];
v_r = v_r + lpImage[(realx + y * IMAGEHEIGHT) * 4 + 0] * kv;
v_g = v_g + lpImage[(realx + y * IMAGEHEIGHT) * 4 + 1] * kv;
v_b = v_b + lpImage[(realx + y * IMAGEHEIGHT) * 4 + 2] * kv;
#else
int realy = y + i;
if(realy < 0) {
realy = 0;
} else if(realy >= IMAGEHEIGHT) {
realy = IMAGEHEIGHT-1;
}

float kv =  lpKernel[i + KERNELHALFWIDTH];
v_r = v_r + lpImage[(x + realy * IMAGEHEIGHT) * 4 + 0] * kv;
v_g = v_g + lpImage[(x + realy * IMAGEHEIGHT) * 4 + 1] * kv;
v_b = v_b + lpImage[(x + realy * IMAGEHEIGHT) * 4 + 2] * kv;
#endif
}

if(v_r > 255) {
v_r = 255;
} else if(v_r < 0) {
v_r = 0;
}
if(v_g > 255) {
v_g = 255;
} else if(v_g < 0) {
v_g = 0;
}
if(v_b > 255) {
v_b = 255;
} else if(v_b < 0) {
v_b = 0;
}

lpImageOut[(x + y * IMAGEHEIGHT) * 4 + 0] = (unsigned char)v_r;
lpImageOut[(x + y * IMAGEHEIGHT) * 4 + 1] = (unsigned char)v_g;
lpImageOut[(x + y * IMAGEHEIGHT) * 4 + 2] = (unsigned char)v_b;
lpImageOut[(x + y * IMAGEHEIGHT) * 4 + 3] = 255;
}
'''

clkernel = '''
#define KERNELHALFWIDTH {0}
#define IMAGEHEIGHT {1}
#define IMAGEWIDTH {2}
#define CONVOLVE_HORIZONTAL 1
'''.format(int(len(imagekernel)/2), image_input.shape, image_input.shape)  + clkernel

clProgH = cl.Program(clContext, clkernel).build()

clkernel = '''
#define KERNELHALFWIDTH {0}
#define IMAGEHEIGHT {1}
#define IMAGEWIDTH {2}
'''.format(int(len(imagekernel)/2), image_input.shape, image_input.shape)  + clkernel

clProgV = cl.Program(clContext, clkernel).build()

imagetemp = np.empty(imagein.shape, dtype = np.uint8)
imageout = np.empty(imagein.shape, dtype = np.uint8)

dta_sizes.append(ksize)
stime = time()

# Allocate out input buffer (we do not need write access _in this case_. This might be interesting when applying multiple
# Convolutions without keeping the intermediate results when swapping two GPU buffers forth and back without
# having to do a copying round ...

#clBufferIn = cl.Buffer(clContext, cl.mem_flags.READ_WRITE | cl.mem_flags.COPY_HOST_PTR, hostbuf = image_input)
clBufferIn = cl.Buffer(clContext, cl.mem_flags.READ_ONLY | cl.mem_flags.COPY_HOST_PTR, hostbuf = image_input)

npImageOut = np.empty(clBufferIn.get_info(cl.mem_info.SIZE), dtype=np.uint8)
clBufferTemp = cl.Buffer(clContext, cl.mem_flags.READ_WRITE | cl.mem_flags.COPY_HOST_PTR, hostbuf = npImageOut)
clBufferOut = cl.Buffer(clContext, cl.mem_flags.WRITE_ONLY | cl.mem_flags.COPY_HOST_PTR, hostbuf = npImageOut)

npKernel = np.asarray(imagekernel, dtype = np.float32)
clBufferKernel = cl.Buffer(clContext, cl.mem_flags.READ_ONLY | cl.mem_flags.COPY_HOST_PTR, hostbuf = npKernel)

krnlH = clProgH.convolve
krnlV = clProgV.convolve

ev1 = krnlH(clQueue, (h, w,), None , clBufferIn, clBufferTemp, clBufferKernel)
ev1.wait()
ev2 = krnlV(clQueue, (h, w,), None , clBufferTemp, clBufferOut, clBufferKernel)
ev2.wait()

dta_krnltimesH.append(ev1.profile.end - ev1.profile.start)
dta_krnltimesV.append(ev2.profile.end - ev2.profile.start)

cl.enqueue_copy(clQueue, npImageOut, clBufferOut) #, origin=(0,0), region=(w,h))
etime = time()

imageout = npImageOut.reshape(image_input.shape)
print(f"{etime - stime}s (Horizontal: {dta_krnltimesH[-1]}, Vertical: {dta_krnltimesV[-1]}")
dta_time.append(etime - stime)
dta_resultims.append(imageout)

fig, ax = plt.subplots(len(dta_sizes) + 1, figsize = (6.4 * 4, 4.8 * 4 * (len(dta_sizes) + 1)))

#ax.plot(dta_sizes, dta_time)
ax.plot(dta_sizes, dta_krnltimesH, label="Horizontal convolution")
ax.plot(dta_sizes, dta_krnltimesV, label="Vertical convolution")
ax.set_xlabel("Kernel size")
ax.set_ylabel("Execution time [ns]")
ax.grid()
ax.legend()

for iim, im in enumerate(dta_resultims):
ax[iim + 1].imshow(im)
ax[iim + 1].set_title(f"Kernel size {dta_sizes[iim]} took {dta_time[iim]} seconds")

plt.savefig("convolution_gaussiantime_separated_kernelsizes_gpu_naive1_gputiming.png")
plt.savefig("convolution_gaussiantime_separated_kernelsizes_gpu_naive1_gputiming.svg")
plt.show()

np.savez("convolution_gaussiantime_separated_kernelsizes_gpu_naive1_gputiming.npz", sizes = dta_sizes, times = dta_time)

    Kernel size 3 ... 0.10130548477172852s (Horizontal: 1195259, Vertical: 1193185
Kernel size 5 ... 0.08895730972290039s (Horizontal: 1970074, Vertical: 1734518
Kernel size 7 ... 0.09129762649536133s (Horizontal: 2288296, Vertical: 2280000
Kernel size 9 ... 0.09295392036437988s (Horizontal: 2824888, Vertical: 2826666
Kernel size 11 ... 0.0969536304473877s (Horizontal: 3360592, Vertical: 3357629
Kernel size 13 ... 0.0929572582244873s (Horizontal: 3926815, Vertical: 3917037
Kernel size 15 ... 0.10594820976257324s (Horizontal: 4474074, Vertical: 4441185
Kernel size 17 ... 0.0972445011138916s (Horizontal: 4985778, Vertical: 5011259
Kernel size 19 ... 0.09595417976379395s (Horizontal: 5523704, Vertical: 5522074
Kernel size 21 ... 0.09695553779602051s (Horizontal: 6062666, Vertical: 6058666
Kernel size 23 ... 0.09807777404785156s (Horizontal: 6668741, Vertical: 6635111 Here one now can see very impressively how fast the GPU is actually doing the calculation in comparison with overall execution time including host to GPU memory transfer. The execution of a 3x3 convolution just takes around 1.2ms for each the horizontal and vertical kernel even though the whole operation including transfer takes around 101ms - so we spend only about 2% of our time performing actual calculations. For the 23x23 kernel the GPU kernels took around 6.6ms each so we spent about 13% time doing actual calculations and the remaining time transferring data back and forth. And this does not even take into account that we do not utilize memory coalescing with the naive kernels (so memory the computation units of the GPU are idling most of the time just waiting for data to arrive).

When looking at the execution time of the kernels alone we already see the linear behavior with kernel size that we’d expect from the algorithm itself though.

When looking only at the computational time we’ve achieved a more than 81439x increase of execution speed compared to the naive Python implementation with the naive OpenCL implementation, when also taking into account the transfer time the speedup is still more than 10969x.