25 Feb 2020 - tsp

*This article is work in progress*

- Introduction
- Basics
- Training methods
- Some Neuron types
- Different common network topologies
- Some layer types

Neuronal networks are often called *the* solution to every hard to tackle problem
today. They’re a powerful method to perform data analysis and data exploration,
but they should not be over-estimated in their importance. They mostly compete
with other algorithms like support vector machines (which are most of the
time more performant than neuronal networks) and systems like bag of words
classifiers. They also compete with basic Bayesian predictors, etc.

The basic idea of a neuronal network is to provide a general data structure that can modify it’s own mappings from inputs to outputs, some kind of feedback and push as much training data inside the neuronal network as possible to calculate weights. One tries to find a (complex but mathematical) model function that identifies some kind of pattern on the input data and does a prediction (either binary or continuous) on output data.

Note that neuronal networks provide a tradeoff between development effort, accuracy, required training data and comutational / memory requirements.

In supervised learning the neuronal network is first presented with a training
dataset. This training dataset normally consists of some sample data (it doesn’t
matter if these are images, text, datasamples from some data aquisition system,
etc.) as well as a set of output values (labels, weights, etc.). The data is
presented to the neuronal networks inputs and the output is compared to the
expected one. Then one calculates the squared error to determine deviation from
the expected output and uses the *backpropagation algorithm* that simply calculates
the influence of each weight of each neuron onto the output (in terms of a
gradient) to calculate the correction applied to the weights - which is modified
using a so called learning rate.

At the end the network is normally tested against a independent test dataset to verify the training worked.

Note that neuronal networks are similar to statistical analysis in this application but provide the advantage that one can use a generic model composed of multiple interconnected functions to describe the problem. The more complex the model gets the more input data is required (and no - this does of course not scale linear).

This is an approach often used in clustering or segmentation. The training also works by presenting the network a huge bunch of training data - but this time the data is not pre-labeled but the network is used to determine some kind of segmentation of the input data. The normal approach for unsupervised learning is to use either principal component analysis (PCA) or cluster analysis instead of neuronal networks. In the world of neuronal networks one might see self-organizing maps or usage of the adaptive resonance theory.

Normally the outcome of unsupervised learning is not known in advance and inspection of the generated neuronal network is required. In practice most of the time PCA or cluster analysis are more performant and more accurate than neuronal networks in this field.

Another method of learning is called *reinforcement learning*.
For reinforcement learning one doesn’t provide known input and output pairs
to train the network but (sporadically) provides feedback via a feedback
function - i.e. tells the network if it’s performance is good or bad. These
informations then are used to recalculate weights. The initial network might
do random action and check the effect judged by an external feedback function.
One might even introduce randomness into the network to provide additional
learning disturbance to leave local minima. Often a simulation is used to
provide feedback on initial networks - i.e. the network provides some
output, the simulation checks if the outcome is positive or negative (not in
mathematical terms but for example if it leads to catastrophic failure in
control systems, achieves higher scores in games, etc.) and provides positive
or negative feedback in this case.

This method is heavily inspired by the way the human brain learns.

The basic building block of neuronal networks is a *neuron*. So what’s a
neuron? It’s a mathematical object composed of:

- An weight $w_{ij}$ that maps the weight of the input of the previous layer or input data $i$ into the neuron $j$. This will end up inside a mapping matrix.
- A transmission function $z = \Sigma(x_1, \ldots, x_i; w_0j, \ldots, w_ij)$
that’s normally really just the sum of all weighted input values. When
built as sum this is normally called the
*activation potential*. - An activation function $o_j = \sigma(z)$. This is the main function used to decide how to react to a given input and it’s the main challenge of using backpropagation as well as performance tuning for neuronal networks.
- Sometimes a threashold value $\theta_j$ is added to the input as a constant value. This is called bias.

As one can see the input to a single neuron with index $j$ is calculated as

$ z_j = \Sigma_{i = 1}^{I} w_{ij} * x_i $

Then the activation function is applied to calculate the output:

$ o_j = \sigma(z_j + \theta_j) $

Evaluation of the network response for a given input is normally done on graphics processing units (GPU) or tensor processing units (TPU) these days but of course it’s entirely possible to be done on CPU based systems. There are essentially only two basic operations requires:

- Matrix multiplication
- Applying a function to each matrix element

For some layer types on might also be required to sum vectors - or is capable of reducing storage by applying the same matrix to a larger number of overlapping or non overlapping zones (for example for convolutional layers).

To evaluate a dense layer one requires:

- The vector $\vec{x}$ contains the input data with elements $x_i$. This vector is assumed to be a column vector with $I$ elements.
- A weight matrix $w$ that contains the weights $w_{ij}$. This matrix is assumed to contain all weights for the first neuron in it’s first row, all weigts for the second neuron in it’s second, etc. This encodes the weights that map all input elements into the corresponding activation potentials $z_j = w_{ij} * x_i + \theta_j$.
- This leads to an intermediate value that should be stored when one want’s to do training but one doesn’t have to store when doing evaluation, the vector $\vec{z}$ of activation potentials $z_j$.
- An output vector $\vec{o}$ that contains the result of network evaluation, i.e. the application of the activation function to each and every element of the activation vector $\vec{o} = \sigma(\vec{z})$
- One can further reduce the complexity with some additional computational and storage overhead by including the bias $\theta_j$ into the weight matrix. This could be realized by adding an additional $1$ element at the end of each input vector and adding a further column to the matrix containing the bias values for each neuron.

$\vec{z} = w * \vec{x}$

[ \left(\begin{matrix}z_{1} \\ z_{2} \\ \ldots \\ z_J \\ 1\end{matrix}\right) = \left(\begin{matrix}w_{11} & w_{21} & \ldots & w_{I1} & \theta_1 \\ w_{12} & w_{22} & \ldots & w_{I2} & \theta_2 \\ \ldots & \ldots & \ldots & \ldots & \ldots \\ w_{1J} & w_{2J} & \ldots & w_{IJ} & \theta_J \\ 0 & 0 & \ldots & 0 & 1 \end{matrix}\right) * \left(\begin{matrix}x_{1} \\ x_{2} \\ \ldots \\ x_I \\ 1\end{matrix}\right) ] [ \left(\begin{matrix}o_{1} \\ o_{2} \\ \ldots \\ o_{J} \\ 1 \end{matrix}\right) = \left(\begin{matrix}\sigma(z_{1}) \\ \sigma(o_{2}) \\ \ldots \\ \sigma(o_{J}) \\ 1 \end{matrix}\right) ]One of the most prominent methods to determine weights for the Neurons is
the *backpropagation* algorithm. This algorithm is used during supervised
learning. In this case on has a huge number of input $\vec{x}$ and expected
target $\vec{t}$ pairs. One can then calculate an squared error value

$E = \sum_{m=0}^{M} (\vec{t_m} - \vec{o_m})^2$

For the following calculations we assume a neuronal network consisting of input data and two dense neuronal layers. The output data is directly taken from the output of the second layer.

Note that the following indices are used:

- $m$ is the index into the training set. If you train your network with 1000 different samples $m$ runs from $0$ to $999$. The total number of training samples is termed $M$
- $i$ is an index into the input data for the layer. If an layer has $500$ different possible inputs $i$ runs from $0$ to $499$. The total number of input values is called $I$. For the first layer this is the index into the input vector.
- $j$ is the index into the output values for the input layer. For the last (output) layer this is the result data. The total number of output values of a given layer is called $J$.
- $j$ is also the index into the input values for the second layer as both layers are chained at each other.
- $k$ is the index into neurons at the second layer as well as into the output layer. The number of output variables is called $K$

In this case $t_{k,m}$ and $o_{k,m}$ compose all input values and expected output values for the $m$-th training sample. The square function guarantees that every error contributes independent of the direction (positive or negative). An error of $0$ would mean that the network perfectly reproduces all presented samples (which normally is an indication for overfitting). This error will also be calculated during validation.

The forward equations for the input layer in this network can be expressed as:

[ z_j = \theta_j + \sum_{i=0}^{N-1} x_i * w_{ij} ] [ y_j = \sigma(z_j) ] [ y_j = \sigma(\theta_j + \sum_{i=0}^{N-1} x_i * w_{ij}) ]The second layer is built on top of the results of the first layer:

[ a_k = \psi_k + \sum_{j=0}^{J-1} y_j * w_{jk} ] [ o_k = \sigma(a_k) ] [ o_k = \sigma(\psi_k + \sum_{j=0}^{J-1} y_j * w_{jk}) ] [ o_k = \sigma(\psi_k + \sum_{j=0}^{J-1} \sigma(\theta_j + \sum_{i=0}^{N-1} x_i * w_{ij}) * w_{jk}) ]The idea of backpropagation is now to minimize the error function. First one has to determine the direction and slope (i.e. the gradient) of the error. One simply computes the derivative of the error $E$ in relation to the weights $w_{ij}$. First we’ll only look at the output layer where one can directly specify the error deviation:

[ E = \frac{1}{2} \sum_{m} \sum_{k=0}^{K} \left(o_{k;m} - t_{k;m}\right)^2 ] [ E = \frac{1}{2} \sum_{m} \sum_{k=0}^{K} \left(\sigma(\psi_k + \sum_{j=0}^{J-1} y_j * w_{jk}) - t_{k;m}\right)^2 ]To make one’s live easier we’ll introduce the bias as explained above inside the weight matrix. This eliminates $\psi$ and $\theta$ from the equations:

[ E = \frac{1}{2} \sum_{m} \sum_{k=0}^{K} \left(\sigma(\sum_{j=0}^{J-1} y_j * w_{jk}) - t_{k;m}\right)^2 ]Now the derivatives can easily be calculated:

[ \frac{\partial E}{\partial w_{jk}} = \sum_{m} \sum_{k=0}^{K} \left(\sigma(a_k) - t_{k;m}\right) * \frac{\partial \sigma}{\partial a_k}(a_k) * \frac{\partial a_k}{\partial w_{jk}} ] [ \frac{\partial E}{\partial w_{jk}} = \sum_{m} \sum_{k=0}^{K} \left(\sigma(a_k) - t_{k;m}\right) * \frac{\partial \sigma}{\partial a_k}(a_k) * y_j ]Using this derivative for a single neuron $E_k$ one can simply determine:

[ \frac{\partial E_k}{\partial w_{jk}} = \sum_{m} \left(\sigma(a_k) - t_{k;m}\right) * \frac{\partial \sigma}{\partial a_k}(a_k) * y_j ]For an intermediate node the approach is similar:

[ \frac{\partial E}{\partial w_{ij}} = \sum_{k} \left( \sigma(a_k) - t_k \right) * \frac{\partial \sigma}{\partial z}(a_k) * \left(\sum_j \frac{\partial \sigma}{\partial z}(z_j) w_{jk} * x_i \right) ] [ \frac{\partial E_j}{\partial w_{ij}} = \sum_{k} \left(\sigma(a_k) - t_{k;m}\right) * \frac{\partial \sigma}{\partial a_k}(a_k) * \frac{\partial \sigma}{\partial z}(z_j) * w_{jk} * x_i ]As one can see the structure of both formulas is quite similar:

[ \frac{\partial E_k}{\partial w_{jk}} = \underbrace{\left( \sigma(a_k) - t_k \right) * \frac{\partial \sigma}{\partial z}(a_k)}_{\delta_k} * y_j ] [ \frac{\partial E_j}{\partial w_{ik}} = \underbrace{\left( \sum_{k} \delta_k * w_{jk} \right) * \frac{\partial \sigma}{\partial z}(z_j)}_{\delta_j} * x_i ]From this expressions one can see why the algorithm is called backpropagation. It simply transfers the error from the output layer toward the input of the output layer, then uses the weights of all nodes that are attached to the next node upstream to transfer the error upwards and uses the same approach again. This is repeated for every layer inside the network.

With these derivates one can calculate the adjustment of all weights in all layers:

[ \Delta w_{jk} = - \eta * \frac{\partial E_k}{\partial w_{jk}} = -\eta * \delta_k * y_j ] [ \Delta w_{ij} = - \eta * \frac{\partial E_j}{\partial w_{ij}} = -\eta * \delta_j * x_i ]Here the learning rate $\eta$ was used to specify the speed with which the algorithm should step into the direction of the gradient. A large learning rate leads to a high convergence speed but the algorithm may also miss local minima in it’s current vicinity. A low learning rate would require a huge number of iterations but might converge more reliable. Note that one also has to take care of numerical instabilities (multiplying large numbers - for example from exploiding gradients of some activation functions - with small numbers is especially a huge problem with floating point arithmetic) - most of a time a hybrid approach is most useful where one also reduces the learning rate during a number of epochs.

The online backpropagation method does this calculation for each and every training sample specified. This might be problematic if there are training samples that would lead to large steps into a given direction wheras the majority would not - i.e. an outlier. The online method is rather simple to implement and straightforward. One presents the intput, propagates it to the tensors till one has calculated a predicted output, calculates the error and all weight differences for a given learning rate and applies the weight modifications. Then the next sample is used and so on.

The batch method is used to reduce the effect of outliers. One calculates the weight differences for a whole bunch of training samples (this step is called an epoch). Then one calculates the average error for all nodes and applies that average to the backpropagation algorithm instead of the error of one sample. In this case the effect of outliers is reduced. One might also eliminate outliers using different methods but one has to remember that removing outliers is just fighting symptoms of bad measurement, way to small sample size or a wrong assumption, a too high error rate for labeling the input data set, etc.

Of course the batch method can also be used as a simple parallelization method. Each node uses the same network weights and calculates it’s own averaged gradients. Then the average over all gradients of a single batch is calculated at a central node, the weights are adjusted and broadcasted to all nodes and then the process continues. Note that this may leak information about the data that the nodes are processing. There is a number of research papers being published about federated learning (like for example Towards Federated Learning at Scale: System Design) and also ideas on how to preserve privacy while aggregating information about gradients (see for example Practical Secure Aggregation for Privacy-Preserving Machine Learning)

The following is just a collection of some basic most commonly used neuron types:

This is one of the most simple functions available and used in many introduction classes to neuronal networks:

$\sigma(z) = 1 \forall z \geq 0$

$\sigma(z) = 0 \forall z < 0$

The main problem of this function is that it’s a non analytical function. It simply splits the input-space by a hyperplane and is used in Perceptrons or at the output stage of binary classifiers - it’s derivative would be the Dirac delta function $\delta(z)$ which takes the value $\infty$ at $z=0$ and zero elsewhere. There is no useable gradient for the backpropagation algorithm available when using this function.

The sigmoid function solves the problem of the diverging gradient by having a finite slope. It’s way more computationally intensive but it’s one of the most easiest functions to deal with during backpropagation.

$\sigma(z) = \frac{1}{1 + e^{-z}}$

The derivative is given by

$\frac{\partial \sigma(z)}{\partial z} = \frac{e^{-z}}{(1 + e^{-z})^2}$

$\sigma(z) = \frac{e^{2z}-1}{e^{2z}+1}$

The derivative is given by

$\frac{\partial \sigma(z)}{\partial z} = \frac{4*e^{2z}}{(e^{2z}+1)^2}$

The rectifier is a partially defined function. It’s linear for positive input values but zero for negative ones - it combines the simple evaluation of the function as well as the existence of the derivative. Note that this function is not continuous any more but normally this doesn’t pose any problem. Also the derivative is really easy to calculate because of two cosntant sloped areas:

$\sigma(z) = z \forall z \geq 0$

$\sigma(z) = 0 \forall z < 0$

An additional weight meight be used to specify the slope:

$\sigma_w(z) = z * \theta_j \forall z \geq 0$

$\sigma_w(z) = 0 \forall z < 0$

The derivative of the rectifier is rather easy to calculate:

$\frac{\partial \sigma(z)}{\partial z} = 1 \forall z \geq 0$

$\frac{\partial \sigma(z)}{\partial z} = 0 \forall z < 0$

The same with weights:

$\frac{\partial \sigma_w(z)}{\partial z} = \theta_j \forall z \geq 0$

$\frac{\partial \sigma_w(z)}{\partial z} = 0 \forall z < 0$

This easy definition makes the summation during backpropagation really simple which gives a huge speedup when exploited in code. Unfortunately the non continuous derivative makes a recursive usage of this node type challenging.

The softmax function is a weighted version of the exponential function.

$\sigma_j(Z) = \frac{e^{z_j}}{\sum_{p=0}^{J} e^{z_p}}$

The normalization used here is simply the sum of all exponentials $e^{z_j}$ inside the softmax layer without normalization. The softmax function is mostly used for categorical distributions - the results of a specifically constructed neuronal network can be interpreted as the probabilities of an object belonging to a given class. The same function is also often used in statistics (Bayesian methods, multi-class regression, etc.)

One might also be tempted to use a simple linear activation function - and might of course do so for some specific cases:

$\sigma(z) = a * z + b$

This might be done with some freely chooseable constants $a$ and $b$. In any case this will not be a good choice for a real neuronal network since the linear combination of all these linear functions will collaps into a multidimensional linear function (i.e. into a single layer) at the end. One cannot build a plain simple linear function based neuronal networks consisting of multiple layers.

There is a really huge number of possible network topologies. In the following section some of the most common network types are shortly mentioned.

This is one of the most basic network structures. The input data is presented to the first layer in the network as input, the output of the first layer as input to the second, etc. The last layer presents the output. Normally all layers are fully connected (i.e. each neuron on the first layer is connected to each neuron on the second, each neuron on the second layer is connected to each neuron on the third, etc.). There are no cyclical connection paths. Such a network is directly representable by a chain of tensor operations.

Note that there is no natural way to describe memory like in a recurring neuronal network - but recurring neuronal networks can be unrolled into a more complex feed forward network when cycles have a finite depth.

RNNs use cycles from layers at lower stages back to upper layers or from outputs of neurons back to their inputs. In that way they contain internal state and have to model a propagation delay for these internal loops. Normally the most basic RNNs are unrolled into feed forward networks for easier computation. They are capable of processing variable length inputs instead of only constant length inputs like the feed forward network (in this case input is shifted through the input areas instead of loaded at a fixed position)

To solve problems of exploding or vanishing gradients some networks introduce long short term memory (LSTM) that is capable of remembering events a longer time ago without having to expand the network to thousands of layers and without propagating an infinitely growing error gradient.

In recursive neuronal networks multiple layers are interconnected by the same weight matrix for every layer (i.e. every layer shares the same weight matrix). They are often used for natural language processing (word embeddings, etc.). Training is only really possible when using some function like logistic or hyperbolic tangent with non vanishing and continuous gradient.

A convolutional neuronal network includes layers that calculate the convolution of the input data into a hidden layer. Normally this means that a sliding window of fixed size is mapped into the next layer and summation is not done over all outputs of the nodes above. Convolutional neuronal networks are most often used to identify features in image classification networks to detect spatial features.

Networks in many machine learning frameworks are built using different types of layers. Some of them are listed below.

This is one of the most classic layers. All outputs from a given layer are connected to each input of the layer below using a weight matrix as described above.

A convolutional layer is convoluting the input data (for example one might imagine a fixed size sliding window to generate all the input data for a following layer at the next stage). This means that not all neurons are fully interconnected.

One might imagine the first 3x3 window of data to be mapped to the first neuron on the next layer, the same window slided by one element to the second neuron and so forth.

A convolutional layer can be either implemented by modifying the sum or forcing some weights to be zero.

The difference between a convolutional layer and a locally connected layer is the sharing of weights. In a convolutional layer each zone that gets convoluted shares the same weights with every other zone. In a locally connected layer connections are locally like in the convolutional layer but each connection in each zone shares it’s own weights.

A pooling layer is normally used below a convolutional layer to discard unnecessary information. It selects information from the zones of convolutional layers - for example using a function like the maximum or minimum inside the zone of a given window. This helps to speed up computation and to reduce overfitting.

A pooling layer has to be performed using a tensor operation that might operate on multiple elements of the given tensor.

More generally a merging layer can merge values of either the whole upper level network or only given zones of the input network. It might merge using a set of different functions (addition, subtraction, multiplication, averaging, maximum, minimum, concatenation, dot products, etc.).

A flattening layer just re-arranges the result of a layer above (mostly used after a pooling layer) to return a single feature matrix instead of the multi dimensional result of a pooling layer.

Noise layers allow the introduction of random white noise into the model. Normally they use gaussian noise model centered around zero. This introduction of noise allows one to leave local minima and prevent overfitting.

A normalization layer is just a layer inside the processing chain, not really a layer that gets evaluated every time the input is changed like the layers above. It’s used to re-normalize the outputs for every batch sequence in the sense that one tried to get the average zero with standard deviation 1.

Again this is not a traditional layer as the layers above but is present in some machine learning frameworks. It allows the user applciation to call one function that will be evaluated for every element inside the input tensor and yield the value in the output tensor (i.e. the function is applied to every element in the input data).

This article is tagged: Programming, Tutorial, Machine learning, Statistics, Data Mining

Dipl.-Ing. Thomas Spielauer, Wien (webcomplains389t48957@tspi.at)

This webpage is also available via TOR at http://jugujbrirx3irwyx.onion/