Short Kalman filter summary

19 May 2021 - tsp
Last update 19 May 2021
Reading time 9 mins

What is a Kalman filter

Basically a Kalman filter allows the iterative prediction of a system state $x_k$ based on it’s previous estimated states $x_{k-1}$, some noisy measurement values, an evolution model as well as known deterministic external influences. In addition it’s a probabilistic model that incorporates non deterministic noise and perturbations in it’s estimations.

The Kalman filter is based around two concepts:

A hidden Markov model (which is a special dynamic Bayesian net)
The Bayes theorem

In a Markov chain one assumes that the evolution of one system state to the next one is independent of any states except the previous one. Using the notation of conditional probabilities $P(A \mid B)$ meaning the probability of $A$ under the previously fulfilled precondition $B$ and describing the system state as $x_k$ and the measured observables of a system as $z$ this simply means that:

[ p(x_k \mid x_{k-1}, x_{k-2}, \ldots, x_0) = p(x_k \mid x_{k-1}) \\ p(z_k \mid x_{k}, x_{k-1}, \ldots, x_0) = p(z_k \mid x_{k}) ] [ p(x_0, \ldots, x_k; z_1, \ldots, z_k) = p(x_0) * \prod_{i = 1}^{k} p(z_i \mid x_i) * p(x_i \mid x_{i-1}) ]

This basically models a chain of probability distribution transfer functions that incrementally generate the current state. The product $p(z_i \mid x_i) * p(x_i \mid x_{i-1})$ just separates two functions that describe conditional probabilities on different subspaces (measurement values versus state transitions).

If one writes out some steps this basically means:

[ p(x_0, \ldots, x_k) = p(x_k \mid x_{k-1}) * p(x_{k-1} \mid x_{k-2}) * \ldots * p(x_1 \mid x_0) * p(x_0) \\ p(x_0, \ldots, x_k; z_1, \ldots, z_k) = \underbrace{p(z_k \mid x_k) * p(x_k \mid x_{k-1})}_{k-1 \to k} * \underbrace{p(z_{k-1} \mid x_{k-1}) * p(x_{k-1} \mid x_{k-2})}_{k-1 \to k-2} * \ldots \underbrace{p(z_1 \mid x_1) * p(x_1 \mid x_0)}_{0 \to 1} * p(x_0) ]

This will also be the basis for the Kalman filter described below that will be separated into two different steps - prediction and an adjustment / correction / update step.

The prediction step basically predicts the current system state from all possible previous states $x_{k-1}$. In theory there is a probability for each previous state $x_{k-1}$ that’s allowed by the measurements $z_{1:k-1}$ - symbolically one can thus describe the probability of the current state predicted by the measurement values as the sum of all possible states allowed by the previous measurements:

[ p(x_k \mid z_{1 : k-1}) = \int p(x_k \mid x_{k-1}) * p(x_{k-1} \mid z_{1:k-1}) dx_{k-1} ]

In the next step the update tries to include the current measurement value to weight the probabilities using the Bayesian theorem:

[ p(x_k \mid z_{1,k}) = \frac{p(z_k \mid x_k) * p(x_k \mid z_{1 : k-1})}{\underbrace{p(z_k \mid z_{1 : k-1})}_{\alpha}} \\ p(x_k \mid z_{1,k}) \propto p(z_k \mid x_k) * p(x_k \mid z_{1 : k-1}) \\ \alpha = p(z_k \mid z_{1 : k-1}) = \int p(z_k \mid x_k) * p(x_k \mid z_{1 : k-1}) dx_k ]

Instead of calculating $p(z_k \mid z_{1 : k-1})$ one can simply describe the probability distribution of $p(z_k \mid z_{1 : k-1})$ and renormalize the integral over all states to 1.

Application to a time discrete filter

How is this idea applied to a linearized time discrete filtering application? In a practical application the problem is described in multiple spaces. The quantities used during the prediction step are:

The (estimated or current) system state $X_k$ ($n$ dimensional vector, probability distribution). This might include position, velocity, current acceleration, etc.
A set of known deterministic influences such as control surface positions, thruster settings, known wind speed, etc. $u_k$ ($l$ dimensional vector)
An linearized evolution model for a single time step $F_{k}$ that’s usually described as an $n \times n$ matrix.
A linearized model for deterministic perturbations and control $B_{k}$ (an $n \times l$ Matrix).

In addition the theoretical model includes stochastic nondeterministic white noise $w_{k}$ that’s distributed according to the normal distribution with zero expectation value and a covariance matrix of $Q_{k}$.

[ X_{k} = \underbrace{F_{k-1} * X_{k-1}}_{\text{System evolution}} + \underbrace{B_{k-1} * u_{k-1}}_{\text{Deterministic perturbations}} + \underbrace{w_{k-1}}_{\text{Nondeterministic noise}} \\ w_k \approx \mathfrak{No} (\vec{0}, Q_{k-1}) ]

Since a control system usually acts on a set of measurement values that are not directly the system state one has to model the measurement system. The distribution of predicted measurement values $Z_k$ can be calculated using the measurement model $H_{k}$ and the current state $X_k$. In addition measurement noise $v_k$ has to be modeled:

[ Z_k = H_k X_k + v_k \\ v_l \approx \mathfrak{No} (\vec{0}, R_k) ]

Usually the distribution of measurement values $Z_k$ is not directly accessible but only a number of realizations of the measurements $z_k$. The main filtering problem thus is estimating the linearized $X_k \approx \mathfrak{No}(x_k, P_k)$ from a series of measurement realizations $z_k$. $x_k$ is the mean value of the estimated system state, $P_k$ the covariance matrix of system status.

Kalman filter algorithm for time discrete applications

This leads to a rather simple algorithm in the time discrete case:

Initialization

First the system is initialized into a known state. In this blog it’s initialized to zero - but one might also initialize to a known boundary condition of course. The vector $x_0$ is the whole known (estimated) state of the system, $P_0$ the usually diagonal covariance matrix that describes the statistical deviations (in diagonal example these are the variances).

[ x_0 = 0 \\ P_0 = \sigma^2 * I ]

Prediction step

For each time step a prediction step is made. The only difference to the model presented above is that one can of course not include nondeterministic noise - it’s simply neglected. The unperturbed linearized process is again described by the matrix $F_k$, the influence of deterministic perturbations by $B_k$. Basically it’s just an addition of two random variables that are described by multivariate normal distributions - and as one recalls from normal distributions the addition of realizations is simply the addition of expectation values as well as their variances.

[ x_{k \mid k-1} = \underbrace{F_{k-1} * x_{k-1}}_{\text{System evolution}} + \underbrace{B_{k-1} u_{k-1}}_{\text{Deterministic perturbation}} \\ P_{k \mid k-1} = \underbrace{F_{k-1} * P_{k-1} * F_{k-1}^{T}}_{\text{Evolution of covariance matrix under system model}} + \underbrace{Q_{k-1}}_{\text{Process noise}} ]

In addition to the predicted system state $x_{k \mid k-1}$ - the notation means the state at time $k$ is predicted under the assumption of the previous state $k-1$ - one also predicts the evolution of the standard deviations. As one can see from the equations this works by a standard projection through the non time evolved space by using the transposed time evolution matrix (one can imagine the sequence $F * P * F^T$ from right to left as a projection from the space of the new state into the space of the old state in which one uses the P matrix that is then transferred back into the space of the new state). One additionally adds the process noise.

Most of the time the process noise is also assumed to be independent - i.e. an diagonal covariance matrix that contains the variances - squared standard deviations though this is not necessary.

Update step

Whenever measurements are available one can do an update or correction step to refine the predicted system state. In case no measurements are available one would simply assume the predicted state to be the refined state.

No measurements available

In case no measurements are available:

[ x_{k} = x_{k \mid k-1} \\ P_{k} = P_{k \mid k-1} ]

Measurements available

Whenever a measurement is available it gets a little bit more complicated. First one will calculate the so called Innovation. The innovation basically models the difference between the predicted measurements that are calculated based on the predicted state and the measurements taken from reality:

[ \bar{y_k} = \underbrace{z_k}_{\text{real measurements}} - \underbrace{H_k * x_{k \mid k-1}}_{\text{predicted measurement}} ]

The prediction of the measurement values is realized using a linearized inverse model of the measurement system $H_k$. This transfers the estimated system state $x_{k \mid k-1}$ into the measurement values space.

In addition one can calculate the residual covariance. This is basically the same thing as the difference of the measurement values - despite one does of course not calculate the difference of variances but the sum:

[ S_k = \underbrace{H_k P_{k \mid k-1} H_k^T}_{\text{predicted covariance}} + \underbrace{R_k}_{\text{measurement noise}} ]

As one can see the residual covariance is the covariance of the innovation - i.e. the residual measurement values.

Now comes the special idea behind Kalman filters - the Kalman gain:

[ K_k = P_{k \mid k-1} H_k^T S_k^{-1} ]

As one can see the Kalman gain basically transforms covariances (or reflects a projection first from the innovation space into the measurement space - then from measurement space into state space at $k-1$ and then then into the state space at $k$)- the larger the measurement noise the smaller will be the influence on the predicted state. The better the measurement is the more influence it will have on the correction. This will be used to weight the innovation that is then applied to the a priori predicted state to determine the a posteriori estimated system state.

[ x_k = \underbrace{x_{k \mid k-1}}_{\text{a priori predicted state}} + \underbrace{K_k * \bar{y_k}}_{\text{weighted innovation}} \\ P_k = P_{k \mid k-1} - K_k S_k K_k^T ]

Note that - when one’s having a hard time grasping the Kalman gain - one can also describe it in the following not practically usable way (since this expression would require the a posteriori $P_k$ instead of the a priori $P_{k \mid k-1}$).

[ K_k = P_k * H_k^T * R_k^{-1} ]

Note that the equations do not have to be the same for every update / correction step. In case a different set of sensors is available one might use separate matrices and state vectors for measurements and measurement covariances. Only the estimated system state has to be the same every iteration. This allows one to integrate slower measurement systems like GPS into a set of other data sources as accelerometers and gyroscopes.