So first of this is not a purely mathematical article nor a tutorial. Itās just
a summary of my understanding of Krylov subspaces and their role in solving
linear equations - since I had to implement such an solver for one of my
FTDM simulations and as usual didnāt want to use external dependencies (and always
wondered how those methods really work since usually one only uses some magically
working and highly performant libraries) I thought it would be a good idea to
also summarize my view on this method - thatās unsurprisingly again based primarily
on the idea of projections and taking a shortcut by solving a high dimensional
problem inside a projective subspace. Itās a good idea to familiarize oneself
with stuff like power iterations, the
QR decomposition using Givens rotations and how to
apply this decomposition to solve the linear least squares problem
first.
Side note: Every time I denote the norm of a vector $\mid \vec{v} \mid$ I
implicitly mean the Euclidean norm $\mid \vec{v} \mid_2$
The problem
So whatās the problem that I wanted to solve and what are Krylov subspace iterations
used for? Theyāre mainly used to solve Eigenvector problems and to approximate
solutions to large scale linear equation systems (letās say a million variables in
a million equations for example). The latter on is a pretty common problem when
doing finite element and finite difference calculations in mechanics, physics,
aerodynamics, electrodynamics, statics, social network analysis, statistics and
many more fields - basically these are methods that are used to solve or
approximate system state of complex systems that are described by partial
differential equations. On the other hand Krylov subspace methods also are a
powerful method to calculate Eigenvectors - in fact their basic idea is
similar to power iterations.
The basic problem for $N$ variables in $N$ linear equations can be written
as a matrix-vector equation:
[
A * \vec{x} = \vec{b}
]
In this case the coefficients of the equations are contained in the matrix $A \in \mathfrak{R}^{N \times N}$,
the vector $\vec{x} \in \mathfrak{R}^N$ is formed by the unknowns and the
vector $\vec{b} \in \mathfrak{R}^N$ contains the constant solutions for the
equations. For example the system of equations
[
1 a + 2 b + 3 c = 4 \\
5 a + 6c = 7 \\
8 a - 9 b + 10 c = 11
]
would be expressed as
[
\begin{pmatrix} 1 & 2 & 3 \\ 5 & 0 & 6 \\ 8 & -9 & 10 \end{pmatrix} * \begin{pmatrix} a \\ b \\ c \end{pmatrix} = \begin{pmatrix} 4 \\ 7 \\ 1 \end{pmatrix}
]
Of course such a small system could be solved by hand or automating the
Gaussian elimination method - which works pretty well for small scale systems
but it gets unusable when one approaches thousands of equations in thousands of
variables. Then iterative methods get more interesting. The basic idea behind
current software implementations is to use algorithms that are just based on
some simple operations such as:
- The matrix - vector product (i.e. a projection or transformation operation)
- Calculating the norm of a vector
- Multiplying by an scalar value
- Adding and subtracting vectors of same dimension
This is usually done by choosing an initial guess $x_0$ and then computing a
sequence of $x_n$ that should minimize the residual $\vec{r_n} = \vec{b} - A * \vec{x_n}$.
For Krylov subspaces this is done by moving into an affine subspace $\vec{x_0} + \mathfrak{K} \subset \mathfrak{R}^N$.
Krylov subspace methods are as of today the most used methods to solve general and
symmetric positive definite systems of equations - other methods such as
the successive overrelaxation do exist but are not really in use often except
for some special cases.
What is a Krylov subspace?
So the first question that arises is - what is a Krylov subspace anyways and why
does one want to use this subspace?
Formally one could solve the linear equation system by calculating a matrix
inverse:
[
A * \vec{x} = \vec{b} \\
A^{-1} * A * \vec{x} = A^{-1} * \vec{b} \\
\vec{x} = A^{-1} * \vec{b}
]
This is of course not suited for any real world implementation - but there is
a theorem - the Cayley-Hamilton theorem that tells one that every square
matrix over a commutative ring satisfies itās own characteristic equation. This
also implies that for every general $n \times n$ matrix thatās invertible and
has a nonzero determinant one can write itās inverse $A^{-1}$ as an polynomial
expression of order $n-1$. This polynomial expression is built by the powers
of $A$ - so the inverse can be built by a linear combination of $A, A^2, A^3, \ldots, A^{n-1}$.
In contrast to a direct approach Krylov subspace iterations never calculate
the powers of a matrix themselves - but they do repeatedly perform a matrix-vector
multiplication and thus basically multiply vectors with powers of $A$ which
leads to a huge performance gain (see power iterations).
Given any vector $\vec{v} \in \mathfrak{R}^N$ and $n \leq N$ a Krylov subspace is
just
[
K_n = K_n(A, \vec{v}) = span(\vec{v}, A \vec{v}, A^2 \vec{v}, \ldots, A^{n-1} \vec{v})
]
As one can see of course this is a nested set of subspaces:
[
K_1 \subseteq K_2 \subseteq K_3 \subseteq \ldots \subseteq K_{N}
]
The dimension of these subspaces grows at a maximum of one for every iteration
and itās dimension can never ever exceed $N$ - in practice itās usually much
smaller. This is also the foundation for solving linear equations in Krylov
subspaces - searching for representation of a solution in a low dimensional space
is usually much easier than to search inside a high dimensional space.
Standard Krylov solvers
There is a description for the basic inner workings of a standard Krylov
subspace solver that Iāve found in a paper:
[
\vec{r_n} = \vec{b} - A \vec{x_n} \\
\vec{x_n} - \vec{x_0} = q_{n-1}(A) \vec{r_0} \in K_n(A,\vec{r_0}) \\
\vec{r_n} - \vec{r_0} = \zeta_n(A) \vec{r_0} \in A K_n(A, \vec{r_0} \in K_{n+1}(A, \vec{r_0}))
]
Building Krylov subspaces
So how does one build the basis of a Krylov subspace in practice? One could for
example use:
- The Lanczos algorithm that only works for symmetric positive definit matrices
- The Arnoldi iteration that requires a full Gram-Schmidt orthogonalization
for each step but works for general matrices.
The Lanczos algorithm
This is one of the most basic Krylov subspace algorithms thatās only applicable
to symmetric positive definite matrices. The basic idea is to start at
a random unit vector $\vec{v}$ and iteratively construct an orthonormal
basis $K_n(A,\vec{v})$ with $n \leq N$.
First one initializes the algorithm:
[
\vec{w}_0 = \vec{0} \\
\vec{w}_1 = \vec{v} \\
\delta_1 = 0
]
Then an iterative sequence follows for at most $n$ iterations:
[
\vec{h} = A * \vec{w}_j - \delta_j * \vec{w}_{j-1} \\
\gamma_j = \vec{h}^ T * \vec{w}_j \\
\vec{k} = \vec{h} - \gamma_j \vec{w}_j \\
\delta_{k+1} = \mid \mid \vec{k} \mid \mid_2 \\
\vec{w_{j+1}} = \frac{1}{\delta_{j+1}} \vec{k}
]
As one can see the first step is as usual to push the current vector $\vec{w}_j$
through the projection matrix $A$. One then subtracts the normal components of
the previous iteration (more on that later) - this weight factor is calculated
later on and of course is zero for the first iteration.
The factor $\gamma_j$ basically describes the projection of the newly projected
vector onto the previous guess. The calculation of $\vec{k}$ thus basically
tries to remove any components of ${w_j}$ from $\vec{h}$ by simple subtraction.
This is a simple try of an orthogonalization technique - note that this is usually
numerically unstable because of cancellation errors.
Now one calculates the new guess simply by determining the length of the remaining
orthogonal components and re-normalizing the guess.
As mentioned above this algorithm would work in theory but is usually numerically
unstable - one could use something like the Householder algorithm.
The resulting orthonormal basis is:
[
W_n = \left( \vec{w_1}, \vec{w_2}, \ldots, \vec{w_n} \right) \in \mathfrak{R}^{N x n}
]
Note that the Lanczos algorithm might terminate early - in case $\delta_{j+1} = 0$
for any $j < n$ one can assume that the degree of $\vec{v}$ is $j$ and $K_j$ is
invariant.
Arnoldi algorithm
The Arnoldi algorithm works somewhat like the Lanczos method above - in fact
when being applied to Hermitian matrices the Arnoldi algorithm reduces to the
Lanczos algorithm. The idea is similar too: Perform a Gram-Schmidt orthogonalization
process after each iteration step to build the Krylov subspace. This requires
of course to store all basis vectors in memory which might be a problem under
some circumstances - one can implement restarts in this case for example to
try different configurations in case the subspace grows too large.
- $\vec{w_j} = A * \vec{w_{j-1}}$
- $\forall l \in 1 \ldots j-1:$
- $ h_{l, j-1} = \vec{w^\star}_{l} * \vec{w}_{j} $
- $ \vec{w}_{j} = \vec{w}_{j} - h_{l, j-1} * \vec{w}_{l} $
- $h_{j, j-1} = \mid \vec{w}_{j} \mid$
- $\vec{w_j} = \frac{1}{h_{j, j-1}} * \vec{w_j}$
Again as one can see the algorithm forms a iterative process that starts with
an unit start vector $w_0$. This vector is then pushed through the projection
matrix (and thus slowly moves towards the Eigenvector or the matrix with the
largest Eigenvalue as with any power iteration).
Then one performs the Gram-Schmidt orthogonalization process. This basically
projects the newly generated vector $\vec{w}_j$ onto
all previously generated vectors $\vec{w}_1 \ldots \vec{w}_{j-1}$
one after each other. As one should already know the inner product of two vectors
yields the length of the projection of one vector onto the other vector - in our
case one vector is of unit length so the projected length is only scaled by the
length of the second vector. Then one simply rescales the previous vector until
the projected components in the direction of $\vec{w}_j$
match the components of $\vec{w}_{j}$ and subtracts
the scaled vector ($\vec{w}_j = \vec{w}_j - h_{l, j-1} * \vec{w}_l$).
This removes all components pointing in the direction of $\vec{w}_l$
out of $\vec{w}_n$ - thus this is the typical Gram-Schmidt
orthogonalization process.
The last step then - since weāre interested in normalized basis vectors - is the
normalization of the vector $\vec{w}_j$ by dividing
it by itās own length $h_{j, j-1}$.
Note that as for the Lanczos algorithm this algorithm may abort early with $\vec{w}_j = \vec{0}$
in case the degree of the characteristic polynomial is only of degree $j$.
Linear equations
So what has all of this to do with linear equations? First lets take a look on
how to solve linear equations numerically. One could of course always apply the
Gaussian elimination procedure which is the way to go for small scale systems
but unfortunately this scales with $\mathfrak{O}(n^3)$ which makes it pretty
prohibitive for large scale applications. And when looking into big data directions
itās not really flexible enough to be updated and interpreted in an online fashion
without knowing the whole dataset.
Steepest descent method
So letās take a different approach (thatās usually also not executed that way
due to itās numerical complexity but itās guaranteed to converge in case a
solution exists). One can show that the solution of the system of linear equations
[
A \vec{x} = \vec{b}
]
is equal to the minimization problem:
[
min \phi(\vec{x}) = \frac{1}{2} \vec{x}^T A \vec{x} - \vec{x}^T \vec{b}
]
The idea behind steepest descent thus is just take some arbitrary start guess $\vec{x}_0$,
find the direction of the steepest descent, calculate $x_{1}$ and take that step.
Now iterate as long as one hasnāt reached the correct solution. The first thing
to look at is the direction of the step so one can do a Taylor expansion of the
function $\phi(\vec{x} + \alpha \vec{d})$ when taking a step into direction $\vec{d}$:
[
\phi(\vec{x}_k + \alpha \vec{d}) = \phi(\vec{x}_k) + \alpha \nabla \phi(\vec{x}_k)^T \vec{d} + \mathfrak{O}(\alpha^2 \mid \vec{d} \mid^2)
]
As one can see the (first order) approximation for the direction of steepest
descent is simply $- \nabla \phi(\vec{x}_k)$. Letās just define this as the
residual $\vec{r}_k$:
[
\vec{r_k} = - \nabla \phi(\vec{x}_k) \\
\vec{r_k} = \vec{b} - A \vec{x}_k
]
Now one is just moving into the direction of $\vec{r}_k$ till one reaches
the optimum solution (exact line search). When one looks at $\phi(\vec{x_k} + \alpha \vec{r}_k)$
one can see this is just a second order polynomial in $\alpha$ and thus one can
solve exactly:
[
\phi(\vec{x_k} + \alpha_k \vec{r}_k) = \frac{1}{2} (\vec{r}_k^T A \vec{r}_k) \alpha_k^2 + (\vec{r}_k^T A \vec{x}_k - \vec{r}_k^T \vec{b}) \alpha_k + \phi(\vec{x}_k) \\
\to \alpha_k = \frac{\vec{r}_k^T \vec{r}_k}{\vec{r}_k^T A \vec{r}_k}
]
This would already be enough to build an idealized algorithm with pretty slow
but guaranteed convergence:
- $\vec{r}_0 = \vec{b} - A \vec{x}_0$
- $k = 0$
while
$\vec{r}_k \neq 0$:
- Calculate distance that we should step into direction of steepest gradient,
i.e. the exact line search: $\alpha_k = \frac{\vec{r}_k^T \vec{r}_k}{\vec{r}_k^T A \vec{r}_k}$
- Perform the step: $\vec{x}_{k+1} = \vec{x}_k + \alpha_k \vec{r}_k$
- Update the residuum: $\vec{r}_{k+1} = \vec{b} - A \vec{x}_{k+1}$
- Done: $\vec{x} = \vec{x}_k$
As one can easily see one can also cache one intermediate result to reduce the
number of matrix multiplications:
[
\vec{r}_{k+1} = \vec{b} - A \vec{x}_{k+1} \\
\vec{r}_{k+1} = \vec{b} - A (\vec{x}_k + \alpha \vec{r}_k) \\
\vec{r}_{k+1} = \vec{r}_k - \alpha_k \underbrace{A \vec{r}_k}_{see in spectral coefficient}
]
Projection method
Since this method is - even when guaranteed to converge - pretty slow one can
extend the idea to use a projective method. The idea is similar to the steepest
descent step - but one performs this minimization in a projective subspace.
This space is spanned by some vectors $span(\vec{v_1}, \vec{v_2}, \ldots, \vec{v_k})$.
This idea arises when one looks at the result updating step inside the steepest
descent algorithm:
[
\vec{x}_{k+1} = \vec{x}_k + \alpha_k \vec{r}_k \\
\vec{x}_{k+1} = \vec{x}_0 + \alpha_0 \vec{r}_0 + \alpha_1 \vec{r}_1 + \ldots + \alpha_k \vec{r}_k \\
\to \vec{x}_{k+1} \in \vec{x_0} + span(\vec{r}_0, \vec{r}_1, \vec{r}_2, \ldots, \vec{r}_k)
]
This will allow one to reduce the complexity of this exact line step for example
when using the conjugate gradient method or the numerical generalized minimal residual
method.
Conjugate gradient method
The conjugate gradient (CG) method is the most used method for symmetric positive
definit matrices. The basic idea is as mentioned above that the exact line search
is faster in a subspace $V_k$ than in the overall solution space. In case one
chooses $V_k$ wisely one can even expect the exact solution to lie inside an
subspace of the whole problem / expect fast convergence. Thus the idea is to
combine the low computational cost of steepest descent in contrast to generalized
methods (which limits the conjugate gradient method to symmetric positive definit
matrices) with the convergence speed of projective methods.
The basic building blocks are:
- Choose some $\vec{p}_k$ such that $\vec{x}_{k+1} = min_{\vec{x} \in \vec{x}_0 + span(\vec{p_1}, \ldots, \vec{p_k})} \phi(\vec{x})$
which also means that $\vec{x}_{k+1} = min_{\vec{x} \in \vec{x_k} + span(\vec{p_n})} \phi(\vec{x})$
- Do a line search on each iteration which again works using $\alpha_k = \frac{\vec{p}_k^T \vec{r}_k}{\vec{p}_k^T A \vec{p}_k}$
- Make sure that $\vec{x}_{k+1} \neq \vec{x}_k$. This works by requiring that
one includes elements that have not been contained in the projective
subspace $V_{k-1}$
The third step can simply be realized symbolically by first projecting the
residuum vector $\vec{r}_k$ onto the space $V_{k-1}$:
[
\vec{z}_k = \mathfrak{P}(AK_{k-1}) * \vec{r}_k
]
Thus $\vec{z}_k$ only contains the elements of $\vec{r}_k$ that are contained in
the previous projective space. Subtracting this from $\vec{r}_k$ yields all
components that lie outside this given subspace:
[
\vec{p}_k = \frac{1}{\gamma_k} (\vec{r}_k - \vec{z}_k)
]
As one can see the subspace oneās spanning that way is $\mathfrak{K}_k = span(\vec{r}_0, A \vec{r}_0, \ldots, A^k \vec{r}_0)$.
The dimension of this space is increasing at most by one each iteration
and thus $dim \mathfrak{K}_k \leq k+1$.
Explicitly inserting and compactifying yields a simple expression without
explicit projections into the subspace:
[
\vec{p}_k = \vec{r}_k - \frac{\vec{p}_{k-1}^T A \vec{r}_k}{p_{k-1}^T A \vec{p}_{k-1}} \vec{p}_{k-1}
]
This is the conjugate gradient algorithm - one can imagine that one is just taking
always the steepest possible step with the optimal length along into the direction
of each residuum in a space orthogonal to the previous subspace. When looking at
the iteration $\mathfrak{K}_k = span(\vec{r}_0, A \vec{r}_0, \ldots, A^k \vec{r}_0)$
that spans the subspace one can already again see a power iteration as when
calculating Eigenvectors - which also intuitively explains the high speed of projective
methods. First they minimize along the approximate direction of the first
principal vector, then after the second one, etc. - itās not exact though
since they donāt calculate the first Eigenvector but only the projection of the
residuum thatās massively projected into the direction of the first Eigenvector -
but the process is similar especially when the Eigenvectors are clearly
separated and Eigenvalues are largely separated.
- $\vec{r}_0 = \vec{b} - A \vec{x}_0$
- $k = 0$
while
$\vec{r}_k \neq 0$:
if
$k = 0$:
else
:
- $s_k = - \frac{\vec{p}_{k-1}^T A \vec{r}_k}{\vec{p}_{k_1}^T A \vec{p}_{k-1}}$
- $\vec{p}_k = \vec{r}_k + s_k * \vec{p}_{k-1}$
- $\alpha_k = \frac{\vec{p}_k^T \vec{r}_k}{\vec{p}_k^T A \vec{p}_k}$
- $\vec{x}_{k+1} = \vec{x}_k + \alpha_k \vec{p}_k$
- $\vec{r}_{k+1} = \vec{b} - A \vec{x}_{k+1}$
- $k = k + 1$
- $\vec{x} = \vec{x}_k$
Generalized minimal residual method (GMRES)
The generalized minimal residual method is one of the methods one can apply
to general indefinite non symmetric systems of linear equations. It approximates
a solution vector in a Krylov subspace by trying to minimize the Euclidean norm
of the residual (by choosing the Krylov subspace representation of the solution
with minimal error deviation) at each step:
[
\vec{r}_n = \vec{b} - A * \vec{x_n} \\
\mid \vec{r}_n \mid \to min.
]
The vector $\vec{x}_n$ is assumed to be a element of the nāth Krylov subspace $K_n$.
[
\vec{x}_n \in K_n
]
Note that this is particular efficient when solving the least squares each
iteration using QR decomposition when
implemented with Givens rotations. This is the
case since Arnoldi iterations performs full Gram Schmidt orthogonalization for
each new vector in the basis thatās created iteratively and thus the cost
will only be $\mathfrak{O}(k)$ at the $k$ step.
The first step is to build the Krylov subspace using not a random vector but
the initial residual for any initial guess $x_0$:
[
\vec{r}_0 = \vec{b} - A * \vec{x}_0
]
The initial vector for the Arnoldi iteration thus is the unit vector pointing
into the direction of the initial residual:
[
\vec{v}_1 = \frac{1}{\mid \vec{r}_0 \mid} \vec{r}_0
]
The approximated vector $\vec{x}_n$ after the nāth iteration can be written
simply as:
[
Q_n = \begin{pmatrix} \vec{v}_1 & \vec{v}_2 & \ldots \vec{v}_n \end{pmatrix} \in \mathfrak{R}^{m \times n} \\
\vec{x}_n = \vec{x}_0 + Q_n * \vec{y}_n
]
if one looks closely one sees that one is basically building a linear combination
of the first $n$ vectors $\vec{v}_n$ that - as for the power iteration / Arnoldi
algorithm - approximate the Eigenvectors of the matrix that correspond to the
largest Eigenvalues - so one basically tries to isolate the components with
the biggest influence on the projective transformation and builds a linear combination
along the principal axis. This is the same as during the subspace iterations
of the conjugate gradient method.
The main difference is that one now does not do an exact line search in the
direction of the maximum gradient in orthogonal direction to the previous
subspace (as for the CG method above) but tries to minimize the residuum in a
general way in the current subspace.
In this case one has to choose $\vec{y}_n$ such that:
[
\mid r_n \mid \to min \\
\mid \vec{b} - A * \vec{x}_n \mid \to min \\
\mid \vec{b} - A * \left( \vec{x}_0 + Q_n * \vec{y}_n \right) \mid \to min \\
\mid \vec{b} - \underbrace{A * \vec{x}_0}_{\beta * \vec{v}_1} - A * Q_n * \vec{y}_n \mid \to min \\
\beta = \mid \vec{r}_0 \mid
]
Thus on each iteration one tries to minimize the residual approximately along one
of the basis vectors more of the Krylov subspace / is approximating the inverse
matrix by a component along one more of the approximated Eigenvector dimensions.
This next basisvector is orthogonal to the previous vectors though because it has
been orthogonalized during the Arnoldi iteration.
The algorithm is finished whenever the minimization yields an result that minimizes
the residual as far as desired or when an exact solution if found (i.e. the residual
is zero - which will generally not be the case even for exactly solvable systems
due to rounding errors as usual for floating point algorithms) - of course this heavily
depends on the problem that one tries to solve and the matrix (preconditioning, etc.).
Itās only guaranteed to converge in the last step - which then is equivalent to
directly solve the system of equations using least squares.
The main problem implementing GMRES is the least squares minimization step to
find an appropriate $\vec{y}_n$. One can easily show from the above equations that
the minimization of
[
\mid \vec{b} - \beta * \vec{v}_1 - A * Q_n * \vec{y}_n \mid \to min
]
is equivalent to the minimization
[
J(\vec{y}) = \mid \mid \vec{r}_0 \mid \vec{v}_1 - A * Q_n * \vec{y}_n \mid
]
This can then be rewritten a little bit further using the upper Heisenberg
matrix $H_k$. Recall that a upper Heisenberg matrix is just a square matrix with
zero elements below the first sub diagonal. This works since the Arnoldi iteration
yields
[
A q_k = \sum_{i = 1}^{k+1} h_{i,k} \vec{q}_i \\
A Q^{k} = Q^{k+1} H^k \\
H^k \in \mathfrak{R}^{(k+1) \times k} \\
H_k = \begin{pmatrix} h_{1,1} & h_{1,2} & \ldots & h_{1,k} \\ h_{2,1} & h_{2,2} & \ldots & h_{2,k} \\ 0 & h_{3,2} & \ldots & h_{3,k} \\ \ldots & \ldots & \ldots & \ldots \\ 0 & 0 & 0 & h_{k+1, k} \end{pmatrix}
]
[
J(\vec{y}) = \mid \mid \vec{r}_0 \mid \vec{v}_1 - Q_{n+1} * H_{n} * \vec{y}_n \mid
]
The next step is just to factorize out $Q_{n+1}$.
[
J(\vec{y}) = \mid Q_{n+1} \left( \beta \vec{e}_1 - H_k * \vec{y} \right) \mid \\
J(\vec{y}) = \mid \beta \vec{e}_1 - H_k \vec{y} \mid
]
The term $Q_{n+1}$ could be neglected since itās an orthogonal matrix - its
application only rotates the vector $\beta \vec{e}_1 - H_k \vec{y}$ and thus does
not change itās magnitude that is minimized using least squares by QR factorization
using Givens rotations.
Since QR factorization tries to produce a upper triangular matrix and the
matrix $H_k$ only has at most $k$ non zero elements below the diagonal in its
first sub diagonal at most $O(k)$ Givens rotations are required - in this case
solving the least squares problem using Givens rotations and backsubstitution
result in a similar or even better performance than Householder reflections
when applying the Arnoldi iteration as described above.
Summarized the algorithm is pretty simple:
- $\vec{q}_1 = \frac{\vec{b}}{\mid \vec{b} \mid}$
- $h_{1,0} = \mid \vec{b} \mid$
- $\vec{x}_0 = \vec{0}$
- $\vec{\lambda}_1 = 0$
- $\vec{b} = \begin{pmatrix} \mid \vec{b} \mid \\ 0 \\ \ldots \\ 0 \end{pmatrix}$
- Repeat for $k = 1 \ldots m$:
- Check if residual $h_{k+1, k} \leq threshold$. If so weāre finished and we
return $\vec{x} = Q_{k} \vec{\lambda}_k$
- $\vec{q}_{k+1} = A \vec{q}_k$
- Repeat for $i = 1 \ldots k$ (This is the orthogonalization of Arnoldi):
- $h_{i,k} = \vec{q}_i^T \vec{q}_{k+1}$ (This also builds the Heisenberg
matrix for the least squares minimization later on column by column - each iteration
adds one more column on the right)
- $\vec{q}_{k+1} = \vec{q}_{k+1} - h_{i,k} * \vec{q}_i$ (Gram Schmidt orthogonalization)
- $h_{k+1, k} = \mid \vec{q}_{k+1}$
- $\vec{q}_{k+1} = \frac{\vec{q}_{k+1}}{h_{k+1,k}}$
- Solve least squares using $k$ Givens rotations. Repeat for $i = 1 \ldots k-1$:
- $\vec{h}_k = G(i, i+1, \theta_i) \vec{h}_k$ (note the $\theta_i$ are the rotations from the previous subspace iterations)
- $\vec{b} = G(k,k + 1, \theta_k) \vec{b}$
- $\vec{h}_k = G(k,k + 1, \theta_k) \vec{h}_k$
- Perform backsubstitution to determine $\vec{\lambda}_k$ from $R_k * \vec{\lambda}_k = \vec{b}$
with $R_k$ being the upper triangularized version of $H_k$, i.e. $R_k = \begin{pmatrix} R_{k-1} & \vec{h}_k \end{pmatrix}$
Short demonstration in Python
The following GitHub GIST
contains a simple and naive implementation of the described algorithms using the
QR decomposition using Givens rotations and
the Least squares solver using QR decomposition. The
notebook is also available as a PDF
This article is tagged: Math, Programming, Physics, How stuff works, Machine learning, Computational linear algebra