Linear time dependent correlations using bivariate correlation and shifts

08 May 2022 - tsp
Last update 08 May 2022
Reading time 7 mins

Introduction

Sometimes the situation arises that one wants to determine if two recorded time-series datasets (i.e. two datasets sampled over time) have some causal correspondence to each other. The first step is usually to look into correlations - one can also apply this on timeseries datasets in ones dataset in a fully automated fashion to filter for candidates for such causation. Keep in mind that correlation does not imply causation (an example for this will be presented in the end).

A simple method to check if there is a linear correlation between two datasets is usually to calculate the bivariate correlation coefficient - also called the Pearson correlation coefficient. This coefficient can be used to check if there is a linear dependence between two variables. This coefficient is defined via covariances and spans the range from $-1$ to $+1$:

[ \rho_{X,Y} = \frac{cov(X,Y)}{\sigma_X \sigma_Y} ]

Since the covariance can be expressed in terms of expectation values:

[ cov(X,Y) = E((X-\mu_X)(Y-\mu_Y)) ]

the bivariate correlation coefficient can be expressed from expectation values:

[ \rho_{X,Y} = \frac{E((X-\mu_X)(Y-\mu_Y))}{\sigma_X \sigma_Y} ]

Note that the quantities $\mu$ and $\sigma$ have the usual meanings:

Rearranging the above definition a little bit one arrives at a pretty simple and more numerical stable formula than for a brute force approach:

[ r_{XY} = \frac{(\sum_{i=0}^{N} x_i y_i) - N * \mu_x \mu_y}{(N-1) \sigma_x \sigma_y} ]

To get this correlation coefficient working for timeseries data one has to get samples at the same time intervals. $x_i$ and $y_i$ have to be sampled at the same time. If this is not the case one might perform interpolation (or at least some kind of linearization) before.

Example for the Pearson correlation coefficient

So as a first example let’s look at some evaluated Pearson correlation coefficients for three different cases:

First let’s look at the correlation coefficient for a linear function of the kind

[ y = k * x + d \\ k > 0 ]

To make things a little bit more statistical there has been some random noise added:

Direct linear correlation

As one can see the correlation coefficient is - as expected - close to $1$. Now lets assume a negative slope ($k < 0$):

Direct linear correlation

Again as expected the correlation coefficient is close to $-1$. As a last try let’s apply this to a simple parabolic function:

[ y = a * x^2 + b ]

Applying the correlation coefficient to a parabolic function

Timeseries

Now let’s apply the correlation coefficient to time series. For sake of simplicity the timeseries used in this example will be sine waves with different phase-shifts. It’s assumed these sines are sampled at the same time intervals:

The functions $f_A$ and $f_B$ will have zero phase shift and $\phi_A = \phi_B = 0$. The amplitudes will be $a_A = 10$ and $a_B = 2$. Thus there will be a linear dependence between those two functions. All functions will use the same frequency $\omega = 0.05$ and the sine will be evaluated in radians. Again applying the bivariate correlation coefficient yields the expected result of $r \approx 1$:

Two sine waves without phase shift

When one applies a phase shift (i.e. a shift in time domain) by applying $\phi_C = -30$ this changes dramatically:

Two shifted sine waves

Even though the functions are just shifted in time domain the correlation coefficient does not see the linear dependence any more. This could be the case for any delayed systems. To first get a feeling that one would still expect a result that indicates linear dependence one might look at a plot that shows how the data values of those functions relate:

Two sine waves without phase shift

Two sine waves with phase shift

The second function does not really look linear any more - that’s due to the time shift. But if one removes the phase shift one would arrive at a plot similar to the first one again.

To solve that problem one can simply evaluate the correlation coefficient for a given window of time shifts. This also allows one to determine the phase relation between the two datasets. Doing this for the two previously shown pairs of functions again:

Two functions without phase shift

Two functions with phase shift

As one can see for the non shifted linear depending functions the correlation coefficient has it’s maximum of nearly one at a lag of $0$. This indicates the functions are linear dependent (or at least one cannot rule out linear dependence) with zero phase shift. The second function on the other hand shows the maximum of the correlation coefficient for a time shift corresponding to the phase shift introduced before. So one can use this method to detect potential delayed correlations in timeseries data.

In case the data has not been sampled at the same positions in time domain one can of course perform linearization of interpolation on the data as usual.

A real world example

As a quantum physicist this reminds one of the $g^{(2)}$ correlation function that’s defined as

[ g^{(2)}(\tau) = \frac{E(I(t) I(t + \tau))}{E(I(t))^2} ]

In this context $I(t)$ would be the intensity or number of counted photons during an experiment and $\tau$ would be the lag introduced. This is indeed also a autocorrelation function - but one that does not include information about covariance of the data.

So what would be a good real world example of using such shifted correlation coefficients?

For example one might ask if given measures of governments during a pandemic triggered excess deaths or if they’ve been a reaction on those. Since I had this data at hand at the moment of writing this article I’ll give an example of this one. The question might arise when one looks at correlations between excess deaths and the stringency of many countries worldwide (the data has been fetched from Our World In Data):

Correlation between excess deaths and government stringency

Intuitively one assumes to know the answer but as usual it’s better to take a look at the data than to take any assumptions. To solve that “mystery” one can simply look at the peak of the shifted functions. A shift to the right indicates that the increase of excess deaths could have been the cause for stringent ruling but not the other way round, a peak upfront would not rule out a correlation the other way round (but one would have to think about other possible causal connections - for example upfront planning for a expected situation. Remember: Correlation does not imply causation):

Shifted correlations for Austria

As one can see the correlation coefficient goes up to 1 only short before 0 lag. This indicates it’s highly likely that the actions have been caused by the effect (if there is a causal relation) and not the other way round. When comparing those graphs for different countries it also shows how they’re reacting different when it comes to relaxation after peaking of the effect:

Shifted correlations for Italy

Please keep in mind that this is really only showing correlations and not causal behavior.

The code

The Jupyter notebook that has been used for the first part of this blog post is available as a GitHub GIST

This article is tagged: Math, Programming, How stuff works, Tutorial, Statistics


Data protection policy

Dipl.-Ing. Thomas Spielauer, Wien (webcomplains389t48957@tspi.at)

This webpage is also available via TOR at http://rh6v563nt2dnxd5h2vhhqkudmyvjaevgiv77c62xflas52d5omtkxuid.onion/

Valid HTML 4.01 Strict Powered by FreeBSD IPv6 support