Statistics 101

- tsp
Last update 27 Jun 2020

Descriptive statistics

The basic idea behind descriptive statistics is to describe a huger amount of data by some classification figures that allow some judgement about a huger amount of data. Descriptive statistics only tries to represent existing measured data, it’s the complement to inference statistics that tries to predict unknown values, quantify unknown values or test hypothesis.

Frequency diagrams and histograms

An easy way to grasp some information about a dataset is to plot it in an graph. The two most basic used diagrams are frequency diagrams and histograms. The frequency diagram is mostly usable for discrete values - one simply defines the absolute occurence of events for each value and plots that as the $y$ value of an diagram.

For a sample dataset of $1,1,1,2,2,3,3,4,4,4,4,4,4,5,5,5$ this would look like the following:

# Gnuplot data file:
1 3
2 2
3 2
3 6
5 3

# Gnuplot commands:
set title "Frequency diagram"
set xlabel "Values"
set ylabel "Count"
unset key
unset grid
unset border
set boxwidth 0.7
set style fill solid
set size 1,1
set yrange [0:6]
plot "data1.dat" using 1:1 with boxes


The frequency diagram provides a fast overview for discrete values but is unuseable for continuous scales. This is because on a continous scale one could have an infinite number of values for each tick of the diagram - in case of an ideal measurement the likelihood of the same measurement to ever appear again would go to zero for an infinite amount of measurements.

To circumvent this problem one can use histograms. Histograms use bins to accumulate data in the same range. For example when one would measure lengths and gets a bunch of data like the following:

0.34, 0.37, 0.341, 0.024, 0.15, 0.047, 0.968, 0.143, 0.85, 0.123, 0.43, 0.23, 0.5122, 0.4123, 0.123, 0.3241, 0.163

Class Class width Class center Count
0.0 - 0.2 0.2 0.1 7
0.2 - 0.4 0.2 0.3 5
0.4 - 0.5 0.1 0.45 2
0.5 - 1.0 0.5 0.75 3

The catch in this case is that the bin widths have to be choosen not to be the same width. This is of course an arbitrary choice and may make not much sense. The problem with choosing different widths is that - without some kind of rescaling - this would lead to a totally wrong impression. Because of this one rescales the height of the histograms by their width so total area represents the total count again:

# Gnuplot data file containing
# central position, bin width, count, label
0.1	0.2	7	"0.0-0.2"
0.3	0.2	5	"0.2-0.4"
0.45	0.1	2	"0.4-0.5"
0.75	0.5	3	"0.5-1.0"

# Wrong graph
set title "Wrong histogram"
set style fill solid noborder
set yrange [0:*]
plot "data2.dat" using 1:3:2:xtic(4) with boxes notitle

# Correct graph
set title "Histogram rescaled correctly"
set style fill solid noborder
set yrange [0:*]
plot "data2.dat" using 1:($3/$2):2:xtic(4) with boxes notitle


In this case an arbitrary scaling factor (the raw width of the histogram bin) has been used so the total value on the second axis does not really represent anything. One could rescale this of course to probabilities or other values that make more sense.

Some basic figures

The following lists some commonly used figures used to classify data. Basically these will be used to describe the central position or generally a figure of position of data, a figure for spread of data and a figure for skew or symmetry of the data.

Central position / description of position

The central position tells one where the (proposed) center of the data lies. The two most commonly used figures are the artihmetic mean as well as the median.

A figure for central position has to fulfill the properties $l(a*x + b) = a * l(x) + b$ and $\text{min}(x) \leq x \leq \text{max}(x)$.

Mode

The mode is the most common value contained inside a dataset. For the datalist $1,1,1,2,2,3,3,4,4,4,4,4,4,5,5,5$ one would determine the mode to be $4$ since it occures 6 times.

Median, Quartiles and 5-point summary (boxplot)

The median is a value that can be evaluated as soon as the order of values is determineable (i.e. it can be used for ordinal data). It’s defined to be the value that’s existing in the center of the ordered list.

$m = \begin{cases} x_{\frac{n+1}{2}} & n \text{odd} \ \frac{x_{\frac{n}{2}} + x_{\frac{n}{2}+1}}{2} & n \text{even} \end{cases}$

As one can see for the definition of the median in the even case the addition (and therefore the distance) between values also has to have a meaning.

The median minimizes $h(x) = \sum_{i=1}{n} \mid x_i - x \mid$.

If one - for example - has measured the datalist $1,1,1,2,2,3,3,4,4,4,4,4,4,5,5,5$ (i.e. $n=16$ values so $n$ is even and one has to take $m = \frac{x_8 + x_9}{2}$ one gets the median $m=\frac{4+4}{2}=4$.

Note that for odd counts the median has to be one of the data values but may not be contained in the datalist for even counts since in that case it’s calculated as an average.

The median will later be is one of the quartiles. These are defined as $Q_\alpha = x_{\alpha * n}$. One can see that $\text{min}(x) = Q_{0}$, $\text{max}(x) = Q_{1}$ and the Median $\text{med}(x) = Q_{0.5}$. Two other quartiles are defined ($Q_{0.25}$ and $Q_{0.75}$). One can determine them on short datalists by simply dividing the left half and the right half of the values again. The five Quartiles form the five-point summary of a dataset and can be visualized in a boxplot

# Gnuplot datafile
1
1
1
2
2
3
3
4
4
4
4
4
4
5
5
5

# Gnuplot commands
set style fill solid 0.25 border -1
set style boxplot outliers pointtype 7
set style data boxplot
set boxwidth 0.5
set pointsize 0.5
unset key
set yrange [0:6]
plot 'data3.dat' using (1):1


In this case on is not capable of determining the median from the boxplot but one can easily recognize the minimum, maximum and $Q_{0.25}$ as well as $Q_{0.75}$. One can also determine that the median has to be equal to $Q_{0.25}$ or $Q_{0.75}$. The main power of interpreting a boxplot is based on the fact that every quartile contains at least $25%$ of the data points.

Arithmetic mean / average

The arithmetic mean is one of the most commonly found figures. It can be imagined as describing the center of mass of the datalist. It’s defined as

$\bar{x} = \frac{\sum_{i=1}{n} x_i}{n}$

It’s of course not meaningful for ordinal scales because distance is a central element to meaningful perform addition of values.

It minimizes $h(x) = \sum_{n=1}{n} (x_i - x)^2$.

If one - for example - has measured the datalist $1,1,1,2,2,3,3,4,4,4,4,4,4,5,5,5$ the median calculates to

$\bar{x} = \frac{1+1+1+2+2+3+3+4+4+4+4+4+4+5+5+5}{16} = \frac{3*1+2*2+2*3+6*4+3*5}{16} = 3.25$

Note that the arithmetic mean doesn’t have to be contained inside the datalist. For a highly spreaded dataset if could even be a value that’s having the largest distance to all datasets inside the range of all values.

Least median of squares (LMS)

This is a less common seen positional figure. It’s the center of the shortest interval that contains $\frac{n}{2} + 1$ datapoints. Since it minimized the median of squared differences it’s extremely robust against measurement errors.

Shorth

Has a similar definition than the LMS above but uses the arithmetic mean instead of the median.

The positional attribut itself doesn’t tell one much about the data. All datapoints may be near the positional measure or they may be far away. The spread has to fulfill

$\sigma(a*x + b) = \mid a \mid * \sigma(x)$ and $\sigma(x) \geq 0$

Standard deviation

The standard deviation is normally used together with the arithmetic mean. It measures the mean distance of all data values from the arithmetic mean:

$s = \sqrt{\frac{\sum_{i=1}^{n} (x_i - x)^2}{n}}$

The variance is defined to be the squared standard deviation.

For example - for the previous data list - $1,1,1,2,2,3,3,4,4,4,4,4,4,5,5,5$ with an previously calculated $\bar{x}=3.25$ one can simply calculate the standard deviation as

$s = \sqrt{\frac{3(1-3.25)^2+2(2-3.25)^2+2(3-3.25)^2+6(4-3.25)^2+3(5-3.25)^2}{16}} = 1.39$

This is most commonly written as $3.25 \pm 1.39$.

Interquartile range (IQR)

The IQR is defined as $IQR = Q_{0.75} - Q_{0.25}$. This is the length of the central interval that contains at least $50%$ of the measurement data.

Span

The span could also be used instead of the IQR (i.e. calculating $\text{max}(x) - \text{min}(x)$). Of course one looses resistance against outliers.

Skew

Skew has to fulfill $s(a*x + b) = \text{sqn}(a) * s(x)$ and in case of a symmetric distribution $\forall x \exists! y : x-y = y-x \to s(x) = 0$

Skew $\gamma$

$\gamma = \frac{\frac{1}{n} * \sum_{i=1}^{n} (x_i - \bar{x})^3}{s^3}$

Skew coefficient $SK$

$SK = \frac{Q_{0.75} - 2*Q_{0.5} + Q_{0.25}}{Q_{0.75} - Q_{0.25}}$

Correlation

Correlation is defined for multivariant data (i.e. when one measures two or more properties of an system). Binary attributes can be visualized with a four field table and described with the four field correlation, bivariant continuous data with a scatter plot and described with the empirical correlation.

Binary data

How four field analysis works will be described in the section about probability.

Continuous data

A scatter plot is simply a plot that displays one data coordinate on one axis and the other coordinate on the other axis. Of course both axis can be scaled to the required dimensions.

First one defines the standard scores for both dimensions separately:

$z_{x,i} = \frac{x_i - \bar{x}}{s_x}$

$z_{y,i} = \frac{y_i - \bar{y}}{s_y}$

$s_x$ and $s_y$ are the standard deviations and $\bar{x}$ and $\bar{y}$ the arithmetic means.

The empiric correlation coefficient can then be defined as

$r_{xy} = \frac{1}{n} \sum_{i=1}^{n} z_{x,i} * z_{y,i}$

Regression

Dipl.-Ing. Thomas Spielauer, Wien (webcomplains389t48957@tspi.at)

This webpage is also available via TOR at http://jugujbrirx3irwyx.onion/