Published: Sun 12 May 2024
By Stefan
In Statistics .
For detecting outliers in non-normally distributed data a common approach is computing the modified z-scores .
Instead of the standard deviation the median absolute deviation is used as a measure of variability. The metric is defined as the median of the the absolute deviations from the sample median.
$${\tilde {X}}=\operatorname {median} (X)$$
$$\operatorname {MAD} =\operatorname {median} (|X_{i}-{\tilde {X}}|)$$
The median absolute deviation can then be used to derive the modified z-scores. To make the scores comparable to the standard z-scores a normalization factor is added.
$$\operatorname{Modified\_Z} = \frac{0.6745\thinspace(X_{i}-{\tilde {X}})}{\operatorname {MAD}}$$
One might wonder where the normalization constant comes from and how it can be derived.
A naive empirical approach is to compute the median absolute deviation for a sufficiently large sample from the standard normal distribution.
$$Z \sim \mathcal{N}(0, 1)$$
$$\operatorname{MAD}(Z) \approx 0.6745$$
>>> Z = scipy . stats . norm . rvs ( size = 100000000 , random_state = 1 )
>>> round ( scipy . stats . median_abs_deviation ( Z ), 4 )
0.6745
Since the mean and the median of the standard normal distribution are both \(\mu = 0\) the median absolute deviation simplifies to the median of the absolute values of the sample.
$$\operatorname{MAD}(Z) = \operatorname {median} (|Z|) \approx 0.6745$$
>>> round ( numpy . median ( numpy . abs ( Z )), 4 )
0.6745
Applying the absolute function to a sample that follows the standard normal distribution results in a standard half-normal distribution where the quantile is defined as
$$Q(F)= {\sqrt {2}}\operatorname {erf} ^{-1}(F)$$
and therefore the median can be computed as
$$\operatorname {median} (|Z|) = {\sqrt {2}}\operatorname {erf} ^{-1}(1/2)\approx 0.6745$$
While the inverse error function has no closed-form solution it can be efficiently approximated, providing a much more convenient method for deriving the normalization factor.
SciPy provides various ways to compute the median of the standard half-normal distribution.
>>> scipy . stats . halfnorm . median ()
0.6744897501960817
>>> scipy . stats . halfnorm . ppf ( 0.5 )
0.6744897501960817
>>> scipy . stats . norm . ppf ( 0.75 )
0.6744897501960817
>>> math . sqrt ( 2 ) * scipy . special . erfinv ( 0.5 )
0.6744897501960818
>>> scipy . special . ndtri ( 0.75 )
0.6744897501960817
The SciPy implementation of the inverse error function uses different approximation approaches depending upon the input value. The original implementation can be found in ndtri.c
. The algorithm essentially performs a rational polynomial approximation. A minimal reimplementation of ndtri.c
yields the same approximation. The implementation may have originated from Methods and Programs for Mathematical Functions .