For detecting outliers in non-normally distributed data a common approach is computing the modified z-scores.
Instead of the standard deviation the median absolute deviation is used as a measure of variability. The metric is defined as the median of the the absolute deviations from the sample median.
The median absolute deviation can then be used to derive the modified z-scores. To make the scores comparable to the standard z-scores a normalization factor is added.
One might wonder where the normalization constant comes from and how it can be derived.
A naive empirical approach is to compute the median absolute deviation for a sufficiently large sample from the standard normal distribution.
>>> Z = scipy.stats.norm.rvs(size=100000000, random_state=1)
>>> round(scipy.stats.median_abs_deviation(Z), 4)
0.6745
Since the mean and the median of the standard normal distribution are both \(\mu = 0\) the median absolute deviation simplifies to the median of the absolute values of the sample.
>>> round(numpy.median(numpy.abs(Z)), 4)
0.6745
Applying the absolute function to a sample that follows the standard normal distribution results in a standard half-normal distribution where the quantile is defined as
and therefore the median can be computed as
While the inverse error function has no closed-form solution it can be efficiently approximated, providing a much more convenient method for deriving the normalization factor.
SciPy provides various ways to compute the median of the standard half-normal distribution.
>>> scipy.stats.halfnorm.median()
0.6744897501960817
>>> scipy.stats.halfnorm.ppf(0.5)
0.6744897501960817
>>> scipy.stats.norm.ppf(0.75)
0.6744897501960817
>>> math.sqrt(2)*scipy.special.erfinv(0.5)
0.6744897501960818
>>> scipy.special.ndtri(0.75)
0.6744897501960817
The SciPy implementation of the inverse error function uses different approximation approaches depending upon the input value. The original implementation can be found in ndtri.c
. The algorithm essentially performs a rational polynomial approximation. A minimal reimplementation of ndtri.c
yields the same approximation. The implementation may have originated from Methods and Programs for Mathematical Functions .