For detecting outliers in non-normally distributed data a common approach is computing the modified z-scores.
Instead of the standard deviation the median absolute deviation is used as a measure of variability. The metric is defined as the median of the the absolute deviations from the sample median.
$${\tilde {X}}=\operatorname {median} (X)$$
$$\operatorname {MAD} =\operatorname {median} (|X_{i}-{\tilde {X}}|)$$
The median absolute deviation can then be used to derive the modified z-scores. To make the scores comparable to the standard z-scores a normalization factor is added.
$$\operatorname{Modified\_Z} = \frac{0.6745\thinspace(X_{i}-{\tilde {X}})}{\operatorname {MAD}}$$
One might wonder where the normalization constant comes from and how it can be derived.
A naive empirical approach is to compute the median absolute deviation for a sufficiently large sample from the standard normal distribution.
$$Z \sim \mathcal{N}(0, 1)$$
$$\operatorname{MAD}(Z) \approx 0.6745$$
>>> Z = scipy.stats.norm.rvs(size=100000000, random_state=1)
>>> round(scipy.stats.median_abs_deviation(Z), 4)
0.6745
Since the mean and the median of the standard normal distribution are both \(\mu = 0\) the median absolute deviation simplifies to the median of the absolute values of the sample.
$$\operatorname{MAD}(Z) = \operatorname {median} (|Z|) \approx 0.6745$$
>>> round(numpy.median(numpy.abs(Z)), 4)
0.6745
Applying the absolute function to a sample that follows the standard normal distribution results in a standard half-normal distribution where the quantile is defined as
$$Q(F)= {\sqrt {2}}\operatorname {erf} ^{-1}(F)$$
and therefore the median can be computed as
$$\operatorname {median} (|Z|) = {\sqrt {2}}\operatorname {erf} ^{-1}(1/2)\approx 0.6745$$
While the inverse error function has no closed-form solution it can be efficiently approximated, providing a much more convenient method for deriving the normalization factor.
SciPy provides various ways to compute the median of the standard half-normal distribution.
>>> scipy.stats.halfnorm.median()
0.6744897501960817
>>> scipy.stats.halfnorm.ppf(0.5)
0.6744897501960817
>>> scipy.stats.norm.ppf(0.75)
0.6744897501960817
>>> math.sqrt(2)*scipy.special.erfinv(0.5)
0.6744897501960818
>>> scipy.special.ndtri(0.75)
0.6744897501960817
The SciPy implementation of the inverse error function uses different approximation approaches depending upon the input value. The original implementation can be found in ndtri.c
. The algorithm essentially performs a rational polynomial approximation. A minimal reimplementation of ndtri.c
yields the same approximation. The implementation may have originated from Methods and Programs for Mathematical Functions .