Normalized Median Absolute Deviation

For detecting outliers in non-normally distributed data a common approach is computing the modified z-scores.

Instead of the standard deviation the median absolute deviation is used as a measure of variability. The metric is defined as the median of the the absolute deviations from the sample median.

$${\tilde {X}}=\operatorname {median} (X)$$
$$\operatorname {MAD} =\operatorname {median} (|X_{i}-{\tilde {X}}|)$$

The median absolute deviation can then be used to derive the modified z-scores. To make the scores comparable to the standard z-scores a normalization factor is added.

$$\operatorname{Modified\_Z} = \frac{0.6745\thinspace(X_{i}-{\tilde {X}})}{\operatorname {MAD}}$$

One might wonder where the normalization constant comes from and how it can be derived.

A naive empirical approach is to compute the median absolute deviation for a sufficiently large sample from the standard normal distribution.

$$Z \sim \mathcal{N}(0, 1)$$
$$\operatorname{MAD}(Z) \approx 0.6745$$
>>> Z = scipy.stats.norm.rvs(size=100000000, random_state=1)
>>> round(scipy.stats.median_abs_deviation(Z), 4)
0.6745

Since the mean and the median of the standard normal distribution are both \(\mu = 0\) the median absolute deviation simplifies to the median of the absolute values of the sample.

$$\operatorname{MAD}(Z) = \operatorname {median} (|Z|) \approx 0.6745$$
>>> round(numpy.median(numpy.abs(Z)), 4)
0.6745

Applying the absolute function to a sample that follows the standard normal distribution results in a standard half-normal distribution where the quantile is defined as

$$Q(F)= {\sqrt {2}}\operatorname {erf} ^{-1}(F)$$

and therefore the median can be computed as

$$\operatorname {median} (|Z|) = {\sqrt {2}}\operatorname {erf} ^{-1}(1/2)\approx 0.6745$$

While the inverse error function has no closed-form solution it can be efficiently approximated, providing a much more convenient method for deriving the normalization factor.

SciPy provides various ways to compute the median of the standard half-normal distribution.

>>> scipy.stats.halfnorm.median()
0.6744897501960817
>>> scipy.stats.halfnorm.ppf(0.5)
0.6744897501960817
>>> scipy.stats.norm.ppf(0.75)
0.6744897501960817
>>> math.sqrt(2)*scipy.special.erfinv(0.5)
0.6744897501960818
>>> scipy.special.ndtri(0.75)
0.6744897501960817

The SciPy implementation of the inverse error function uses different approximation approaches depending upon the input value. The original implementation can be found in ndtri.c. The algorithm essentially performs a rational polynomial approximation. A minimal reimplementation of ndtri.c yields the same approximation. The implementation may have originated from Methods and Programs for Mathematical Functions .

Interesting Stuff

Me writing about Tech stuff