How to find outliers using the Median Absolute Deviation in PostgreSQL?

In PostgreSQL, you can use the Median Absolute Deviation (MAD) to identify outliers in a dataset. MAD is a robust statistic that measures the variability of a dataset, and it is less sensitive to outliers than traditional measures such as standard deviation. The MAD is defined as the median of the absolute deviations from the median of the dataset.

Here’s how you can use the MAD to identify outliers in PostgreSQL:

  1. Calculate the median of the dataset:

SELECT percentile_cont(0.5) WITHIN GROUP (ORDER BY my_column) AS median
FROM my_table;

In this example, my_table is the name of the table containing the dataset, and my_column is the name of the column containing the data to be analyzed. The percentile_cont function calculates the median of the dataset.

2. Calculate the Median Absolute Deviation (MAD):

WITH median_value AS (
SELECT percentile_cont(0.5) WITHIN GROUP (ORDER BY my_column) AS median
FROM my_table
)
SELECT percentile_cont(0.5) WITHIN GROUP (ORDER BY abs(my_column – median_value.median)) AS mad
FROM my_table, median_value;

In this example, the WITH clause calculates the median of the dataset and stores it in a temporary table called median_value. The main query then calculates the MAD by calculating the absolute deviation of each data point from the median, and then calculating the median of these absolute deviations.

  1. Identify outliers using the MAD:

WITH median_value AS (
SELECT percentile_cont(0.5) WITHIN GROUP (ORDER BY my_column) AS median
FROM my_table
), mad_value AS (
SELECT percentile_cont(0.5) WITHIN GROUP (ORDER BY abs(my_column – median_value.median)) AS mad
FROM my_table, median_value
)
SELECT *
FROM my_table
WHERE abs(my_column – median_value.median) > 3 * mad_value.mad;

In this example, the WITH clause calculates the median and MAD of the dataset and stores them in temporary tables called median_value and mad_value, respectively. The main query then identifies outliers by selecting all data points where the absolute deviation from the median is greater than three times the MAD.

Real-life data examples of using the MAD to identify outliers in PostgreSQL can be found in various applications that handle data analysis or data validation. For example, a financial application might use the MAD to identify fraudulent transactions or unusual patterns in financial data. Similarly, a data analytics application might use the MAD to identify unusual patterns in customer behavior or sales data.

About Shiv Iyer 446 Articles
Open Source Database Systems Engineer with a deep understanding of Optimizer Internals, Performance Engineering, Scalability and Data SRE. Shiv currently is the Founder, Investor, Board Member and CEO of multiple Database Systems Infrastructure Operations companies in the Transaction Processing Computing and ColumnStores ecosystem. He is also a frequent speaker in open source software conferences globally.