
In PostgreSQL, you can actively use the Median Absolute Deviation (MAD) to identify outliers in a dataset. Since MAD is a robust statistic, it measures variability more effectively than traditional metrics. Moreover, it remains less sensitive to outliers compared to the standard deviation. You define MAD as the median of the absolute deviations from the dataset’s median.
Here’s how you can use the MAD to identify outliers in PostgreSQL:
-
Calculate the median of the dataset:
First, calculate the median value:
SELECT percentile_cont(0.5) WITHIN GROUP (ORDER BY my_column) AS median
FROM my_table;
In this example, my_table is the name of the table containing the dataset, and my_column is the name of the column containing the data to be analyzed. The percentile_cont function calculates the median of the dataset.
2. Calculate the Median Absolute Deviation (MAD):
Next, use a WITH
clause to compute the MAD:
WITH median_value AS (
SELECT percentile_cont(0.5) WITHIN GROUP (ORDER BY my_column) AS median
FROM my_table
)
SELECT percentile_cont(0.5) WITHIN GROUP (ORDER BY abs(my_column – median_value.median)) AS mad
FROM my_table, median_value;
In this example, the WITH clause calculates the median of the dataset and stores it in a temporary table called median_value. The main query then calculates the MAD by calculating the absolute deviation of each data point from the median, and then calculating the median of these absolute deviations.
-
Identify outliers using the MAD:Finally, combine both the median and MAD to flag outliers:
WITH median_value AS (
SELECT percentile_cont(0.5) WITHIN GROUP (ORDER BY my_column) AS median
FROM my_table
), mad_value AS (
SELECT percentile_cont(0.5) WITHIN GROUP (ORDER BY abs(my_column – median_value.median)) AS mad
FROM my_table, median_value
)
SELECT *
FROM my_table
WHERE abs(my_column – median_value.median) > 3 * mad_value.mad;
In this example, the WITH
clause first calculates the median and MAD of the dataset. It then stores them in temporary tables named median_value
and mad_value
, respectively. After that, the main query identifies outliers by selecting all data points where the absolute deviation from the median exceeds three times the MAD.
Real-life data examples of using the MAD to identify outliers in PostgreSQL can be found in various applications that handle data analysis or data validation. For example, a financial application might use the MAD to identify fraudulent transactions or unusual patterns in financial data. Similarly, a data analytics application might use the MAD to identify unusual patterns in customer behavior or sales data.