Checking Postgres Statistics

Checking the histogram

Consider a table tab and a column col, we get its histogram using:

SELECT histogram_bounds FROM pg_stats
WHERE tablename='tab' AND attname='col';

We can check the histogram quality by counting the real number of rows between each bounds of the histogram:

WITH histogram AS 
(SELECT unnest(histogram_bounds::text::int[]) AS bound 
FROM pg_stats
WHERE tablename='tab' AND attname='col') ,

intervals AS 
(select (ROW_NUMBER() OVER (ORDER BY histogram.bound)) - 1 AS index,
        lag(bound) over (order by bound) AS min, 
        bound AS max
 FROM histogram OFFSET 1),

real_histogram AS (
select index, min, max, count(col) AS count
FROM tab AS tab, intervals AS i 
WHERE tab.col >= i.min AND tab.col < i.max 
GROUP BY index, min, max
ORDER BY index) 

select index AS bucket, min, max , count , repeat('■',
               (   count::float
                 / max(count) over()
                 * 30
               )::int
        ) as bar
from real_histogram;

We obtain a result of this form:

index	min	max	count	bar
1	3818	379263	1009	■■■■■■■■■■■■■■■■■■
2	379263	796818	1294	■■■■■■■■■■■■■■■■■■■■■■■
3	796818	1244509	1021	■■■■■■■■■■■■■■■■■■
4	1244509	1704733	1179	■■■■■■■■■■■■■■■■■■■■■
5	1704733	2097371	1111	■■■■■■■■■■■■■■■■■■■■
6	2097371	2371027	916	■■■■■■■■■■■■■■■■
7	2371027	2761796	1260	■■■■■■■■■■■■■■■■■■■■■■
8	2761796	3100446	1043	■■■■■■■■■■■■■■■■■■
9	3100446	3509108	1103	■■■■■■■■■■■■■■■■■■■
10	3509108	3808976	1034	■■■■■■■■■■■■■■■■■■

Checking the Most Common Values

Consider a table tab and a column col, we get the most common values with their frequencies using:

SELECT  most_common_vals, most_common_freqs FROM pg_stats
WHERE tablename='tab' AND attname='col';

We can display the estimated rows and the actual with their normalized difference using bars (one square each 5%) as follow:

WITH n_rows AS (
SELECT count(*) AS n_rows FROM tab),

T AS 
(SELECT n_rows, 
        unnest(most_common_vals::text::int[]) AS mcv, 
        unnest(most_common_freqs::text::float[]) AS mcf
FROM pg_stats, n_rows
WHERE tablename='tab' AND attname='col'),

mcv_check AS (
SELECT mcv, (mcf * n_rows)::int AS estimated_n_rows, count(*) AS n_rows
FROM T, tab AS tab WHERE tab.col = mcv 
GROUP BY mcv, mcf, n_rows)

SELECT mcv, estimated_n_rows, n_rows,
repeat('■',
((abs(n_rows - estimated_n_rows)::float / n_rows::float)* 20)::int) as deviation
FROM mcv_check
ORDER BY mcv;

mcv	estimated_n_rows	n_rows	deviation
95759	10008	9018	■■■
142005	6255	6119	■
438087	12197	11849	■
799888	32087	31076	■
827510	18014	17767
1519231	8882	8746
2826276	17451	16572	■
3240017	3002	3076	■
3549370	3002	3571	■■■■
3646595	7631	6922	■■
3656433	3127	2150	■■■■■■■■■
3693069	2877	2482	■■■■
4525288	8444	6986	■■■■■
4540224	3628	4128	■■■
4868255	10758	9889	■■
5430681	4003	4855	■■■■■
5566825	5317	4729	■■■
6285203	7568	8474	■■■