Checking Postgres Statistics
Checking the histogram
Consider a table tab
and a column col
, we get its histogram using:
SELECT histogram_bounds FROM pg_stats WHERE tablename='tab' AND attname='col';
We can check the histogram quality by counting the real number of rows between each bounds of the histogram:
WITH histogram AS (SELECT unnest(histogram_bounds::text::int[]) AS bound FROM pg_stats WHERE tablename='tab' AND attname='col') , intervals AS (select (ROW_NUMBER() OVER (ORDER BY histogram.bound)) - 1 AS index, lag(bound) over (order by bound) AS min, bound AS max FROM histogram OFFSET 1), real_histogram AS ( select index, min, max, count(col) AS count FROM tab AS tab, intervals AS i WHERE tab.col >= i.min AND tab.col < i.max GROUP BY index, min, max ORDER BY index) select index AS bucket, min, max , count , repeat('■', ( count::float / max(count) over() * 30 )::int ) as bar from real_histogram;
We obtain a result of this form:
index | min | max | count | bar |
---|---|---|---|---|
1 | 3818 | 379263 | 1009 | ■■■■■■■■■■■■■■■■■■ |
2 | 379263 | 796818 | 1294 | ■■■■■■■■■■■■■■■■■■■■■■■ |
3 | 796818 | 1244509 | 1021 | ■■■■■■■■■■■■■■■■■■ |
4 | 1244509 | 1704733 | 1179 | ■■■■■■■■■■■■■■■■■■■■■ |
5 | 1704733 | 2097371 | 1111 | ■■■■■■■■■■■■■■■■■■■■ |
6 | 2097371 | 2371027 | 916 | ■■■■■■■■■■■■■■■■ |
7 | 2371027 | 2761796 | 1260 | ■■■■■■■■■■■■■■■■■■■■■■ |
8 | 2761796 | 3100446 | 1043 | ■■■■■■■■■■■■■■■■■■ |
9 | 3100446 | 3509108 | 1103 | ■■■■■■■■■■■■■■■■■■■ |
10 | 3509108 | 3808976 | 1034 | ■■■■■■■■■■■■■■■■■■ |
Checking the Most Common Values
Consider a table tab
and a column col
, we get the most common values with their frequencies using:
SELECT most_common_vals, most_common_freqs FROM pg_stats WHERE tablename='tab' AND attname='col';
We can display the estimated rows and the actual with their normalized difference using bars (one square each 5%) as follow:
WITH n_rows AS ( SELECT count(*) AS n_rows FROM tab), T AS (SELECT n_rows, unnest(most_common_vals::text::int[]) AS mcv, unnest(most_common_freqs::text::float[]) AS mcf FROM pg_stats, n_rows WHERE tablename='tab' AND attname='col'), mcv_check AS ( SELECT mcv, (mcf * n_rows)::int AS estimated_n_rows, count(*) AS n_rows FROM T, tab AS tab WHERE tab.col = mcv GROUP BY mcv, mcf, n_rows) SELECT mcv, estimated_n_rows, n_rows, repeat('■', ((abs(n_rows - estimated_n_rows)::float / n_rows::float)* 20)::int) as deviation FROM mcv_check ORDER BY mcv;
mcv | estimatednrows | nrows | deviation |
---|---|---|---|
95759 | 10008 | 9018 | ■■■ |
142005 | 6255 | 6119 | ■ |
438087 | 12197 | 11849 | ■ |
799888 | 32087 | 31076 | ■ |
827510 | 18014 | 17767 | |
1519231 | 8882 | 8746 | |
2826276 | 17451 | 16572 | ■ |
3240017 | 3002 | 3076 | ■ |
3549370 | 3002 | 3571 | ■■■■ |
3646595 | 7631 | 6922 | ■■ |
3656433 | 3127 | 2150 | ■■■■■■■■■ |
3693069 | 2877 | 2482 | ■■■■ |
4525288 | 8444 | 6986 | ■■■■■ |
4540224 | 3628 | 4128 | ■■■ |
4868255 | 10758 | 9889 | ■■ |
5430681 | 4003 | 4855 | ■■■■■ |
5566825 | 5317 | 4729 | ■■■ |
6285203 | 7568 | 8474 | ■■■ |