Which is the most probable class ?

Theoretical result

We consider the case of a random variable \(X\) with a finite range \(\{v_{1}, \dots, v_{m}\}\) and there exists an minimal integer \(Z\) such that \(P(X=v_{i}) = \frac{u_{i}}{Z}\). So, we have \(\sum_{1\leq i \leq m} u_{i} = Z\).

We perform \(n\) independent draws of \(X\), the probability of obtaining \(k_{i}\) times the values \(v_{i}\) with \(\sum_{1\leq i \leq m} k_{i} = n\) is: \[\binom{n}{k_{1}} (\frac{u_{1}}{Z})^{k_{1}} \times \binom{n - k_{1}}{k_{2}} (\frac{u_{2}}{Z})^{k_{2}} \dots \times \binom{n - k_{1} \dots - k_{m-1}}{k_{m}} (\frac{u_{m}}{Z})^{k_{m}}\]

We can simply the formula to obtain: \[\frac{n!}{Z^{n}} \prod_{1\leq i \leq m} \frac{u_{i}^{k_{i}}}{k_{i}!}\]

Finding the set of values \(k_{i}\) that maximize the above formula is equivalent to find for each \(i\) the \(k_{i}\) maximizing \(\frac{u_{i}^{k_{i}}}{k_{i}!}\). According the following section, the maximum is reached when \(k_{i} = u_{i}\). However the additional constraint \(\sum_{1\leq i \leq m} k_{i} = n\) ensures that the choice \(k_{i} = u_{i}\) is possible iff \(n\) is a multiple of \(Z\).

Analyze of u^k/k!

In the following, we observe that the maximum of \(\frac{u^{k}}{k!}\) for fixed \(u\) seems to be reached when \(k=u\).

import matplotlib.pyplot as plt
import numpy as np
n = 50

k = np.arange(0, n)
k[0] = 1 # fact(0)
u = np.repeat(np.arange(0, n), n).reshape((n, n))
uoverk = np.divide(u, k)
uoverk[:,0] = 1 # u^0 =1
res = np.cumprod(uoverk, axis=1)
normalized_res = res/res.max(axis=1)[:,None]

fig, axis = plt.subplots() # il me semble que c'est une bonne habitude de faire supbplots
heatmap = axis.pcolor(normalized_res, cmap=plt.cm.Blues) # heatmap contient les valeurs
plt.colorbar(heatmap)
plt.xlabel("k")
plt.ylabel("u", rotation=0)
plt.title("u^k/k! normalized for fixed u")
plt.show()

We just have to write the following equation :

\[\frac{u^n}{n!} = \frac{u}{1} \times \frac{u}{2} \dots \times \frac{u}{n}\]

6k draws of a dice

We choose for a given \(k\), \(n=6k\), \(Z=6k\), \(m=6\), \(u_i=k\), so the \(k_i\) have to be equal to \(k\) to maximize the probability and the formula becomes :

\[\frac{(6k)!}{(6k)^{6k}} \prod_{1\leq i \leq 6} \frac{k^k}{k!} = \frac{(6k)!}{6^{6k} (k!)^6}\]

def most_prob_class(k):
    prob = 1
    for i in range(1, 6*k +1):
        d =  i % k if (i % k) != 0 else k 
        prob = prob * (i/(6*d))
    return prob

print([most_prob_class(1), most_prob_class(10), most_prob_class(100), most_prob_class(300)])

[0.015432098765432098, 7.456270054665195e-05, 2.4632858255234786e-07, 1.5853278892898133e-08]

TODO Comparison of expected value and best answer

In general, a PDB is a triplet \((\mathcal D, \mathcal W, P)\) where \(\mathcal D\) is the possibly infinite set of possible tuples, \(\mathcal W\) is a \(\sigma\) algebra on \(\mathcal D\), it represents the set of the possible database instances, so every member of \(\mathcal W\) is a finite set and \(P\) is a probability over \(\mathcal W\).

How to define the union or intersection of two instances in \(\mathcal W\) with the bag semantic ?

How to define an independent block PDB as a PDB from the probabilities of the values in each block ? It should be easy. Is the order of the tuples taken into account ?

Finally, how to relate the previous results with the probabilities of BIDPDB ?

For \(Q\) a given numerical query and a \((\mathcal D, \mathcal W, P)\) a PDB, the expected value of \(Q\) on \((\mathcal D, \mathcal W, P)\) is defined by: \[E(Q(D)) = \int_{D \in \mathcal W} Q(D) dP\]

Which is the most probable class ?

Theoretical result

Analyze of uk/k!

6k draws of a dice

TODO Comparison of expected value and best answer

Analyze of u^k/k!