Which is the most probable class ?
Theoretical result
We consider the case of a random variable \(X\) with a finite range \(\{v_{1}, \dots, v_{m}\}\) and there exists an minimal integer \(Z\) such that \(P(X=v_{i}) = \frac{u_{i}}{Z}\). So, we have \(\sum_{1\leq i \leq m} u_{i} = Z\).
We perform \(n\) independent draws of \(X\), the probability of obtaining \(k_{i}\) times the values \(v_{i}\) with \(\sum_{1\leq i \leq m} k_{i} = n\) is: \[\binom{n}{k_{1}} (\frac{u_{1}}{Z})^{k_{1}} \times \binom{n - k_{1}}{k_{2}} (\frac{u_{2}}{Z})^{k_{2}} \dots \times \binom{n - k_{1} \dots - k_{m-1}}{k_{m}} (\frac{u_{m}}{Z})^{k_{m}}\]
We can simply the formula to obtain: \[\frac{n!}{Z^{n}} \prod_{1\leq i \leq m} \frac{u_{i}^{k_{i}}}{k_{i}!}\]
Finding the set of values \(k_{i}\) that maximize the above formula is equivalent to find for each \(i\) the \(k_{i}\) maximizing \(\frac{u_{i}^{k_{i}}}{k_{i}!}\). According the following section, the maximum is reached when \(k_{i} = u_{i}\). However the additional constraint \(\sum_{1\leq i \leq m} k_{i} = n\) ensures that the choice \(k_{i} = u_{i}\) is possible iff \(n\) is a multiple of \(Z\).
Analyze of uk/k!
In the following, we observe that the maximum of \(\frac{u^{k}}{k!}\) for fixed \(u\) seems to be reached when \(k=u\).
import matplotlib.pyplot as plt import numpy as np n = 50 k = np.arange(0, n) k[0] = 1 # fact(0) u = np.repeat(np.arange(0, n), n).reshape((n, n)) uoverk = np.divide(u, k) uoverk[:,0] = 1 # u^0 =1 res = np.cumprod(uoverk, axis=1) normalized_res = res/res.max(axis=1)[:,None]
fig, axis = plt.subplots() # il me semble que c'est une bonne habitude de faire supbplots heatmap = axis.pcolor(normalized_res, cmap=plt.cm.Blues) # heatmap contient les valeurs plt.colorbar(heatmap) plt.xlabel("k") plt.ylabel("u", rotation=0) plt.title("u^k/k! normalized for fixed u") plt.show()
We just have to write the following equation :
\[\frac{u^n}{n!} = \frac{u}{1} \times \frac{u}{2} \dots \times \frac{u}{n}\]
6k draws of a dice
We choose for a given \(k\), \(n=6k\), \(Z=6k\), \(m=6\), \(u_i=k\), so the \(k_i\) have to be equal to \(k\) to maximize the probability and the formula becomes :
\[\frac{(6k)!}{(6k)^{6k}} \prod_{1\leq i \leq 6} \frac{k^k}{k!} = \frac{(6k)!}{6^{6k} (k!)^6}\]
def most_prob_class(k): prob = 1 for i in range(1, 6*k +1): d = i % k if (i % k) != 0 else k prob = prob * (i/(6*d)) return prob print([most_prob_class(1), most_prob_class(10), most_prob_class(100), most_prob_class(300)])
[0.015432098765432098, 7.456270054665195e-05, 2.4632858255234786e-07, 1.5853278892898133e-08]
TODO Comparison of expected value and best answer
In general, a PDB is a triplet \((\mathcal D, \mathcal W, P)\) where \(\mathcal D\) is the possibly infinite set of possible tuples, \(\mathcal W\) is a \(\sigma\) algebra on \(\mathcal D\), it represents the set of the possible database instances, so every member of \(\mathcal W\) is a finite set and \(P\) is a probability over \(\mathcal W\).
How to define the union or intersection of two instances in \(\mathcal W\) with the bag semantic ?
How to define an independent block PDB as a PDB from the probabilities of the values in each block ? It should be easy. Is the order of the tuples taken into account ?
Finally, how to relate the previous results with the probabilities of BIDPDB ?
For \(Q\) a given numerical query and a \((\mathcal D, \mathcal W, P)\) a PDB, the expected value of \(Q\) on \((\mathcal D, \mathcal W, P)\) is defined by: \[E(Q(D)) = \int_{D \in \mathcal W} Q(D) dP\]