may 2021
Notes while creating ~mqb/information-entropy.
Generally, information entropy is the average amount of information conveyed by an event, when considering all possible outcomes of a random trial.
$$H(X) = -\sum_{i=1}^{n} P(x_i) log_b P(x_i)$$
$H$ is Shannon's Entropy, named after Boltzmann's H-Theorem, Greek capital letter "Eta".
$X$ a discrete random variable with ${ x_1, \cdots, x_n }$ possible values. Its range is a countable set.
$\sum_{i=1}^{n}$ denotes the summation, or just sum, over the variable's possible values, outcomes, of the sequence. Greek capital letter "Sigma".
- $n$ is the upper bound of summation.
- $i=1$ is the starting index of summation.
$P(x_i)$ is a probability mass function (PMF) with all possible values being positive and summing up to one.
- $x_i$ is a possible value in the range, $R_x$, of the discrete random variable.
$log_b$ is a logarithm.
- $b$ is the base of the logarithm, bits or Shannons (2), Euler's constant or nats (โ), hartley or bans (10).
An example by flipping a fair coin.
Our sample space, ๐บ, with possible independent values being either heads (H) or tails (T).
๐บ = { H, T }
// outcome scenarios
// 0 = false, 1 = true
H | T
0 1
1 0
The number of possible combinations for heads.
๐นโ = { 0, 1 }
๐ท (๐) = ๐ท (๐ = ๐) for ๐ = ๐นโ = 0, 1
๐ท (0) = ๐ท (๐ = 0) = ๐ท (T) = ยนโโ
๐ท (1) = ๐ท (๐ = 1) = ๐ท (H) = ยนโโ
Application: the probability of getting all heads by rule of product.
The coin is flipped and recorded for a total of four rounds.
๐ท ({ H, H, H, H }) = ยนโโ ร ยนโโ ร ยนโโ ร ยนโโ = 0.0625 or ยนโโโ
The coin is flipped TWICE per round for a total of four rounds.
๐บ = { HH, HT, TH, TT }
// outcome scenarios
// 0 = false, 1 = true
H | T
0 0
1 0
0 1
1 1
The number of possible combinations for heads.
๐นโ = { 0, 1, 2 }
๐ท (๐) = ๐ท (๐ = ๐) for ๐ = ๐นโ = 0, 1, 2
๐ท (0) = ๐ท (๐ = 0) = ๐ท (TT) = ยนโโ
๐ท (1) = ๐ท (๐ = 1) = ๐ท ({ HT, TH }) = ยนโโ + ยนโโ or ยนโโ
๐ท (2) = ๐ท (๐ = 2) = ๐ท (HH) = ยนโโ
Application: the probability of getting all heads by rule of product.
๐ท ({ HH, TT, HH, TH }) = ยนโโ ร ยนโโ ร ยนโโ ร ยนโโ = 0.00390625 or ยนโโโ โ
$$H(P, Q) = -\sum_{i=1}^{n} P(x_i) log_b Q(x_i)$$
Quantifies the average amount of total bits one distribution (Q) differs from another (P). It measures the similarity between two distributions.
$$H(P, Q) = -\sum_{i=1}^{n} P(x_i) log_b(P(x_i)/Q(x_i))$$
Kullback-Leibler Divergence may also go by relative entropy.
Quantifies the average amount of extra bits one distribution (Q) differs from another (P), or the difference between cross-entropy and entropy. It measures the dissimilarity between two distributions.