~mqb

information entropy

may 2021

Notes while creating ~mqb/information-entropy.

Generally, information entropy is the average amount of information conveyed by an event, when considering all possible outcomes of a random trial.

$$H(X) = -\sum_{i=1}^{n} P(x_i) log_b P(x_i)$$

$H$ is Shannon's Entropy, named after Boltzmann's H-Theorem, Greek capital letter "Eta".

$X$ a discrete random variable with ${ x_1, \cdots, x_n }$ possible values. Its range is a countable set.

$\sum_{i=1}^{n}$ denotes the summation, or just sum, over the variable's possible values, outcomes, of the sequence. Greek capital letter "Sigma".

$n$ is the upper bound of summation.

$i=1$ is the starting index of summation.

$P(x_i)$ is a probability mass function (PMF) with all possible values being positive and summing up to one.

$x_i$ is a possible value in the range, $R_x$, of the discrete random variable.

$log_b$ is a logarithm.

$b$ is the base of the logarithm, bits or Shannons (2), Euler's constant or nats (ℇ), hartley or bans (10).

probability mass function

An example by flipping a fair coin.

Our sample space, 𝑺, with possible independent values being either heads (H) or tails (T).

𝑺 = { H, T }

// outcome scenarios
// 0 = false, 1 = true
H | T
0   1
1   0

The number of possible combinations for heads.

𝑹ₓ = { 0, 1 }

𝑷 (𝑘) = 𝑷 (𝑋 = 𝑘) for 𝑘 = 𝑹ₓ = 0, 1

𝑷 (0) = 𝑷 (𝑋 = 0) = 𝑷 (T) = ¹⁄₂

𝑷 (1) = 𝑷 (𝑋 = 1) = 𝑷 (H) = ¹⁄₂

Application: the probability of getting all heads by rule of product.

The coin is flipped and recorded for a total of four rounds.

𝑷 ({ H, H, H, H }) = ¹⁄₂ × ¹⁄₂ × ¹⁄₂ × ¹⁄₂ = 0.0625 or ¹⁄₁₆

The coin is flipped TWICE per round for a total of four rounds.

𝑺 = { HH, HT, TH, TT }

// outcome scenarios
// 0 = false, 1 = true
H | T
0   0
1   0
0   1
1   1

The number of possible combinations for heads.

𝑹ₓ = { 0, 1, 2 }

𝑷 (𝑘) = 𝑷 (𝑋 = 𝑘) for 𝑘 = 𝑹ₓ = 0, 1, 2

𝑷 (0) = 𝑷 (𝑋 = 0) = 𝑷 (TT) = ¹⁄₄

𝑷 (1) = 𝑷 (𝑋 = 1) = 𝑷 ({ HT, TH }) = ¹⁄₄ + ¹⁄₄ or ¹⁄₂

𝑷 (2) = 𝑷 (𝑋 = 2) = 𝑷 (HH) = ¹⁄₄

Application: the probability of getting all heads by rule of product.

𝑷 ({ HH, TT, HH, TH }) = ¹⁄₄ × ¹⁄₄ × ¹⁄₄ × ¹⁄₄ = 0.00390625 or ¹⁄₂₅₆

cross-entropy

$$H(P, Q) = -\sum_{i=1}^{n} P(x_i) log_b Q(x_i)$$

Quantifies the average amount of total bits one distribution (Q) differs from another (P). It measures the similarity between two distributions.

kl-divergence

$$H(P, Q) = -\sum_{i=1}^{n} P(x_i) log_b(P(x_i)/Q(x_i))$$

Kullback-Leibler Divergence may also go by relative entropy.

Quantifies the average amount of extra bits one distribution (Q) differs from another (P), or the difference between cross-entropy and entropy. It measures the dissimilarity between two distributions.

information entropy

probability mass function

cross-entropy

kl-divergence

references