~mqb

information entropy

may 2021

Notes while creating ~mqb/information-entropy.

Generally, information entropy is the average amount of information conveyed by an event, when considering all possible outcomes of a random trial.

$$H(X) = -\sum_{i=1}^{n} P(x_i) log_b P(x_i)$$

$H$ is Shannon's Entropy, named after Boltzmann's H-Theorem, Greek capital letter "Eta".

$X$ a discrete random variable with ${ x_1, \cdots, x_n }$ possible values. Its range is a countable set.

$\sum_{i=1}^{n}$ denotes the summation, or just sum, over the variable's possible values, outcomes, of the sequence. Greek capital letter "Sigma".

$P(x_i)$ is a probability mass function (PMF) with all possible values being positive and summing up to one.

$log_b$ is a logarithm.

probability mass function

An example by flipping a fair coin.

Our sample space, ๐‘บ, with possible independent values being either heads (H) or tails (T).

๐‘บ = { H, T }

// outcome scenarios
// 0 = false, 1 = true
H | T
0   1
1   0

The number of possible combinations for heads.

๐‘นโ‚“ = { 0, 1 }

๐‘ท (๐‘˜) = ๐‘ท (๐‘‹ = ๐‘˜) for ๐‘˜ = ๐‘นโ‚“ = 0, 1

๐‘ท (0) = ๐‘ท (๐‘‹ = 0) = ๐‘ท (T) = ยนโ„โ‚‚

๐‘ท (1) = ๐‘ท (๐‘‹ = 1) = ๐‘ท (H) = ยนโ„โ‚‚

Application: the probability of getting all heads by rule of product.

The coin is flipped and recorded for a total of four rounds.

๐‘ท ({ H, H, H, H }) = ยนโ„โ‚‚ ร— ยนโ„โ‚‚ ร— ยนโ„โ‚‚ ร— ยนโ„โ‚‚ = 0.0625 or ยนโ„โ‚โ‚†

The coin is flipped TWICE per round for a total of four rounds.

๐‘บ = { HH, HT, TH, TT }

// outcome scenarios
// 0 = false, 1 = true
H | T
0   0
1   0
0   1
1   1

The number of possible combinations for heads.

๐‘นโ‚“ = { 0, 1, 2 }

๐‘ท (๐‘˜) = ๐‘ท (๐‘‹ = ๐‘˜) for ๐‘˜ = ๐‘นโ‚“ = 0, 1, 2

๐‘ท (0) = ๐‘ท (๐‘‹ = 0) = ๐‘ท (TT) = ยนโ„โ‚„

๐‘ท (1) = ๐‘ท (๐‘‹ = 1) = ๐‘ท ({ HT, TH }) = ยนโ„โ‚„ + ยนโ„โ‚„ or ยนโ„โ‚‚

๐‘ท (2) = ๐‘ท (๐‘‹ = 2) = ๐‘ท (HH) = ยนโ„โ‚„

Application: the probability of getting all heads by rule of product.

๐‘ท ({ HH, TT, HH, TH }) = ยนโ„โ‚„ ร— ยนโ„โ‚„ ร— ยนโ„โ‚„ ร— ยนโ„โ‚„ = 0.00390625 or ยนโ„โ‚‚โ‚…โ‚†

cross-entropy

$$H(P, Q) = -\sum_{i=1}^{n} P(x_i) log_b Q(x_i)$$

Quantifies the average amount of total bits one distribution (Q) differs from another (P). It measures the similarity between two distributions.

kl-divergence

$$H(P, Q) = -\sum_{i=1}^{n} P(x_i) log_b(P(x_i)/Q(x_i))$$

Kullback-Leibler Divergence may also go by relative entropy.

Quantifies the average amount of extra bits one distribution (Q) differs from another (P), or the difference between cross-entropy and entropy. It measures the dissimilarity between two distributions.

references