Theorem 1:
Interpretation.
We have least information when we have no reason to prefer any one outcome over others.
Proof.
Since
,
a proof using the former notion of the
function as below is equally valid when the base of the
logarithm is changed.
Let
Then
Since
,
we have
Theorem 2:
with equality iff some
Interpretation.
Information/uncertainty cannot be a negative quantity.
Proof.
Note that we conventionally define
to be
Now the proof is simple.
whence
.
Consequently,
and so
sum of many such quantities is also non-negative. Equality results
iff pi = 1 for some i and
.
Theorem 3: Jensen's inequality:
with equality iff
Interpretation.
Misestimation of the distribution governing the symbols in a source results in increased uncertainty/information.
Let
Again, we use the property that
.
Then,
The very useful quantity
is often
denoted by I(p:q) or D(p||q) and is called the
Kullback-Liebler distance, I-directed divergence,
relative entropy, or discrimination between the densities p
and q. It is also defined for the continuous case by: