39 | Notes on Information Theory

01 July, 2022 - 97 min read

Notes on Information Theory from various sources



1 | Introduction to Coding Theory

  • Shannon’s seminal book “A Mathematical theory of communication” (1948) measured information content in the output of a random source in terms of its entropy
  • In his book, he presented the Noiseless Coding Theorem: given nn independent and identically distributed random variables, each with entropy H(x)H(x)...

    • H(x)H(x) can be compressed into n(H(x)+ϵ)n(H(x) + \epsilon) bits with negligible loss
    • or alternatively compressed into n(H(x)ϵ)n(H(x) - \epsilon) bits entailing certain loss
    • Given a noisy communication channel whose behavior is given by a stochastic channel law and considering a message of kk bits output by a source coder, the theorem states that every such channel has a precisely defined real numbered capacity that quantifies the maximum rate of reliable communication
    • Given a noisy channel with a capacity CC, if information is transmitted at rate RR (meaning k=nRk=nR message bits are transmitted in nn uses of the channel), then if R<CR < C, there exist coding schemes (encoding and decoding pairs) that guarantee negligible probability of miscommunication, whereas if R>CR > C, then –regardless of a coding scheme– the probability of an error at the receiver is bounded below some constant (which increases with RR)
    • Conversely, when R>CR > C, the probability of miscommunication goes exponentially to 1 in kk

1 - A Simple Code

Suppose we need to store 64-bit words in such a way that they can be recovered even if a single bit per word gets flipped

  • We could duplicate each bit at least three times, thus storing 21 bits of information per word, and limiting ourselves to roughly 1/3 of the information per word. But the tradeoff is that we could correct any single bit flip since the majority value of the three copies gives te correct value of the bit if one of them is flipped.

In 1949, Hamming proferred his code to correct 1-bit errors using fewer redundant bits. Chunks of 4 information bits x1,x2,x3,x4x_1, x_2, x_3, x_4 get mapped to (encoded) to a codeword of 7 bits as

x1,x2,x3,x3,x2x3x4,x1x3x4,x1x2x4 x_1, x_2, x_3, x_3, x_2 \oplus x_3 \oplus x_4, x_1 \oplus x_3 \oplus x_4, x_1 \oplus x_2 \oplus x_4

which can be equivalently represented as a mapping from xx to GxGx (operations do mod2\mod 2 ) where xx if the column vector [1,2]T[1, 2]^T and GG is the matrix

[1000010000100001011110111101] \begin{bmatrix} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \\ 0 & 1 & 1 & 1 \\ 1 & 0 & 1 & 1 \\ 1 & 1 & 0 & 1 \\ \end{bmatrix}
  • Two distinct 4-bit vectors x,yx, y get mapped to the codewords Gx,GyGx, Gy which differ in the last 3 bits (define w=xy,w0w = x - y, w \neq 0 and check that, for each non-zero ww, GwGw has at least 3 bits which are 11)
  • This code can correct all single bit flips since, for any 7-bit vector, there is at most 1 codeword which can be obtained by a single bit flip
  • For any y{0,1}7y \in \{0, 1\}^7 which is not a codeword, there is always a codeword which can be obtained by a single bit flip

Basic Definitions

Hamming Distance

The Hamming Distance between two strings x,yx,y of the same length over the alphabet Σ\Sigma is denoted Δ(x,y)\Delta(x,y) and defined as the number of positions at which xx and yy differ:

Δ(x,y)={ixiyi}\Delta(x,y) = |\{ i | x_i \neq y_i\}|

The fractional or relative Hamming Distance between x,yΣnx,y \in \Sigma^n is given by

δ(x,y)=Δ(x,y)n\delta(x,y) = \frac{\Delta(x,y)}{n}
  • The distance of an error correcting code is the measurement of error resilience quantified in terms of how many errors need to be introduced to cofuse one codeword with another
  • The minimum distance of a code CC denoted Δ(C)\Delta(C) is the minimum Hamming distance between two distinct codewords of CC:
Δ(C)=minc1,c2Cc1c2Δ(c1,c2) \Delta(C) = \min\limits_{{c_1, c_2 \in C \atop c_1 \neq c_2}} \Delta(c_1, c_2) \\
  • For every pair of distinct codewords in CC, the Hamming Distance is at least Δ(C)\Delta(C)
  • The relative distance of CC denoted δ(C)\delta(C) is the normalized quantity Δ(C)n\frac{\Delta(C)}{n} where nn is the block length of CC. Thus, any two codewords of CC differ in at least a fraction of σ(C)\sigma(C) positions
  • For example, the parity check code, which maps kk bits to k+1k+1 bits by appending the parity of the message bits, is an example of a distance 2 code with rate k/(k+1)k/(k+1)

Hamming Weight

The Hamming Weight is the number of non-zero symbols in a string xx over alphabet Σ\Sigma

w(x)={ixi0}w(xy)=Δ(x,y)\begin{aligned} w(x) &= |\{ i | x_i \neq 0 \}| \\ w(x-y) &= \Delta(x, y) \end{aligned}


A Hamming Ball is given by the radius rr around a string xΣnx \in \Sigma^n given by the set

{yΣnΔ(x,y)r}\{y \in \Sigma^n | \Delta(x,y) \leq r\}


An error correcting code CC of length nn over a finite alphabet Σ\Sigma is a subset of Σn\Sigma^n.

  • Elements of CC are codewords in CC, and the collection of all the codewords is sometimes called a codebook.
  • The alphabet of CC is Σ\Sigma, and if Σ=q|\Sigma| = q, we say that CC is qq-ary (binary, ternary, etc.)
  • The length nn of codewords in CC is called the block length of CC
  • Associated with a code is also an encoding map EE which maps the message set M\mathcal M, identified in some canonical way with {1,2,...,C}\{1, 2, ..., |C| \} to codewords in Σn\Sigma^n

    • The code is then the image of the encoding map


The rate of a code is defined by the amount of redundancy it introduces.

R(C)=logCnlogΣnR(C) = \frac{\log |C|}{n \log |\Sigma^n|}

which is the amount of non-redundant information per bit in codewords of CC.

  • The dimension of CC is given by logClogΣn\frac{\log |C|}{\log |\Sigma^n|}
  • A qq-ary code of dimension \ell has qq^\ell codewords


  • The distance of an error correcting code is the measurement of error resilience quantified in terms of how many errors need to be introduced to confuse one codeword with another
  • The minimum distance of a code CC denoted Δ(C)\Delta(C) is the minimum Hamming distance between two distinct codewords of CC:
Δ(C)=minc1,c2Cc1c2Δ(c1,c2) \Delta(C) = \min\limits_{{c_1, c_2 \in C \atop c_1 \neq c_2}} \Delta(c_1, c_2) \\
  • For every pair of distinct codewords in CC, the Hamming distance is at least Δ(C)\Delta(C)
  • The relative distance of CC denoted δ(C)\delta(C) is the normalized quantity Δ(C)n\frac{\Delta(C)}{n} where nn is the block length of CC. Thus, any two codewords of CC differ in at least a fraction of σ(C)\sigma(C) positions
  • For example, the parity check code, which maps kk bits to k+1k+1 bits by appending the parity of the message bits, is an example of a distance 2 code with rate k/(k+1)k/(k+1)

Properties of Codes

  1. CC has a minimum distance of 2t+12t + 1
  2. CC can be used to correct all tt symbol errors
  3. CC can be used to detect all 2t2t symbol errors
  4. CC can be used to correct all 2t2t symbol erasures – that is, some symbols are erased and we can determine the locations of all erasures and fill them in using filled position and the redundancy code

An Aside about Galois Fields

  • A finite or Galois Field is a finite set on which addition, subtraction, multiplicaiton, and division are defined and correspond to the same operations on rationals and reals
  • A field is denoted as Fq\mathbb F_q or GFq\mathbf {GF}_q where qq is a the order or size of the field given by the number of elements within

    • A finite field of order qq exists if and only if qq is a prime power pkp^k, where pp is prime and kk is a positive integer
    • In a field of order pkp^k, adding pp copies of any element always results in zero, meaning tat the characteristic of the field is pp
    • All fields of order q=pkq = p^k are isomorphic, so we can unambiguously refer to them as Fq\mathbb F_q
  • Fq\mathbb F_q, then, is the field with the least number of elements and is said to be unique if the additive and multiplicative identities are denoted 0,10, 1 respectively
  • It should be pretty familiar for bitwise logic then as these identities correspond to modulo 2 arithmetic and the boolean XOR operation:
±\pm 0 1
0 0 1
1 1 0
  • Other properties of binary fields

    • Every element xx of F2\mathbb F_2 satisfies x+x=0;  x=xx + x = 0; \;-x = x
    • Every element xx of F2\mathbb F_2 satisfies x2=0x^2 = 0

3 - Linear Codes

General codes may have no structures, but of particular interest to us are a subclass of codes with an additional structure, called Linear Codes defined over alphabets Σ\Sigma which are finite fields Fq\mathbb F_q where qq is a prime number in which case Fq={0,1,...,q1}\mathbb F_q = \{0, 1, ..., q-1\} with addition and multiplication defined modulo qq.

  • Formally, if Σ\Sigma is a field and CΣnC \subset \Sigma^n is a subspace of Σn\Sigma^n then CC is said to be a Linear code
  • As CC is a subspace, there exists a basis c1,c2,...,ckc_1, c_2, ..., c_k where kk is the dimension of the subspace. Any codeword can be expressed as the linear combination of these basis vectors. We can write these vectors in matrix form as the columns of an n×kn \times k matrix call a generator matrix

Generator Matrices and Encoding

  • Let CFqnC \subseteq \mathbb F_q^n be a linear code of dimension kk. A matrix GFqn×kG \in \mathbb F^{n \times k}_q is said to be a generator matrix for CC if its kk columns span CC
  • It provies a way to encode a message xFqkx \in \mathbb F^k_q (a column vector) as the codeword which is a linear transformation from xx to GxGx
  • Note that a Linear Code admits many different generator matrices, corresponding to the different choices of bases for the code as a vector space
  • A qq-ary code block of length nn and dimension kk will be referred to as an [n,k]q[n, k]_q code

    • If the code has distance dd, it may be expressed as [n,k,d]q[n, k, d]_q and when the alphabet size is obvious or irrelevant, the subscript may be ommitted

Aside about Parity-Bit Check Codes

A simple Parity-bit Check code uses an additional bit denoting the even or odd parity of a messages Hamming Weight which is appended to the message to serve as a mechanism for error detection (but not correction since it provides no means of determing the position of error).

  • Such methods are inadeqaute for multi-bit errors of matching parity e.g., 2 bit flips might still pass a parity check, despite a corrupted message
data Hamming Weight Code (even parity) odd
00000000000000 0 000000000000000\color{red}{0} 1\color{red}{1}
10100011010001 3 101000111010001\color{red}{1} 0\color{red}{0}
11010011101001 4 110100101101001\color{red}{0} 1\color{red}{1}
11111111111111 7 111111111111111\color{red}{1} 0\color{red}{0}


  • Suppose Alice wants to send a message m=1001m=1001
  • She computes the parity of her message peven(m)=1+0+0+0+1(mod2)=0p_{even}(m) = 1+ 0 + 0 + 0 + 1 (\mod 2) = 0
  • And appends it to her message: m=mp(m)=10010m’ = m | p(m) = 10010
  • And transmits it to Bob AmBA \xrightarrow{m’} B
  • Bob recieves mm’ with questionable integrity and computes the parity for himself: 1+0+0+0+1+0(mod2)=01+ 0 + 0 + 0 + 1 + 0(\mod 2) = 0 and observes the expected even-parity result
  • This same process works under and odd-parity scheme as well

Examples of Codes

  • [n,n1,2]2[n, n-1, 2]_2 corresponds to the binary parity check code consisting of all vectors in F2n\mathbb F^n_2 of even Hamming Weight
  • [n,1,n]2[n, 1, n]_2 is a binary repetition code consisting of the two vectors 0n,1n0^n, 1^n
  • [7,4,3]2[7,4,3]_2 is the linear Hamming Code discussed above
  • If HH is the parity check matrix of the Linear Code CC, then Δ(C)\Delta(C) is the minimum number of columns of HH that are Linearly Dependent

Hamming Codes Revisited

  • Hamming Codes can be understood by the structure of its parity check matrix which allows us to generalize them to codes of larger lengths
  • CHamC_\text{Ham} = [7,4,3]2[7,4,3]_2 is a Hamming Code using the Generator Matrix GG, and a parity matrix HH
G=[1000010000100001011110111101],H=1    2    3    4    5    6    7[000111101100111010101] G = \begin{bmatrix} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \\ 0 & 1 & 1 & 1 \\ 1 & 0 & 1 & 1 \\ 1 & 1 & 0 & 1 \\ \end{bmatrix}, \qquad H = \begin{array}{rcl} &\begin{array}{c} \color{red}1 \ \ \ \ 2 \ \ \ \ 3 \ \ \ \ 4 \ \ \ \ 5 \ \ \ \ 6 \ \ \ \ 7 \end{array} \\ &\begin{bmatrix} 0 & 0 & 0 & 1 & 1 & 1 & 1 \\ 0 & 1 & 1 & 0 & 0 & 1 & 1 \\ 1 & 0 & 1 & 0 & 1 & 0 & 1 \\ \end{bmatrix} \end{array}

and see that HG=0HG = \mathbf 0.

Correcting single errors with Hamming codes

  • Suppose that yy is a corrupted version of some unknown codeword cCc \in C with a single error (bit flip)
  • We know that by the distance property of CHamC_\text{Ham}, cc is uniquely determined by yy. We could determine cc by flipping each bit of yy and check if the resulting vector is in the null space of HH,

    • Recall that the nullspace of a matrix AA is the set of all n-dimensional column vectors xx such that Ax=0Ax = \mathbf 0
  • However, a more efficient correction can be performed. We know that y=c+eiy = c + e_i, where eie_i is the column vector of all zeros except for a single 1 in the iith position. Note that Hy=H(c+ei)=Hc+Hei=Hy = H(c+ e_i) = Hc + He_i = the iith column of HH which is the binary representation of ii, and thus this recovers the location ii of the error
  • Syndrome: We say that HyHy is the syndrome of yy

Generalized Haming Codes

We define HrH_r to be the r×2r1r \times 2^r - 1 matrix where column ii of HrH_r is the binary representation of ii. This matrix must contain e1e_1 through ere_r, which are binary representations of all powers of 2 from 1,2r11, 2^{r-1} and thus has full row rank

  • The rrth generalized Hamming Code then is given by
CHam(r)={cF22r1Hrc=0}C^{(r)}_\text{Ham} = \{ c \in \mathbb F^{2^r-1}_2 | H_rc = \mathbf 0\}

and is the binary code with the parity check matrix HrH_r

  • Note that CHam(r)C^{(r)}_\text{Ham} is an [2r1,2r1r,3]2[2^r -1, 2^r - 1 - r, 3]_2 code
  • Proof:

    • Since HrH_r has rank r,dim(CHam(r))=rr, \text{dim}(C^{(r)}_\text{Ham})=r, and we must simply check that no two columns of HrHr are linearly dependent, and that there are three linearly dependent columns in HrH_r
    • The former follows since all the columns are distinct, and the latter since the first three columns of HH are the binary representations of 1,2,3 add up to 0.

If CC is a binary code of block length nn, and has a minimum distance of 3, then

C2nn+1(1)\tag{1} |C| \leq \frac{2^n}{n+1}

and it follows that the Hamming Code CHam(r)C^{(r)}_\text{Ham} has the maximum possible number of code words (and thus, an optimal rate) amongst all binary codes of block length 2r12^r -1 and a minimum distance of at least 3.

  • Proof:

    • For each cCc \in C, define its neighbordhood N(c)={y{0,1}nΔ(y,c)1}N(c) = \{y \in \{0,1\}^n | \Delta(y,c) \leq 1 \} of all strings that differ in at most one bit from cc.
    • Since CC has distance 3, by the triangle inequality we have
N(c)N(c)=  for  ccCN(c) \cap N(c’) = \emptyset \; \text{for} \; c \neq c’ \in C \therefore 2ncCN(c)=cCN(c)=C(n+1)2^n \geq \Big|\bigcup_{c \in C} N(c) \Big| = \sum_{c \in C} |N(c)| = |C| \cdot (n+1)

yielding the desired upper bound on C|C| per this “Hamming Volume” upper bound on size of codes.

  • Note that CHam(r)C^{(r)}_\text{Ham} has a dimension of 2r1r2^r-1-r and therefore a size 22r1/2r2^{2^r -1}/2^r which meets the upper bound for block length n=2r1n = 2^r - 1
  • The general definition of a Hamming or Volume Bound for binary codes CC of block length nn, and distance dd is given by
C2ni=0d12(ni)(2)\tag{2} |C| \leq \frac{2^n}{\sum_{i=0}^{\lfloor \frac{d-1}{2} \rfloor}{n \choose i}}

for equality to hold in (2), Hamming Balls of radius d12\lfloor \frac{d-1}{2} \rfloor around the code words must cover each point in {0,1}n\{0,1\}^n exactly once, giving a perfect packing of non-overlapping Hamming spheres that cover the full space.

  • Codes which meet the bound in (2) are therefore called perfect codes

Perfect Codes

There are few perfect binary codes:

  • The Hamming Code CHam(r)C^{(r)}_\text{Ham}
  • The Golay Code
  • The trivial codes consistng of just 1 codeword, or the whole space, or the repetition code {0n,1n}\{0^n, 1^n\} for odd nn

The Golay Code

Like other Hamming Codes, the Golay Code can be constructed in a number of ways. Gol23\text{Gol}_{23} consists of (c0,c1,...,c22){0,1}23(c_0, c_1, ..., c_{22}) \in \{0,1\}^{23} when the polynomial c(X)=c0X0+c1X1+...+c22X22c(X) = c_0X^0 + c_1X^1 + ... +c_{22}X^{22} is divisible by

g(X)=1+X+X5+X6+X7+X9+X11g(X) = 1 + X + X^5 + X^6 + X^7 + X^9 + X^11

in the ring F[X]/(X231)\mathbb F[X]/(X^{23}-1).

3.3 Dual of a Code

Since a Linear Code is a subspace of Fqn\mathbb F_q^n, we can define its dual in orthogonal space – a co-code, if you will

  • Dual Code: If CFqnC \subseteq \mathbb F_q^n is a linear code, then its dual CC^\perp is a linear code over over Fq\mathbb F_q given by
C={zFqnzc=0  cC}C^\perp = \{ z \in \mathbb F_q^n | \langle z \cdot c \rangle = 0 \; \forall c \in C \}

where z,c=i=1nzici\langle z, c \rangle = \sum^n_{i=1} z_ic_i is the dot product over Fq\mathbb F_q of vectors z,cz,c

  • Unlike vector spaces over R\mathbb R, where the dial (or orthogonal complement) vector WW^\perp of a subspace wRnw \subseteq \mathbb R^n satisfies WW={0}W \cap W^\perp = \{ \mathbf 0 \} and (W+W+RnW + W^\perp + \mathbb R^n), for subspaces of Fqn\mathbb F^n_q, CC and CC^\perp can intersect non-trivially
  • In fact, we can devise CCC^\perp \subseteq C (a self-orthogonal code) or even just C=CC = C^\perp, a self-dual

3.4 Dual of the Hamming Code

The Dual of the Hamming Code CHam(r)C^{(r)}_\text{Ham} has a generator matrix Gr=(Hr)TG_r = (H_r)^T which is a (2r1)×r(2^r-1)\times r matrix whose rows are all non-zero bit vectors of length rr. This yields a [2r1,r]2[2^r-1, r]_2 code and is called a simplex code

The Hadamard Code, the Hamming Code’s dual, is obtained by adding an all-zeros row to GrG_r

Hadr\text{Had}_r is a [2r,r]2[2^r, r]_2 linear code whose 2r×r2^r \times r generator matrix has all rr-bit vecotrs as its rows.

  • Its encoding map encodes xF2rx \in \mathbb F_2^r by a string in F22rF_2^{2^r} consisting of the dot product x,a\langle x,a \rangle for every aFqka \in \mathbb F^k_q
  • The hadamard code can also be dinfed over Fq\mathbb F_q, buby encoding a message in Fqk\mathbb F_q^k with its dot product with ever other vector in that field
  • The Hadamard Code is the most redundant linear code in which no two codeword symbols are equal in every codeword, implying a robust distance property:

    • The Hadamard Code and simplex codes have a minimum distance of 2r12^{r-1}. The qq-ary Hadamard Code of dimension rr has a distance (11q)qr(1-\frac{1}{q})q^r
    • Proof: For x0,z,a0x\neq0, \langle z, a \rangle \neq 0 for exactly 2r12^{r-1} (i.e. half) of the elemtns of aF2ra \in \mathbb F_2^r.
    • Assume for definiteness that x1=1x_1=1.
    • Then, for every a,x,a+x,a+e1=x1=1a, \langle x, a \rangle + \langle x, a+e_1 \rangle = x_1 = 1, and therefore exactly one of x,a\langle x, a \rangle and x,a+ei\langle x, a + e_i\rangle equals 1. The proof for the qq-ary case$ is similar.
  • Binary codes cannot have a relative distance δ(c)\delta(c) of more than 12\frac{1}{2}, unless they only have a fixed number of code words. This the relative distance of Hadamard Codes is optimal, but their rate is necessarily poor

Code Families and Asymptotically Good Codes

  • The Hamming and Hadamard codes exhibit two extreme trade-offs between rate and distance

    • Hamming Code’s rates appraochs 1 (optimal), but their distance is only 3
    • Conversely, Hadamard Code’s have an optimal relative distance of 12\frac{1}{2}, but their rate approaces 0
  • The natural question that follows is is there a class of codes which have good rate and good relative distance such that neither approach 0 for large block lengths?
  • To answer this question, we consider asymptotic behavior of code families:

    • Define a code family C\mathcal C to be an infinite collection {CiiN}\{ C_i | i \in N \}, where CiC_i is a qiq_i-ary code of block length nin_i, with ni>ni1n_i > n_{i-1} and qi>qi1q_i > q_{i-1}
    • Many constructions of codes will belong to an infinite family of codes that share general structure and properties whoe asymptotic behavior guides their construction
    • We also need extend the notion of relative rate and distance to these inifinite families:
R(C)=limikini\mathcal {R(C)} = \lim_{i \rightarrow \infty } \frac{k_i}{n_i} δ(C)=limiΔ(Ci)ni\delta\mathcal {(C)} = \lim_{i \rightarrow \infty } \frac{\Delta(C_i)}{n_i}
  • A qq-ary family of codes is said to be asymptotically good if its rate and relative distance are bounded away from zero such that there exist constant R0>0,δ0>0  s.t.R(C)>R0,δ(C)>δ0\mathcal{R_0> 0, \delta_0 >0 \; s.t. R(C) > R_0, \delta(C) > \delta_0}

While a good deal is known about the existence or even explicit construction of good codes as well as their limitations, the best possible asymptotic trade-off between rate and relative distance for binary codes remains a fundamental open question which has seen no improvement since the late 70s in part due to the the fundamental trade off between large rate and large relative distance

Error Correction Coding: Mathematical Methods and Algorithms - Todd Moon

1 | A Context for Error Correction Coding

1.2 - Where Are Codes

  • Error Control Codes are comprised of error correction and error detection

Da heck is an ISBN

  • For a valid ISBN, the sum-of-the-sum of the digits must be equal to 0mod110 \mod 11
0country20publisher136186book #8check\underbrace{0}_{\text{country}} - \underbrace{20}_{\text{publisher}} - \underbrace{1 - 36186}_{\text{book \#}} - \underbrace{8}_{check}
digit cum. sum cum. sum of sum
0 0 0
2 2 2
0 2 4
1 3 7
3 6 13
6 12 25
1 13 38
8 21 59
6 27 86
8 35 121

The final sum is 121=0mod11121 = 0 \mod 11

1.3 - The Communication System

Information theory treats information as a pseudo-physical quantity which can be measured, transformed, stored, and moved from place to place. A fundemental concept of information theory is that information is conveyed by the resolution of uncertainty.

  • Probabilities are used to described uncertainty. For a discrete random variable X{0,1}X \in \{0, 1\}, the information conveyed by observing an outcome xx is log2P(X=x)-\log_2 P(X=x) bits.

    • If P(X=1)=1P(X=1)=1, outcome of 1 is certain, then observing X=1X=1 yields log2(1)=0- \log_2(1) = 0 bits of information.
    • Conversely, observing X=0X=0 in this case yields log2(0)=- \log_2(0) = \infty; a total surprise at observing an impossible outcome
  • Entropy is the average information. For our above example with a discrete, binary, random variable XX, we have either H2(X)H_2(X) (entropy of the source) or H2(p)H_2(p) (a function of the outcome probabilities) given by
H2(X)=H2(p)=E[log2P(X)]=plog2(p)(1p)log2(1p) bits H_2(X) = H_2(p) = \mathbb E[- \log_2 P(X)] = -p \log_2(p) - (1-p) \log_2(1 -p) \text{ bits}
  • For a source XX having NN outcomes x1,x2,...,xNx_1, x_2, ..., x_N with probabilities P(X=xi)=pi,i=1,2,...,NP(X=x_i) = p_i, i = 1, 2, ..., N, the entropy is
H(x)=E[log2P(X)]=i=1Npilog2pi bitsH(x) = \mathbb E [-\log_2 P(X)] = \sum_{i=1}^N p_i \log_2 p_i \text{ bits}
  • Source-Channel Separation Theorem: Treating the problems of data compression and error correction separately, rather than seeking a jointly optimal source/channel coding solution, is asymptotically optimal in block sizes
  • Because of redundancy introduced by the channel coder, there must be more symbols at the output of the coder than at the input. A channel coder operates by accepting a block of kk input symbols and outputting a block of n>kn > k symbols.
  • The Rate of a channel coder is R=k/n<1R = k/n < 1
  • Channel Coding Theorem: Provided that the rate RR of transmission is less than the channel capacity CC, there exists a code such that the probability of error can be made arbitrarily small

1.4 - Binary Phase-Shift Keying

  • Process of mapping a binary sequence to a waveform for transmission over a channel
  • Given a binary sequence {...,b2,b1,b0,b1,b2,...}\{..., b_{-2}, b_{-1}, b_{0}, b_{1}, b_{2}, ... \}, we map the set {0,1}{1,1}\{0, 1\} \rarr \{-1, 1\} via
b~i=(2bi1) or b~i=(2bi1)\tilde{b}_i = (2b_i - 1) \text{ or }\tilde{b}_i = -(2b_i - 1)
  • We also have ai=Eb(2bi1)=Eb(1)bi=Ebb~ia_i = \sqrt{E_b}(2b_i -1) = -\sqrt{E_b}(-1)^{b_i} = \sqrt{E_b}\tilde{b}_i, a mapping of a bit bib_i or (b~i\tilde{b}_i) to a signal amplitude which multiplies a waveform φ1(t)\varphi_1(t) which is a unit-energy signal
φ1(t)2dt=1\int^\infty_{-\infty} \varphi_1(t)^2dt = 1

over [0,T)[0, T). A bit bib_i arriving at iTiT can be represented as the signal aiφ1(tiT)a_i\varphi_1(t - iT) where the energy required to transmit the bit bib_i is given by

(aiφ1(t))2dt=Eb\int^\infty_{-\infty} (a_i \varphi_1(t))^2 dt = E_b
  • The transmitted signal ±Ebφ1(t)\pm \sqrt{E_b}\varphi_1(t) are points ±Eb\pm \sqrt{E_b} in a (for our purposes: one-) dimensional signal space on the unit-energy φ1\varphi_1 axis

such that a sequence of bits [1,1,0,1,0][1,1,0,1,0] can be expressed as a waveform:

We can expand this representation to two dimensions by introducing another signal φ2(t)\varphi_2(t) and modulating the two to establish a signal constellation. Let M=2mM = 2^m, for some inteeger mm, be the number of points in a constellation. MM-ary transmission is obtained by placing MM points S={(a1k,a2k),k=0,1,...,M1}\mathcal{S} = \{(a_{1k}, a_{2k}), k = 0, 1, ..., M-1 \} in this signal space and assigning each point a unique pattern of mm bits.

  • Note that the assignment of the bits to constellation points is such that adjacent points differ by only 1 bit. Such an assignment is called Gray code order. Since it is most probably that errors will move from a point to an adjacent point, this reduces the probability of bit error

Associated with each signal (stream of kk bits) sk(t)=a1kφ1(t)+a2kφ2(t)s_k(t) = a_{1k}\varphi_1(t) + a_{2k}\varphi_2(t) and signal constellation point (a1k,a2k)S(a_{1k},a_{2k}) \in \mathcal S is a signal energy

Ek=(sk(t))2dt=a1k2+a2k2E_k = \int^\infty_{-\infty} (s_k(t))^2 dt = a_{1k}^2 + a_{2k}^2

The average signal energy EsE_s is the average of all signal energies, assuming that each point is used with equal probability

Es=1Mk=0M1(sk(t))2dt=k=0M1(a1k2+a2k2)E_s = \frac{1}{M} \sum_{k=0}^{M-1} \int^\infty_{-\infty} (s_k(t))^2 dt = \sum_{k=0}^{M-1} (a_{1k}^2 + a_{2k}^2)

and ca be related to the average energy per bit

Eb=energy per signalbits per signal=EsmE_b = \frac{\text{energy per signal}}{\text{bits per signal}} = \frac{E_s}{m}

2 | Groups and Vector Space

2.1 - Introduction

Linear block codes form a group and a vector space, which is useful

2.2 - Groups

A Group formalizes some basic rules of arithmetic necessary for cancellation and solution of simple equations

A binary operation * on a set is a rule that assigns to each ordered pair of elements of a set a,ba,b some element of the set

  • Since the operation is deinfed to return an element of the set, this operation is closed
  • We can further assume that all binary operations are closed
  • Examples include: min(a,b),  fst(a,b),  a+b\min(a, b), \; fst(a,b), \; a + b

A Group G,\langle G, * \rangle is a set GG with a binary operation * on GG such that

  • The operator is associative: for any a,b,cG  (ab)c=a(bc)a, b, c \in G\; (a * b) * c = a * (b * c)
  • For all aGa\in G there exists bGb \in G which is the inverse of aa such that ab=ea * b = e, where ee is called the identity of the group. The inverse can be denoted a1,aa^{-1}, -a for multiplication-like and addition-like operations, respectively

When * is obvious from context, the group may be referred to as just the set GG.

If GG has a finite number of elements, it is said to be a finite group with order G|G|, the number of elements.

A group GG is commutative if ab=ba;  a,bGa * b = b * a; \; \forall a, b \in G.

  • Z,+\langle \Z, + \rangle, the integers under addition, forms a group with the identity 00 since a+0=0+a;  aZa + 0 = 0 + a; \; \forall a \in \Z

    • The invert of aZ=aa \in \Z = -a. Thus the group is commutative.
    • Groups with an addition-like operator are called Abelian
  • Z,\langle \Z, \cdot \rangle, the integers under multiplication, does not form a group since there does not exist a multiplicative inverse for every element of the integers.
  • Q\{0},\langle \mathbb Q \backslash \{0\}, \cdot \rangle, the rationals without zero under multiplication with the identity of 11 and the inverse a1=1aa^{-1} = \frac{1}{a} forms a group

These rules are strong enough to introduce the notion of cancellation. In a group GG, if ab=aca * b = a * c, then b=cb = c by left cancellation via

a1(ab)=a1(bc)    (a1a)c=ec    =c\begin{aligned} & a^{-1} * (a * b) = a^{-1} (b * c)\\ \implies & (a^{-1} * a) * c = e * c\\ \implies &= c \\ \end{aligned}

and similarly for right hand cancellation via properties of associativity and identity.

We can also identify that solution to linear equations of the form ax=ba *x = b are unique.

  • Immediately, we get a=a1ba = a^{-1} b. If x1,x2x_1, x_2 are the two solutions such that
ax1=b=ax2a * x_1 = b = a * x_2

then by cancellation, we get x1=x2x_1 = x_2.

For example, let Z5,+\langle \Z_5, + \rangle be addition mod5\mod 5 on the set {0,1,2,3,4,5}\{0, 1, 2, 3, 4, 5 \} with the identity 00.

++ 0\mathbf 0 1\mathbf 1 2\mathbf 2 3\mathbf 3 4\mathbf 4
0\mathbf 0 00 11 22 33 44
1\mathbf 1 11 22 33 44 00
2\mathbf 2 22 33 44 00 11
3\mathbf 3 33 44 00 11 22
4\mathbf 4 44 00 11 22 33
  • Since 00 appears in each row and column, every element has an inverse.
  • By uniqueness of solution, we must have every element appearing in every row and column, which we do. So Z5,+\langle \Z_5, + \rangle is a Abelian group.

One way to form groups is via the Cartesian production. Given groups G1,,G2,,G3,,..Gr,\langle G_1, * \rangle, \langle G_2, * \rangle, \langle G_3, * \rangle, .. \langle G_r, * \rangle, the direct product group G1×G2×...×GrG_1 \times G_2 \times ... \times G_r has elements (a1,a2,...,ar)(a_1, a_2, ..., a_r) where aiGia_i \in G_i. The operation is performed elementwise such that if

(a1,a2,...,ar)G(a_1, a_2, ..., a_r) \in G


(b1,b2,...,br)G(b_1, b_2, ..., b_r) \in G


(a1,a2,...,ar)(b1,b2,...,br)=(a1b1,a2b2,...,arbr)(a_1, a_2, ..., a_r) * (b_1, b_2, ..., b_r) = (a_1 * b_1, a_2 * b_2, ..., a_r * b_r)


The group Z2×Z2,+\langle \Z_2 \times Z_2, + \rangle consists of the tuplies and addition mod2\mod 2.

++ (0,0)(\mathbf {0, 0}) (0,1)(\mathbf {0, 1}) (1,0)(\mathbf {1, 0}) (1,1)(\mathbf {1, 1})
(0,0)(\mathbf {0, 0}) (0,0)(0, 0) (0,1)(0, 1) (1,0)(1, 0) (1,1)(1, 1)
(0,1)(\mathbf {0, 1}) (0,1)(0, 1) (0,0)(0, 0) (1,1)(1, 1) (1,0)(1, 0)
(1,0)(\mathbf {1, 0}) (1,0)(1, 0) (1,1)(1, 1) (0,0)(0, 0) (0,1)(0, 1)
(1,1)(\mathbf {1, 1}) (1,1)(1, 1) (1,0)(1, 0) (0,1)(0, 1) (0,0)(0, 0)

This is called the Klein 4-group, and intorduces the concept of permutations as elements of a group, and is a composable function as our operation as opposed to simple arithmetic groups.


A permutation of set AA is a bijection (one to one, and onto) of AA onto itself. Let

A={1,2,3,4}A = \{ 1,2,3,4 \}

then a permutation

p1=(12343413)p_1 = \begin{pmatrix} 1 & 2 & 3 & 4 \\ 3 & 4 & 1 & 3 \end{pmatrix}


p1=(12343421)p_1 = \begin{pmatrix} 1 & 2 & 3 & 4 \\ \downarrow & \downarrow & \downarrow & \downarrow \\ 3 & 4 & 2 & 1 \end{pmatrix}

There are n!n! different permutations on nn distinct elements. We can think of p1p_1 as a prefix operation: p11=3p_1 \circ 1 = 3 or p14=1p_1 \circ 4 = 1. If

p2=(12344312)p_2 = \begin{pmatrix} 1 & 2 & 3 & 4 \\ 4 & 3 & 1 & 2 \end{pmatrix}


p2p1=(12343421)(12344312)=(12341234)=ep_2 \circ p_1 = \begin{pmatrix} 1 & 2 & 3 & 4 \\ 3 & 4 & 2 & 1 \end{pmatrix} \circ \begin{pmatrix} 1 & 2 & 3 & 4 \\ 4 & 3 & 1 & 2 \end{pmatrix} = \begin{pmatrix} 1 & 2 & 3 & 4 \\ 1 & 2 & 3 & 4 \end{pmatrix} = e

is the construction of permutations which is closed under the set of permutations with the identity ee and the inverse

p11=(12343412)=p2p_1^{-1} = \begin{pmatrix} 1 & 2 & 3 & 4 \\ 3 & 4 & 1 & 2 \end{pmatrix} = p_2

Composition of permutations is associative: for

p1,p2,p3;(p1p2)p3=p1(p2p3)p_1, p_2, p_3; \\ (p_1 \circ p_2) \circ p_3 = p_1 \circ (p_2 \circ p_3)

and thus the set of all n!n! permutations on nn elements forms a group under composition.

  • This is known as a symmetric group on nn letters SnS_n.
  • Note that composition is not commutative since p1p2p2p1p_1 \circ p_2 \neq p_2 \circ p_1, so S4S_4 is a non-commutative group

2.2.1 Subgroups

A subgroup H,\langle H, * \rangle of GG is a group formed from a subset of elements of GG where H<GH < G indicates the relation between the two sets where HGH \subset G is said to be a proper subgroup.

The subgroups H\{e}GH \backslash \{e\} \subset G and H=GH = G are the trivial subgroups.


Let G=Z6,+,H={0,2,4},+G = \langle \Z_6, + \rangle, H = \langle \{ 0, 2, 4\}, +\rangle, both mod6\mod 6. We can show that HGH \subset G is a group: K={0,3},+K = \langle \{0, 3\}, + \rangle.

Similarly, Z,+<Q,+<R,+<C,+\langle \Z, + \rangle < \langle \mathbb Q, + \rangle < \langle \mathbb R, + \rangle < \langle \mathbb C, + \rangle


Consider the group of permutations on S4S_4, which has a subgroup formed by the closed compositions:

p0=(12341234),  p1=(12342341),p2=(12343412),  p3=(12344123),p4=(12342143),  p5=(12344321),p6=(12343214),  p7=(12341432)p_0 = \begin{pmatrix} 1 & 2 & 3 & 4 \\ 1 & 2 & 3 & 4 \end{pmatrix}, \; p_1 = \begin{pmatrix} 1 & 2 & 3 & 4 \\ 2 & 3 & 4 & 1 \end{pmatrix}, \\ p_2 = \begin{pmatrix} 1 & 2 & 3 & 4 \\ 3 & 4 & 1 & 2 \end{pmatrix}, \; p_3 = \begin{pmatrix} 1 & 2 & 3 & 4 \\ 4 & 1 & 2 & 3 \end{pmatrix}, \\ p_4 = \begin{pmatrix} 1 & 2 & 3 & 4 \\ 2 & 1 & 4 & 3 \end{pmatrix}, \; p_5 = \begin{pmatrix} 1 & 2 & 3 & 4 \\ 4 & 3 & 2 & 1 \end{pmatrix}, \\ p_6 = \begin{pmatrix} 1 & 2 & 3 & 4 \\ 3 & 2 & 1 & 4 \end{pmatrix}, \; p_7 = \begin{pmatrix} 1 & 2 & 3 & 4 \\ 1 & 4 & 3 & 2 \end{pmatrix} \\

2.2.2 Cyrclic Groups and the Order of an Element

In a group GG with a multiplication-like operation, we use ana^n to indicate times\underbrace{a * a * ... * a}_{n \text{ times}}

with a0=ea^0=e being the identity element for GG.

For a group with additive operations nana is used to indicate

a+a+...+an times\underbrace{a + a + ... + a}_{n \text{ times}}

If aGa \in G, any subgroup containing aa must also include a2,a3a^2, a^3, etc.

It must also adhere to e=aa1,an=(a1)ne = aa^{-1}, a^{-n} = (a^{-1})^n.

For any aGa \in G, the set {annZ}\{a^n | n \in \Z\} generates a subgroup of GG called the cyclic subgroup. aa is said to be the generator of the subgroup and the resultant group is denote a\langle a \rangle.

If every element of a group can be gneerated from a single element, the group is said to be cyclic.


Z5,+\langle \Z_5, + \rangle is cyclice since every element can be generated by a=2a = 2.

2=22+2=42+2+2=12+2+2+2=32+2+2+2+2=0\begin{aligned} 2 &= 2 \\ 2 + 2 &= 4 \\ 2 + 2 + 2 &= 1 \\ 2 + 2 + 2 + 2 &= 3 \\ 2 + 2 + 2 + 2 + 2 &= 0 \\ \end{aligned}

In a group GG with aGa \in G, the smallest nn such that ana^n is equal to the identity in GG is said to be the order of the element aa.

  • If no such nn exists, aa is of infinite order.
  • In Z5\Z_5, 2 is of order 55, as are all non-zero elements of the set.


If G=Z6,+G = \langle \Z_6, + \rangle, then

2={0,2,4},3={0,3},5={0,1,2,3,4,5}\begin{aligned} \langle 2 \rangle &= \{ 0, 2, 4\}, \\ \langle 3 \rangle &= \{ 0, 3\}, \\ \langle 5 \rangle &= \{ 0, 1, 2, 3, 4, 5 \} \end{aligned}

Therefore aZ6a \in \Z_6 is a generator for the group if and only if aa and 66 are relatively prime

[IMAGE Of the planes]

2.3 - Cosets

The left coset of H,H<GH, H < G (GG not necessarily being commutative), aHa * H is the set

{ahhH}\{a * h | h \in H \}

and the right coset HaH * a is given by the symmetric relation. In a commutative group, they are obviously equal.

Let GG be a group, HH a subgroup of GG, and the left coset aHGa * H \in G. Then, baHb \in a *H if and only f b=ah,hHb = a * h, h \in H. By cancellation, we have a1bHa^{-1}*b \in H.

To determine if a,ba,b are in the same (left) coset, we simply check the above condition.


Let G=Z,+G = \langle \Z, + \rangle and S0=3Z={...,6,3,0,3,6,...}S_0 = 3 \Z = \{ ..., -6, -3, 0, 3, 6, ...\} such that S0<GS_0 < G. Now, form the cosets:

S1=S0+1={...,5,2,1,4,7,...}S2=S0+2={...,4,1,3,5,8,...}\begin{aligned} S_1 &= S_0 + 1 = \{ ..., -5, -2, 1, 4, 7, ...\} \\ S_2 &= S_0 + 2 = \{ ..., -4, -1, 3, 5, 8, ...\} \\ \end{aligned}

and observe that neither S1,S2S_1, S_2 are subgroups as they do not have an identity, but the union of all three cover the original group: G=S0S1S2G = S_0 \cup S_1 \cup S_2.

We can determine whether or not two numbers a=4,b=6a =4, b =6 are in the same coset of S0S_0 by checking whether

(a)+bS0a+b=2S0(-a) + b \in S_0 \\ -a + b = 2 \notin S_0

2.2.4 - Lagrange's Theorem

Basically, it's a prescription of subgroup size relative to its group.

  • Every coset of HH in a group GG has the same number of elements
  • Let (ah1),(ah2)aH(a * h_1), (a * h_2 )\in a * H be two elements ub tge ciset aHa * H. If they are equal, then we have h1=h2h_1 = h_2, and thereform the elements of a coset are uniquely identified by elements in HH

It follows, then, that we have the following properties of cosets:

  • reflecivity: An element aa is in the same coset of itself
  • symmetry: If a,ba,b are in the same coset, then b,ab,a are in the same coset
  • transitivity: If a,ba,b are in the same coset, and b,cb,c are in the same coset, then a,ca, c are in the same coset as well

Therefore, the relation of "being in the same coset" is an equivalence relation, and thus, every equivalence relation partitions its elements into disjoint sets.

  • It follows then that the distinct coets of HH in a group GG are disjoint

With these properties and lemmas, Lagrange's Theorem states that: If GG is a group of finite order, and HH a subgroup of GG, then the order of HH divides the order of GG such that GmodH=0|G| \mod |H| = 0. We say aba | b "aa divides bb" without remainder

  • If G<|G| < \infty and H<GH < G, then HG|H| \big| |G|

Lagrange's Theorem implies that every group of prime order is cyclic.

  • Let GG be of prime order, with aGa \in G and e=idGe = \mathbf {id}_G. Let H=aH = \langle a \rangle, the cyclic subgroup generated by aa. Then a,eHa,e \in H. But by Lagrange's Theorem, the order of HH must ("cleanly") divide the order of GG. Since G|G| is prime, we must have H=G|H| = |G|, thus aa generates GG, so GG must be cyclic.

2.2.5 - Induced Operations; Isomorphism

Returning to the three sentence cosets we defined earlier, we have S={S0,S1,S2}S = \{S_0, S_1, S_2 \} and define addition on SS as follows: for

A,B,CS;A+B=CA, B, C \in S; \\ A + B = C

if and only if a+b=c;  a,b,cA,B,Ca + b = c; \; \forall a, b, c \in A, B, C. That is, addition of the sets is defined by representatives in the sets. The operation is said to be the induced operations on the coset: S1+S2=S0S_1 + S_2 = S_0.

Taking 1S1,2S21 \in S_1, 2 \in S_2 and noting that 1+2=3S01 + 2 = 3 \in S_0. Similarly, S1+S1=S2S_1 + S_1 = S_2 via 1+1=21 + 1 = 2, taking the representatives to be 1S11 \in S_1. Based on this induced operation, we can obtain the following table

++ S0S_0 S1S_1 S2S_2
S0S_0 S0S_0 S1S_1 S2S_2
S1S_1 S1S_1 S2S_2 S0S_0
S2S_2 S2S_2 S0S_0 S1S_1

which we say is \cong