\(\newcommand{\var}{\text{var}}\) \(\newcommand{\sd}{\text{sd}}\) \(\newcommand{\cov}{\text{cov}}\) \(\newcommand{\cor}{\text{cor}}\) \(\renewcommand{\P}{\mathbb{P}}\) \(\newcommand{\E}{\mathbb{E}}\) \(\newcommand{\R}{\mathbb{R}}\) \(\newcommand{\N}{\mathbb{N}}\)
  1. Random
  2. 3. Expected Value
  3. 1
  4. 2
  5. 3
  6. 4
  7. 5
  8. 6
  9. 7
  10. 8
  11. 9
  12. 10
  13. 11
  14. 12
  15. 13

11. Vector Spaces of Random Variables

Basic Theory

Many of the concepts in this chapter have elegant interpretations if we think of real-valued random variables as vectors in a vector space. In particular, variance and higher moments are related to the concept of norm and distance, while covariance is related to inner product. These connections can help unify and illuminate some of the ideas in the chapter from a different point of view. Of course, real-valued random variables are simply measurable, real-valued functions defined on the sample space, so much of the discussion in this section is a special case of our discussion of function spaces in the chapter on Distributions, but recast in the notation of probability.

As usual, our starting point is a random experiment modeled by a probability space \( (\Omega, \mathscr{F}, \P) \). Thus, \( \Omega \) is the sample space, \( \mathscr{F} \) is the \( \sigma \)-algebra of events, and \( \P \) is the probability measure. Our basic vector space \(\mathscr{V}\) consists of all real-valued random variables defined on \((\Omega, \mathscr{F}, \P)\). Recall that random variables \( X_1 \) and \( X_2 \) are equivalent if \( \P(X_1 = X_2) = 1 \), in which case we write \( X_1 \equiv X_2 \). We consider two such random variables as the same vector, so that technically, our vector space consists of equivalence classes under this equivalence relation. The addition operator corresponds to the usual addition of two real-valued random variables, and the operation of scalar multiplication corresponds to the usual multiplication of a real-valued random variable by a real (non-random) number. These operations are compatible with the equivalence relation in the sense that if \( X_1 \equiv X_2 \) and \( Y_1 \equiv Y_2 \) then \( X_1 + Y_1 \equiv X_2 + Y_2 \) and \( c X_1 \equiv c X_2 \) for \( c \in \R \). In short, the vector space \( \mathscr{V} \) is well-defined.

Norm

Suppose that \( k \in [1, \infty) \). The \( k \) norm of \( X \in \mathscr{V} \) is defined by

\[ \|X\|_k = \left[\E\left(\left|X\right|^k\right)\right]^{1 / k} \]

Thus, \(\|X\|_k\) is a measure of the size of \(X\) in a certain sense, and of course it's possible that \( \|X\|_k = \infty \). The following theorems establish the fundamental properties. The first is the positive property.

Suppose again that \( k \in [1, \infty) \). For \( X \in \mathscr{V} \),

  1. \(\|X\|_k \ge 0\)
  2. \(\|X\|_k = 0\) if and only if \(\P(X = 0) = 1\) (so that \(X \equiv 0\)).
Proof:

These results follow from the basic inequality properties of expected value. First \( \left|X\right|^k \ge 0 \) with probability 1, so \( \E\left(\left|X\right|^k\right) \ge 0 \). In addition, \( \E\left(\left|X\right|^k\right) = 0 \) if and only if \( \P(X = 0) = 1 \).

The next result is the scaling property.

Suppose again that \( k \in [1, \infty) \). Then \(\|c X\|_k = \left|c\right| \, \|X\|_k\) for \( X \in \mathscr{V} \) and \(c \in \R\).

Proof: \[ \| c X \|_k = [\E\left(\left|c X\right|^k\right]^{1 / k} = \left[\E\left(\left|c\right|^k \left|X\right|^k\right)\right]^{1/k} = \left[\left|c\right|^k \E\left(\left|X\right|^k\right)\right]^{1/k} = \left|c\right| \left[\E\left(\left|X\right|^k\right)\right]^{1/k} = \left|c\right| \|X\|_k \]

The next result is Minkowski's inequality, named for Hermann Minkowski, and also known as the triangle inequality.

Suppose again that \( k \in [1, \infty) \). Then \(\|X + Y\|_k \le \|X\|_k + \|Y\|_k\) for \( X, \, Y \in \mathscr{V} \).

Proof:

The first quadrant \(S = \left\{(x, y) \in \R^2: x \ge 0, \; y \ge 0\right\}\) is a convex set and \(g(x, y) = \left(x ^{1/k} + y^{1/k}\right)^k\) is concave on \(S\). From Jensen's inequality, if \(U\) and \(V\) are nonnegative random variables, then \[ \E\left[(U^{1/k} + V^{1/k})^k\right] \le \left(\left[\E(U)\right]^{1/k} + \left[\E(V)\right]^{1/k}\right)^k \] Letting \(U = \left|X\right|^k\) and \(V = \left|Y\right|^k\) and simplifying gives the result. To show that \( g \) really is concave on \( S \), we can compute the second partial derivatives. Let \( h(x, y) = x^{1/k} + y^{1/k} \) so that \( g = h^k \). Then \begin{align} g_{xx} & = \frac{k-1}{k} h^{k-2} x^{1/k - 2}\left(x^{1/k} - h\right) \\ g_{yy} & = \frac{k-1}{k} h^{k-2} y^{1/k - 2}\left(y^{1/k} - h\right) \\ g_{xy} & = \frac{k-1}{k} h^{k-2} x^{1/k - 1} y^{1/k - 1} \end{align} Clearly \( h(x, y) \ge x^{1/k} \) and \( h(x, y) \ge y^{1/k} \) for \( x \ge 0 \) and \( y \ge 0 \), so \( g_{xx} \)and \( g_{yy} \), the diagonal entries of the second derivative matrix, are nonpositive on \( S \). A little algebra shows that the determinant of the second derivative matrix \( g_{xx} g_{yy} - g_{xy}^2 = 0\) on \( S \). Thus, the second derivative matrix of \( g \) is negative semi-definite.

It follows from the last three results that the set of random variables (again, modulo equivalence) with finite \(k\) norm forms a subspace of our parent vector space \(\mathscr{V}\), and that the \(k\) norm really is a norm on this vector space.

For \( k \in [1, \infty) \), \( \mathscr{L}_k \) denotes the vector space of \( X \in \mathscr{V} \) with \(\|X\|_k \lt \infty\), and with norm \( \| \cdot \|_k \).

In analysis, \( p \) is often used as the index rather than \( k \) as we have used here, but \( p \) seems too much like a probability, so we have broken with tradition on this point. The \( \mathscr{L} \) is in honor of Henri Lebesgue, who developed much of this theory. Sometimes, when we need to indicate the dependence on the underlying \( \sigma \)-algebra \( \mathscr{F} \), we write \( \mathscr{L}_k(\mathscr{F}) \). Our next result is Lyapunov's inequality, named for Aleksandr Lyapunov. This inequality shows that the \(k\)-norm of a random variable is increasing in \(k\).

Suppose that \( j, \, k \in [1, \infty) \) with \(j \le k\). Then \(\|X\|_j \le \|X\|_k\) for \(X \in \mathscr{V}\).

Proof:

Note that \(S = \{x \in \R: x \ge 0\}\) is convex and \(g(x) = x^{k/j}\) is convex on \(S\). From Jensen's inequality, if \(U\) is a nonnegative random variable then \(\left[\E(U)\right]^{k/j} \le \E\left(U^{k/j}\right)\). Letting \(U = \left|X\right|^j\) and simplifying gives the result.

Lyapunov's inequality shows that if \(1 \le j \le k\) and \( \|X\|_k \lt \infty \) then \( \|X\|_j \lt \infty \). Thus, \(\mathscr{L}_k\) is a subspace of \(\mathscr{L}_j\).

Metric

The \(k\) norm, like any norm on a vector space, can be used to define a metric, or distance function; we simply compute the norm of the difference between two vectors.

For \( k \in [1, \infty) \), the \(k\) distance (or \(k\) metric) between \(X, \, Y \in \mathscr{V}\) is defined by \[ d_k(X, Y) = \|X - Y\|_k = \left[\E\left(\left|X - Y\right|^k\right)\right]^{1/k} \]

The following properties are analogous to the properties in norm properties (and thus very little additional work is required for the proofs). These properties show that the \(k\) metric really is a metric on \( \mathscr{L}_k \) (as always, modulo equivalence). The first is the positive property.

Suppose again that \( k \in [1, \infty) \) \(X, \; Y \in \mathscr{V}\). Then

  1. \(d_k(X, Y) \ge 0\)
  2. \(d_k(X, Y) = 0\) if and only if \(\P(X = Y) = 1\) (so that \(X \equiv Y\) and \(Y\)).
Proof:

These results follow directly from the corresponding norm property.

Next is the obvious symmetry property:

\( d_k(X, Y) = d_k(Y, X) \) for \( X, \; Y \in \mathscr{V} \).

Next is the distance version of the triangle inequality.

\(d_k(X, Z) \le d_k(X, Y) + d_k(Y, Z)\) for \(X, \; Y, \; Z \in \mathscr{V}\)

Proof:

From Minkowski's inequality, \[ d_k(X, Z) = \|X - Z\|_k = \|(X - Y) + (Y - Z) \|_k \le \|X - Y\|_k + \|Y - Z\|_k = d_k(X, Y) + d_k(Y, Z) \]

The last three properties mean that \( d_k \) is indeed a metric on \( \mathscr{L}_k \) for \( k \ge 1 \). In particular, note that the standard deviation is simply the 2-distance from \(X\) to its mean \( \mu = \E(X) \): \[ \sd(X) = d_2(X, \mu) = \|X - \mu\|_2 = \sqrt{\E\left[(X - \mu)^2\right]} \] and the variance is the square of this. More generally, the \(k\)th moment of \(X\) about \(a\) is simply the \(k\)th power of the \(k\)-distance from \(X\) to \(a\). The 2-distance is especially important for reasons that will become clear below, in the discussion of inner product. This distance is also called the root mean square distance.

Center and Spread Revisited

Measures of center and measures of spread are best thought of together, in the context of a measure of distance. For a real-valued random variable \(X\), we first try to find the constants \(t \in \R\) that are closest to \(X\), as measured by the given distance; any such \(t\) is a measure of center relative to the distance. The minimum distance itself is the corresponding measure of spread.

Let us apply this procedure to the 2-distance.

For \( X \in \mathscr{L}_2 \), define the root mean square error function by \[ d_2(X, t) = \|X - t\|_2 = \sqrt{\E\left[(X - t)^2\right]}, \quad t \in \R \]

For \( X \in \mathscr{L}_2 \), \(d_2(X, t)\) is minimized when \(t = \E(X)\) and the minimum value is \(\sd(X)\).

Proof:

Note that the minimum value of \(d_2(X, t)\) occurs at the same points as the minimum value of \(d_2^2(X, t) = \E\left[(X - t)^2\right]\) (this is the mean square error function). Expanding and taking expected values term by term gives \[ \E\left[(X - t)^2\right] = \E\left(X^2\right) - 2 t \E(X) + t^2 \] This is a quadratic function of \( t \) and hence the graph is a parabola opening upward. The minimum occurs at \( t = \E(X) \), and the minimum value is \( \var(X) \). Hence the minimum value of \( t \mapsto d_2(X, t) \) also occurs at \( t = \E(X) \) and the minimum value is \( \sd(X) \).

We have seen this computation several times before. The best constant predictor of \( X \) is \( \E(X) \), with mean square error \( \var(X) \). The physical interpretation of this result is that the moment of inertia of the mass distribution of \(X\) about \(t\) is minimized when \(t = \mu\), the center of mass. Next, let us apply our procedure to the 1-distance.

For \( X \in \mathscr{L}_1 \), define the mean absolute error function by \[ d_1(X, t) = \|X - t\|_1 = \E\left[\left|X - t\right|\right], \quad t \in \R \]

We will show that \(d_1(X, t)\) is minimized when \(t\) is any median of \(X\). (Recall that the set of medians of \( X \) forms a closed, bounded interval.) We start with a discrete case, because it's easier and has special interest.

Suppose that \(X \in \mathscr{L}_1\) has a discrete distribution with values in a finite set \(S \subseteq \R\). Then \(d_1(X, t)\) is minimized when \(t\) is any median of \(X\).

Proof:

Note first that \(\E\left(\left|X - t\right|\right) = \E(t - X, \, X \le t) + \E(X - t, \, X \gt t)\). Hence \(\E\left(\left|X - t\right|\right) = a_t \, t + b_t\), where \(a_t = 2 \, \P(X \le t) - 1\) and where \(b_t = \E(X) - 2 \, \E(X, \, X \le t)\). Note that \(\E\left(\left|X - t\right|\right)\) is a continuous, piecewise linear function of \(t\), with corners at the values in \(S\). That is, the function is a linear spline. Let \(m\) be the smallest median of \(X\). If \(t \lt m\) and \(t \notin S\), then the slope of the linear piece at \(t\) is negative. Let \(M\) be the largest median of \(X\). If \(t \gt M\) and \(t \notin S\), then the slope of the linear piece at \(t\) is positive. If \(t \in (m, M)\) then the slope of the linear piece at \(t\) is 0. Thus \(\E\left(\left|X - t\right|\right)\) is minimized for every \(t\) in the median interval \([m, M]\).

The last result shows that mean absolute error has a couple of basic deficiencies as a measure of error:

Indeed, when \(X\) does not have a unique median, there is no compelling reason to choose one value in the median interval, as the measure of center, over any other value in the interval.

Suppose now that \(X \in \mathscr{L}_1 \) has a general distribution on \(\R\). Then \(d_1(X, t)\) is minimized when \(t\) is any median of \(X\).

Proof:

Suppose that \(s \lt t\). Computing the expected value over the events \(X \le s\), \(s \lt X \le t\), and \(X \ge t\), and simplifying gives \[ \E\left(\left|X - t\right|\right) = \E\left(\left|X - s\right|\right) + (t - s) \, \left[2 \, \P(X \le s) - 1\right] + 2 \, \E(t - X, \, s \lt X \le t) \] Suppose that \(t \lt s\). Using similar methods gives \[ \E\left(\left|X - t\right|\right) = \E\left(\left|X - s\right|\right) + (t - s) \, \left[2 \, \P(X \lt s) - 1\right] + 2 \, \E(X - t, \, t \le X \lt s) \] Note that the last terms on the right in these equations are nonnegative. If we take \(s\) to be a median of \(X\), then the middle terms on the right in the equations are also nonnegative. Hence if \(s\) is a median of \(X\) and \(t\) is any other number then \(\E\left(\left|X - t\right|\right) \ge \E\left(\left|X - s\right|\right)\).

Convergence

Whenever we have a measure of distance, we automatically have a criterion for convergence.

Suppose that \( X_n \in \mathscr{L}_k \) for \( n \in \N_+ \) and that \( X \in \mathscr{L}_k \), where \( k \in [1, \infty) \). Then \(X_n \to X\) as \(n \to \infty\) in \(k\)th mean if \( X_n \to X \) as \( n \to \infty \) in the vector space \( \mathscr{L}_k \). That is, \[ d_k(X_n, X) = \|X_n - X\|_k \to 0 \text{ as } n \to \infty \] or equivalently \( \E\left(\left|X_n - X\right|^k\right) \to 0\) as \(n \to \infty \).

When \(k = 1\), we simply say that \(X_n \to X\) as \(n \to \infty\) in mean; when \(k = 2\), we say that \(X_n \to X\) as \(n \to \infty\) in mean square. These are the most important special cases.

Suppose that \(1 \le j \le k\). If \(X_n \to X\) as \(n \to \infty\) in \(k\)th mean then \(X_n \to X\) as \(n \to \infty\) in \(j\)th mean.

Proof:

This follows from Lyapunov's inequality: \( 0 \le d_j(X_n, X) \le d_k(X_n, X) \to 0 \) as \( n \to \infty \).

Convergence in \( k \)th mean implies that the \( k \) norms converge.

Suppose that \( X_n \in \mathscr{L}_k \) for \( n \in \N_+ \) and that \( X \in \mathscr{L}_k \), where \( k \in [1, \infty) \). If \( X_n \to X \) as \( n \to \infty \) in \( k \)th mean then \( \|X_n\|_k \to \|X\|_k \) as \( n \to \infty \). Equivalently, if \( \E(|X_n - X|^k) \to 0 \) as \( n \to \infty \) then \( \E(|X_n|^k) \to \E(|X|^k) \) as \( n \to \infty \).

Proof:

This is a simple consequence of the reverse triangle inequality, which holds in any normed vector space. The general result is that if a sequence of vectors in a normed vector space converge then the norms converge. In our notation here, \[ \left|\|X_n\|_k - \|X\|_k\right| \le \|X_n - X\|_k \] so if the right side converges to 0 as \( n \to \infty \), then so does the left side.

The converse is not true. A counterexample is given below. Our next result shows that convergence in mean is stronger than convergence in probability.

Suppose that \( X_n \in \mathscr{L}_1 \) for \( n \in \N_+ \) and that \( X \in \mathscr{L}_1 \). If \(X_n \to X\) as \(n \to \infty\) in mean, then \(X_n \to X\) as \(n \to \infty\) in probability.

Proof:

This follows from Markov's inequality. For \( \epsilon \gt 0 \), \(0 \le \P\left(\left|X_n - X\right| \gt \epsilon\right) \le \E\left(\left|X_n - X\right|\right) \big/ \epsilon \to 0 \) as \( n \to \infty \).

The converse is not true. Moreover, convergence with probability 1 does not imply convergence in \(k\)th mean; a counterexample is given below. Also convergence in \(k\)th mean does not imply convergence with probability 1; another counterexample is given below. In summary, the implications in the various modes of convergence are shown below; no other implications hold in general.

However, the next section on uniformly integrable variables gives a condition under which convergence in probability implies convergence in mean.

Inner Product

The vector space \( \mathscr{L}_2 \) of real-valued random variables on \( (\Omega, \mathscr{F}, \P) \) (modulo equivalence of course) with finite second moment is special, because it's the only one in which the norm corresponds to an inner product.

The inner product of \( X, \, Y \in \mathscr{L}_2 \) is defined by \[ \langle X, Y \rangle = \E(X Y) \]

The following results are analogous to the basic properties of covariance, and show that this definition really does give an inner product on the vector space

For \( X, \, Y, \, Z \in \mathscr{L}_2 \) and \( a \in \R \),

  1. \(\langle X, Y \rangle = \langle Y, X \rangle\).
  2. \(\langle X, X \rangle \ge 0\) and \(\langle X, X \rangle = 0\) if and only if \(\P(X = 0) = 1\) (so that \(X \equiv 0\)).
  3. \(\langle a X, Y \rangle = a \langle X, Y \rangle\).
  4. \(\langle X + Y, Z \rangle = \langle X, Z \rangle + \langle Y, Z \rangle\).
Proof:
  1. The symmetry property is trivial from the definition.
  2. Note that \( \E(X^2) \ge 0 \) and \( \E(X^2) = 0 \) if and only if \( \P(X = 0) = 1 \).
  3. This follows from the scaling property of expected value: \( \E(a X Y) = a \E(X Y) \)
  4. This follows from the additive property of expected value: \( \E[(X + Y) Z] = \E(X Z) + \E(Y Z) \).

Covariance and correlation can easily be expressed in terms of this inner product. The covariance of two random variables is the inner product of the corresponding centered variables. The correlation is the inner product of the corresponding standard scores.

For \( X, \, Y \in \mathscr{L}_2 \),

  1. \(\cov(X, Y) = \langle X - \E(X), Y - \E(Y) \rangle\)
  2. \(\cor(X, Y) = \left \langle [X - \E(X)] \big/ \sd(X), [Y - \E(Y)] / \sd(Y) \right \rangle\)
Proof:
  1. This is simply a restatement of the definition of covariance.
  2. This is a restatement of the fact that the correlation of two variables is the covariance of their corresponding standard scores.

Thus, real-valued random variables \( X \) and \( Y \) are uncorrelated if and only if the centered variables \( X - \E(X) \) and \( Y - \E(Y) \) are perpendicular or orthogonal as elements of \( \mathscr{L}_2 \).

For \( X \in \mathscr{L}_2 \), \(\langle X, X \rangle = \|X\|_2^2 = \E\left(X^2\right)\).

Thus, the norm associated with the inner product is the 2-norm studied above, and corresponds to the root mean square operation on a random variable. This fact is a fundamental reason why the 2-norm plays such a special, honored role; of all the \(k\)-norms, only the 2-norm corresponds to an inner product. In turn, this is one of the reasons that root mean square difference is of fundamental importance in probability and statistics. Technically, the vector space \( \mathscr{L}_2 \) is a Hilbert space, named for David Hilbert.

The next result is Hölder's inequality, named for Otto Hölder.

Suppose that \(j, \, k \in [1, \infty)\) and \(\frac{1}{j} + \frac{1}{k} = 1\). For \( X \in \mathscr{L}_j \) and \( Y \in \mathscr{L}_k \), \[\langle \left|X\right|, \left|Y\right| \rangle \le \|X\|_j \|Y\|_k \]

Proof:

Note that \(S = \left\{(x, y) \in \R^2: x \ge 0, \; y \ge 0\right\}\) is a convex set and \(g(x, y) = x^{1/j} y^{1/k}\) is concave on \(S\). From Jensen's inequality, if \(U\) and \(V\) are nonnegative random variables then \(\E\left(U^{1/j} V^{1/k}\right) \le \left[\E(U)\right]^{1/j} \left[\E(V)\right]^{1/k}\). Substituting \(U = \left|X\right|^j\) and \(V = \left|Y\right|^k\) gives the result.

To show that \( g \) really is concave on \( S \), we compute the second derivative matrix:

\[ \left[ \begin{matrix} (1 / j)(1 / j - 1) x^{1 / j - 2} y^{1 / k} & (1 / j)(1 / k) x^{1 / j - 1} y^{1 / k - 1} \\ (1 / j)(1 / k) x^{1 / j - 1} y^{1 / k - 1} & (1 / k)(1 / k - 1) x^{1 / j} y^{1 / k - 2} \end{matrix} \right] \]

Since \( 1 / j \lt 1 \) and \( 1 / k \lt 1 \), the diagonal entries are negative on \( S \). The determinant simplifies to

\[ (1 / j)(1 / k) x^{2 / j - 2} y^{2 / k - 2} [1 - (1 / j + 1 / k)] = 0 \]

In the context of the last theorem, \(j\) and \(k\) are called conjugate exponents. If we let \(j = k = 2\) in Hölder's inequality, then we get the Cauchy-Schwarz inequality, named for Augustin Cauchy and Karl Schwarz: For \( X, \, Y \in \mathscr{L}_2 \), \[ \E\left(\left|X\right| \left|Y\right|\right) \le \sqrt{\E\left(X^2\right)} \sqrt{\E\left(Y^2\right)} \] In turn, the Cauchy-Schwarz inequality is equivalent to the basic inequalities for covariance and correlations: For \( X, \, Y \in \mathscr{L}_2 \), \[ \left| \cov(X, Y) \right| \le \sd(X) \sd(Y), \quad \left|\cor(X, Y)\right| \le 1 \]

If \(j, \, k \in [1, \infty)\) are conjugate exponents then

  1. \(k = \frac{j}{j - 1}\).
  2. \(k \downarrow 1\) as \(j \uparrow \infty\).

The following result is an equivalent to the identity \( \var(X + Y) + \var(X - Y) = 2\left[\var(X) + \var(Y)\right] \) that we studied in the section on covariance and correlation. In the context of vector spaces, the result is known as the parallelogram rule:

If \(X, \, Y \in \mathscr{L}_2\) then \[ \|X + Y\|_2^2 + \|X - Y\|_2^2 = 2 \|X\|_2^2 + 2 \|Y\|_2^2\]

Proof:

This result follows from the bi-linearity of inner product: \begin{align} \|X + Y\|_2^2 + \|X - Y\|_2^2 & = \langle X + Y, X + Y \rangle + \langle X - Y, X - Y\rangle \\ & = \left(\langle X, X \rangle + 2 \langle X, Y \rangle + \langle Y, Y \rangle\right) + \left(\langle X, X \rangle - 2 \langle X, Y \rangle + \langle Y, Y \rangle\right) = 2 \|X\|^2 + 2 \|Y\|^2 \end{align}

The following result is equivalent to the statement that the variance of the sum of uncorrelated variables is the sum of the variances, which again we proved in the section on covariance and correlation. In the context of vector spaces, the result is the famous Pythagorean theorem, named for Pythagoras of course.

If \((X_1, X_2, \ldots, X_n)\) is a sequence of random variables in \(\mathscr{L}_2\) with \(\langle X_i, X_j \rangle = 0\) for \(i \ne j\) then \[ \left \| \sum_{i=1}^n X_i \right \|_2^2 = \sum_{i=1}^n \|X_i\|_2^2 \]

Proof:

Again, this follows from the bi-linearity of inner product: \[ \left \| \sum_{i=1}^n X_i \right \|_2^2 = \left\langle \sum_{i=1}^n X_i, \sum_{j=1}^n X_j\right\rangle = \sum_{i=1}^n \sum_{j=1}^n \langle X_i, X_j \rangle \] The terms with \( i \ne j \) are 0 by the orthogonality assumption, so \[ \left \| \sum_{i=1}^n X_i \right \|_2^2 = \sum_{i=1}^n \langle X_i, X_i \rangle = \sum_{i=1}^n \|X_i\|_2^2 \]

Projections

The best linear predictor studied in the section on covariance and correlation and conditional expected values have nice interpretation in terms of projections onto subspaces of \( \mathscr{L}_2 \). First let's review the concepts. Recall that \( \mathscr{U} \) is a subspace of \( \mathscr{L}_2 \) if \( \mathscr{U} \subseteq \mathscr{L}_2 \) and \( \mathscr{U} \) is also a vector space (under the same operations of addition and scalar multiplication). To show that \( \mathscr{U} \subseteq \mathscr{L}_2 \) is a subspace, we just need to show the closure properties (the other axioms of a vector space are inherited).

Suppose now that \( \mathscr{U} \) is a subspace of \( \mathscr{L}_2 \) and that \( X \in \mathscr{L}_2 \). Then the projection of \( X \) onto \( \mathscr{U} \) (if it exists) is the vector \( V \in \mathscr{U} \) with the property that \( X - V \) is perpendicular to \( \mathscr{U} \): \[ \langle X - V, U \rangle = 0, \quad U \in \mathscr{U} \] The projection has two critical properties: It is unique (if it exists) and it is the vector in \( \mathscr{U} \) closest to \( X \). If you look at the proofs of these results, you will see that they are essentially the same as the ones used for the best predictors of \( X \) mentioned at the beginning of this subsection. Moreover, the proofs use only vector space concepts—the fact that our vectors are random variables on a probability space plays no special role.

The projection of \( X \) onto \( \mathscr{U} \) (if it exists) is unique.

Proof:

Suppose that \( V_1 \) and \( V_2 \) satisfy the definition. then \[ \left\|V_1 - V_2\right\|_2^2 = \langle V_1 - V_2, V_1 - V_2 \rangle = \langle V_1 - X + X - V_2, V_1 - V_2 \rangle = \langle V_1 - X, V_1 - V_2 \rangle + \langle X - V_2, V_1 - V_2 \rangle = 0 \] The last equality holds by assumption and the fact that \( V_1 - V_2 \in \mathscr{U} \)

Suppose that \( V \) is the projection of \( X \) onto \( \mathscr{U} \). Then

  1. \( \left\|X - V\right\|_2^2 \le \left\|X - U\right\|_2^2\) for all \( U \in \mathscr{U} \).
  2. Equality holds in (a) if and only if \( U = V \)
Proof:
  1. If \( U \in \mathscr{U} \) then \[ \left\| X - U \right\|_2^2 = \left\| X - V + V - U \right\|_2^2 = \left\| X - V \right\|_2^2 + 2 \langle X - V, V - U \rangle + \left\| V - U \right\|_2^2\] But the middle terms is 0 so \[ \left\| X - U \right\|_2^2 = \left\| X - V \right\|_2^2 + \left\| V - U \right\|_2^2 \ge \left\| X - V \right\|_2^2\]
  2. Equality holds if and only if \( \left\| V - U \right\|_2^2 = 0\), if and only if \( X = U \).

Now let's return to our study of best predictors of a random variable.

If \( X \in \mathscr{L}_2 \) then the set \( \mathscr{W}_X = \{a + b X: a \in \R, \; b \in \R\} \) is a subspace of \(\mathscr{L}_2\). In fact, it is the subspace generated by \(X\) and 1.

Proof:

Note that \( \mathscr{W} \) is the set of all linear combinations of the vectors \( 1 \) and \( X \). If \( U, \; V \in \mathscr{W} \) then \( U + V \in \mathscr{W} \). If \( U \in \mathscr{W} \) and \( c \in \R \) then \( c U \in \mathscr{W} \).

Recall that for \( X, Y \in \mathscr{L}_2 \), the best linear predictor of \( Y \) based on \( X \) is \[ L(Y \mid X) = \E(Y) + \frac{\cov(X, Y)}{\var(X)} \left[X - \E(X)\right] \]

If \( X, \, Y \in \mathscr{L}_2 \) then \( L(Y \mid X) \) is the projection of \(Y\) onto \(\mathscr{W}_X\).

Proof:

Note first that \(L(Y \mid X) \in \mathscr{W}_X \). Thus, we just need to show that \( Y - L(Y \mid X) \) is perpendicular to \( \mathscr{W}_X \). For this, it suffices to show

  1. \(\left\langle Y - L(Y \mid X), X \right\rangle = 0\)
  2. \(\left\langle Y - L(Y \mid X), 1 \right\rangle = 0\)

We have already done this in the earlier sections, but for completeness, we do it again. Note that \( \E\left(X \left[X - \E(X)\right]\right) = \var(X) \). Hence \( \E\left[X L(Y \mid X)\right] = \E(X) \E(Y) + \cov(X, Y) = \E(X Y) \). This gives (a). By linearity, \( \E\left[L(Y \mid X)\right] = \E(Y) \) so (b) holds as well.

The previous result is actually just the random variable version of the standard formula for the projection of a vector onto a space spanned by two other vectors. Note that \( 1 \) is a unit vector and that \( X_0 = X - \E(X) = X - \langle X, 1 \rangle 1 \) is perpendicular to \( 1 \). Thus, \( L(Y \mid X) \) is just the sum of the projections of \( Y \) onto \( 1 \) and \( X_0 \): \[ L(Y \mid X) = \langle Y, 1 \rangle 1 + \frac{\langle Y, X_0 \rangle}{\langle X_0, X_0\rangle} X_0 \]

Suppose now that \( \mathscr{G} \) is a sub \( \sigma \)-algebra of \( \mathscr{F} \). Of course if \( X: \Omega \to \R \) is \( \mathscr{G} \)-measurable then \( X \) is \( \mathscr{F} \)-measurables, so \( \mathscr{L}_2(\mathscr{G}) \) is a subspace of \( \mathscr{L}_2(\mathscr{F}) \).

If \( X \in \mathscr{L}_2(\mathscr{F}) \) then \( \E(X \mid \mathscr{G}) \) is the projection of \( X \) onto \( \mathscr{L}_2(\mathscr{G}) \).

Proof:

This is essentially the definition of \( \E(X \mid \mathscr{G}) \) as the only (up to equivalence) random variable in \( \mathscr{L}_2(\mathscr{G}) \) with \( \E\left[\E(X \mid \mathscr{G}) U\right] = \E(X U) \) for every \( U \in \mathscr{L}_2(\mathscr{G}) \).

But remember that \( \E(X \mid \mathscr{G}) \) is defined more generally for \( X \in \mathscr{L}_1(\mathscr{F}) \). Our final result in this discussion concerns convergence.

Suppose that \( k \in [1, \infty) \) and that \( \mathscr{G} \) is a sub \( \sigma \)-algebra of \( \mathscr{F} \).

  1. If \( X \in \mathscr{L}_k(\mathscr{F}) \) then \( \E(X \mid \mathscr{G}) \in \mathscr{L}_k(\mathscr{G}) \)
  2. If \( X_n \in \mathscr{L}_k(\mathscr{F}) \) for \( n \in \N_+ \), \( X \in \mathscr{L}_k(\mathscr{F}) \), and \( X_n \to X \) as \( n \to \infty \) in \( \mathscr{L}_k(\mathscr{F}) \) then \( \E(X_n \mid \mathscr{G}) \to \E(X \mid \mathscr{G}) \) as \( n \to \infty \) in \( \mathscr{L}_k(\mathscr{G}) \)
Proof:
  1. Note that \( |\E(X \mid \mathscr{G})| \le \E(|X| \mid \mathscr{G}) \). Since \( t \mapsto t^k \) is increasing and convex on \( [0, \infty) \) we have \[ |\E(X \mid \mathscr{G})|^k \le [\E(|X| \mid \mathscr{G})]^k \le \E\left(|X|^k \mid \mathscr{G}\right) \] The last step uses Jensen's inequality. Taking expected values gives \[ \E[|\E(X \mid \mathscr{G})|^k] \le \E(|X|^k) \lt \infty \]
  2. Using the same ideas, \[ \E\left[\left|\E(X_n \mid \mathscr{G}) - \E(X \mid \mathscr{G})\right|^k\right] = \E\left[\left|\E(X_n - X \mid \mathscr{G})\right|^k\right] \le E[|X_n - X|^k] \] By assumption, the right side converges to 0 as \( n \to \infty \) and hence so does the left side.

Examples and Applications

App Exercises

In the error function app, select the root mean square error function. Click on the \( x \)-axis to generate an empirical distribution, and note the shape and location of the graph of the error function.

In the error function app, select the mean absolute error function. Click on the \( x \)-axis to generate an empirical distribution, and note the shape and location of the graph of the error function.

Computational Exercises

Suppose that \(X\) is uniformly distributed on the interval \([0, 1]\).

  1. Find \(\|X\|_k\) for \( k \in [1, \infty) \).
  2. Graph \(\|X\|_k\) as a function of \(k \in [1, \infty)\).
  3. Find \(\lim_{k \to \infty} \|X\|_k\).
Answer:
  1. \(\frac{1}{(k + 1)^{1/k}}\)
  2. 1

Suppose that \(X\) has probability density function \(f(x) = \frac{a}{x^{a+1}}\) for \(1 \le x \lt \infty\), where \(a \gt 0\) is a parameter. Thus, \(X\) has the Pareto distribution with shape parameter \(a\).

  1. Find \(\|X\|_k\) for \( k \in [1, \infty) \).
  2. Graph \(\|X\|_k\) as a function of \(k \in (1, a)\).
  3. Find \(\lim_{k \uparrow a} \|X\|_k\).
Answer:
  1. \(\left(\frac{a}{a -k}\right)^{1/k}\) if \(k \lt a\), \(\infty\) if \(k \ge a\)
  2. \(\infty\)

Suppose that \((X, Y)\) has probability density function \(f(x, y) = x + y\) for \(0 \le x \le 1\), \(0 \le y \le 1\). Verify Minkowski's inequality.

Answer:
  1. \(\|X + Y\|_k = \left(\frac{2^{k+2} - 2}{(k + 2)(k + 3)}\right)^{1/k}\)
  2. \(\|X\|_k + \|Y\|_k = 2 \left(\frac{1}{k + 2} + \frac{1}{2(k + 1)}\right)^{1/k}\)

Let \(X\) be an indicator random variable with \(\P(X = 1) = p\), where \(0 \le p \le 1\). Graph \(\E\left(\left|X - t\right|\right)\) as a function of \(t \in \R\) in each of the cases below. In each case, find the minimum value of the function and the values of \(t\) where the minimum occurs.

  1. \(p \lt \frac{1}{2}\)
  2. \(p = \frac{1}{2}\)
  3. \(p \gt \frac{1}{2}\)
Answer:
  1. The minimum is \(p\) and occurs at \(t = 0\).
  2. The minimum is \(\frac{1}{2}\) and occurs for \(t \in [0, 1]\)
  3. The minimum is \(1 - p\) and occurs at \(t = 1\)

Suppose that \(X\) is uniformly distributed on the interval \([0, 1]\). Find \(d_1(X, t) = \E\left(\left|X - t\right|\right)\) as a function of \(t\) and sketch the graph. Find the minimum value of the function and the value of \(t\) where the minimum occurs.

Suppose that \(X\) is uniformly distributed on the set \([0, 1] \cup [2, 3]\). Find \(d_1(X, t) = \E\left(\left|X - t\right|\right)\) as a function of \(t\) and sketch the graph. Find the minimum value of the function and the values of \(t\) where the minimum occurs.

Suppose that \((X, Y)\) has probability density function \(f(x, y) = x + y\) for \(0 \le x \le 1\), \(0 \le y \le 1\). Verify Hölder's inequality in the following cases:

  1. \(j = k = 2\)
  2. \(j = 3\), \(k = \frac{3}{2}\)
Answer:
  1. \(\|X\|_2 \|Y\|_2 = \frac{5}{12}\)
  2. \(\|X\|_3 + \|Y\|_{3/2} \approx 0.4248\)

Counterexamples

The following exercise shows that convergence with probability 1 does not imply convergence in mean.

Suppose that \((X_1, X_2, \ldots)\) is a sequence of independent random variables with \[ \P\left(X = n^3\right) = \frac{1}{n^2}, \; \P(X_n = 0) = 1 - \frac{1}{n^2}; \quad n \in \N_+ \]

  1. \(X_n \to 0\) as \(n \to \infty\) with probability 1.
  2. \(X_n \to 0\) as \(n \to \infty\) in probability.
  3. \(\E(X_n) \to \infty\) as \(n \to \infty\).
Proof:
  1. This follows from the basic characterization of convergence with probability 1: \( \sum_{n=1}^\infty \P(X_n \gt \epsilon) = \sum_{n=1}^\infty 1 / n^2 \lt \infty \) for \( 0 \lt \epsilon \lt 1 \).
  2. This follows since convergence with probability 1 implies convergence in probability.
  3. Note that \( \E(X_n) = n^3 / n^2 = n \) for \( n \in \N_+ \).

The following exercise shows that convergence in mean does not imply convergence with probability 1.

Suppose that \((X_1, X_2, \ldots)\) is a sequence of independent indicator random variables with \[ \P(X_n = 1) = \frac{1}{n}, \; \P(X_n = 0) = 1 - \frac{1}{n}; \quad n \in \N_+ \]

  1. \(\P(X_n = 0 \text{ for infinitely many } n) = 1\).
  2. \(\P(X_n = 1 \text{ for infinitely many } n) = 1\).
  3. \(\P(X_n \text{ does not converge as } n \to \infty) = 1\).
  4. \(X_n \to 0\) as \(n \to \infty\) in \(k\)th mean for every \(k \ge 1\).
Proof:
  1. This follows from the second Borel-Cantelli lemma since \( \sum_{n=1}^\infty \P(X_n = 1) = \sum_{n=1}^\infty 1 / n = \infty \)
  2. This also follows from the second Borel-Cantelli lemma since \( \sum_{n=1}^\infty \P(X_n = 0) = \sum_{n=1}^\infty (1 - 1 / n) = \infty \).
  3. This follows from parts (a) and (b).
  4. Note that \( \E(X_n) = 1 / n \to 0 \) as \( n \to \infty \).

The following exercise show that convergence of the \( k \)th means does not imply convergence in \( k \)th mean.

Suppose that \( U \) has the Bernoulli distribution with parmaeter \( \frac{1}{2} \), so that \( \P(U = 1) = \P(U = 0) = \frac{1}{2} \). Let \( X_n = U \) for \( n \in \N_+ \) and let \( X = 1 - U \). Let \( k \in [1, \infty) \). Then

  1. \( \E(X_n^k) = \E(X^k) = \frac{1}{2} \) for \( n \in \N_+ \), so \( \E(X_n^k) \to \E(X^k) \) as \( n \to \infty \)
  2. \( \E(|X_n - X|^k) = 1 \) for \( n \in \N \) so \( X_n \) does not converge to \( X \) as \( n \to \infty \) in \( \mathscr{L}_k \).
Proof:
  1. Note that \( X_n^k = U^k = U \) for \( n \in \N_+ \), since \( U \) just takes values 0 and 1. Also, \( U \) and \( 1 - U \) have the same distribution so \( \E(U) = \E(1 - U) = \frac{1}{2} \).
  2. Note that \( X_n - X = U - (1 - U) = 2 U - 1 \) for \( n \in \N_+ \). Again, \( U \) just takes values 0 and 1, so \( |2 U - 1| = 1 \).