\(\newcommand{\var}{\text{var}}\)
\(\newcommand{\sd}{\text{sd}}\)
\(\newcommand{\cov}{\text{cov}}\)
\(\newcommand{\cor}{\text{cor}}\)
\(\newcommand{\mse}{\text{MSE}}\)
\(\renewcommand{\P}{\mathbb{P}}\)
\(\newcommand{\E}{\mathbb{E}}\)
\(\newcommand{\R}{\mathbb{R}}\)
\(\newcommand{\N}{\mathbb{N}}\)
\( \newcommand{\bs}{\boldsymbol} \)

Conditional expected value is much more important than one might at first think. Indeed, conditional expected value is at the core of modern probability theory because it provides the basic way of incorporating known information into a probability measure.

As usual, our starting point is a random experiment modeled by a probability space \((\Omega, \mathscr F, \P)\), so that \( \Omega \) is the sample set, \( \mathscr F \) is the \( \sigma \)-algebra of events, and \( \P \) is the probability measure on the sample space \( (\Omega, \mathscr F) \). In our first elementary discussion, we studied the conditional expected value of a real-value random variable \(X\) given a general random variable \(Y\). The more general approach is to condition on a sub \(\sigma\)-algebra \( \mathscr G \) of \( \mathscr F \). The sections on \( \sigma \)-algebras and measure theory are essential prerequisites for this section.

Before we get to the definition, we will assume that any expected values that we mention exist as real numbers. That is, if we write \( \E(X) \) then we are assuming that \( X \) is a real-valued random variable and that \( \E(|X|) \lt \infty \). Also the notion of equivalence plays a fundamental role in this section. Recall that real-valued random variables \( X_1 \) and \( X_2 \) are equivalent if \( P(X_1 = X_2) = 1 \). Equivalence really does define an equivalence relation on the collection of real-valued random variables for our experiment. Moreover, we often regard equivalent random variables as being essentially the same object. More precisely from this point of view, the objects of our study are not individual random variables but rather equivalence classes of random variables under this equivalence relation. Finally, for \( A \in \mathscr F \), recall the notation for the expected value of \( X \) on the event \( A \): \[ \E(X; A) = \E(X \bs{1}_A)\]

suppose that \(X\) is a random variable and that \(\mathscr G\) is a sub \(\sigma\)-algebra of \(\mathscr F\). The conditional expected value of \(X\) given \(\mathscr G\) is the random variable \(\E(X \mid \mathscr G)\) defined by the following properties:

- \(\E(X \mid \mathscr G)\) is measurable with repsect to \(\mathscr G\).
- If \(A \in \mathscr G\) then \(\E[\E(X \mid \mathscr G); A] = \E(X; A)\)

The basic idea is that \( \E(X \mid \mathscr G) \) is the expected value of \( X \) given the information in the \( \sigma \)-algebra \( \mathscr G \). Hopefully this idea will become clearer during our study. The conditions above uniquely define \( \E(X \mid \mathscr G) \) up to equivalence. The proof of this fact is a simple application of the Radon-Nikodym theorem, named for Johann Radon and Otto Nikodym

Existence and uniqueness of \( \E(X \mid \mathscr G) \)

- There exists a random variable \( V \) satisfying the defining conditions.
- If \( V_1 \) and \( V_2 \) satisfy the defining conditions, then \( \P(V_1 = V_2) = 1 \) so that \( V_1 \) and \( V_2 \) are equivalent.

- Note that \( \nu(A) = \E(X; A) \) for \( A \in \mathscr G \) defines a (signed) measure on \( \mathscr G \). Moreover, if \( A \in \mathscr G \) and \( \P(A) = 0 \) then \( \nu(A) = 0 \). Hence \( \nu \) is absolutely continuous with respect to the restriction of \( \P \) to \( \mathscr G \). By the Radon-Nikodym theorem, there exists a random variable \( V \) that is measurable with respect to \( \mathscr G \) such that \( \nu(A) = \E(V; A) \) for \( A \in \mathscr G \). That is, \( V \) is the
*density*or*derivative*of \( \nu \) with respect to \( \P \) on \( \mathscr G \). - This follows from the uniqueness of the Radon-Nikodym derivative, up to equivalence.

The following characterization might seem stronger but in fact in equivalent to the definition.

\( \E(X \mid \mathscr G) \) is characterized by the following properties:

- \( \E(X \mid \mathscr G) \) is measurable with respect to \( \mathscr G \)
- If \( U \) is measurable with respect to \( \mathscr G \) then \( \E[U \E(X \mid \mathscr G)] = \E(U X) \).

We have to show that part (b) in the original definition is equivalent to part (b) here. First (b) here implies (b) in the original definition since \( \bs{1}_A \) is \( \mathscr G \)-measurable if \( A \in \mathscr G \). Conversely suppose that (b) in the original definition holds. We will show that (b) here holds by a classical *bootstrapping argument.*. First \( \E[U \E(X \mid \mathscr G)] = \E(U X) \) if \( U = \bs{1}_A \) for some \( A \in \mathscr G \). Next suppose that \( U \) is a simple random variable that is \( \mathscr G \)-measurable. That is, \( U = \sum_{i \in I} a_i \bs{1}_{A_i} \) where \( I \) is a finite index set, \( a_i \ge 0 \) for \( i \in I \), and \( A_i \in \mathscr G \) for \( i \in I \). then
\[ \E[U \E(X \mid \mathscr G)] = \E[\sum_{i \in I} a_i \bs{1}_{A_i} \E(X \mid \mathscr G)] = \sum_{i \in I} a_i \E[\bs{1}_{A_i} \E(X \mid \mathscr G)] = \sum_{i \in I} a_i \E(\bs{1}_{A_i} X) = \E(\sum_{i \in I} a_i \bs{1}_{A_i} X) = \E(U X) \]
Next suppose that \( U \) is nonnegative and \( \mathscr G \)-measurable. Then there exists a sequence of simple \( \mathscr G \)-measurable random variables \( (U_1, U_2, \ldots) \) with \( U_n \uparrow U \) as \( n \to \infty \). Then by the previous step, \( \E[U_n \E(X \mid \mathscr G)] = \E(U_n X) \) for each \( n \). Letting \( n \to \infty \) and using the monotone convergence theorem we have \( \E[U \E(X \mid \mathscr G)] = \E(U X) \). Finally, suppose that \( U \) is a general \( \mathscr G \)-measurable random variable. Then \( U = U^+ - U^- \) where \( U^+ \) and \( U^- \) are the usual positive and negative parts of \( U \). These parts are nonnegative and \( \mathscr G \)-measurable, so by the previous step, \( \E[U^+ \E(X \mid \mathscr G)] = \E(U^+ X) \) and \( \E[U^- \E(X \mid \mathscr G)] = \E(U^- X) \). hence
\[ \E[U \E(X \mid \mathscr G)] = \E[(U^+ - U^-) \E(X \mid \mathscr G)] = \E[U^+ \E(X \mid \mathscr G)] - \E[U^- \E(X \mid \mathscr G)] = \E(U^+ X) - \E(U^- X) = \E(U X) \]

Recall again that random variables are assumed to be real-valued with finite expected value, and that \( \mathscr G \) is a sub \( \sigma \)-algebra of \( \mathscr F \). Also, all equalities and inequalities are understood to hold *modulo equivalence*, that is, with probability 1. Note also that many of the proofs work by showing that the right hand side satisfies the defining properties for the conditional expected value on the left side.

Our first property is a simple consequence of the definition: \( X \) and \( \E(X \mid \mathscr G) \) have the same mean.

\( \E[\E(X \mid \mathscr G)] = \E(X) \).

This follows immediately by letting \( A = \Omega \) in the definition.

The result above can often be used to compute \( \E(X) \), by choosing the \( \sigma \)-algebra \( \mathscr G \) in a clever way. We say that we are computing \( \E(X) \) by conditioning on \( \mathscr G \). Our next properties are fundamental: every version of expected value must satisfy the linearity properties. The first part is the additive property and the second part is the scaling property.

Suppose that \( X \) and \( Y \) are random variables and that \( c \) is a constant. Then

- \( \E(X + Y \mid \mathscr G) = \E(X \mid \mathscr G) + \E(Y \mid \mathscr G) \)
- \( \E(c X \mid \mathscr G) = c \E(X \mid \mathscr G) \)

- We show that \( \E(X \mid \mathscr G) + \E(Y \mid \mathscr G) \) satisfies the defining conditions for \( \E(X + Y \mid \mathscr G) \). Note first that \( \E(X \mid \mathscr G) + \E(Y \mid \mathscr G) \) is \( \mathscr G \)-measurable since both terms are. If \(A \in \mathscr G \) then \[ \E\{[\E(X \mid \mathscr G) + \E(Y \mid \mathscr G)]; A\} = \E[\E(X \mid \mathscr G); A] + \E[\E(Y \mid \mathscr G); A] = \E(X; A) + \E(Y; A) = \E[X + Y; A] \]
- We show that \( c \E(X \mid \mathscr G) \) satisfy the defining conditions for \( \E(c X \mid \mathscr G) \). Note first that \( c \E(X \mid \mathscr G) \) is \( \mathscr G \)-measurable since the second factor is. If \(A \in \mathscr G \) then \[ \E[c \E(X \mid \mathscr G); A] = c \E[\E(X \mid \mathscr G); A] = c \E(X; A) = \E(c X; A) \]

The next set of properties are also fundamental to every notion of expected value. The first part is the positive property and the second part is the increasing property.

Suppose that \( X \) and \( Y \) are random variables.

- If \( X \ge 0 \) then \( \E(X \mid \mathscr G) \ge 0 \)
- If \( X \le Y \) then \( \E(X \mid \mathscr G) \le \E(Y \mid \mathscr G) \)

- Let \( A = \{\E(X \mid \mathscr G) \lt 0\} \). Note that \(A \in \mathscr G \) and hence \( \E(X; A) = \E[\E(X \mid \mathscr G); A] \). Since \( X \ge 0 \) with probability 1 we have \( E(X; A) \ge 0 \). On the other hand, if \( \P(A) \gt 0 \) then \( \E[\E(X \mid \mathscr G); A] \lt 0 \) which is a contradiction. Hence we must have \( \P(A) = 0 \).
- Note that if \( X \le Y \) then \( Y - X \ge 0 \). Hence by (a) and (5), \( \E(Y - X \mid \mathscr G) = \E(Y \mid \mathscr G) - \E(X \mid \mathscr G) \ge 0 \) so \( \E(Y \mid \mathscr G) \ge \E(X \mid \mathscr G) \).

The next few properties relate to the central idea that \( \E(X \mid \mathscr G) \) is the expected value of \( X \) given the information in the \( \sigma \)-algebra \( \mathscr G \).

Suppose that \( X \) and \( V \) are random variables and that \( V \) is measurable with respect to \( \mathscr G \), then \( \E(V X \mid \mathscr G) = V \E(X \mid \mathscr G) \).

We show that \( V \E(X \mid \mathscr G) \) satisfy the properties that characterize \( \E(V X \mid \mathscr G) \). First, \( V \E(X \mid \mathscr G) \) is \( \mathscr G \)-measurable since both factors are. If \( U \) is \( \mathscr G \)-measurable then so is \( U V \) and hence \[ \E[U V \E(X \mid \mathscr G)] = \E(U V X) = \E[U (V X)] \]

Compare this result with the scaling property. If \( V \) is measurable with respect to \( \mathscr G \) then \( V \) is like a constant in terms of the conditional expected value given \( \mathscr G \). On the other hand, note that this result implies scaling property, since a constant can be viewed as a random variable, and as such, is measurable with respect to any \( \sigma \)-algebra. As a corollary to this result, note that if \( X \) itself is measurable with respect to \( \mathscr G \) then \( \E(X \mid \mathscr G) = X \). The following result gives the other extreme.

If \( X \) and \( \mathscr G \) are independent then \( \E(X \mid \mathscr G) = \E(X) \).

We show that \( \E(X) \) satisfy the properties that characterize \( \E(X \mid \mathscr G) \). First of course, \( \E(X) \) is \( \mathscr G \)-measurable as a constant random variable. If \( A \in \mathscr G \) then \( X \) and \( \bs{1}_A \) are independent and hence \[ \E(X; A) = \E(X) \P(A) = \E[\E(X); A] \]

Every random variable \( X \) is independent of the trivial \( \sigma \)-algebra \( \{\emptyset, \Omega\} \) so it follows that \( \E(X \mid \{\emptyset, \Omega\}) = \E(X) \).

The next properties are consistency conditions, also known as the tower properties. When conditioning twice, with respect to nested \( \sigma \)-algebras, the smaller one (representing the smaller amount of information) always prevails.

Suppose that \( X \) is a random variable and that \( \mathscr H \) and \( \mathscr G \) are \( \sigma \)-algebras with \( \mathscr H \subseteq \mathscr G \subseteq \mathscr F\). Then

- \( \E[\E(X \mid \mathscr H) \mid \mathscr G] = \E(X \mid \mathscr H) \)
- \( \E[\E(X \mid \mathscr G) \mid \mathscr H] = \E(X \mid \mathscr H) \)

- Note first that \( \E(X \mid \mathscr H) \) is \( \mathscr H \)-measurable and hence also \( \mathscr G \)-measurable. Thus by the factoring property above, \( \E[\E(X \mid \mathscr H) \mid \mathscr G] = \E(X \mid \mathscr H) \).
- We show that \( \E(X \mid \mathscr H) \) satisfies the defining conditions for \( \E[\E(X \mid \mathscr G) \mid \mathscr H]\). Note again that \( \E(X \mid \mathscr H) \) is \( \mathscr H \)-measurable. If \(A \in \mathscr H \) then \( A \in \mathscr G \) and hence \[ \E[\E(X \mid \mathscr G); A] = \E(X; A) = \E[\E(X \mid \mathscr H); A] \]

The next result gives Jensen's inequality for conditional expected value, named for Johan Jensen.

Suppose that \( X \) takes values in an interval \( S \subseteq \R \) and that \( g: S \to \R \) is convex. Then \[ \E[g(X) \mid \mathscr G] \ge g[\E(X \mid \mathscr G)] \]

As with Jensen's inequality for ordinary expected value, the best proof uses the characterization of convex functions in terms of supporting lines: For each \( t \in S \) there exist numbers \( a \) and \( b \) (depending on \( t \)) such that

- \( a + b t = g(t) \)
- \( a + b x \le g(x) \) for \( x \in S \)

Random variables \( X \) and \( \E(X \mid \mathscr G) \) takes values in \( S \). We can construct a *random* supporting line at \( \E(X \mid \mathscr G) \). That is, there exist random variables \( A \) and \( B \), measurable with respect to \( \mathscr G \), such that

- \( A + B \E(X \mid \mathscr G) = g[\E(X \mid \mathscr G)] \)
- \( A + B X \le g(X) \)

We take conditional expected value through the inequality in (b) and then use properties of conditional expected value and property (a): \[ \E[g(X) \mid \mathscr G] \ge \E(A + B X \mid \mathscr G) = A + B \E(X \mid \mathscr G) = g[\E(X \mid \mathscr G] \] Note that the second step uses the fact that \( A \) and \( B \) are measurable with respect to \( \mathscr G \).

The conditional probability of an event \(A\), given \(\mathscr G\), is a special case of the conditional expected value. As usual, let \(\bs{1}_A\) denote the indicator random variable of \(A\). We define \[ \P(A \mid \mathscr G) = \E(\bs{1}_A \mid \mathscr G) \] Thus, we have the following characterizations of conditional probability, which are special cases of the first characterization and the second characterization of \( \E(X \mid \mathscr G) \):

\( \P(A \mid \mathscr G) \) is characterized (up to equivalence) by the following properties

- \( \P(A \mid \mathscr G) \) is measurable with respect to \( \mathscr G \).
- If \( B \in \mathscr G \) then \( \E[\P(A \mid \mathscr G); B] = \P(A \cap B) \)

For part (b), note that \[ \E[\bs{1}_B \P(A \mid \mathscr G)] = \E[\bs{1}_B \E(\bs{1}_A \mid \mathscr G)] = \E(\bs{1}_A \bs{1}_B) = \E(\bs{1}_{A \cap B}) = \P(A \cap B) \]

\( \P(A \mid \mathscr G) \) is characterized (up to equivalence) by the following properties

- \( \P(A \mid \mathscr G) \) is measurable with respect to \( \mathscr G \).
- If \( U \) is measurable with respect to \( \mathscr G \) then \( \E[U \P(A \mid \mathscr G)] = \E(U; A) \)

The properties above for conditional expected value, of course, have special cases for conditional probability. In particular, we can compute the probability of an event by conditioning on a \( \sigma \)-algebra:

\(\P(A) = \E[\P(A \mid \mathscr G)]\).

This is a direct result of the conditioning rule for expected value, since \( \E(\bs{1}_A) = \P(A) \).

Again, the result in the previous exercise is often a good way to compute \(\P(A)\) when we know the conditional probability of \(A\) given \(\mathscr G\). This is a very compact and elegant version of the law of total probability given first in the section on Conditional Probability in the chapter on Probability Spaces and later in the section on Discrete Distributions in the Chapter on Distributions. The following theorem gives the conditional version of the axioms of probability.

The following properties hold (as usual, modulo equivalence):

- \( \P(A \mid \mathscr G) \ge 0 \) for every event \( A \)
- \( \P(\Omega \mid \mathscr G) = 1 \)
- If \( \{A_i: i \in I\} \) is a countable collection of disjoint events then \( \P(\bigcup_{i \in I} A_i \bigm| \mathscr G) = \sum_{i \in I} \P(A_i \mid \mathscr G) \)

- This is a direct consequence of the positive property.
- This is trivial since \( \bs{1}_\Omega = 1 \).
- We show that the right side satisfies the defining conditions for the left side. Note that \( \sum_{i \in I} \P(A_i \mid \mathscr G) \) is \( \mathscr G \)-measurable since each term in the sum has this property. Let \( B \in \mathscr G \). then \[ \E\left[\sum_{i \in I} \P(A_i \mid \mathscr G); B\right] = \sum_{i \in I} \E[\P(A_i \mid \mathscr G); B] = \sum_{i \in I} \P(A_i \cap B) = \P\left(B \cap \bigcup_{i \in I} A_i\right) \]

From the last result, it follows that other standard probability rules hold for conditional probability given \( \mathscr G \) (as always, modulo equivalence). These results include

- the complement rule
- the increasing property
- Boole's inequality
- Bonferroni's inequality
- the inclusion-exclusion laws

However, it is not correct to state that \( A \mapsto \P(A \mid \mathscr G) \) is a probability measure. In part (c), the left and right sides are random variables and the equation is an event that has probability 1. However this event depends on the collection \( \{A_i: i \in I\} \). In general, there will be uncountably many such collections in \( \mathscr F \), and the intersection of all of the corresponding events may well have probability less than 1 (if it's measurable at all). It turns out that if the underlying probability space \( (\Omega, \mathscr F, \P) \) is sufficiently nice

(and most probability spaces that arise in applications are nice), then there does in fact exist a regular conditional probability. That is, for each \( A \in \mathscr F \), there exists a random variable \( \P(A \mid \mathscr G) \) satisfying the defining conditions and such that with probability 1, \( A \mapsto \P(A \mid \mathscr G) \) is a probability measure.

The following theorem gives a version of Bayes' theorem, named for the inimitable Thomas Bayes.

Suppose that \( A \in \mathscr G \) and \( B \in \mathscr F \). then \[ \P(A \mid B) = \frac{\E[\P(B \mid \mathscr G); A]}{\E[\P(B \mid \mathscr G)]} \]

The proof is absolutely trivial. By definition of conditional probability given \( \mathscr G \), the numerator is \( \P(A \cap B) \) and the denominator is \( P(B) \). Nonetheless, Bayes' theorem is useful in settings where the expected values in the numerator and denominator can be computed directly

The purpose of this discussion is to tie the general notions of conditional expected value that we are studying here to the more elementary concepts that you have seen before. Suppose that \( A \) is an event (that is, a member of \( \mathscr F \)) with \( \P(A) \gt 0 \). If \( B \) is another event, then of course, the conditional probability of \( B \) given \( A \) is \[ \P(B \mid A) = \frac{\P(A \cap B)}{\P(A)} \] If \( X \) is a real-valued random variable (whose expected value exists in \( \R \)), then the conditional distribution of \( X \) given \( A \) is the probability measure on \( \R \) given by \[ R \mapsto \P(X \in R \mid A) = \frac{\P(\{X \in R\} \cap A)}{\P(A)} \text{ for measurable } R \subseteq \R \] and the conditional expected value of \( X \) given \( A \), denoted \( \E(X \mid A) \), is simply the mean of this conditional distribution.

Suppose now that \( \mathscr{A} = \{A_i: i \in I\} \) is a countable partition of the sample space \( \Omega \) into events with positive probability. To review the jargon, the index set \( I \) is countable; \( A_i \cap A_j = \emptyset \) for \( i, \; j \in I \) with \( i \ne j \); \( \bigcup_{i \in I} A_i = \Omega \); and \( \P(A_i) \gt 0 \) for \( i \in I \). Let \( \mathscr G = \sigma(\mathscr{A}) \), the \( \sigma \)-algebra generated by \( \mathscr{A} \). The elements of \( \mathscr G \) are of the form \( \bigcup_{j \in J} A_j \) for \( J \subseteq I \). Moreover, the random variables that are measurable with respect to \( \mathscr G \) are precisely the variables that are constant on \( A_i \) for each \( i \in I \).

If \( B \) is an event then \( \P(B \mid \mathscr G) \) is the random variable whose value on \( A_i \) is \( \P(B \mid A_i) \) for each \(i \in I \).

Let \( U \) denote the random variable that takes the value \( \P(B \mid A_i) \) on \( A_i \) for each \( i \in I \). First, \( U \) is measurable with respect to \( \scr G \) since \( U \) is constant on \( A_i \) for each \( i \in I \). So we just need to show that \( E(U ; A) = \P(A \cap B) \) for each \( A \in \mathscr G \). Thus, let \( A = \bigcup_{j \in J} A_j \) where \( J \subseteq I \). Then \[ \E(U; A) = \sum_{j \in J} \E(U ; A_j) = \sum_{j \in J} \P(B \mid A_j) \P(A_j) = \P(A \cap B)\]

In this setting, Bayes' theorem above reduces to the usual elementary formulation: For \( i \in I \), \( \E[P(B \mid \mathscr G); A_i] = \P(A_i) \P(B \mid A_i) \) and \( \E[\P(B \mid \mathscr G)] = \sum_{j \in I} \P(A_j) \P(B \mid A_j) \). Hence \[ \P(A_i \mid B) = \frac{P(A_i) \P(B \mid A_i)}{\sum_{j \in I} \P(A_j) \P(B \mid A_j)} \]

If \( X \) is a random variable then \( \E(X \mid \mathscr G) \) is the random variable whose value on \( A_i \) is \( \E(X \mid A_i) \) for each \( i \in I \).

Let \( U \) denote the random variable that takes the value \( \E(X \mid A_i) \) on \( A_i \) for each \( i \in I \). First, \( U \) is measurable with respect to \( \scr G \) since \( U \) is constant on \( A_i \) for each \( i \in I \). So we just need to show that \( E(U; A) = \E(X; A) \) for each \( A \in \mathscr G \). Thus, let \( A = \bigcup_{j \in J} A_j \) where \( J \subseteq I \). Then \[ \E(U; A) = \sum_{j \in J} \E(U; A_j) = \sum_{j \in J} \E(X \mid A_j) \P(A_j) = E(X; A) \]

The previous examples would apply to \( \mathscr G = \sigma(Y) \) if \( Y \) is a discrete random variable taking values in a countable set \( T \). In this case, the partition is simply \( \mathscr{A} = \{ \{Y = y\}: y \in T\} \). On the other hand, suppose that \( Y \) is a random variable for our experiment taking values in a general set \( T \) with \( \sigma \)-algebra \( \mathscr{T} \). The real-valued random variables that are measurable with respect to \( \mathscr G = \sigma(Y) \) are (up to equivalence) the measurable, real-valued functions of \( Y \).

Specializing further, Suppose that \( X \) takes values in \( S \subseteq \R \), \( Y \) takes values in \( T \subseteq \R^n \) and that \( (X, Y) \) has a joint continuous distribution with probability density function \( f \). Let \( h \) denote the probability density function of \( Y \), so that \[ h(y) = \int_S f(x, y) \, dx, \quad y \in T \] Then the conditional probability density function of \( X \) given \( Y = y \) is \[ g(x \mid y) =\frac{f(x, y)}{h(y)}, \quad x \in S \] This is precisely the setting of our elementary discussion of conditional expected value, and we usually write \( \E(X \mid Y) \) instead of the clunkier \( \E[X \mid \sigma(Y)] \).

In this setting, \[ \E(X \mid Y = y) = \int_S x g(x \mid y) \, dx = \int_S x \frac{f(x, y)}{h(y)} \, dx\]

In this subsection, we assume that the real-value random variables mentioned have finite variance. In our elementary treatment of conditional expected value, we showed that the conditional expected value of a real-value random variable \( X \) given a general random variable \( Y \) is the best predictor of \( X \), in the least squares sense, among all real-valued functions of \( Y \). A more careful statement is that \( \E(X \mid Y) \) is the best predictor of \( X \) among all real-valued random variables that are *measurable* with respect to \( \sigma(Y) \). Thus, it should come as not surprise that if \( \mathscr G \) is a sub \( \sigma \)-algebra of \( \mathscr F \), then \( \E(X \mid \mathscr G) \) is the best predictor of \( X \), in the least squares sense, among all real-valued random variables that are measurable with respect to \( \mathscr G) \). We will show that this is indeed the case in this subsection. The proofs are very similar to the ones given in the elementary section.

Suppose that \( U \) is measurable with respect to \( \mathscr G \). Then \( X - \E(X \mid \mathscr G) \) and \( U \) are uncorrelated.

Note that \( X - \E(X \mid \mathscr G) \) has mean 0 by the conditioning rule. Using the defining conditions for \( \E(X \mid \mathscr G) \) we have \[ \cov[X - \E(X \mid \mathscr G), U] = \E(U [X - \E(X \mid \mathscr G)]) = \E(U X) - \E[U \E(X \mid \mathscr G] = \E(U X) - \E(U X) = 0 \]

The next result is the main one: \( \E(X \mid \mathscr G) \) is closer to \( X \) in the mean square sense than any other random variable that is measurable with respect to \( \mathscr G \). Thus, if \( \mathscr G \) represents the information that we have, then \( \E(X \mid \mathscr G) \) is the best we can do in estimating \( X \).

Suppose that \( U \) is measurable with respect to \( \mathscr G \). Then

- \( \E([X - \E(X \mid \mathscr G)]^2) \le \E[(X - U)^2] \).
- Equality holds if and only if \(\P[U = \E(X \mid \mathscr G)] = 1 \), so \( U \) and \( \E(X \mid \mathscr G) \) are equivalent.

- Note that \begin{align} \E[(X - U)^2] & = \E([X - \E(X \mid \mathscr G) + \E(X \mid \mathscr G) - U]^2) \\ & = \E([X - \E(X \mid \mathscr G)]^2) + 2 \E([X - \E(X \mid \mathscr G)][\E(X \mid \mathscr G) - U]) + \E([\E(X \mid \mathscr G) - U]^2 ) \end{align} By the conditioning rule, \( X - \E(X \mid \mathscr G) \) has mean 0, so the middle term in the displayed equation is \( 2 \cov[X - \E(X \mid \mathscr G), \E(X \mid \mathscr G) - U] \). But \( \E(X \mid \mathscr G) - U \) is \( \mathscr G \)-measurable and hence this covariance is 0 by the uncorrelated property above. Therefore \[ \E[(X - U)^2] = \E([X - \E(X \mid \mathscr G)]^2) + \E([\E(X \mid \mathscr G) - U]^2 ) \ge \E([X - \E(X \mid \mathscr G)]^2) \]
- Equality holds if and only if \( \E([\E(X \mid \mathscr G) - U]^2 ) = 0 \) if and only if \(\P[U = \E(X \mid \mathscr G)] = 1 \)

Once again, we assume that the real-valued random variables in this subsection have finite variance. The conditional variance of \( X \) given \(\mathscr G\) is naturally defined as follows: \[ \var(X \mid \mathscr G) = \E([X - \E(X \mid \mathscr G)]^2 \biggm| \mathscr G ) \] Like all conditional expected values relative to \( \mathscr G \), \( \var(X \mid \mathscr G) \) is a random variable that is measurable with respect to \( \mathscr G \). Note that by the conditioning rule, \[ \E[\var(X \mid \mathscr G)] = \E([X - \E(X \mid \mathscr G)]^2) \] which is the mean square error when \( \E(X \mid \mathscr G) \) is used as a predictor of \( X \).

\(\var(X \mid \mathscr G) = \E(X^2 \mid \mathscr G) - [\E(X \mid \mathscr G)]^2\).

Expanding the square in the definition and using basic properties of conditional expectation, we have

\begin{align} \var(X \mid \mathscr G) & = \E(X^2 - 2 X \E(X \mid \mathscr G) + [\E(X \mid \mathscr G)]^2 \biggm| \mathscr G ) = \E(X^2 \mid \mathscr G) - 2 \E[X \E(X \mid \mathscr G) \mid \mathscr G] + \E([\E(X \mid \mathscr G)]^2 \mid \mathscr G) \\ & = \E(X^2 \mid \mathscr G) - 2 \E(X \mid \mathscr G) \E(X \mid \mathscr G) + [\E(X \mid \mathscr G)]^2 = \E(X^2 \mid \mathscr G) - [\E(X \mid \mathscr G)]^2 \end{align}\(\var(X) = \E[\var(X \mid \mathscr G)] + \var[\E(X \mid \mathscr G)]\).

From the previous theorem and properties of conditional expected value we have \( \E[\var(X \mid \mathscr G)] = \E(X^2) - \E([\E(X \mid \mathscr G)]^2) \). But \( \E(X^2) = \var(X) + [\E(X)]^2 \) and similarly, \(\E([\E(X \mid \mathscr G)]^2) = \var[\E(X \mid \mathscr G)] + (\E[\E(X \mid \mathscr G)])^2 \). But also, \( \E[\E(X \mid \mathscr G)] = \E(X) \) so subsituting we get \( \E[\var(X \mid \mathscr G)] = \var(X) - \var[\E(X \mid \mathscr G)] \).

Thus, the variance of \( X \) is the expected conditional variance plus the variance of the conditional expected value. This result is often a good way to compute \(\var(X)\) when we know the conditional distribution of \(X\) given \(\mathscr G\).

Let us return to the study of predictors of the real-valued random variable \(X\), and compare them in terms of mean square error.

Predictors of \( X \).

- The best constant predictor of \(X\) is \(\E(X)\) with mean square error \(\var(X)\).
- If \(Y\) is another real-valued random variable, then the best predictor of \(X\) among linear functions of \(Y\) is \[ L(X \mid Y) = \E(X) + \frac{\cov(X,Y)}{\var(Y)}[Y - \E(Y)] \] with mean square error \( \var(X)[1 - \cor^2(X,Y)]\).
- If \(Y\) is a general random variable, then the best predictor of \(X\) among all real-valued functions of \(Y\) is \(\E(X \mid Y)\) with mean square error \(\E[\var(X \mid Y)] = \var(X) - \var[\E(X \mid Y)]\).
- If \(\mathscr G\) is a sub \( \sigma \)-algebra of \( \mathscr F \), then the best predictor of \(X\) among all real-valued random variables that are measurable with respect to \(\mathscr G\) is \(\E(X \mid \mathscr G)\) with mean square error \(\E[\var(X \mid \mathscr G)] = \var(X) - \var[\E(X \mid \mathscr G)]\).

Of course, (a) is a special case of (d) with \( \mathscr G = \{\emptyset, \Omega\} \) and (c) is a special case of (d) with \( \mathscr G = \sigma(Y) \). Only (b), the linear case, cannot be interpreted in terms of conditioning with respect to a \( \sigma \)-algebra.