\(\newcommand{\var}{\text{var}}\) \(\newcommand{\sd}{\text{sd}}\) \(\newcommand{\cov}{\text{cov}}\) \(\newcommand{\cor}{\text{cor}}\) \(\newcommand{\mse}{\text{MSE}}\) \(\renewcommand{\P}{\mathbb{P}}\) \(\newcommand{\E}{\mathbb{E}}\) \(\newcommand{\R}{\mathbb{R}}\) \(\newcommand{\N}{\mathbb{N}}\) \( \newcommand{\bs}{\boldsymbol} \)
  1. Random
  2. 3. Expected Value
  3. 1
  4. 2
  5. 3
  6. 4
  7. 5
  8. 6
  9. 7
  10. 8
  11. 9
  12. 10
  13. 11
  14. 12
  15. 13

7. Conditional Expected Value

As usual, our starting point is a random experiment with probability measure \(\P\) on a sample space \(\Omega\). Suppose that \(X\) is a random variable taking values in a set \(S\) and that \(Y\) is a random variable taking values in \(T \subseteq \R\). In this section, we will study the conditional expected value of \(Y\) given \(X\), a concept of fundamental importance in probability. As we will see, the expected value of \(Y\) given \(X\) is the function of \(X\) that best approximates \(Y\) in the mean square sense. Note that \(X\) is a general random variable, not necessarily real-valued. In this section, we will assume that all expected values that are mentioned exist (as real numbers).

Basic Theory

Definitions

Note that we can think of \((X, Y)\) as a random variable that takes values in a subset of \(S \times T\). Suppose first that \( S \subseteq \R^n \) and that \((X, Y)\) has a (joint) continuous distribution with probability density function \(f\). Recall that \(X\) has probability density function \(g\) given by \[ g(x) = \int_T f(x, y) \, dy, \quad x \in S \] We assume that \( S \) is a minimal support set for \( X \) so that \( g(x) \gt 0 \) for \( x \in S \). Then the conditional probability density function of \(Y\) given \(X = x \in S\) is given by \[ h(y \mid x) = \frac{f(x, y)}{g(x)}, \quad y \in T \] We are now ready for the basic definitions:

The conditional expected value of \(Y\) given \(X = x \in S\) is simply the mean computed relative to the conditional distribution: \[ \E(Y \mid X = x) = \int_T y h(y \mid x) \, dy \] The function \(v: S \to \R\) defined by \( v(x) = \E(Y \mid X = x)\) for \( x \in S \) is the regression function of \(Y\) based on \(X\). The random variable \(v(X)\) is called the conditional expected value of \(Y\) given \(X\) and is denoted \(\E(Y \mid X)\).

In the case that \( (X, Y) \) has a joint discrete distribution, the definitions are the same, but with sums replacing the integrals. In the case of a mixed distribution, the definitions are also similar, with partial discrete and continuous density functions and a mixture of sums and integrals. Intuitively, we treat \(X\) as known, and therefore not random, and we then average \(Y\) with respect to the probability distribution that remains. The advanced section on conditional expected value gives a much more general definition that unifies the definitions given here for the various distribution types.

Properties

The most important property of the random variable \(\E(Y \mid X)\) is given in the following theorem. In a sense, this result states that \( \E(Y \mid X) \) behaves just like \( Y \) in terms of other functions of \( X \), and is essentially the only function of \( X \) with this property.

The fundamental property

  1. \( \E\left[r(X) \E(Y \mid X)\right] = \E\left[r(X) Y\right] \) for every function \( r: S \to \R \).
  2. If \(u: S \to \R\) and \(u(X)\) satisfy the condition in (a), then \( \P\left[u(X) = \E(Y \mid X)\right] = 1 \).
Proof:
  1. Consider the continuous case, with the notation established above. From the change of variables theorem for expected value, \begin{align} \E\left[r(X) \E(Y \mid X)\right] & = \int_S r(x) \E(Y \mid X = x) g(x) \, dx = \int_S r(x) \left(\int_T y h(y \mid x) \, dy \right) g(x) \, dx\\ & = \int_S \int_T r(x) y h(y \mid x) g(x) \, dy \, dx = \int_{S \times T} r(x) y f(x, y) \, d(x, y) = \E[r(X) Y] \end{align}
  2. Suppose that \( u_1 \) and \( u_2 \) are functions from \( S \) into \( \R \) and that \( u_1(X) \) and \( u_2(X) \) satisfy the condition in (a). Then the indicator random variable \(r(X) = \bs{1}\left[u_1(X) \gt u_2(X)\right] \) is a function of \( X \). Hence by assumption, \(\E\left[r(X) u_1(X)\right] = \E\left[r(X) Y\right] = \E\left[r(X) u_2(X)\right]\) But if \( \P\left[u_1(X) \gt u_2(X)\right] \gt 0 \) then \( \E\left[r(X) u_1(X)\right] \gt \E\left[r(X) u_2(X)\right] \), a contradiction. Hence we must have \( \P\left[u_1(X) \gt u_2(X)\right] = 0 \) and by a symmetric argument, \( \P[u_1(X) \lt u_2(X)] = 0 \).

Two random variables that are equal with probability 1 are said to be equivalent. We often think of equivalent random variables as being essentially the same object, so the fundamental property above essentially characterizes \( \E(Y \mid X) \). That is, we can think of \( \E(Y \mid X) \) as any random variable that is a function of \( X \) and satisfies this property. Moreover the fundamental property can be used as a definition of conditional expected value, regardless of the type of the distribution of \((X, Y)\). If you are interested, read the more advanced treatment of conditional expected value.

Suppose that \( X \) is also real-valued. Recall that the best linear predictor of \( Y \) based on \( X \) was characterized by property (a), but with just two functions: \( r(x) = 1 \) and \( r(x) = x \). Thus the characterization in the fundamental property is certainly reasonable, since (as we show below) \( \E(Y \mid X) \) is the best predictor of \( Y \) among all functions of \( X \), not just linear functions.

The fundamental property above is also very useful for establishing other properties of conditional expected value. Our first consequence is the fact that \( Y \) and \( \E(Y \mid X) \) have the same mean.

\(\E\left[\E(Y \mid X)\right] = \E(Y)\).

Proof:

Let \(r\) be the constant function 1 in the fundamental property.

Aside from the theoretical interest, the previous theorem is often a good way to compute \(\E(Y)\) when we know the conditional distribution of \(Y\) given \(X\). We say that we are computing the expected value of \(Y\) by conditioning on \(X\).

For many basic properties of ordinary expected value, there are analogous results for conditional expected value. We start with two of the most important: every type of expected value must satisfy two critical properties: linearity and monotonicity. In the following two theorems, the random variables \( Y \) and \( Z \) are real-valued, and as before, \( X \) is a general random variable.

Linearity

  1. \(\E(Y + Z \mid X) = \E(Y \mid X) + \E(Z \mid X)\).
  2. \(\E(c \, Y \mid X) = c \, \E(Y \mid X)\)
Proof:
  1. Note that \( \E(Y \mid X) + \E(Z \mid X) \) is a function of \( X \). If \( r: S \to \R \) then \[ \E\left(r(x) \left[\E(Y \mid X) + \E(Z \mid X)\right]\right) = \E\left[r(X) \E(Y \mid X)\right] + \E\left[r(X) \E(Z \mid X)\right] = E\left[r(X) Y\right] + \E\left[r(X) Z\right] = \E\left[r(X) (Y + Z)\right] \] Hence the result follows from the fundamental property.
  2. Note that \( c \E(Y \mid X) \) is a function of \( X \). If \( r: S \to \R \) then \[ \E\left[r(X) c \E(Y \mid X)\right] = c \E\left[r(X) \E(Y \mid X)\right] = c \E\left[r(X) Y\right] = \E\left[r(X) (c Y)\right] \] Hence the result follows from the fundamental property

Monotonicity

  1. If \(Y \ge 0\) then \(\E(Y \mid X) \ge 0\).
  2. If \(Y \le Z\) then \(\E(Y \mid X) \le \E(Z \mid X)\).
  3. \( \left|\E(Y \mid X)\right| \le \E\left(\left|Y\right| \mid X\right)\)
Proof:
  1. This follows directly from the definition.
  2. Note that if \( Y \le Z \) then \( Y - Z \ge 0 \) so by (a) and linearity, \[ \E(Y - Z \mid X) = \E(Y \mid X) - \E(Z \mid X) \ge 0 \]
  3. Note that \( -\left|Y\right| \le Y \le \left|Y\right| \) and hence by (b), and linearity, \(-\E\left(\left|Y\right| \mid X \right) \le \E(Y \mid X) \le \E\left(\left|Y\right| \mid X\right)\).

Our next few properties relate to the idea that \( \E(Y \mid X) \) is the expected value of \( Y \) given \( X \). The first property is essentially a restatement of the fundamental property.

If \(r: S \to \R\), then \(Y - \E(Y \mid X)\) and \(r(X)\) are uncorrelated.

Proof:

Note that \( Y - \E(Y \mid X) \) has mean 0 by equal mean property. Hence, by the fundamental property, \[ \cov\left[Y - \E(Y \mid X), r(X)\right] = \E\left\{\left[Y - \E(Y \mid X)\right] r(X)\right\} = \E\left[Y r(X)\right] - \E\left[\E(Y \mid X) r(X)\right] = 0 \]

The next result states that any (deterministic) function of \(X\) acts like a constant in terms of the conditional expected value with respect to \(X\).

If \(s: S \to \R\) then \[ \E\left[s(X)\,Y \mid X\right] = s(X)\,\E(Y \mid X) \]

Proof:

Note that \( s(X) \E(Y \mid X) \) is a function of \( X \). If \( r: S \to \R \) then \[ \E\left[r(X) s(X) \E(Y \mid X)\right] = \E\left[r(X) s(X) Y\right] \] So the result now follow from the fundamental property.

The following rule generalizes the previous result and is sometimes referred to as the substitution rule for conditional expected value.

If \(s: S \times T \to \R\) then \[ \E\left[s(X, Y) \mid X = x\right] = \E\left[s(x, Y) \mid X = x\right] \]

In particular, it follows from the substitution rule that \(\E[s(X) \mid X] = s(X)\). At the opposite extreme, we have the next result: If \(X\) and \(Y\) are independent, then knowledge of \(X\) gives no information about \(Y\) and so the conditional expected value with respect to \(X\) reduces to the ordinary (unconditional) expected value of \(Y\).

If \(X\) and \(Y\) are independent then \[ \E(Y \mid X) = \E(Y) \]

Proof:

Trivially, \( \E(Y) \) is a (constant) function of \( X \). If \( r: S \to \R \) then \( \E\left[\E(Y) r(X)\right] = \E(Y) \E\left[r(X)\right] = \E\left[Y r(X)\right] \), the last equality by independence. Hence the result follows from the fundamental property.

Suppose now that \(Z\) is real-valued and that \(X\) and \(Y\) are random variables (all defined on the same probability space, of course). The following theorem gives a consistency condition of sorts. Iterated conditional expected values reduce to a single conditional expected value with respect to the minimum amount of information. For simplicity, we write \( \E(Z \mid X, Y) \) rather than \( \E\left[Z \mid (X, Y)\right] \).

Consistency

  1. \(\E\left[\E(Z \mid X, Y) \mid X\right] = \E(Z \mid X)\)
  2. \(\E\left[\E(Z \mid X) \mid X, Y\right] = \E(Z \mid X)\)
Proof:
  1. Suppose that \( X \) takes values in \( S \) and \( Y \) takes values in \( T \), so that \( (X, Y) \) takes values in \( S \times T \). By definition, \( \E(Z \mid X) \) is a function of \( X \). If \( r: S \to \R \) then trivially \( r \) can be thought of as a function on \( S \times T \) as well. Hence \[ \E\left[r(X) \E(Z \mid X)\right] = \E\left[r(X) Z\right] = \E\left[r(X) \E(Z \mid X, Y)\right] \] It follows from the in fundamental property that \(\E\left[\E(Z \mid X, Y) \mid X\right] = \E(Z \mid X) \).
  2. Note that since \( \E(Z \mid X) \) is a function of \( X \), it is trivially a function of \( (X, Y) \). Hence from the factoring property, \( \E\left[\E(Z \mid X) \mid X, Y\right] = \E(Z \mid X) \).

Finally we show that \( \E(Y \mid X) \) has the same covariance with \( X \) as does \( Y \), not surprising since again, \( \E(Y \mid X) \) behaves just like \( Y \) in its relations with \( X \).

\(\cov\left[X, \E(Y \mid X)\right] = \cov(X, Y)\).

Proof:

\( \cov\left[X, \E(Y \mid X)\right] = \E\left[X \E(Y \mid X)\right] - \E(X) \E\left[\E(Y \mid X)\right] \). But \( \E\left[X \E(Y \mid X)\right] = \E(X Y) \) by the fundamental property, and \( \E\left[\E(Y \mid X)\right] = \E(Y) \) by equal means property. Hence \( \cov\left[X, \E(Y \mid X)\right] = \E(X Y) - \E(X) \E(Y) = \cov(X, Y) \).

Conditional Probability

The conditional probability of an event \(A\), given random variable \(X\), is a special case of the conditional expected value. As usual, let \(\bs{1}_A\) denote the indicator random variable of \(A\). We define \[ \P(A \mid X) = \E\left(\bs{1}_A \mid X\right) \] The fundamental property for conditional probability is as follows:

The fundamental property

  1. \( \E\left[r(X) \P(A \mid X)\right] = \E\left[r(X) \bs{1}_A\right] \) for every function \( r: S \to \R \).
  2. If \( u: S \to \R \) and \( u(X) \) satisfies the condition in (a), then \( \P\left[u(X) = \P(A \mid X)\right] = 1 \).

For example, suppose that \( X \) has a discrete distribution on a countable set \( S \) with probability density function \( g \). Then (a) becomes \[ \sum_{x \in S} r(x) \P(A \mid X = x) g(x) = \sum_{x \in S} r(x) \P(A, X = x) \] But this is obvious since \( \P(A \mid X = x) = \P(A, X = x) \big/ \P(X = x) \) and \( g(x) = \P(X = x) \). Similarly, if \( X \) has a continuous distribution on \( S \subseteq \R^n \) then (a) states that \[ \E\left[r(X) \bs{1}_A\right] = \int_S r(x) \P(A \mid X = x) g(x) \, dx \]

The properties above for conditional expected value, of course, have special cases for conditional probability.

\(\P(A) = \E\left[\P(A \mid X)\right]\).

Proof:

This is a direct result of the general equal means property above, since \( \E(\bs{1}_A) = \P(A) \).

Again, the result in the previous exercise is often a good way to compute \(\P(A)\) when we know the conditional probability of \(A\) given \(X\). We say that we are computing the probability of \(A\) by conditioning on \(X\). This is a very compact and elegant version of the conditioning result given first in the section on Conditional Probability in the chapter on Probability Spaces and later in the section on Discrete Distributions in the Chapter on Distributions.

The following result gives the conditional version of the axioms of probability.

Axioms of probability

  1. \( \P(A \mid X) \ge 0 \) for every event \( A \).
  2. \( \P(\Omega \mid X) = 1 \)
  3. If \( \{A_i: i \in I\} \) is a countable collection of disjoint events then \( \P\left(\bigcup_{i \in I} A_i \bigm| X\right) = \sum_{i \in I} \P(A_i \mid X)\).

From the last result, it follows that other standard probability rules hold for conditional probability given \( X \). These results include

The Best Predictor

The next result shows that, of all functions of \(X\), \(\E(Y \mid X)\) is closest to \(Y\), in the sense of mean square error. This is fundamentally important in statistical problems where the predictor vector \(X\) can be observed but not the response variable \(Y\). In this subsection and the next, we assume that the real-valued random variables have finite variance.

If \(u: S \to \R\), then

  1. \(\E\left(\left[\E(Y \mid X) - Y\right]^2\right) \le \E\left(\left[u(X) - Y\right]^2\right)\)
  2. Equality holds in (a) if and only if \(u(X) = \E(Y \mid X)\) with probability 1.
Proof:
  1. Note that \begin{align} \E\left(\left[Y - u(X)\right]^2\right) & = \E\left(\left[Y - \E(Y \mid X) + \E(Y \mid X) - u(X)\right]^2\right) \\ & = \E\left(\left[Y - \E(Y \mid X)\right]^2 \right) + 2 \E\left(\left[Y - \E(Y \mid X)\right] \left[\E(Y \mid X) - u(X)\right]\right) + \E\left(\left[\E(Y \mid X) - u(X)\right]^2\right) \end{align} But \( Y - \E(Y \mid X) \) has mean 0, so the middle term on the right is \( 2 \cov\left[Y - \E(Y \mid X), \E(Y \mid X) - u(X)\right] \). Moreover, \( \E(Y \mid X) - u(X) \) is a function of \( X \) and hence is uncorrelated with \( Y - \E(Y \mid X) \) by the general uncorrelated property. Hence the middle term is 0, so \[ \E\left(\left[Y - u(X)\right]^2\right) = \E\left(\left[Y - \E(Y \mid X)\right]^2 \right) + \E\left(\left[\E(Y \mid X) - u(X)\right]^2\right) \] and therefore \( \E\left(\left[Y - \E(Y \mid X)\right]^2 \right) \le \E\left(\left[Y - u(X)\right]^2\right) \).
  2. Equality holds if and only if \( \E\left(\left[\E(Y \mid X) - u(X)\right]^2\right) = 0 \), if and only if \( \P\left[u(X) = \E(Y \mid X)\right] = 1 \).

Suppose now that \(X\) is real-valued. In the section on covariance and correlation, we found that the best linear predictor of \(Y\) given \(X\) is

\[ L(Y \mid X) = \E(Y) + \frac{\cov(X,Y)}{\var(X)} \left[X - \E(X)\right] \]

On the other hand, \(\E(Y \mid X)\) is the best predictor of \(Y\) among all functions of \(X\). It follows that if \(\E(Y \mid X)\) happens to be a linear function of \(X\) then it must be the case that \(\E(Y \mid X) = L(Y \mid X)\). However, we will give a direct proof also:

If \(\E(Y \mid X) = a + b X\) for constants \(a\) and \(b\) then \( \E(Y \mid X) = L(Y \mid X) \); that is,

  1. \(b = \cov(X,Y) \big/ \var(X) \)
  2. \(a = \E(Y) - \E(X) \cov(X,Y) \big/ \var(X) \)
Proof:

First, \( \E(Y) = \E\left[\E(Y \mid X)\right] = a + b \E(X) \), so \( a = \E(Y) - b \E(X) \). Next, \( \cov(X, Y) = \cov[X \E(Y \mid X)] = \cov(X, a + b X) = b \var(X) \) and therefore \( b = \cov(X, Y) \big/ \var(X) \).

Conditional Variance

The conditional variance of \( Y \) given \( X \) is defined like the ordinary variance, but with all expected values conditioned on \( X \).

The conditional variance of \(Y\) given \(X\) is defined as \[ \var(Y \mid X) = \E\left(\left[Y - \E(Y \mid X)\right]^2 \biggm| X \right) \]

Thus, \( \var(Y \mid X) \) is a function of \( X \), and in particular, is a random variable. Our first result is a computational formula that is analogous to the one for standard variance—the variance is the mean of the square minus the square of the mean, but now with all expected values conditioned on \( X \):

\(\var(Y \mid X) = \E\left(Y^2 \mid X\right) - \left[\E(Y \mid X)\right]^2\).

Proof:

Expanding the square in the definition and using basic properties of conditional expectation, we have

\begin{align} \var(Y \mid X) & = \E\left(Y^2 - 2 Y \E(Y \mid X) + \left[\E(Y \mid X)\right]^2 \biggm| X \right) = \E(Y^2 \mid X) - 2 \E\left[Y \E(Y \mid X) \mid X\right] + \E\left(\left[\E(Y \mid X)\right]^2 \mid X\right) \\ & = \E\left(Y^2 \mid X\right) - 2 \E(Y \mid X) \E(Y \mid X) + \left[\E(Y \mid X)\right]^2 = \E\left(Y^2 \mid X\right) - \left[\E(Y \mid X)\right]^2 \end{align}

Our next result shows how to compute the ordinary variance of \( Y \) by conditioning on \( X \).

\(\var(Y) = \E\left[\var(Y \mid X)\right] + \var\left[\E(Y \mid X)\right]\).

Proof:

From the previous theorem and properties of conditional expected value we have \( \E\left[\var(Y \mid X)\right] = \E\left(Y^2\right) - \E\left(\left[\E(Y \mid X)\right]^2\right) \). But \( \E\left(Y^2\right) = \var(Y) + \left[\E(Y)\right]^2 \) and similarly, \(\E\left(\left[\E(Y \mid X)\right]^2\right) = \var\left[\E(Y \mid X)\right] + \left(\E\left[\E(Y \mid X)\right]\right)^2 \). But also, \( \E\left[\E(Y \mid X)\right] = \E(Y) \) so subsituting we get \( \E\left[\var(Y \mid X)\right] = \var(Y) - \var\left[\E(Y \mid X)\right] \).

Thus, the variance of \( Y \) is the expected conditional variance plus the variance of the conditional expected value. This result is often a good way to compute \(\var(Y)\) when we know the conditional distribution of \(Y\) given \(X\).

From the definition of conditional variance and the basic property above, it follows that the mean square error when \( \E(Y \mid X) \) is used as a predictor of \( Y \) is \[ \E\left(\left[Y - \E(Y \mid X)\right]^2\right) = \E\left[\var(Y \mid X)\right] = \var(Y) - \var\left[E(Y \mid X)\right] \] Thus, let us return to the study of predictors of the real-valued random variable \(Y\), and compare the three predictors we have studied in terms of mean square error.

Suppose that \( Y \) is a real-valued random variable.

  1. The best constant predictor of \(Y\) is \(\E(Y)\) with mean square error \(\var(Y)\).
  2. If \(X\) is another real-valued random variable, then the best linear predictor of \(Y\) given \(X\) is \[ L(Y \mid X) = \E(Y) + \frac{\cov(X,Y)}{\var(X)}\left[X - \E(X)\right] \] with mean square error \(\var(Y)\left[1 - \cor^2(X,Y)\right]\).
  3. If \(X\) is a general random variable, then the best overall predictor of \(Y\) given \(X\) is \(\E(Y \mid X)\) with mean square error \(\E\left[\var(Y \mid X)\right] = \var(Y) - \var\left[\E(Y \mid X)\right]\).

Conditional Covariance

Suppose that \( Y \) and \( Z \) are real-valued random variables, and that \( X \) is a general random variable, all defined on our underlying probability space. Analogous to variance, the conditional covariance of \( Y \) and \( Z \) given \( X \) is defined like the ordinary covariance, but with all expected values conditioned on \( X \).

The conditional covariance of \(Y\) and \( Z \) given \(X\) is defined as \[ \cov(Y, Z \mid X) = \E\left([Y - \E(Y \mid X)] [Z - \E(Z \mid X) \biggm| X \right) \]

Thus, \( \cov(Y, Z \mid X) \) is a function of \( X \), and in particular, is a random variable. Our first result is a computational formula that is analogous to the one for standard covariance—the covariance is the mean of the product minus the product of the means, but now with all expected values conditioned on \( X \):

\(\cov(Y, Z \mid X) = \E\left(Y Z \mid X\right) - \E(Y \mid X) E(Z \mid X)\).

Proof:

Expanding the product in the definition and using basic properties of conditional expectation, we have

\begin{align} \cov(Y, Z \mid X) & = \E\left(Y Z - Y \E(Z \mid X) - Z E(Y \mid X) + \E(Y \mid X) E(Z \mid X) \biggm| X \right) = \E(Y Z \mid X) - \E\left[Y \E(Z \mid X) \mid X\right] - \E\left[Z \E(Y \mid X) \mid X\right] + \E\left[\E(Y \mid X) \E(Z \mid X) \mid X\right] \\ & = \E\left(Y Z \mid X\right) - \E(Y \mid X) \E(Z \mid X) - \E(Y \mid X) \E(Z \mid X) + \E(Y \mid X) \E(Z \mid X) = \E\left(Y Z \mid X\right) - \E(Y \mid X) E(Z \mid X) \end{align}

Our next result shows how to compute the ordinary covariance of \( Y \) and \( Z \) by conditioning on \( X \).

\(\cov(Y, Z) = \E\left[\cov(Y, Z \mid X)\right] + \cov\left[\E(Y \mid X), \E(Z \mid X) \right]\).

Proof:

From the previous theorem and properties of conditional expected value we have \[ \E\left[\cov(Y, Z \mid X)\right] = \E(Y Z) - \E\left[\E(Y\mid X) \E(Z \mid X) \right] \] But \( \E(Y Z) = \cov(Y, Z) + \E(Y) \E(Z)\) and similarly, \[\E\left[\E(Y \mid X) \E(Z \mid X)\right] = \cov[E(Y \mid X), E(Z \mid X) + E[E(Y\mid X)] E[E(Z \mid X)]\] But also, \( \E\left[\E(Y \mid X)\right] = \E(Y) \) and \( \E[E(Z \mid X)] = \E(Z) \) so subsituting we get \[ \E\left[\cov(Y, Z \mid X)\right] = \cov(Y, Z) - \cov\left[E(Y \mid X), E(Z \mid X)\right] \]

Thus, the covariance of \( Y \) and \( Z \) is the expected conditional covariance plus the covariance of the conditional expected values. This result is often a good way to compute \(\cov(Y, Z)\) when we know the conditional distribution of \((Y, Z)\) given \(X\).

Examples and Applications

As always, be sure to try the proofs and computations yourself before reading the ones in the text.

Simple Continuous Distributions

Suppose that \((X,Y)\) has probability density function \(f(x,y) = x + y\) for \(0 \le x \le 1\), \(0 \le y \le 1\).

  1. Find \(L(Y \mid X)\).
  2. Find \(\E(Y \mid X)\).
  3. Graph \(L(Y \mid X = x)\) and \(\E(Y \mid X = x)\) as functions of \(x\), on the same axes.
  4. Find \(\var(Y)\).
  5. Find \(\var(Y)\left[1 - \cor^2(X, Y)\right]\).
  6. Find \(\var(Y) - \var\left[\E(Y \mid X)\right]\).
Answer:
  1. \(\frac{7}{11} - \frac{1}{11} X\)
  2. \(\frac{3 X + 2}{6 X + 3}\)
  3. \(\frac{11}{144} = 0.0764\)
  4. \(\frac{5}{66} = 0.0758\)
  5. \(\frac{1}{12} - \frac{1}{144} \ln(3) = 0.0757\)

Suppose that \((X,Y)\) has probability density function \(f(x,y) = 2 (x + y)\) for \(0 \le x \le y \le 1\).

  1. Find \(L(Y \mid X)\).
  2. Find \(\E(Y \mid X)\).
  3. Graph \(L(Y \mid X = x)\) and \(\E(Y \mid X = x)\) as functions of \(x\), on the same axes.
  4. Find \(\var(Y)\).
  5. Find \(\var(Y)\left[1 - \cor^2(X, Y)\right]\).
  6. Find \(\var(Y) - \var\left[\E(Y \mid X)\right]\).
Answer:
  1. \(\frac{26}{43} + \frac{15}{43} X\)
  2. \(\frac{5 X^2 + 5 X + 2}{9 X + 3}\)
  3. \(\frac{3}{80} = 0.0375\)
  4. \(\frac{13}{430} = 0.0302\)
  5. \(\frac{1837}{21\;870} - \frac{512}{6561} \ln(2) = 0.0299\)

Suppose that \((X,Y)\) has probability density function \(f(x,y) = 6 x^2 y\) for \(0 \le x \le 1\), \(0 \le y \le 1\).

  1. Find \(L(Y \mid X)\).
  2. Find \(\E(Y \mid X)\).
  3. Graph \(L(Y \mid X = x)\) and \(\E(Y \mid X = x)\) as functions of \(x\), on the same axes.
  4. Find \(\var(Y)\).
  5. Find \(\var(Y)\left[1 - \cor^2(X, Y)\right]\).
  6. Find \(\var(Y) - \var\left[\E(Y \mid X)\right]\).
Answer:

Note that \(X\) and \(Y\) are independent.

  1. \(\frac{2}{3}\)
  2. \(\frac{2}{3}\)
  3. \(\frac{1}{18}\)
  4. \(\frac{1}{18}\)
  5. \(\frac{1}{18}\)

Suppose that \((X,Y)\) has probability density function \(f(x,y) = 15 x^2 y\) for \(0 \le x \le y \le 1\).

  1. Find \(L(Y \mid X)\).
  2. Find \(\E(Y \mid X)\).
  3. Graph \(L(Y \mid X = x)\) and \(\E(Y \mid X = x)\) as functions of \(x\), on the same axes.
  4. Find \(\var(Y)\).
  5. Find \(\var(Y)\left[1 - \cor^2(X, Y)\right]\).
  6. Find \(\var(Y) - \var\left[\E(Y \mid X)\right]\).
Answer:
  1. \(\frac{30}{51} + \frac{20}{51}X\)
  2. \(\frac{2(X^2 + X + 1)}{3(X + 1)}\)
  3. \(\frac{5}{252} = 0.0198\)
  4. \(\frac{5}{357} = 0.0140\)
  5. \(\frac{292}{63} - \frac{20}{3} \ln(2) = 0.0139\)

Exercises on Basic Properties

Suppose that \(X\), \(Y\), and \(Z\) are real-valued random variables with \(\E(Y \mid X) = X^3\) and \(\E(Z \mid X) = \frac{1}{1 + X^2}\). Find \(\E\left[Y\,e^X - Z\,\sin(X) \mid X\right]\).

Answer:

\(X^3 e^X - \frac{\sin(X)}{1 + X^2}\)

Uniform Distributions

Suppose that \( S \subset \R^2 \) has positive, finite area \( \lambda_2(S) \). Recall that the uniform distribution on \( S \) is a continuous distribution with probability density function \( f \) given by \[ f(x, y) = \frac{1}{\lambda_2(S)}, \quad (x, y) \in S \] Recall also that the projection of \( S \) onto the \( x \)-axis is \( R = \{x \in \R: (x, y) \in S \text{ for some } y \in \R\} \), and the cross section of \( S \) at \( x \in R \) is \( S_x = \{y \in \R: (x, y) \in S\} \).

Suppose that \( (X, Y) \) is uniformly distributed on a region \( S \subseteq \R^2 \) with the property that \( S_x \) is a bounded interval with midpoint \( m(x) \) and length \( l(x) \) for each \( x \in R \). Then

  1. \( \E(Y \mid X) = m(X) \)
  2. \( \var(Y \mid X) = \frac{1}{12}l^2(X) \)
Proof:

This follows immediately from the fact that the conditional distribution of \( Y \) given \( X = x \) is uniformly distributed on \( S_x \) for each \( x \in R \).

In each case below, suppose that \( (X,Y) \) is uniformly distributed on the give region. Find \(\E(Y \mid X)\) and \( \var(Y \mid X) \)

  1. The rectangular region \(R = [a, b] \times [c, d]\) where \(a \lt b\) and \(c \lt d\).
  2. The triangular region \(T = \left\{(x,y) \in \R^2: -a \le x \le y \le a\right\}\) where \(a \gt 0\).
  3. The circular region \( C = \left\{(x, y) \in \R^2: x^2 + y^2 \le r\right\} \) where \( r \gt 0 \).
Answer:
  1. \(\E(Y \mid X) = \frac{1}{2}(c + d)\), \( \var(Y \mid X) = \frac{1}{12}(d - c)^2 \). Note that \( X \) and \( Y \) are independent.
  2. \(\E(Y \mid X) = \frac{1}{2}(a + X)\), \( \var(Y \mid X) = \frac{1}{12}(a - X)^2 \)
  3. \( \E(Y \mid X) = 0 \), \( \var(Y \mid X) = 4 (r^2 - X^2) \)

In the bivariate uniform experiment, select each of the following regions. In each case, run the simulation 2000 times and note the relationship between the cloud of points and the graph of the regression function.

  1. square
  2. triangle
  3. circle

Suppose that \(X\) is uniformly distributed on the interval \((0, 1)\), and that given \(X\), random variable \(Y\) is uniformly distributed on \((0, X)\). Find each of the following:

  1. \(\E(Y \mid X)\)
  2. \(\E(Y)\)
  3. \(\var(Y \mid X)\)
  4. \(\var(Y)\)
Answer:
  1. \(\frac{1}{2} X\)
  2. \(\frac{1}{4}\)
  3. \(\frac{1}{12} X^2\)
  4. \(\frac{7}{144}\)

The Hypergeometric Distribution

Suppose that a population consists of \(m\) objects, and that each object is one of three types. There are \(a\) objects of type 1, \(b\) objects of type 2, and \(m - a - b\) objects of type 0. The parameters \(a\) and \(b\) are positive integers with \(a + b \lt m\). We sample \(n\) objects from the population at random, and without replacement, where \( n \in \{0, 1, \ldots, m\} \). Denote the number of type 1 and 2 objects in the sample by \(X\) and \(Y\), so that the number of type 0 objects in the sample is \(n - X - Y\). In the in the chapter on Distributions, we showed that the joint, marginal, and conditional distributions of \( X \) and \( Y \) are all hypergeometric—only the parameters change. Here is the relevant result for this section:

In the setting above,

  1. \( \E(Y \mid X) = \frac{b}{m - a}(n - X) \)
  2. \( \var(Y \mid X) = \frac{b (m - a - b)}{(m - a)^2 (m - a - 1)} (n - X) (m - a - n + X)\)
  3. \( \E\left([Y - \E(Y \mid X)]^2\right) = \frac{n(m - n)b(m - a - b)}{m (m - 1)(m - a)} \)
Proof:

Recall that \( (X, Y) \) has the (multivariate) hypergeometric distribution with parameters \( m \), \( a \), \( b \), and \( n \). Marginally, \( X \) has the hypergeometric distribution with parameters \( m \), \( a \), and \( n \), and \( Y \) has the hypergeometric distribution with parameters \( m \), \( b \), and \( n \). Given \( X = x \in \{0, 1, \ldots, n\} \), the remaining \( n - x \) objects are chosen at random from a population of \( m - a \) objects, of which \( b \) are type 2 and \( m - a - b \) are type 0. Hence, the conditional distribution of \( Y \) given \( X = x \) is hypergeometric with parameters \( m - a \), \( b \), and \( n - x \). Parts (a) and (b) then follow from the standard formulas for the mean and variance of the hypergeometric distribution, as functions of the parameters. Part (c) is the mean square error, and in this case can be computed most easily as \[ \var(Y) - \var[\E(Y \mid X)] = \var(Y) - \left(\frac{b}{m - a}\right)^2 \var(X) = n \frac{b}{m} \frac{m - b}{m} \frac{m - n}{m - 1} - \left(\frac{b}{m - a}\right)^2 n \frac{a}{m} \frac{m - a}{m} \frac{m - n}{m - 1} \] Simplifying gives the result.

Note that \( \E(Y \mid X) \) is a linear function of \( X \) and hence \( \E(Y \mid X) = L(Y \mid X) \).

In a collection of 120 objects, 50 are classified as good, 40 as fair and 30 as poor. A sample of 20 objects is selected at random and without replacement. Let \( X \) denote the number of good objects in the sample and \( Y \) the number of poor objects in the sample. Find each of the following:

  1. \( \E(Y \mid X) \)
  2. \( \var(Y \mid X) \)
  3. The predicted value of \( Y \) given \( X = 8 \)
Answer:
  1. \( \E(Y \mid X) = \frac{80}{7} - \frac{4}{7} X \)
  2. \( \var(Y \mid X) = \frac{4}{1127}(20 - X)(50 + X) \)
  3. \( \frac{48}{7} \)

The Multinomial Trials Model

Suppose that we have a sequence of \( n \) independent trials, and that each trial results in one of three outcomes, denoted 0, 1, and 2. On each trial, the probability of outcome 1 is \( p \), the probability of outcome 2 is \( q \), so that the probability of outcome 0 is \( 1 - p - q \). The parameters \( p, \, q \in (0, 1) \) with \( p + q \lt 1 \), and of course \( n \in \N_+ \). Let \( X \) denote the number of trials that resulted in outcome 1, \( Y \) the number of trials that resulted in outcome 2, so that \( n - X - Y \) is the number of trials that resulted in outcome 0. In the in the chapter on Distributions, we showed that the joint, marginal, and conditional distributions of \( X \) and \( Y \) are all multinomial—only the parameters change. Here is the relevant result for this section:

In the setting above,

  1. \( \E(Y \mid X) = \frac{q}{1 - p}(n - X) \)
  2. \( \var(Y \mid X) = \frac{q (1 - p - q)}{(1 - p)^2}(n - X)\)
  3. \( \E\left([Y - \E(Y \mid X)]^2\right) = \frac{q (1 - p - q)}{1 - p} n \)
Proof:

Recall that \( (X, Y) \) has the multinomial distribution with parameters \( n \), \( p \), and \( q \). Marginally, \( X \) has the binomial distribution with parameters \( n \) and \( p \), and \( Y \) has the binomial distribution with parameters \( n \) and \( q \). Given \( X = x \in \{0, 1, \ldots, n\} \), the remaining \( n - x \) trials are independent, but with just two outcomes: outcome 2 occurs with probability \( q / (1 - p) \) and outcome 0 occurs with probability \( 1 - q / (1 - p) \). (These are the conditional probabilities of outcomes 2 and 0, respectively, given that outcome 1 did not occur.) Hence the conditional distribution of \( Y \) given \( X = x \) is binomial with parameters \( n - x \) and \( q / (1 - p) \). Parts (a) and (b) then follow from the standard formulas for the mean and variance of the binomial distribution, as functions of the parameters. Part (c) is the mean square error and in this case can be computed most easily from \[ \E[\var(Y \mid X)] = \frac{q (1 - p - q)}{(1 - p)^2} [n - \E(X)] = \frac{q (1 - p - q)}{(1 - p)^2} (n - n p) = \frac{q (1 - p - q)}{1 - p} n\]

Note again that \( \E(Y \mid X) \) is a linear function of \( X \) and hence \( \E(Y \mid X) = L(Y \mid X) \).

Suppose that a fair, 12-sided die is thrown 50 times. Let \( X \) denote the number of throws that resulted in a number from 1 to 5, and \( Y \) the number of throws that resulted in a number from 6 to 9. Find each of the following:

  1. \( \E(Y \mid X) \)
  2. \( \var(Y \mid X) \)
  3. The predicted value of \( Y \) given \( X = 20 \)
Answer:
  1. \( \E(Y \mid X) = \frac{4}{7}(50 - X) \)
  2. \( \var(Y \mid X) = \frac{12}{49}(50 - X) \)
  3. \( \frac{120}{7} \)

The Poisson Distribution

Recall that the Poisson distribution, named for Simeon Poisson, is widely used to model the number of random points in a region of time or space, under certain ideal conditions. The Poisson distribution is studied in more detail in the chapter on the Poisson Process. The Poisson distribution with parameter \( r \in (0, \infty) \) has probability density function \[ f(x) = e^{-r} \frac{r^x}{x!}, \quad x \in \N \] The parameter \( r \) is the mean and variance of the distribution.

Suppose that \( X \) and \( Y \) are independent random variables, and that \( X \) has the Poisson distribution with parameter \( a \in (0, \infty) \) and \( Y \) has the Poisson distribution with parameter \( b \in (0, \infty) \). Let \( N = X + Y \). Then

  1. \( \E(X \mid N) = \frac{a}{a + b}N\)
  2. \( \var(X \mid N) = \frac{a b}{(a + b)^2} N \)
  3. \( \E\left([X - \E(X \mid N)]^2\right) = \frac{a b}{a + b} \)
Proof:

We have shown before that the distribution of \( N \) is also Poisson, with parameter \( a + b \), and that the conditional distribution of \( X \) given \( N = n \in \N \) is binomial with parameters \( n \) and \( a / (a + b) \). Hence parts (a) and (b) follow from the standard formulas for the mean and variance of the binomial distribution, as functions of the parameters. Part (c) is the mean square error, and in this case can be computed most easily as \[ \E[\var(X \mid N)] = \frac{a b}{(a + b)^2} \E(N) = \frac{ab}{(a + b)^2} (a + b) = \frac{a b}{a + b} \]

Once again, \( \E(X \mid N) \) is a linear function of \( N \) and so \( \E(X \mid N) = L(X \mid N) \). If we reverse the roles of the variables, the conditional expected value is trivial from our basic properties: \[ \E(N \mid X) = \E(X + Y \mid X) = X + b \]

Coins and Dice

A pair of fair dice are thrown, and the scores \((X_1, X_2)\) recorded. Let \(Y = X_1 + X_2\) denote the sum of the scores and \(U = \min\left\{X_1, X_2\right\}\) the minimum score. Find each of the following:

  1. \(\E\left(Y \mid X_1\right)\)
  2. \(\E\left(U \mid X_1\right)\)
  3. \(\E\left(Y \mid U\right)\)
  4. \(\E\left(X_2 \mid X_1\right)\)
Answer:
  1. \(\frac{7}{2} + X_1\)
  2. \(x\) 1 2 3 4 5 6
    \(\E(U \mid X_1 = x)\) 1 \(\frac{11}{6}\) \(\frac{5}{2}\) 3 \(\frac{10}{3}\) \(\frac{7}{2}\)
  3. \(u\) 1 2 3 4 5 6
    \(\E(Y \mid U = u)\) \(\frac{52}{11}\) \(\frac{56}{9}\) \(\frac{54}{7}\) \(\frac{46}{5}\) \(\frac{32}{3}\) 12
  4. \(\frac{7}{2}\)

A box contains 10 coins, labeled 0 to 9. The probability of heads for coin \(i\) is \(\frac{i}{9}\). A coin is chosen at random from the box and tossed. Find the probability of heads.

Answer:

\(\frac{1}{2}\)

This problem is an example of Laplace's rule of succession, named for Pierre Simon Laplace.

Random Sums of Random Variables

Suppose that \(\bs{X} = (X_1, X_2, \ldots)\) is a sequence of independent and identically distributed real-valued random variables. We will denote the common mean, variance, and moment generating function, respectively, by \(\mu = \E(X_i)\), \(\sigma^2 = \var(X_i)\), and \(G(t) = \E\left(e^{t\,X_i}\right)\). Let \[ Y_n = \sum_{i=1}^n X_i, \quad n \in \N \] so that \((Y_0, Y_1, \ldots)\) is the partial sum process associated with \(\bs{X}\). Suppose now that \(N\) is a random variable taking values in \(\N\), independent of \(\bs{X}\). Then \[ Y_N = \sum_{i=1}^N X_i \] is a random sum of random variables; the terms in the sum are random, and the number of terms is random. This type of variable occurs in many different contexts. For example, \(N\) might represent the number of customers who enter a store in a given period of time, and \(X_i\) the amount spent by the customer \(i\), so that \( Y_N \) is the total revenue of the store during the period.

The conditional and ordinary expected value of \(Y_N\) are

  1. \(\E\left(Y_N \mid N\right) = N \mu\)
  2. \(\E\left(Y_N\right) = \E(N) \mu\)
Proof:
  1. Using the substitution rule and the independence of \( N \) and \( \bs{X} \) we have \[ \E\left(Y_N \mid N = n\right) = \E\left(Y_n \mid N = n\right) = \E(Y_n) = \sum_{i=1}^n \E(X_i) = n \mu \] so \E\left(Y_N \mid N\right) = N \mu.
  2. From (a) and conditioning, \( E\left(Y_N\right) = \E\left[\E\left(Y_N \mid N\right)\right] = \E(N \mu) = \E(N) \mu \).

Wald's equation, named for Abraham Wald, is a generalization of the previous result to the case where \(N\) is not necessarily independent of \(\bs{X}\), but rather is a stopping time for \(\bs{X}\). Roughly, this means that the event \( N = n \) depends only \( (X_1, X_2, \ldots, X_n) \). Wald's equation is discussed in the chapter on Random Samples. An elegant proof of and Wald's equation is given in the chapter on Martingales. The advanced section on stopping times is in the chapter on Probability Measures.

The conditional and ordinary variance of \(Y_N\) are

  1. \(\var\left(Y_N \mid N\right) = N \sigma^2\)
  2. \(\var\left(Y_N\right) = \E(N) \sigma^2 + \var(N) \mu^2\)
Proof:
  1. Using the substitution rule, the independence of \( N \) and \( \bs{X} \), and the fact that \( \bs{X} \) is an IID sequence, we have \[ \var\left(Y_N \mid N = n\right) = \var\left(Y_n \mid N = n\right) = \var\left(Y_n\right) = \sum_{i=1}^n \var(X_i) = n \sigma^2 \] so \( \var\left(Y_N \mid N\right) = N \sigma^2 \).
  2. From (a) and the previous result, \[ \var\left(Y_N\right) = \E\left[\var\left(Y_N \mid N\right)\right] + \var\left[\E(Y_N \mid \N)\right] = \E(\sigma^2 N) + \var(\mu N) = \E(N) \sigma^2 + \mu^2 \var(N)\]

Let \(H\) denote the probability generating function of \(N\). The conditional and ordinary moment generating function of \(Y_N\) are

  1. \(\E\left(e^{t Y_N} \mid N\right) = \left[G(t)\right]^N\)
  2. \(\E\left(e^{t N}\right) = H\left(G(t)\right)\)
Proof:
  1. Using the substitution rule, the independence of \( N \) and \( \bs{X} \), and the fact that \( \bs{X} \) is an IID sequence, we have \[ \E\left(e^{t Y_N} \mid N = n\right) = \E\left(e^{t Y_n} \mid N = n\right) = \E\left(e^{t Y_n}\right) = \left[G(t)\right]^n \] (Recall that the MGF of the sum of independent variables is the product of the individual MGFs.)
  2. From (a) and conditioning, \( \E\left(e^{t N}\right) = \E\left[\E\left(e^{t N} \mid N\right)\right] = \E\left(G(t)^N\right) = H(G(t)) \).

Thus the moment generating function of \( Y_N \) is \( H \circ G \), the composition of the probability generating function of \( N \) with the common moment generating function of \( \bs{X} \), a simple and elegant result.

In the die-coin experiment, a fair die is rolled and then a fair coin is tossed the number of times showing on the die. Let \(N\) denote the die score and \(Y\) the number of heads. Find each of the following:

  1. The conditional distribution of \(Y\) given \(N\).
  2. \(\E\left(Y \mid N\right)\)
  3. \(\var\left(Y \mid N\right)\)
  4. \(\E\left(Y_i\right)\)
  5. \(\var(Y)\)
Answer:
  1. Binomial with parameters \(N\) and \(p = \frac{1}{2}\)
  2. \(\frac{1}{2} N\)
  3. \(\frac{1}{4} N\)
  4. \(\frac{7}{4}\)
  5. \(\frac{7}{3}\)

Run the die-coin experiment 1000 times and compare the empirical mean and standard deviation to the distribution mean and standard deviation.

The number of customers entering a store in a given hour is a random variable with mean 20 and standard deviation 3. Each customer, independently of the others, spends a random amount of money with mean $50 and standard deviation $5. Find the mean and standard deviation of the amount of money spent during the hour.

Answer:
  1. \($1000\)
  2. \($30.82\)

A coin has a random probability of heads \(V\) and is tossed a random number of times \(N\). Suppose that \(V\) is uniformly distributed on \([0, 1]\); \(N\) has the Poisson distribution with parameter \(a \gt 0\); and \(V\) and \(N\) are independent. Let \(Y\) denote the number of heads. Compute the following:

  1. \(\E(Y \mid N, V)\)
  2. \(\E(Y \mid N)\)
  3. \(\E(Y \mid V)\)
  4. \(\E(Y)\)
  5. \(\var(Y \mid N, V)\)
  6. \(\var(Y)\)
Answer:
  1. \(N V\)
  2. \(\frac{1}{2} N\)
  3. \(a V\)
  4. \(\frac{1}{2} a\)
  5. \(N V (1 - V)\)
  6. \(\frac{1}{12} a^2 + \frac{1}{2} a\)

Mixtures of Distributions

Suppose that \(\bs{X} = (X_1, X_2, \ldots)\) is a sequence of real-valued random variables. Denote the mean, variance, and moment generating function of \( X_i \) by \(\mu_i = \E(X_i)\), \(\sigma_i^2 = \var(X_i)\), and \(M_i(t) = \E\left(e^{t\,X_i}\right)\), for \(i \in \N_+\). Suppose also that \(N\) is a random variable taking values in \(\N_+\), independent of \(\bs{X}\). Denote the probability density function of \(N\) by \(p_n = \P(N = n)\) for \(n \in \N_+\). The distribution of the random variable \(X_N\) is a mixture of the distributions of \(\bs{X} = (X_1, X_2, \ldots)\), with the distribution of \(N\) as the mixing distribution.

The conditional and ordinary expected value of \( X_N \) are

  1. \(\E(X_N \mid N) = \mu_N\)
  2. \(\E(X_N) = \sum_{n=1}^\infty p_n\,\mu_n\)
Proof:
  1. Using the substitution rule and the independence of \( N \) and \( \bs{X} \), we have \( \E(X_N \mid N = n) = \E(X_n \mid N = n) = \E(X_n) = \mu_n \)
  2. From (a) and the conditioning rule, \[ \E\left(X_N\right) = \E\left[\E\left(X_N\right)\right] = \E\left(\mu_N\right) = \sum_{n=1}^\infty p_n \mu_n\]

The conditional and ordinary variance of \( X_N \) are

  1. \(\var\left(X_N \mid N\right) = \sigma_N^2\)
  2. \(\var(X_N) = \sum_{n=1}^\infty p_n (\sigma_n^2 + \mu_n^2) - \left(\sum_{n=1}^\infty p_n\,\mu_n\right)^2\).
Proof:
  1. Using the substitution rule and the independence of \( N \) and \(\bs{X}\), we have \( \var\left(X_N \mid N = n\right) = \var\left(X_n \mid N = n\right) = \var\left(X_n\right) = \sigma_n^2 \)
  2. From (a) we have \begin{align} \var\left(X_N\right) & = \E\left[\var\left(X_N \mid N\right)\right] + \var\left[\E\left(X_N \mid N\right)\right] = \E\left(\sigma_N^2\right) + \var\left(\mu_N\right) = \E\left(\sigma_N^2\right) + \E\left(\mu_N^2\right) - \left[\E\left(\mu_N\right)\right]^2 \\ & = \sum_{n=1}^\infty p_n \sigma_n^2 + \sum_{n=1}^\infty p_n \mu_n^2 - \left(\sum_{n=1}^\infty p_n \mu_n\right)^2 \end{align}

The conditional and ordinary moment generating function of \( X_N \) are

  1. \( \E\left(e^{t X_N} \mid N\right) = M_N(t) \)
  2. \(\E\left(e^{tX_N}\right) = \sum_{i=1}^\infty p_i M_i(t)\).
Proof:
  1. Using the substitution rule and the independence of \( N \) and \( \bs{X} \), we have \( \E\left(e^{t X_N} \mid N = n\right) = \E\left(e^{t X_n} \mid N = n\right) = \E\left(^{t X_n}\right) = M_n(t) \)
  2. From (a) and the conditioning rule, \( \E\left(e^{t X_N}\right) = \E\left[\E\left(e^{t X_N} \mid N\right)\right] = \E\left[M_N(t)\right] = \sum_{n=1}^\infty p_n M_n(t)\)

In the coin-die experiment, a biased coin is tossed with probability of heads \(\frac{1}{3}\). If the coin lands tails, a fair die is rolled; if the coin lands heads, an ace-six flat die is rolled (faces 1 and 6 have probability \(\frac{1}{4}\) each, and faces 2, 3, 4, 5 have probability \(\frac{1}{8}\) each). Find the mean and standard deviation of the die score.

Answer:
  1. \(\frac{7}{2}\)
  2. \(1.8634\)

Run the coin-die experiment 1000 times and note the apparent convergence of the empirical mean and standard deviation to the distribution mean and standard deviation.