\(\newcommand{\R}{\mathbb{R}}\) \(\newcommand{\N}{\mathbb{N}}\) \(\newcommand{\Z}{\mathbb{Z}}\) \(\newcommand{\E}{\mathbb{E}}\) \(\newcommand{\P}{\mathbb{P}}\) \(\newcommand{\var}{\text{var}}\) \(\newcommand{\sd}{\text{sd}}\) \(\newcommand{\cov}{\text{cov}}\) \(\newcommand{\cor}{\text{cor}}\) \(\newcommand{\bias}{\text{bias}}\) \(\newcommand{\MSE}{\text{MSE}}\) \(\newcommand{\bs}{\boldsymbol}\)
  1. Random
  2. 6. Point Estimation
  3. 1
  4. 2
  5. 3
  6. 4
  7. 5
  8. 6

6. Sufficient, Complete, and Ancillary Statistics

Basic Theory

The Basic Statistical Model

Consider again the basic statistical model, in which we have a random experiment with an observable random variable \(\bs{X}\) taking values in a set \(S\). Once again, the experiment is typically to sample \(n\) objects from a population and record one or more measurements for each item. In this case, the outcome variable has the form \[ \bs{X} = (X_1, X_2, \ldots, X_n) \] where \(X_i\) is the vector of measurements for the \(i\)th item. In general, we suppose that the distribution of \(\bs{X}\) depends on a parameter \(\theta\) taking values in a parameter space \(\Theta\). The parameter \(\theta\) may also be vector-valued. We will sometimes use subscripts in probability density functions, expected values, etc. to denote the dependence on \(\theta\).

As usual, the most important special case is when \(\bs{X}\) is a sequence of independent, identically distributed random variables. In this case \(\bs{X}\) is a random sample from the common distribution.

Sufficient Statistics

Let \(U = u(\bs{X})\) be a statistic taking values in a set \(T\). Intuitively, \(U\) is sufficient for \(\theta\) if \(U\) contains all of the information about \(\theta\) that is available in the entire data variable \(\bs{X}\). Here is the formal definition:

A statistic \(U\) is sufficient for \(\theta\) if the conditional distribution of \(\bs{X}\) given \(U\) does not depend on \(\theta\).

Sufficiency is related to the concept of data reduction. Suppose that \(\bs{X}\) takes values in \(\R^n\). If we can find a sufficient statistic \(\bs{U}\) that takes values in \(\R^j\), then we can reduce the original data vector \(\bs{X}\) (whose dimension \(n\) is usually large) to the vector of statistics \(\bs{U}\) (whose dimension \(j\) is usually much smaller) with no loss of information about the parameter \(\theta\).

The following result gives a condition for sufficiency that is equivalent to this definition.

Let \(U = u(\bs{X})\) be a statistic taking values in \(T\), and let \(f_\theta\) and \(h_\theta\) denote the probability density functions of \(\bs{X}\) and \(U\) respectively. Then \(U\) is suffcient for \(\theta\) if and only if the function on \( S \) given below does not depend on \( \theta \in \Theta \): \[ \bs{x} \mapsto \frac{f_\theta(\bs{x})}{h_\theta[u(\bs{x})]} \]

Proof:

The joint distribution of \((\bs{X}, U)\) is concentrated on the set \(\{(\bs{x}, y): \bs{x} \in S, y = u(\bs{x})\} \subseteq S \times T\). The conditional PDF of \(\bs{X}\) given \(U = u(\bs{x})\) is \(f_\theta(\bs{x}) \big/ h_\theta[u(\bs{x})]\) on this set, and is 0 otherwise.

The definition precisely captures the intuitive notion of sufficiency given above, but can be difficult to apply. We must know in advance a candidate statistic \(U\), and then we must be able to compute the conditional distribution of \(\bs{X}\) given \(U\). The factorization theorem given in the next exercise frequently allows the identification of a sufficient statistic from the form of the probability density function of \(\bs{X}\).

Let \(f_\theta\) denote the probability density function of \(\bs{X}\) and suppose that \(U = u(\bs{X})\) is a statistic taking values in \(T\). Then \(U\) is sufficient for \(\theta\) if and only if there exists \(G: T \times \Theta \to [0, \infty)\) and \(r: S \to [0, \infty)\) such that \[ f_\theta(\bs{x}) = G[u(\bs{x}), \theta] r(\bs{x}); \quad \bs{x} \in S, \; \theta \in \Theta \]

Proof:

Let \( h_\theta \) denote the PDF of \( U \) for \( \theta \in \Theta \). If \( U \) is sufficient for \( \theta \), then from the previous theorem, the function \( r(\bs{x}) = f_\theta(\bs{x}) \big/ h_\theta[u(\bs{x})] \) for \( \bs{x} \in S\) does not depend on \( \theta \in \Theta \). Hence \( f_\theta(\bs{x}) = h_\theta[u(\bs{x})] r(\bs{x}) \) for \( (\bs{x}, \theta) \in S \times \Theta \) and so \((\bs{x}, \theta) \mapsto f_\theta(\bs{x}) \) has the form given in the theorem. Conversely, suppose that \( (\bs{x}, \theta) \mapsto f_\theta(\bs{x}) \) has the form given in the theorem. Then there exists a positive constant \( C \) such that \( h_\theta(y) = C G(y, \theta) \) for \( \theta \in \Theta \) and \( y \in T \). Hence \( f_\theta(\bs{x}) \big/ h_\theta[u(x)] = r(\bs{x}) / C\) for \( \bs{x} \in S \), independent of \( \theta \in \Theta \).

Note that \(r\) depends only on the data \(\bs{x}\) but not on the parameter \(\theta\). Less technically, \(u(\bs{X})\) is sufficient for \(\theta\) if the probability density function \(f_\theta(\bs{x})\) depends on the data vector \(\bs{x}\) and the parameter \(\theta\) only through \(u(\bs{x})\).

If \(U\) and \(V\) are equivalent statistics and \(U\) is sufficient for \(\theta\) then \(V\) is sufficient for \(\theta\).

Minimal Sufficient Statistics

The entire data variable \(\bs{X}\) is trivially sufficient for \(\theta\). However, as noted above, there usually exists a statistic \(U\) that is sufficient for \(\theta\) and has smaller dimension, so that we can achieve real data reduction. Naturally, we would like to find the statistic \(U\) that has the smallest dimension possible. In many cases, this smallest dimension \(j\) will be the same as the dimension \(k\) of the parameter vector \(\theta\). However, as we will see, this is not necessarily the case; \(j\) can be smaller or larger than \(k\), an (see example) is given below.

Suppose that a statistic \(U\) is sufficient for \(\theta\). Then \(U\) is minimally sufficient if \(U\) is a function of any other statistic \(V\) that is sufficient for \(\theta\).

Once again, the definition precisely captures the notion of minimal sufficiency, but is hard to apply. The following exercise gives an equivalent condition.

Let \(f_\theta\) denote the probability density function of \(\bs{X}\) corresponding to the parameter value \(\theta\) and suppose that \(U = u(\bs{X})\) is a statistic taking values in \(T\). Then \(U\) is minimally sufficient for \(\theta\) if the following condition holds: for \(\bs{x} \in S\) and \(\bs{y} \in S\) \[ \frac{f_\theta(\bs{x})}{f_\theta(\bs{y})} \text{ is independent of } \theta \text{ if and only if } u(\bs{x}) = u(\bs{y}) \]

Proof:

Suppose that the condition in the theorem is satisfied. Then the PDF \(f_\theta\) of \( \bs{X} \) must have the form given in the factorization theorem so \(U\) is sufficient for \(\theta\). Next, suppose that \(V = v(\bs{X})\) is another sufficient statistic for \( \theta \), taking values in \( T \). From the factorization theorem, there exists \( G: T \times \Theta \to [0, \infty) \) and \( r: S \to [0, \infty) \) such that \( f_\theta(\bs{x}) = G[v(\bs{x}), \theta] r(\bs{x}) \) for \( (\bs{x}, \theta) \in S \times \Theta \). Hence if \( \bs{x}, \bs{y} \in S \) and \( v(\bs{x}) = v(\bs{y}) \) then \[\frac{f_\theta(\bs{x})}{f_\theta(\bs{y})} = \frac{G[v(\bs{x}), \theta] r(\bs{x})}{G[v(\bs{y}), \theta] r(\bs{y})} = \frac{v(\bs{x})}{v(\bs{y})}\] does not depend on \( \theta \in \Theta \). Hence from the condition in the theorem, \( u(\bs{x}) = u(\bs{y}) \) and it follows that \( U \) is a function of \( V \).

If \(U\) and \(V\) are equivalent statistics and \(U\) is minimally sufficient for \(\theta\) then \(V\) is minimally sufficient for \(\theta\).

Properties of Sufficient Statistics

Sufficiency is related to several of the methods of constructing estimators that we have studied.

Suppose that \(U\) is sufficient for \(\theta\) and that there exists a maximum likelihood estimator of \(\theta\). Then there exists a maximum likelihood estimator \(V\) that is a function of \(U\).

Proof:

From the factorization theorem, the log likelihood function for \( \bs{x} \in S \) is \[\theta \mapsto \ln G[u(\bs{x}), \theta] + \ln r(\bs{x})\] Hence a value of \(\theta\) that maximizes this function, if it exists, must be a function of \(u(\bs{x})\).

In particular, suppose that \(V\) is the unique maximum likelihood estimator of \(\theta\) and that \(V\) is sufficient for \(\theta\). If \(U\) is sufficient for \(\theta\) then \(V\) is a function of \(U\) by the previous exercise. Hence it follows that \(V\) is minimally sufficient for \(\theta\).

Suppose that the statistic \(U\) is sufficient for the parameter \(\theta\) and that \(V\) is a Bayes' estimator of \(\theta\). Then \(V\) is a function of \(U\).

Proof:

Again this follows from the factorization theorem.

The next result is the Rao-Blackwell theorem, named for CR Rao and David Blackwell. The theorem shows how a sufficient statistic can be used to improve an unbiased estimator.

Suppose that \(U\) is sufficient for \(\theta\) and that \(V\) is an unbiased estimator of a real parameter \(\lambda = \lambda(\theta)\). Then \(\E_\theta(V \mid U)\) is also an unbiased estimator of \( \lambda \) and is uniformly better than \(V\).

Proof:

This follows from basic properties of conditional expected value and conditional variance. First, since \(V\) is a function of \(\bs{X}\) and \(U\) is sufficient for \(\theta\), \(\E_\theta(V \mid U)\) is a valid statistic; that is, it does not depend on \(\theta\), in spite of the formal dependence on \(\theta\) in the expected value. Next, \(\E_\theta(V \mid U)\) is a function of \(U\) and \(\E_\theta[\E_\theta(V \mid U)] = \E_\theta(V) = \lambda\) for \(\theta \in \Theta\). Thus \(\E_\theta(V \mid U)\) is an unbiased estimator of \(\lambda\). Finally \(\var_\theta[\E_\theta(V \mid U)] = \var_\theta(V) - \E_\theta[\var_\theta(V \mid U)] \le \var_\theta(V)\) for any \(\theta \in \Theta\).

Complete Statistics

Suppose that \(U = u(\bs{X})\) is a statistic taking values in a set \(T\). Then \(U\) is a complete statistic for \(\theta\) if for any function \(r: T \to \R\) \[ \E_\theta\left[r(U)\right] = 0 \text{ for all } \theta \in \Theta \implies \P_\theta\left[r(U) = 0\right] = 1 \text{ for all } \theta \in \Theta \]

To understand this rather strange looking condition, suppose that \(r(U)\) is a statistic constructed from \(U\) that is being used as an estimator of 0 (thought of as a function of \(\theta\)). The completeness condition means that the only such unbiased estimator is the statistic that is 0 with probability 1.

If \(U\) and \(V\) are equivalent statistics and \(U\) is complete for \(\theta\) then \(V\) is complete for \(\theta\).

The next result shows the importance of statistics that are both complete and sufficient; it is known as the Lehmann-Scheffé theorem, named for Erich Lehmann and Henry Scheffé.

Suppose that \(U\) is sufficient and complete for \(\theta\) and that \(V = r(U)\) is an unbiased estimator of a real parameter \(\lambda = \lambda(\theta)\). Then \(V\) is a uniformly minimum variance unbiased estimator (UMVUE) of \(\lambda\).

Proof:

Suppose that \(W\) is an unbiased estimator of \(\lambda\). By the Rao-Blackwell theorem, \(\E(W \mid U)\) is also an unbiased estimator of \(\lambda\) and is uniformly better than \(W\). Since \(\E(W \mid U)\) is a function of \(U\), it follows from completeness that \(V = \E(W \mid U)\) with probability 1.

Ancillary Statistics

Suppose that \(V = v(\bs{X})\) is a statistic taking values in a set \(T\). If the distribution of \(V\) does not depend on \(\theta\), then \(V\) is called an ancillary statistic for \(\theta\).

Thus, the notion of an ancillary statistic is complementary to the notion of a sufficient statistic. A sufficient statistic contains all available information about the parameter; an ancillary statistic contains no information about the parameter. The following result, known as Basu's Theorem and named for Debabrata Basu, makes this point more precisely.

Suppose that \(U\) is complete and sufficient for a parameter \(\theta\) and that \(V\) is an ancillary statistic for \( \theta \). Then \(U\) and \(V\) are independent.

Proof:

Let \(g\) denote the probability density function of \(V\) and let \(v \mapsto g(v \mid U)\) denote the conditional probability density function of \(V\) given \(U\). From properties of conditional expected value, \(\E[g(v \mid U)] = g(v)\) for \(v \in T\). But then from completeness, \(g(v \mid U) = g(v)\) with probability 1.

If \(U\) and \(V\) are equivalent statistics and \(U\) is ancillary for \(\theta\) then \(V\) is ancillary for \(\theta\).

Applications and Special Distributions

In this subsection, we will explore sufficient, complete, and ancillary statistics for a number of special distributions. As always, be sure to try the problems yourself before looking at the solutions.

The Bernoulli Distribution

Recall that the Bernoulli distribuiton with parameter \(p \in (0, 1)\) has probability density function \[ g(x) = p^x (1 - p)^{1-x}, \quad x \in \{0, 1\} \] Suppose that \(\bs{X} = (X_1, X_2, \ldots, X_n)\) is a random sample of size \(n\) from the Bernoulli distribution with parameter \(p\). Equivalently, \(\bs{X}\) is a sequence of Bernoulli trials, so that in the usual langauage of reliability, \(X_i = 1\) if trial \(i\) is a success, and \(X_i = 0\) if trial \(i\) is a failure. The Bernoulli distribution is named for Jacob Bernoulli and is studied in more detail in the chapter on Bernoulli Trials

Let \(Y = \sum_{i=1}^n X_i\) denote the number of successes. Recall that \(Y\) has the binomial distribution with parameters \(n\) and \(p\), and has probability density function \[ h(y) = \binom{n}{y} p^y (1 - p)^{n-y}, \quad y \in \{0, 1, \ldots, n\} \]

\(Y\) is sufficient for \(p\). Specifically, for \( y \in \{0, 1, \ldots, n\} \), the conditional distribution of \(\bs{X}\) given \(Y = y\) is uniform on the set of points \[ D_y = \left\{(x_1, x_2, \ldots, x_n) \in \{0, 1\}^n: x_1 + x_2 + \cdots + x_n = y\right\} \]

Proof:

The joint PDF of \( \bs{X} \) is given by \[ f(\bs{x}) = g(x_1) g(x_2) \cdots g(x_n) = p^y (1 - p)^{n-y}, \quad \bs{x} = (x_1, x_2, \ldots, x_n) \in \{0, 1\}^n \] where \( y = \sum_{i=1}^n x_i \). Now let \( y \in \{0, 1, \ldots, n\} \). Given \( Y = y \), \( \bs{X} \) is concentrated on \( D_y \) and \[ \P(\bs{X} = \bs{x} \mid Y = y) = \frac{\P(\bs{X} = \bs{x})}{\P(Y = y)} = \frac{p^y (1 - p)^{n-y}}{\binom{n}{y} p^y (1 - p)^{n-y}} = \frac{1}{\binom{n}{y}}, \quad \bs{x} \in D_y \] Of course, \( \binom{n}{y} \) is the cardinality of \(D_y\).

This result is intuitively appealing: in a sequence of Bernoulli trials, all of the information about the probability of success \(p\) is contained in the number of successes \(Y\). The particular order of the successes and failures provides no additional information. Of course, the sufficiency of \(Y\) follows more easily from the factorization theorem, but the conditional distribution provides additional insight.

\(Y\) is minimally sufficient for \(p\).

Proof:

The sample space of \( \bs{X} \) is \( \{0, 1\}^n \). Let \( u: \{0, 1\}^n \to \{0, 1 \ldots, n\} \) by \( u(\bs{x}) = \sum_{i=1}^n x_i \). Once again, the joint PDF of \( \bs{X} \) is given by \[ f(\bs{x}) = p^{u(\bs{x})} (1 - p)^{n - u(\bs{x})}, \quad \bs{x} = (x_1, x_2, \ldots, x_n) \in \{0, 1\}^n \] Hence \[ \frac{f(\bs{x})}{f(\bs{y})} = p^{u(\bs{x}) - u(\bs{y})} (1 - p)^{n - [u(\bs{x}) - u(\bs{y})]}, \quad \bs{x}, \, \bs{y} \in \{0, 1\}^n \] Clearly this is independent of \( p \) if and only if \( u(\bs{x}) = u(\bs{y}) \).

\(Y\) is complete for \(p\) on the parameter space \( (0, 1) \).

Proof:

If \(r: \{0, 1, \ldots, n\} \to \R\), then \[\E[r(Y)] = \sum_{y=0}^n r(y) \binom{n}{y} p^y (1 - p)^{n-y} = (1 - p)^n \sum_{y=0}^n r(y) \binom{n}{y} \left(\frac{p}{1 - p}\right)^y\] The last sum is a polynomial in the variable \(t = \frac{p}{1 - p} \in (0, \infty)\). If this polynomial is 0 for all \(t\), then all of the coefficients must be 0. Hence we must have \( r(y) = 0 \) for \( y \in \{0, 1, \ldots, n\} \).

The proof of the last result actually shows that if the parameter space \( P \subseteq (0, 1) \) is an interval of positive length, then \( Y \) is complete for \( p \in P \).

The sample mean \(M = Y / n\) (the sample proportion of successes) is clearly equivalent to \( Y \), and hence is also minimally sufficient for \( p \) and is complete for \(p \in (0, 1)\). Recall that the sample mean \( M \) is the method of moments estimator of \( p \), and is the maximum likelihood estimator of \( p \) on the parameter space \( (0, 1) \).

The standard sample variance \( S^2 \) is an UMVUE of the distribution variance \( p (1 - p) \) for \( p \in (0, 1) \), and can be written as \[ S^2 = \frac{Y}{n - 1} \left(1 - \frac{Y}{n}\right) \]

Proof:

Recall that the sample variance can be written as \[S^2 = \frac{1}{n - 1} \sum_{i=1}^n X_i^2 - \frac{n}{n - 1} M^2\] But \(X_i^2 = X_i\) since \(X_i\) is an indicator variable, and \(M = Y / n\). Substituting gives the representation above. In general, \(S^2\) is an unbiased estimator of the distribution variance \(\sigma^2\). But in this case, \(S^2\) is a function of the complete, sufficient statistic \(Y\), and hence by the Lehmann Scheffé theorem, \(S^2\) is an UMVUE of \(\sigma^2 = p (1 - p)\).

Recall that the notion of completeness depends very much on the parameter space. So far in this subsection, we have assumed that the parameter space is the full interval \((0, 1)\). The following exercise considers the case where \(p\) has a finite set of values.

Suppose that the parameter space \( P \subset (0, 1) \) is a finite set with \( k \in \N_+ \) elements. If the sample size \(n \) is at least \( k \), then \(Y\) is not complete for \(p\).

Proof:

Suppose that \( r: \{0, 1, \ldots, n\} \to \R \) and that \( \E[r(Y)] = 0 \) for \( p \in P \). Then we have \[ \sum_{y=0}^n \binom{n}{y} p^y (1 - p)^{n-y} r(y) = 0, \quad p \in P \] This is a set of \( k \) linear, homogenous equations in the variables \( (r(0), r(1), \ldots, r(n)) \). Since \( n \ge k \), we have at least \( k + 1 \) variables, so there are infinitely many non-trivial solutions.

The Poisson Distribution

Recall that the Poisson distribution with parameter \(\theta \in (0, \infty)\) has probability density function \[ g(x) = e^{-\theta} \frac{\theta^x}{x!}, \quad x \in \N \] The Poisson distribution is named for Simeon Poisson and is used to model the number of random points in region of time or space, under certain ideal conditions. The parameter \(\theta\) is proportional to the size of the region, and is both the mean and the variance of the distribution. The Poisson distribution is studied in more detail in the chapter on Poisson process.

Suppose now that \(\bs{X} = (X_1, X_2, \ldots, X_n)\) is a random sample of size \(n\) from the Poisson distribution with parameter \(\theta\). Recall that the sum of the scores \(Y = \sum_{i=1}^n X_i\) also has the Poisson distribution, but with parameter \(n \theta\).

The statistic \(Y\) is sufficient for \(\theta\). Specifically, for \( y \in \N \), the conditional distribution of \( \bs{X} \) given \( Y = y \) is the multinomial distribution with \( y \) trials, \( n \) trial values, and uniform trial probabilities.

Proof:

The joint PDF of \( \bs{X} \) is given by \[ f(\bs{x}) = g(x_1) g(x_2) \cdot g(x_n) = \frac{e^{-n \theta} \theta^y}{x_1! x_2! \cdots x_n!}, \quad \bs{x} = (x_1, x_2, \ldots, x_n) \in \N^n \] where \( y = \sum_{i=1}^n x_i \). Given \( Y = y \in \N \), random vector \( \bs{X} \) takes values in the set \(D_y = \left\{\bs{x} = (x_1, x_2, \ldots, x_n) \in \N^n: \sum_{i=1}^n x_i = y\right\}\). Moreover, \[\P(\bs{X} = \bs{x} \mid Y = y) = \frac{\P(\bs{X} = \bs{x})}{\P(Y = y)} = \frac{e^{-n \theta} \theta^y / (x_1! x_2! \cdots x_n!)}{e^{-n \theta} (n \theta)^y / y!} = \frac{y!}{x_1! x_2! \cdots x_n!} \frac{1}{n^y}, \quad \bs{x} \in D_y\] The last expression is the PDF of the multinomial distribution stated in the theorem. Of course, the important point is that the conditional distribution does not depend on \( \theta \).

As before, it's easier to use the factorization theorem to prove the sufficiency of \( Y \), but the conditional distribution gives some additional insight.

\(Y\) is complete for \(\theta \in (0, \infty)\).

Proof:

If \(r: \N \to \R\) then \[\E\left[r(Y)\right] = \sum_{y=0}^\infty e^{-n \theta} \frac{(n \theta)^y}{y!} r(y) = e^{-n \theta} \sum_{y=0}^\infty \frac{n^y}{y!} r(y) \theta^y\] The last sum is a power series in \(\theta\) with coefficients \( n^y r(y) / y! \) for \( y \in \N \). If this series is 0 for all \(\theta\) in an open interval, then the coefficients must be 0 and hence \( r(y) = 0 \) for \( y \in \N \).

As with our discussion of Bernoulli trials, the sample mean \( M = Y / n \) is clearly equivalent to \( Y \) and hence is also sufficient for \( \theta \) and complete for \( \theta \in (0, \infty) \). Recall that \( M \) is the method of moments estimator of \( \theta \) and is the maximum likelihood estimator on the parameter space \( (0, \infty) \).

An UMVUE of the parameter \(\P(X = 0) = e^{-\theta}\) for \( \theta \in (0, \infty) \) is \[ \left( \frac{n-1}{n} \right)^Y \]

Proof:

The probability generating function of \(Y\) is \[ P(t) = \E(t^Y) = e^{n \theta(t - 1)}, \quad t \in \R \] Hence \[ \E\left[\left(\frac{n - 1}{n}\right)^Y\right] = \exp \left[n \theta \left(\frac{n - 1}{n} - 1\right)\right] = e^{-\theta}, \quad \theta \in (0, \infty) \] So \( U = [(n - 1) / n]^Y \) is an unbiased estimator of \( e^{-\theta} \). Since \( U \) is a function of the complete, sufficient statistic \( Y \), it follows from the Lehmann-Scheffé theorem that \( U \) is an UMVUE of \( e^{-\theta} \).

The Normal Distribution

Recall that the normal distribution with mean \(\mu \in \R\) and variance \(\sigma^2 \in (0, \infty)\) has probability density function \[ g(x) = \frac{1}{\sqrt{2 \, \pi} \sigma} \exp\left[-\frac{1}{2}\left(\frac{x - \mu}{\sigma}\right)^2\right], \quad x \in \R \] The normal distribution is often used to model physical quantities subject to small, random errors, and is studied in more detail in the chapter on Special Distributions. Because of the central limit theorem, the normal distribution is perhaps the most important distribution in statistics.

Suppose that \(\bs{X} = (X_1, X_2, \ldots, X_n)\) is a random sample from the normal distribution with mean \(\mu\) and variance \(\sigma^2\). Then then each of the following pairs of statistics is minimally sufficient for \( (\mu, \sigma^2) \)

  1. \((Y, V)\) where \(Y = \sum_{i=1}^n X_i\) and \(V = \sum_{i=1}^n X_i^2\).
  2. \(\left(M, S^2\right)\) where \(M = \frac{1}{n} \sum_{i=1}^n X_i\) is the sample mean and \(S^2 = \frac{1}{n - 1} \sum_{i=1}^n (X_i - M)^2\) is the standard sample variance.
  3. \( (M, T^2) \) where \( T^2 = \frac{1}{n} \sum_{i=1}^n (X_i - M)^2 \) is the biased version of the sample variance.
Proof:
  1. The joint PDF \( f \) of \( \bs{X} \) is given by \[ f(\bs{x}) = g(x_1) g(x_2) \cdots g(x_n) = \frac{1}{(2 \pi)^{n/2} \sigma^n} \exp\left[-\frac{1}{2 \sigma^2} \sum_{i=1}^n (x_i - \mu)^2\right], \quad \bs{x} = (x_1, x_2 \ldots, x_n) \in \R^n \] After some algebra, this can be written as \[ f(\bs{x}) = \frac{1}{(2 \pi)^{n/2} \sigma^n} e^{-n \mu^2 / \sigma^2} \exp\left(-\frac{1}{2 \sigma^2} \sum_{i=1}^n x_i^2 + \frac{2 \mu}{\sigma^2} \sum_{i=1}^n x_i \right), \quad \bs{x} = (x_1, x_2 \ldots, x_n) \in \R^n\] It follows from the factorization theorem that \( (Y, V) \) is sufficient for \(\left(\mu, \sigma^2\right)\).
  2. Note that \( M = \frac{1}{n} Y, \; S^2 = \frac{1}{n - 1} V - \frac{n}{n - 1} M^2\). Hence \(\left(M, S^2\right)\) is equivalent to \( (Y, V) \) and so \(\left(M, S^2\right)\) is also sufficient for \(\left(\mu, \sigma^2\right)\).
  3. Similarly, \( M = \frac{1}{n} Y \) and \( T^2 = \frac{1}{n} V - M^2 \). Hence \( (M, T^2) \) is equivalent to \( (Y, V) \) and so \( (M, T^2) \) is also sufficient for \( (\mu, \sigma^2) \).

Recall that \( M \) and \( T^2 \) are the method of moments estimators of \( \mu \) and \( \sigma^2 \), respectively, and are also the maximum likelihood estimators on the parameter space \( \R \times (0, \infty) \).

Run the normal estimation experiment 1000 times with various values of the parameters. Compare the estimates of the parameters in terms of bias and mean square error.

Sometimes the variance \( \sigma^2 \) of the normal distribution is known, but not the mean \( \mu \). It's rarely the case that \( \mu \) is known but not \( \sigma^2 \). Nonetheless we can give sufficient statistics in both cases.

Suppose again that \( \bs{X} = (X_1, X_2, \ldots, X_n) \) is a random sample from the normal distribution with mean \( \mu \in \R \) and variance \( \sigma^2 \in (0, \infty)\). If

  1. If \( \sigma^2 \) is known then \( Y = \sum_{i=1}^n X_i \) is sufficient for \( \mu \).
  2. If \( \mu \) is known then \( U = \sum_{i=1}^n (X_i - \mu)^2 \) is sufficient for \( \sigma^2 \).
Proof:

The results follow from the two expressions given for the PDF \( f(\bs{x}) \) of \( \bs{X} \) in the previous theorem.

The Gamma Distribution

Recall that the gamma distribution with shape parameter \(k \in (0, \infty)\) and scale parameter \(b \in (0, \infty)\) has probability density function \[ g(x) = \frac{1}{\Gamma(k) b^k} x^{k-1} e^{-x / b}, \quad x \in (0, \infty) \] The gamma distribution is often used to model random times and certain other types of positive random variables, and is studied in more detail in the chapter on Special Distributions.

Suppose that \(\bs{X} = (X_1, X_2, \ldots, X_n)\) is a random sample from the gamma distribution with shape parameter \(k\) and scale parameter \(b\). The following pairs of statistics are sufficient for \((k, b)\)

  1. \((Y, V)\) where \(Y = \sum_{i=1}^n X_i\) is the sum of the scores and \(V = \prod_{i=1}^n X_i\) is the product of the scores.
  2. \((M, U)\) where \(M = Y / n\) is the sample (arithmetic) mean of \(\bs{X}\) and \(U = V^{1/n}\) is the sample geometric mean of \(\bs{X}\).
Proof:
  1. The joint PDF \( f \) of \( \bs{X} \) is given by \[ f(\bs{x}) = g(x_1) g(x_2) \cdots g(x_n) = \frac{1}{\Gamma^n(k) b^{nk}} (x_1 x_2 \ldots x_n)^{k-1} e^{-(x_1 + x_2 + \cdots + x_n) / b}, \quad \bs{x} = (x_1, x_2, \ldots, x_n) \in (0, \infty)^n \] From the factorization theorem, \( (Y, V) \) is sufficient for \( (k, b) \).
  2. Clearly \( M = Y / n \) is equivalent to \( Y \) and \( U = V^{1/n} \) is equivalent to \( V \). Hence \( (M, U) \) is also sufficient for \( (k, b) \).

Recall that the method of moments estimators of \( k \) and \( b \) are \( M^2 / T^2 \) and \( T^2 / M \), respectively, where \( M = \frac{1}{n} \sum_{i=1}^n X_i \) is the sample mean and \( T^2 = \frac{1}{n} \sum_{i=1}^n (X_i - M)^2 \) is the biased version of the sample variance. If the shape parameter \( k \) is known, \( \frac{1}{k} M \) is both the method of moments estimator of \( b \) and the maximum likelihood estimator on the parameter space \( (0, \infty) \). Note that \( T^2 \) is not a function of the sufficient statistics \( (Y, V) \), and hence estimators based on \( T^2 \) suffer from a loss of information.

Run the gamma estimation experiment 1000 times with various values of the parameters and the sample size \( n \). Compare the estimates of the parameters in terms of bias and mean square error.

The proof of the last theorem actually shows that \( Y \) is sufficient for \( b \) if \( k \) is known, and that \( V \) is sufficient for \( k \) if \( b \) is known.

Suppose again that \(\bs{X} = (X_1, X_2, \ldots, X_n)\) is a random sample of size \(n\) from the gamma distribution with shape parameter \( k \in (0, \infty) \) and scale parameter \(b \in (0, \infty)\). Then \(Y = \sum_{i=1}^n X_i\) is complete for \(b\).

Proof:

\( Y \) has the gamma distribution with shape parameter \( n k \) and scale parameter \( b \). Hence, if \(r: [0, \infty) \to \R\), then \[\E\left[r(Y)\right] = \int_0^\infty \frac{1}{\Gamma(n k) b^{n k}} y^{n k-1} e^{-y/b} r(y) \, dy = \frac{1}{\Gamma(n k) b^{n k}} \int_0^\infty y^{n k - 1} r(y) e^{-y / b} \, dy\] The last integral can be interpreted as the Laplace transform of the function \( y \mapsto y^{n k - 1} r(y) \) evaluated at \( 1 / b \). If this transform is 0 for all \(b\) in an open interval, then \( r(y) = 0 \) almost everywhere in \( (0, \infty) \).

Suppose again that \(\bs{X} = (X_1, X_2, \ldots, X_n)\) is a random sample from the gamma distribution on \( (0, \infty) \) with shape parameter \( k \in (0, \infty) \) and scale parameter \(b \in (0, \infty)\). Let \( M = \frac{1}{n} \sum_{i=1}^n X_i \) denote the sample mean and \( U = (X_1 X_2 \ldots X_n)^{1/n} \) the sample geometric mean, as before. Then

  1. \( M / U \) is ancillary for \( b \).
  2. \( M \) and \( M / U \) are independent.
Proof:
  1. We can take \( X_i = b Z_i \) for \( i \in \{1, 2, \ldots, n\} \) where \( \bs{Z} = (Z_1, X_2, \ldots, Z_n) \) is a random sample of size \( n \) from the gamma distribution with shape parameter \( k \) and scale parameter 1 (the standard gamma distribution with shape parameter \( k \)). Then \[ \frac{M}{U} = \frac{1}{n} \sum_{i=1}^n \frac{X_i}{(X_1 X_2 \cdots X_n)^{1/n}} = \frac{1}{n} \sum_{i=1}^n \left(\frac{X_i^n}{X_1 X_2 \cdots X_n}\right)^{1/n} = \frac{1}{n} \sum_{i=1}^n \left(\prod_{j \ne i} \frac{X_i}{X_j}\right)^{1/n} \] But \( X_i / X_j = Z_i / Z_j\) for \( i \ne j \), and the distribution of \( \left\{Z_i / Z_j: i, j \in \{1, 2, \ldots, n\}, \; i \ne j\right\} \) does not depend on \( b \). Hence the distribution of \( M / U \) does not depend on \( b \).
  2. This follows from Basu's theorem above, since \( M \) is complete and sufficient for \( b \) and \( M / U \) is ancillary for \( b \).

The Beta Distribution

Recall that the beta distribution with left parameter \(a \in (0, \infty)\) and right parameter \(b \in (0, \infty)\) has probability density function \( g \) given by \[ g(x) = \frac{1}{B(a, b)} x^{a-1} (1 - x)^{b-1}, \quad x \in (0, 1)\] where \( B \) is the beta function. The beta distribution is often used to model random proportions and other random variables that take values in bounded intervals. It is studied in more detail in the chapter on Special Distribution

Suppose that \(\bs{X} = (X_1, X_2, \ldots, X_n)\) is a random sample from the beta distribution with left parameter \(a\) and right parameter \(b\). Then \((P, Q)\) is sufficient for \((a, b)\) where \(P = \prod_{i=1}^n X_i\) and \(Q = \prod_{i=1}^n (1 - X_i)\).

Proof:

The joint PDF \( f \) of \( \bs{X} \) is given by \[ f(\bs{x}) = g(x_1) g(x_2) \cdots g(x_n) = \frac{1}{B^n(a, b)} (x_1 x_2 \cdots x_n)^{a - 1} [(1 - x_1) (1 - x_2) \cdots (1 - x_n)]^{b-1}, \quad \bs{x} = (x_1, x_2, \ldots, x_n) \in (0, 1)^n \] From the factorization theorem, it follows that \( (U, V) \) is sufficient for \( (a, b) \).

The proof also shows that \( P \) is sufficient for \( a \) if \( b \) is known, and that \( Q \) is sufficient for \( b \) if \( a \) is known. Recall that the method of moments estimators of \( a \) and \( b \) are \[ U = \frac{M(M - M_2)}{M_2 - M^2}, \quad V = \frac{(1 - M)(M - M_2)}{M_2 - M^2} \] respectively, where \( M = \frac{1}{n} \sum_{i=1}^n X_i \) is the sample mean and \( M_2 = \frac{1}{n} \sum_{i=1}^n X_i^2 \). If \( b \) is known, the method of moments estimator of \( a \) is \( U_b = b M / (1 - M) \), while if \( a \) is known, the method of moments estimator of \( b \) is \( V_a = a (1 - M) / M \). None of these estimators is a function of the sufficient statistics \( (P, Q) \) and so all suffer from a loss of information. On the other hand, if \( b = 1 \), the maximum likelihood estimator of \( a \) on the interval \( (0, \infty) \) is \( W = -n / \sum_{i=1}^n \ln X_i \), which is a function of \( P \) (as it must be).

Run the beta estimation experiment 1000 times with various values of the parameters. Compare the estimates of the parameters.

The Pareto Distribution

Recall that the Pareto distribution with shape parameter \(a \in (0, \infty)\) and scale parameter \(b \in (0, \infty)\) has probability density function \[ g(x) = \frac{a b^a}{x^{a+1}}, \quad b \le x \lt \infty \] The Pareto distribution, named for Vilfredo Pareto, is a heavy-tailed distribution often used to model income and certain other types of random variables. It is studied in more detail in the chapter on Special Distribution.

Suppose that \(\bs{X} = (X_1, X_2, \ldots, X_n)\) is a random sample from the Pareto distribution with shape parameter \(a\) and scale parameter \( b \). Then \( \left(P, X_{(1)}\right) \) is sufficient for \( (a, b) \) where \(P = \prod_{i=1}^n X_i\) is the product of the sample variables and where \( X_{(1)} = \min\{X_1, X_2, \ldots, X_n\} \) is the first order statistic.

Proof:

The joint PDF \( f \) of \( \bs{X} \) at \( \bs{x} = (x_1, x_2, \ldots, x_n) \) is given by \[ f(\bs{x}) = g(x_1) g(x_2) \cdots g(x_n) = \frac{a^n b^{n a}}{(x_1 x_2 \cdots x_n)^{a + 1}}, \quad x_1 \ge b, x_2 \ge b, \ldots, x_n \ge b \] which can be rewritten as \[ f(\bs{x}) = g(x_1) g(x_2) \cdots g(x_n) = \frac{a^n b^{n a}}{(x_1 x_2 \cdots x_n)^{a + 1}} \bs{1}\left(x_{(n)} \ge b\right), \quad (x_1, x_2, \ldots, x_n) \in (0, \infty)^n \] So the result follows from the factorization theorem.

The proof also shows that \( P \) is sufficient for \( a \) if \( b \) is known (which is often the case), and that \( X_{(1)} \) is sufficient for \( b \) if \( a \) is known (much less likely). Recall that the method of moments estimators of \( a \) and \( b \) are \[U = 1 + \sqrt{\frac{M_2}{M_2 - M^2}}, \quad V = \frac{M_2}{M} \left( 1 - \sqrt{\frac{M_2 - M^2}{M_2}} \right)\] respectively, where as before \( M = \frac{1}{n} \sum_{i=1}^n X_i \) and \( M_2 = \sum_{i=1}^n X_i^2 \). These estimators are not functions of the sufficient statistics and hence suffers from loss of information. On the other hand, the maximum likelihood estimators of \( a \) and \( b \) on the interval \( (0, \infty) \) are \[W = \frac{n}{\sum_{i=1}^n \ln X_i - n \ln X_{(1)}}, \quad X_{(1)}\] respectively. These are functions of the sufficient statistics, as they must be.

Run the Pareto estimation experiment 1000 times with various values of the parameters \( a \) and \( b \) and the sample size \( n \). Compare the method of moments estimates of the parameters with the maximum likelihood estimates in terms of the empirical bias and mean square error.

The Uniform Distribution

In this subsection, we consider continuous uniform distributions on an interval, in which the parameter of interest is related to one or both of the endpoints. These distributions lead to some interesting counterexamples. As usual, we denote the random sample by \(\bs{X} = (X_1, X_2, \ldots, X_n)\). Recall that \(X_{(1)} = \min\{X_1, X_2, \ldots, X_n\}\) and \(X_{(n)} = \max\{X_1, X_2, \ldots, X_n\}\) are the 1st and \(n\)th order statistics, respectively.

Suppose that \(\bs{X}\) is a random sample from the uniform distribution on the interval \([0, a]\) where \(a \in (0, \infty)\) is the unknown parameter. Then \(X_{(n)}\) is sufficient for \(a\).

Proof:

The uniform distribution on \( [0, a] \) has PDF \( g \) given by \( g(x) = 1 / a \) for \( x \in [0, a] \). Hence the PDF \( f \) of \( \bs{X} \) is given by \[ f(\bs{x}) = g(x_1) g(x_2) \cdots g(x_n) = \frac{1}{a^n}, \quad \bs{x} = (x_1, x_2, \ldots x_n) \in [0, a]^n \] We can rewrite the PDF as \[ f(\bs{x}) = \frac{1}{a^n} \bs{1}[x_{(n)} \le a], \quad \bs{x} = (x_1, x_2, \ldots, x_n) \in [0, \infty)^n \] It then follows from the factorization theorem that \( X_{(n)} \) is sufficient for \( a \).

Run the uniform estimation experiment 1000 times with various values of the parameter. Compare the estimates of the parameter.

Suppose that \(\bs{X}\) is a random sample from the uniform distribution on the interval \([a, a + 1]\) where \(a \in (0, \infty)\) is the unknown parameter. Then \(\left(X_{(1)}, X_{(n)}\right)\) is minimally sufficient for \(a\).

Proof:

The uniform distribution on \( [a, a + 1] \) has PDF \( g \) given by \( g(x) = 1 \) for \( x \in [a, a + 1] \). Hence the PDF \( f \) of \( \bs{X} \) is given by \[ f(\bs{x}) = g(x_1) g(x_2) \cdots g(x_n) = 1, \quad \bs{x} = (x_1, x_2, \ldots, x_n) \in [a, a + 1]^n \] We can rewrite the PDF as \[ f(\bs{x}) = \bs{1}[x_{(1)} \ge a] \bs{1}[x_{(n)} \le a + 1], \quad \bs{x} = (x_1, x_2, \ldots, x_n) \in \R^n \] It then follows from the factorization theorem that \( \left(X_{(1)}, X_{(n)}\right) \) is sufficient for \( a \).

Note that we have a single parameter, but the minimally sufficient statistic is a vector of dimension 2.

The Hypergeometric Model

So far, in all of our examples, the basic variables have formed a random sample from a distribution. In this subsection, our basic variables will be dependent.

Recall that in the hypergeometric model, we have a population of \( N \) objects, and that \( r \) of the objects are type 1 and the remaining \( N - r \) are type 0. The population size \( N \) is a positive integer and the type 1 size \( r \) is a nonnegative integer with \( r \le N \). Typically one or both parameters are unknown. We select a random sample of \( n \) objects, without replacement from the population, and let \( X_i \) be the type of the \( i \)th object chosen. So our basic sequence of random variables is \( \bs{X} = (X_1, X_2, \ldots, X_n) \). The variables are identically distributed indicator variables with \( \P(X_i = 1) = r / N \) for \( i \in \{1, 2, \ldots, n\} \), but are dependent. Of course, the sample size \( n \) is a positive integer with \( n \le N \).

The variable \( Y = \sum_{i=1}^n X_i \) is the number of type 1 objects in the sample. This variable has the hypergeometric distribution with parameters \( N \), \( r \), and \( n \), and has probability density function \( h \) given by \[ h(y) = \frac{\binom{r}{y} \binom{N - r}{n - y}}{\binom{N}{n}} = \binom{n}{y} \frac{r^{(y)} (N - r)^{(n - y)}}{N^{(n)}}, \quad y \in \{\max\{0, N - n + r\}, \ldots, \min\{n, r\}\} \] (Recall the falling power notation \( x^{(k)} = x (x - 1) \cdots (x - k + 1) \)). The hypergeometric distribution is studied in more detail in the chapter on Finite Sampling Models.

\( Y \) is sufficient for \( (N, r) \). Specifically, for \( y \in \{\max\{0, N - n + r\}, \ldots, \min\{n, r\}\} \), the conditional distribution of \( \bs{X} \) given \( Y = y \) is uniform on the set of points \[ D_y = \left\{(x_1, x_2, \ldots, x_n) \in \{0, 1\}^n: x_1 + x_2 + \cdots + x_n = y\right\} \]

Proof:

By a simple application of the multiplication rule of combinatorics, the PDF \( f \) of \( \bs{X} \) is given by \[ f(\bs{x}) = \frac{r^{(y)} (N - r)^{(n - y)}}{N^{(n)}}, \quad \bs{x} = (x_1, x_2, \ldots, x_n) \in \{0, 1\}^n \] where \( y = \sum_{i=1}^n x_i \). If \( y \in \{\max\{0, N - n + r\}, \ldots, \min\{n, r\}\} \), the conditional distribution of \( \bs{X} \) given \( Y = y \) is concentrated on \( D_y \) and \[ \P(\bs{X} = \bs{x} \mid Y = y) = \frac{\P(\bs{X} = \bs{x})}{\P(Y = y)} = \frac{r^{(y)} (N - r)^{(n-y)}/N^{(n)}}{\binom{n}{y} r^{(y)} (N - r)^{(n - y)} / N^{(n)}} = \frac{1}{\binom{n}{y}}, \quad \bs{x} \in D_y \] Of course, \( \binom{n}{y} \) is the cardinality of \( D_y \).

There are clearly strong similarities between the hypergeometric model and the Bernoulli trials model above. Indeed if the sampling were with replacement, the Bernoulli trials model with \( p = r / N \) would apply rather than the hypergeometric model. It's also interesting to note that we have a single real-valued statistic that is sufficient for two real-valued parameters.

Once again, the sample mean \( M = Y / n \) is equivalent to \( Y \) and hence is also sufficient for \( (N, r) \). Recall that the method of moments estimator of \( r \) with \( N \) known is \( N M \) and the method of moment estimator of \( N \) with \( r \) known is \( r / M \). The estimator of \( r \) is the one that is used in the capture-recapture experiment.

Exponential Families

Suppose now that our data vector \(\bs{X}\) takes values in a set \(S\), and that the distribution of \(\bs{X}\) depends on a parameter vector \(\bs{\theta}\) taking values in a parameter space \(\Theta\). The distribution of \(\bs{X}\) is a \(k\)-parameter exponential family if \(S\) does not depend on \(\bs{\theta}\) and if the probability density function of \(\bs{X}\) can be written as

\[ f_\bs{\theta}(\bs{x}) = \alpha(\bs{\theta}) r(\bs{x}) \exp\left(\sum_{i=1}^k \beta_i(\bs{\theta}) u_i(\bs{x}) \right); \quad \bs{x} \in S, \; \bs{\theta} \in \Theta \]

where \(\alpha\) and \(\left(\beta_1, \beta_2, \ldots, \beta_k\right)\) are real-valued functions on \(\Theta\), and where \(r\) and \(\left(u_1, u_2, \ldots, u_k\right)\) are real-valued functions on \(S\). Moreover, \(k\) is assumed to be the smallest such integer. The parameter vector \(\bs{\beta} = \left(\beta_1(\bs{\theta}), \beta_2(\bs{\theta}), \ldots, \beta_k(\bs{\theta})\right)\) is sometimes called the natural parameter of the distribution, and the random vector \(\bs{U} = \left(u_1(\bs{X}), u_2(\bs{X}), \ldots, u_k(\bs{X})\right)\) is sometimes called the natural statistic of the distribution. Although the definition may look intimidating, exponential families are useful because they have many nice mathematical properties, and because many special parametric families are exponential families. In particular, the sampling distributions from the Bernoulli, Poisson, gamma, normal, beta, and Pareto considered above are exponential families. Exponential families of distributions are studied in more detail in the chapter on special distributions.

\(\bs{U}\) is minimally sufficient for \(\bs{\theta}\).

Proof:

That \( U \) is sufficient for \( \theta \) follows immediately from the factorization theorem. That \( U \) is minimally sufficient follows since \( k \) is the smallest integer in the exponential formulation.

It turns out that \(\bs{U}\) is complete for \(\bs{\theta}\) as well, although the proof is more difficult.