Maximum Likelihood

\(\newcommand{\R}{\mathbb{R}}\) \(\newcommand{\N}{\mathbb{N}}\) \(\newcommand{\Z}{\mathbb{Z}}\) \(\newcommand{\E}{\mathbb{E}}\) \(\newcommand{\P}{\mathbb{P}}\) \(\newcommand{\var}{\text{var}}\) \(\newcommand{\sd}{\text{sd}}\) \(\newcommand{\cov}{\text{cov}}\) \(\newcommand{\cor}{\text{cor}}\) \(\newcommand{\bias}{\text{bias}}\) \(\newcommand{\mse}{\text{mse}}\) \(\newcommand{\bs}{\boldsymbol}\)

Basic Theory

The Method

Suppose again that we have an observable random variable \(\bs{X}\) for a random experiment, with values in a set \(S\). Suppose also that distribution of \(\bs{X}\) depends on an unknown parameter \(\theta\), with values in a parameter set \(T\). Of course, our data variable \(\bs{X}\) will almost always be vector valued. The parameter \(\theta\) may also be vector valued. We will denote the probability density function of \(\bs{X}\) on \(S\) by \(f_\theta\) for \(\theta \in T\). The distribution of \( \bs{X} \) could be discrete or continuous.

The likelihood function is the function obtained by reversing the roles of \(\bs{x}\) and \(\theta\) in the probability density function; that is, we view \(\theta\) as the variable and \(\bs{x}\) as the given information (which is precisely the point of view in estimation).

The likelihood function at \( \bs{x} \in S \) is the function \( L_{\bs{x}}: T \to [0, \infty) \) given by \[ L_\bs{x}(\theta) = f_\theta(\bs{x}), \quad \theta \in T \]

In the method of maximum likelihood, we try to find the value of the parameter that maximizes the likelihood function for each value of the data vector.

Suppose that the maximum value of \( L_{\bs{x}} \) occurs at \( u(\bs{x}) \in T \) for each \( \bs{x} \in S \). Then the statistic \( u(\bs{X}) \) is a maximum likelihood estimator of \( \theta \).

The method of maximum likelihood is intuitively appealing—we try to find the value of the parameter that would have most likely produced the data we in fact observed.

Since the natural logarithm function is strictly increasing on \( (0, \infty) \), the maximum value of the likelihood function, if it exists, will occur at the same points as the maximum value of the logarithm of the likelihood function.

The log-likelihood function at \( \bs{x} \in S \) is the function \( \ln L_{\bs{x}} \): \[ \ln L_{\bs{x}}(\theta) = \ln f_\theta(\bs{x}), \quad \theta \in T \] If the maximum value of \( \ln L_{\bs{x}} \) occurs at \( u(\bs{x}) \in T \) for each \( \bs{x} \in S \). Then the statistic \( u(\bs{X}) \) is a maximum likelihood estimator of \( \theta \)

The log-likelihood function is often easier to work with than the likelihood function (typically because the probability density function \(f_\theta(\bs{x})\) has a product structure).

Vector of Parameters

An important special case is when \(\bs{\theta} = (\theta_1, \theta_2, \ldots, \theta_k)\) is a vector of \(k\) real parameters, so that \(T \subseteq \R^k\). In this case, the maximum likelihood problem is to maximize a function of several variables. If \(T\) is a continuous set, the methods of calculus can be used. If the maximum value of \(L_\bs{x}\) occurs at a point \(\bs{\theta}\) in the interior of \(T\), then \(L_\bs{x}\) has a local maximum at \(\bs{\theta}\). Therefore, assuming that the likelihood function is differentiable, we can find this point by solving \[ \frac{\partial}{\partial \theta_i} L_\bs{x}(\bs{\theta}) = 0, \quad i \in \{1, 2, \ldots, k\} \] or equivalently \[ \frac{\partial}{\partial \theta_i} \ln L_\bs{x}(\bs{\theta}) = 0, \quad i \in \{1, 2, \ldots, k\} \] On the other hand, the maximum value may occur at a boundary point of \(T\), or may not exist at all.

Random Samples

The most important special case is when the data variables form a random sample from a distribution.

Suppose that \(\bs{X} = (X_1, X_2, \ldots, X_n)\) is a random sample of size \(n\) from the distribution of a random variable \(X\) taking values in \(R\), with probability density function \(g_\theta\) for \(\theta \in T\). Then \(\bs{X}\) takes values in \(S = R^n\), and the likelihood and log-likelihood functions for \( \bs{x} = (x_1, x_2, \ldots, x_n) \in S \) are \begin{align*} L_\bs{x}(\theta) & = \prod_{i=1}^n g_\theta(x_i), \quad \theta \in T \\ \ln L_\bs{x}(\theta) & = \sum_{i=1}^n \ln g_\theta(x_i), \quad \theta \in T \end{align*}

Extending the Method and the Invariance Property

Returning to the general setting, suppose now that \(h\) is a one-to-one function from the parameter set \(T\) onto a set \(\Lambda\). We can view \(\lambda = h(\theta)\) as a new parameter with values in the set \(\Lambda\), and it is easy to re-parameterize the probability density function with the new parameter. Thus, let \( \hat{f}_\lambda(\bs{x}) = f_{h^{-1}(\lambda)}(\bs{x})\) for \( \bs{x} \in S \) and \( \lambda \in \Lambda \). The corresponding likelihood function for \( \bs{x} \in S \) is \[ \hat{L}_\bs{x}(\lambda) = L_\bs{x}\left[h^{-1}(\lambda)\right], \quad \lambda \in \Lambda \] Clearly if \(u(\bs{x}) \in T\) maximizes \(L_\bs{x}\) for \(\bs{x} \in S\). Then \(h\left[u(\bs{x})\right] \in \Lambda\) maximizes \(\hat{L}_\bs{x}\) for \(\bs{x} \in S\). It follows that if \(U\) is a maximum likelihood estimator for \(\theta\), then \(V = h(U)\) is a maximum likelihood estimator for \( \lambda = h(\theta) \).

If the function \(h\) is not one-to-one, the maximum likelihood function for the new parameter \(\lambda = h(\theta)\) is not well defined, because we cannot parameterize the probability density function in terms of \(\lambda\). However, there is a natural generalization of the method.

Suppose that \( h: T \to \Lambda \), and let \( \lambda = h(\theta) \) denote the new parameter. Define the likelihood function for \( \lambda \) at \( \bs{x} \in S\) by \[ \hat{L}_\bs{x}(\lambda) = \max\left\{L_\bs{x}(\theta): \theta \in h^{-1}\{\lambda\} \right\}; \quad \lambda \in \Lambda \] If \( v(\bs{x}) \in \Lambda \) maximizes \( \hat{L}_{\bs{x}} \) for each \( \bs{x} \in S \), then \( V = v(\bs{X}) \) is a maximum likelihood estimator of \( \lambda \).

This definition extends the maximum likelihood method to cases where the probability density function is not completely parameterized by the parameter of interest. The following theorem is known as the invariance property: if we can solve the maximum likelihood problem for \( \theta \) then we can solve the maximum likelihood problem for \( \lambda = h(\theta) \).

In the setting of definition , if \( U \) is a maximum likelihood estimator of \( \theta \), then \( V = h(U) \) is a maximum likelihood estimator of \( \lambda \).

Details:

As before, if \(u(\bs{x}) \in T\) maximizes \(L_\bs{x}\) for \(\bs{x} \in S\). Then \(h\left[u(\bs{x})\right] \in \Lambda\) maximizes \(\hat{L}_\bs{x}\) for \(\bs{x} \in S\).

Examples and Special Cases

In the following subsections, we will study maximum likelihood estimation for a number of special parametric families of distributions. Recall that if \(\bs{X} = (X_1, X_2, \ldots, X_n)\) is a random sample from a distribution with mean \(\mu\) and variance \(\sigma^2\), then the method of moments estimators of \(\mu\) and \(\sigma^2\) are, respectively, \begin{align} M & = \frac{1}{n} \sum_{i=1}^n X_i \\ T^2 & = \frac{1}{n} \sum_{i=1}^n (X_i - M)^2 \end{align} Of course, \(M\) is the sample mean, and \(T^2 \) is the biased version of the sample variance. These statistics will also sometimes occur as maximum likelihood estimators. Another statistic that will occur in some of the examples below is \[ M_2 = \frac{1}{n} \sum_{i=1}^n X_i^2 \] the second-order sample mean. As always, be sure to try the derivations yourself before expanding the details.

The Bernoulli Distribution

Suppose that \(\bs{X} = (X_1, X_2, \ldots, X_n)\) is a random sample of size \(n\) from the Bernoulli distribution with success parameter \(p \in [0, 1]\). Recall that the Bernoulli probability density function is \[ g(x) = p^x (1 - p)^{1 - x}, \quad x \in \{0, 1\} \] Thus, \(\bs{X}\) is a sequence of independent indicator variables with \(\P(X_i = 1) = p\) for each \(i\). In the usual language of reliability, \(X_i\) is the outcome of trial \(i\), where 1 means success and 0 means failure. Let \(Y = \sum_{i=1}^n X_i\) denote the number of successes, so that the proportion of successes (the sample mean) is \(M = Y / n\). Recall that \(Y\) has the binomial distribution with parameters \(n\) and \(p\).

The sample mean \(M\) is the maximum likelihood estimator of \(p\) on the parameter set \( (0, 1) \).

Details:

Note that \(\ln g(x) = x \ln p + (1 - x) \ln(1 - p)\) for \( x \in \{0, 1\} \) Hence the log-likelihood function at \( \bs{x} = (x_1, x_2, \ldots, x_n) \in \{0, 1\}^n \) is \[ \ln L_{\bs{x}}(p) = \sum_{i=1}^n [x_i \ln p + (1 - x_i) \ln(1 - p)], \quad p \in (0, 1) \] Differentiating with respect to \(p\) and simplifying gives \[ \frac{d}{dp} \ln L_{\bs{x}}(p) = \frac{y}{p} - \frac{n - y}{1 - p} \] where \(y = \sum_{i=1}^n x_i\). Thus, there is a single critical point at \(p = y / n = m\). The second deriviative is \[ \frac{d^2}{d p^2} \ln L_{\bs{x}}(p) = -\frac{y}{p^2} - \frac{n - 1}{(1 - p)^2} \lt 0 \] Hence the log-likelihood function is concave downward and so the maximum occurs at the unique critical point \(m\).

Recall that \(M\) is also the method of moments estimator of \(p\). It's always nice when two different estimation procedures yield the same result. Next let's look at the same problem, but with a much restricted parameter set.

Suppose now that \(p\) takes values in \(\left\{\frac{1}{2}, 1\right\}\). Then the maximum likelihood estimator of \(p\) is the statistic \[ U = \begin{cases} 1, & Y = n\\ \frac{1}{2}, & Y \lt n \end{cases} \]

\(\E(U) = \begin{cases} 1, & p = 1 \\ \frac{1}{2} + \left(\frac{1}{2}\right)^{n+1}, & p = \frac{1}{2} \end{cases}\)
\(U\) is positively biased, but is asymptotically unbiased.
\(\mse(U) = \begin{cases} 0 & p = 1 \\ \left(\frac{1}{2}\right)^{n+2}, & p = \frac{1}{2} \end{cases}\)
\(U\) is consistent.

Details:

Note that the likelihood function at \( \bs{x} = (x_1, x_2, \ldots, x_n) \in \{0, 1\}^n \) is \(L_{\bs{x}}(p) = p^y (1 - p)^{n-y}\) for \( p \in \left\{\frac{1}{2}, 1\right\} \) where as usual, \(y = \sum_{i=1}^n x_i\). Thus \(L_{\bs{x}}\left(\frac{1}{2}\right) = \left(\frac{1}{2}\right)^y\). On the other hand, \(L_{\bs{x}}(1) = 0\) if \(y \lt n\) while \(L_{\bs{x}}(1) = 1\) if \(y = n\). Thus, if \(y = n\) the maximum occurs when \(p = 1\) while if \(y \lt n\) the maximum occurs when \(p = \frac{1}{2}\).

If \(p = 1\) then \(\P(U = 1) = \P(Y = n) = 1\), so trivially \(\E(U) = 1\). If \(p = \frac{1}{2}\), \[ \E(U) = 1 \P(Y = n) + \frac{1}{2} \P(Y \lt n) = 1 \left(\frac{1}{2}\right)^n + \frac{1}{2}\left[1 - \left(\frac{1}{2}\right)^n\right] = \frac{1}{2} + \left(\frac{1}{2}\right)^{n+1} \]
Note that \( \E(U) \ge p \) and \(\E(U) \to p\) as \(n \to \infty\) both in the case that \(p = 1\) and \(p = \frac{1}{2}\).
If \( p = 1 \) then \( U = 1 \) with probability 1, so trivially \( \mse(U) = 0 \). If \( p = \frac{1}{2} \), \[ \mse(U) = \left(1 - \frac{1}{2}\right)^2 \P(Y = n) + \left(\frac{1}{2} - \frac{1}{2}\right)^2 \P(Y \lt n) = \left(\frac{1}{2}\right)^2 \left(\frac{1}{2}\right)^n = \left(\frac{1}{2}\right)^{n+2}\]
From (c), \( \mse(U) \to 0 \) as \( n \to \infty \).

Note that the Bernoulli distribution in would model a coin that is either fair or two-headed. Exercises and show that the maximum likelihood estimator of a parameter, like the solution to any maximization problem, depends critically on the domain.

\(U\) is uniformly better than \(M\) on the parameter set \(\left\{\frac{1}{2}, 1\right\}\).

Details:

Recall that \( \mse(M) = \var(M) = p (1 - p) / n \). If \( p = 1 \) then \( \mse(M) = \mse(U) = 0 \) so that both estimators give the correct answer. If \( p = \frac{1}{2} \), \( \mse(U) = \left(\frac{1}{2}\right)^{n+2} \lt \frac{1}{4 n} = \mse(M) \).

Suppose that \(\bs{X} = (X_1, X_2, \ldots, X_n)\) is a random sample of size \(n\) from the Bernoulli distribution with unknown success parameter \(p \in (0, 1)\). Find the maximum likelihood estimator of \(p (1 - p)\), which is the variance of the sampling distribution.

Details:

By the invariance principle, the estimator is \(M (1 - M)\) where \(M\) is the sample mean.

The Geometric Distribution

Recall that the geometric distribution on \(\N_+\) with success parameter \(p \in (0, 1)\) has probability density function \[ g(x) = p (1 - p)^{x-1}, \quad x \in \N_+ \] The geometric distribution governs the trial number of the first success in a sequence of Bernoulli Trials.

Suppose that \(\bs{X} = (X_1, X_2, \ldots, X_n)\) is a random sample from the geometric distribution with unknown parameter \(p \in (0, 1)\). The maximum likelihood estimator of \(p\) is \(U = 1 / M\).

Details:

Note that \( \ln g(x) = \ln p + (x - 1) \ln(1 - p) \) for \( x \in \N_+ \). Hence the log-likelihood function corresponding to the data \( \bs{x} = (x_1, x_2, \ldots, x_n) \in \N_+^n \) is \[ \ln L_\bs{x}(p) = n \ln p + (y - n) \ln(1 - p), \quad p \in (0, 1) \] where \( y = \sum_{i=1}^n x_i \). So \[ \frac{d}{dp} \ln L(p) = \frac{n}{p} - \frac{y - n}{1 - p} \] The derivative is 0 when \( p = n / y = 1 / m \). Finally, \( \frac{d^2}{dp^2} \ln L_\bs{x}(p) = -n / p^2 - (y - n) / (1 - p)^2 \lt 0 \) so the maximum occurs at the critical point.

Recall that \(U\) is also the method of moments estimator of \(p\). It's always reassuring when two different estimation procedures produce the same estimator.

The Negative Binomial Distribution

More generally, the negative binomial distribution on \( \N \) with shape parameter \( k \in (0, \infty) \) and success parameter \( p \in (0, 1) \) has probability density function \[ g(x) = \binom{x + k - 1}{k - 1} p^k (1 - p)^x, \quad x \in \N \] If \( k \) is a positive integer, then this distribution governs the number of failures before the \( k \)th success in a sequence of Bernoulli trials with success parameter \( p \). However, the distribution makes sense for general \( k \in (0, \infty) \).

Suppose that \(\bs{X} = (X_1, X_2, \ldots, X_n)\) is a random sample of size \(n\) from the negative binomial distribution on \( \N \) with known shape parameter \( k \) and unknown success parameter \( p \in (0, 1) \). The maximum likelihood estimator of \( p \) is \[ U = \frac{k}{k + M} \]

Details:

Note that \( \ln g(x) = \ln \binom{x + k - 1}{k - 1} + k \ln p + x \ln(1 - p) \) for \( x \in \N \). Hence the log-likelihood function corresponding to \( \bs{x} = (x_1, x_2, \ldots, x_n) \in \N^n \) is \[ \ln L_\bs{x}(p) = n k \ln p + y \ln(1 - p) + C, \quad p \in (0, 1) \] where \( y = \sum_{i=1}^n x_i \) and \( C = \sum_{i=1}^n \ln \binom{x_i + k - 1}{k - 1} \). Hence \[ \frac{d}{dp} \ln L_\bs{x}(p) = \frac{n k}{p} - \frac{y}{1 - p} \] The derivative is 0 when \( p = n k / (n k + y) = k / (k + m) \) where as usual, \( m = y / n \). Finally, \( \frac{d^2}{dp^2} \ln L_\bs{x}(p) = - n k / p^2 - y / (1 - p)^2 \lt 0 \), so the maximum occurs at the critical point.

Once again, this is the same as the method of moments estimator of \( p \) with \( k \) known.

The Poisson Distribution

Recall that the Poisson distribution with parameter \(r \gt 0\) has probability density function \[ g(x) = e^{-r} \frac{r^x}{x!}, \quad x \in \N \] The Poisson distribution is named for Simeon Poisson and is widely used to model the number of random points in a region of time or space, particularly in the context of the Poisson process. The parameter \( r\) is proportional to the size of the region.

Suppose that \(\bs{X} = (X_1, X_2, \ldots, X_n)\) is a random sample from the Poisson distribution with unknown parameter \(r \in (0, \infty)\). The maximum likelihood estimator of \(r\) is the sample mean \(M\).

Details:

Note that \( \ln g(x) = -r + x \ln r - \ln(x!) \) for \( x \in \N \). Hence the log-likelihood function corresponding to \( \bs{x} = (x_1, x_2, \ldots, x_n) \in \N^n\) is \[ \ln L_\bs{x}(r) = -n r + y \ln r - C, \quad r \in (0, \infty) \] where \( y = \sum_{i=1}^n x_i \) and \( C = \sum_{i=1}^n \ln(x_i!) \). Hence \( \frac{d}{dr} \ln L_\bs{x}(r) = -n + y / r \). The derivative is 0 when \( r = y / n = m \). Finally, \( \frac{d^2}{dr^2} \ln L_\bs{x}(r) = -y / r^2 \lt 0 \), so the maximum occurs at the critical point.

Recall that for the Poisson distribution, the parameter \(r\) is both the mean and the variance. Thus \(M\) is also the method of moments estimator of \(r\). We showed in the introductory section that \(M\) has smaller mean square error than \(S^2\), although both are unbiased.

Suppose that \(\bs{X} = (X_1, X_2, \ldots, X_n)\) is a random sample from the Poisson distribution with parameter \(r \in (0, \infty)\), and let \(p = \P(X = 0) = e^{-r}\). Find the maximum likelihood estimator of \(p\) in two ways:

Directly, by finding the likelihood function corresponding to the parameter \(p\).
By using the result of and the invariance property.

Details:

\(e^{-M}\) where \(M\) is the sample mean.

The Normal Distribution

Recall that the normal distribution with mean \(\mu\) and variance \(\sigma^2\) has probability density function \[ g(x) = \frac{1}{\sqrt{2 \, \pi} \sigma} \exp \left[-\frac{1}{2} \left(\frac{x - \mu}{\sigma}\right)^2\right], \quad x \in \R \] Normal distributions are often used to model physical quantities subject to small, random errors.

Suppose that \(\bs{X} = (X_1, X_2, \ldots, X_n)\) is a random sample from the normal distribution with unknown mean \(\mu \in \R\) and variance \(\sigma^2 \in (0, \infty)\). The maximum likelihood estimators of \(\mu\) and \(\sigma^2\) are \(M\) and \(T^2\), respectively.

Details:

Note that \[ \ln g(x) = -\frac{1}{2} \ln(2 \pi) - \frac{1}{2} \ln(\sigma^2) - \frac{1}{2 \sigma^2} (x - \mu)^2, \quad x \in \R \] Hence the log-likelihood function corresponding to the data \( \bs{x} = (x_1, x_2, \ldots, x_n) \in \R^n \) is \[ \ln L_\bs{x}(\mu, \sigma^2) = -\frac{n}{2} \ln(2 \pi) - \frac{n}{2} \ln(\sigma^2) - \frac{1}{2 \sigma^2} \sum_{i=1}^n (x_i - \mu)^2, \quad (\mu, \sigma^2) \in \R \times (0, \infty) \] Taking partial derivatives gives \begin{align*} \frac{\partial}{\partial \mu} \ln L_\bs{x}(\mu, \sigma^2) &= \frac{1}{\sigma^2} \sum_{i=1}^n (x_i - \mu) = \frac{1}{\sigma^2}\left(\sum_{i=1}^n x_i - n \mu\right) \\ \frac{\partial}{\partial \sigma^2} \ln L_\bs{x}(\mu, \sigma^2) &= -\frac{n}{2 \sigma^2} + \frac{1}{2 \sigma^4} \sum_{i=1}^n (x_i - \mu)^2 \end{align*} The partial derivatives are 0 when \( \mu = \frac{1}{n} \sum_{i=1}^n x_i\) and \( \sigma^2 = \frac{1}{n} \sum_{i=1}^n (x_i - \mu)^2 \). Hence the unique critical point is \( (m, t^2) \). Finally, with a bit more calculus, the second partial derivatives evaluated at the critical point are \[ \frac{\partial^2}{\partial \mu^2} \ln L_\bs{x}(m, t^2) = -n / t^2, \; \frac{\partial^2}{\partial \mu \partial \sigma^2} \ln L_\bs{x}(m, t^2) = 0, \; \frac{\partial^2}{\partial (\sigma^2)^2} \ln L_\bs{x}(m, t^2) = -n / t^4\] Hence the second derivative matrix at the critical point is negative definite and so the maximum occurs at the critical point.

Of course, \(M\) and \(T^2\) are also the method of moments estimators of \(\mu\) and \(\sigma^2\), respectively.

Run the Normal estimation experiment 1000 times for several values of the sample size \(n\), the mean \(\mu\), and the variance \(\sigma^2\). For the parameter \(\sigma^2\), compare the maximum likelihood estimator \(T^2\) with the standard sample variance \(S^2\). Which estimator seems to work better in terms of mean square error?

Suppose again that \(\bs{X} = (X_1, X_2, \ldots, X_n)\) is a random sample from the normal distribution with unknown mean \(\mu \in \R\) and unknown variance \(\sigma^2 \in (0, \infty)\). Find the maximum likelihood estimator of \(\mu^2 + \sigma^2\), which is the second moment about 0 for the sampling distribution.

Details:

By the invariance principle, the estimator is \(M^2 + T^2\) where \(M\) is the sample mean and \(T^2\) is the (biased version of the) sample variance.

The Gamma Distribution

Recall that the gamma distribution with shape parameter \(k \gt 0\) and scale parameter \(b \gt 0\) has probability density function \[ g(x) = \frac{1}{\Gamma(k) \, b^k} x^{k-1} e^{-x / b}, \quad 0 \lt x \lt \infty \] Gamma distributions are often used to model random times, particularly in the context of the Poisson process, and certain other types of positive random variables.

Suppose that \(\bs{X} = (X_1, X_2, \ldots, X_n)\) is a random sample from the gamma distribution with known shape parameter \(k\) and unknown scale parameter \(b \in (0, \infty)\). The maximum likelihood estimator of \(b\) is \(V_k = \frac{1}{k} M\).

Details:

Note that for \( x \in (0, \infty) \), \[ \ln g(x) = -\ln \Gamma(k) - k \ln b + (k - 1) \ln x - \frac{x}{b} \] and hence the log-likelihood function corresponding to the data \( \bs{x} = (x_1, x_2, \ldots, x_n) \in (0, \infty)^n \) is \[ \ln L_\bs{x}(b) = - n k \ln b - \frac{y}{b} + C, \quad b \in (0, \infty)\] where \( y = \sum_{i=1}^n x_i \) and \( C = -n \ln \Gamma(k) + (k - 1) \sum_{i=1}^n \ln x_i \). It follows that \[ \frac{d}{d b} \ln L_\bs{x}(b) = -\frac{n k}{b} + \frac{y}{b^2} \] The derivative is 0 when \( b = y / n k = 1 / k m \). Finally, \( \frac{d^2}{db^2} \ln L_\bs{x}(b) = n k / b^2 - 2 y / b^3 \). At the critical point \( b = y / n k \), the second derivative is \(-(n k)^3 / y^2 \lt 0\) so the maximum occurs at the critical point.

Recall that \(V_k\) is also the method of moments estimator of \(b\) when \(k\) is known. But when \(k\) is unknown, the method of moments estimator of \(b\) is \(V = \frac{T^2}{M}\).

Run the gamma estimation experiment 1000 times for several values of the sample size \(n\), shape parameter \(k\), and scale parameter \(b\). In each case, compare the method of moments estimator \(V\) of \(b\) when \(k\) is unknown with the method of moments and maximum likelihood estimator \(V_k\) of \(b\) when \(k\) is known. Which estimator seems to work better in terms of mean square error?

The Beta Distribution

Recall that the beta distribution with left parameter \(a \in (0, \infty)\) and right parameter \(b = 1\) has probability density function \[ g(x) = a x^{a-1}, \quad x \in (0, 1) \] Beta distributions are often used to model random proportions and other random variables that take values in bounded intervals.

Suppose that \(\bs{X} = (X_1, X_2, \ldots, X_n)\) is a random sample from the beta distribution with unknown left parameter \(a \in (0, \infty)\) and right parameter \(b = 1\). The maximum likelihood estimator of \(a\) is \[ W = - \frac{n}{\sum_{i=1}^n \ln X_i} = -\frac{n}{\ln(X_1 X_2 \cdots X_n)} \]

Details:

Note that \( \ln g(x) = \ln a + (a - 1) \ln x \) for \( x \in (0, \infty) \) Hence the log-likelihood function corresponding to the data \( \bs{x} = (x_1, x_2, \ldots, x_n) \in (0, \infty)^n \) is \[ \ln L_\bs{x}(a) = n \ln a + (a - 1) \sum_{i=1}^n \ln x_i, \quad a \in (0, \infty) \] Therefore \( \frac{d}{da} \ln L_\bs{x}(a) = n / a + \sum_{i=1}^n \ln x_i \). The derivative is 0 when \( a = -n \big/ \sum_{i=1}^n \ln x_i \). Finally, \( \frac{d^2}{da^2} \ln L_\bs{x}(a) = -n / a^2 \lt 0 \), so the maximum occurs at the critical point.

Recall that when \(b = 1\), the method of moments estimator of \(a\) is \(U_1 = M \big/ (1 - M)\), but when \(b \in (0, \infty)\) is also unknown, the method of moments estimator of \(a\) is \(U = M (M - M_2) \big/ (M_2 - M^2)\). When \(b = 1\), which estimator is better, the method of moments estimator or the maximum likelihood estimator?

In the beta estimation experiment, set \(b = 1\). Run the experiment 1000 times for several values of the sample size \(n\) and the parameter \(a\). In each case, compare the estimators \(U\), \(U_1\) and \(W\). Which estimator seems to work better in terms of mean square error?

Finally, note that \( 1 / W \) is the sample mean for a random sample of size \( n \) from the distribution of \( -\ln X \). This distribution is the exponential distribution with rate \( a \).

The Pareto Distribution

Recall that the Pareto distribution with shape parameter \(a \gt 0\) and scale parameter \(b \gt 0\) has probability density function \[ g(x) = \frac{a b^a}{x^{a+1}}, \quad b \le x \lt \infty \] It is named for Vilfredo Pareto, and is a heavy-tailed distribution often used to model income and certain other types of random variables.

Suppose that \(\bs{X} = (X_1, X_2, \ldots, X_n)\) is a random sample from the Pareto distribution with unknown shape parameter \(a \in (0, \infty)\) and scale parameter \(b \in (0, \infty)\). The maximum likelihood estimator of \( b \) is \( X_{(1)} = \min\{X_1, X_2, \ldots, X_n\} \), the first order statistic. The maximum likelihood estimator of \( a \) is \[ U = \frac{n}{\sum_{i=1}^n \ln X_i - n \ln X_{(1)}} = \frac{n}{\sum_{i=1}^n \left(\ln X_i - \ln X_{(1)}\right)}\]

Details:

Note that \( \ln g(x) = \ln a + a \ln b - (a + 1) \ln x \) for \( x \in [b, \infty) \). Hence the log-likelihood function corresponding to the data \( \bs{x} = (x_1, x_2, \ldots, x_n) \) is \[ \ln L_\bs{x}(a, b) = n \ln a + n a \ln b - (a + 1) \sum_{i=1}^n \ln x_i; \quad 0 \lt a \lt \infty, \, 0 \lt b \le x_i \text{ for each } i \in \{1, 2, \ldots, n\} \] Equivalently, the domain is \( 0 \lt a \lt \infty \) and \( 0 \lt b \le x_{(1)} \). Note that \( \ln L_{\bs{x}}(a, b) \) is increasing in \( b \) for each \( a \), and hence is maximized when \( b = x_{(1)} \) for each \( a \). Next, \[ \frac{d}{d a} \ln L_{\bs{x}}\left(a, x_{(1)}\right) = \frac{n}{a} + n \ln x_{(1)} - \sum_{i=1}^n \ln x_i \] The derivative is 0 when \( a = n \big/ \left(\sum_{i=1}^n \ln x_i - n \ln x_{(1)}\right) \). Finally, \( \frac{d^2}{da^2} \ln L_\bs{x}\left(a, x_{(1)}\right) = -n / a^2 \lt 0 \), so the maximum occurs at the critical point.

Recall that if \(a \gt 2\), the method of moments estimators of \(a\) and \( b \) are \[ 1 + \sqrt{\frac{M_2}{M_2 - M^2}}, \; \frac{M_2}{M} \left(1 - \sqrt{\frac{M_2 - M^2}{M_2}}\right)\]

Open the the Pareto estimation experiment. Run the experiment 1000 times for several values of the sample size \(n\) and the parameters \(a\) and \( b \). Compare the method of moments and maximum likelihood estimators. Which estimators seem to work better in terms of bias and mean square error?

Suppose that \(\bs{X} = (X_1, X_2, \ldots, X_n)\) is a random sample from the Pareto distribution with unknown shape parameter \(a \in (0, \infty)\) and known scale parameter \(b \in (0, \infty)\). The maximum likelihood estimator of \( a \) is \[ U = \frac{n}{\sum_{i=1}^n \ln X_i - n \ln b} = \frac{n}{\sum_{i=1}^n \left(\ln X_i - \ln b \right)}\]

Details:

Modifying the previous proof, the log-likelihood function corresponding to the data \( \bs{x} = (x_1, x_2, \ldots, x_n) \) is \[ \ln L_\bs{x}(a) = n \ln a + n a \ln b - (a + 1) \sum_{i=1}^n \ln x_i, \quad 0 \lt a \lt \infty \] The derivative is \[ \frac{d}{d a} \ln L_{\bs{x}}(a) = \frac{n}{a} + n \ln b - \sum_{i=1}^n \ln x_i \] The derivative is 0 when \( a = n \big/ \left(\sum_{i=1}^n \ln x_i - n \ln b\right) \). Finally, \( \frac{d^2}{da^2} \ln L_\bs{x}(a) = -n / a^2 \lt 0 \), so the maximum occurs at the critical point.

Uniform Distributions

In this section we will study estimation problems related to the uniform distribution that are a good source of insight and counterexamples. In a sense, our first estimation problem is the continuous analogue of an estimation problem studied in the section on order statistics in the chapter on finite sampling models. Suppose that \(\bs{X} = (X_1, X_2, \ldots, X_n)\) is a random sample from the uniform distribution on the interval \([0, h]\), where \(h \in (0, \infty)\) is an unknown parameter. Thus, the sampling distribution has probability density function \[ g(x) = \frac{1}{h}, \quad x \in [0, h] \] First let's review results from the last section.

The method of moments estimator of \(h\) is \(U = 2 M\). The estimator \(U\) satisfies the following properties:

\(U\) is unbiased.
\(\var(U) = \frac{h^2}{3 n}\) so \(U\) is consistent.

The maximum likelihood estimator of \(h\) is \(X_{(n)} = \max\{X_1, X_2, \ldots, X_n\}\), the \(n\)th order statistic. The estimator \(X_{(n)}\) satisfies the following properties:

\(\E\left(X_{(n)}\right) = \frac{n}{n + 1} h\)
\(\bias\left(X_{(n)}\right) = -\frac{h}{n+1}\) so that \(X_{(n)}\) is negatively biased but asymptotically unbiased.
\(\var\left(X_{(n)}\right) = \frac{n}{(n+2)(n+1)^2} h^2\)
\(\mse\left(X_{(n)}\right) = \frac{2}{(n+1)(n+2)}h^2\) so that \(X_{(n)}\) is consistent.

Details:

The likelihood function corresponding to the data \( \bs{x} = (x_1, x_2, \ldots, x_n) \) is \( L_\bs{x}(h) = 1 / h^n \) for \( h \ge x_i \) for each \( i \in \{1, 2, \ldots n\} \). The domain is equivalent to \( h \ge x_{(n)} \). The function \( h \mapsto 1 / h^n \) is decreasing, and so the maximum occurs at the smallest value, namely \( x_{(n)} \). Parts (a) and (c) are restatements of results from the section on order statistics. Parts (b) and (d) follow from (a) and (c).

Since the expected value of \(X_{(n)}\) is a known multiple of the parameter \(h\), we can easily construct an unbiased estimator.

Let \(V = \frac{n+1}{n} X_{(n)}\). The estimator \(V\) satisfies the following properties:

\(V\) is unbiased.
\(\var(V) = \frac{h^2}{n(n + 2)}\) so that \(V\) is consistent.
The asymptotic relative efficiency of \(V\) to \(U\) is infinite.

Details:

Parts (a) and (b) follow from and basic properties of the expected value and variance. For part (c), \[ \frac{\var(U)}{\var(V)} = \frac{h^2 / 3 n}{h^2 / n (n + 2)} = \frac{n + 2}{3} \to \infty \text{ as } n \to \infty \]

The last part shows that the unbiased version \(V\) of the maximum likelihood estimator is a much better estimator than the method of moments estimator \(U\). In fact, an estimator such as \(V\), whose mean square error decreases on the order of \(\frac{1}{n^2}\), is called super efficient. Now, having found a really good estimator, let's see if we can find a really bad one. A natural candidate is an estimator based on \(X_{(1)} = \min\{X_1, X_2, \ldots, X_n\}\), the first order statistic. The next result will make the computations very easy.

The sample \(\bs{X} = (X_1, X_2, \ldots, X_n)\) satisfies the following properties:

\(h - X_i\) is uniformly distributed on \([0, h]\) for each \(i\).
\((h - X_1, h - X_2, \ldots, h - X_n)\) is also a random sample from the uniform distribution on \([0, h]\).
\(X_{(1)}\) has the same distribution as \(h - X_{(n)}\).

Details:

This is a simple consequence of the fact that uniform distributions are preserved under linear transformations on the random variable.
This follows from (a) and that the fact that if \( \bs{X} \) is a sequence of independent variables, then so is \( (h - X_1, h - X_2, \ldots, h - X_n) \).
From part (b), \( X_{(1)} = \min\{X_1, X_2, \ldots, X_n\} \) has the same distribution as \( \min\{h - X_1, h - X_2, \ldots, h - X_n\} = h - \max\{X_1, X_2, \ldots, X_n\} = h - X_{(n)} \).

Let \(W = (n + 1)X_{(1)}\). Then

\( W \) is an unbiased estimator of \( h \).
\(\var(W) = \frac{n}{n+2} h^2\), so \(W\) is not even consistent.

Details:

These results follow from :

\( \E(X_{(1)}) = h - \E(X_{(n)}) = h - \frac{n}{n + 1} h = \frac{1}{n + 1} h \) and hence \( \E(W) = h \).
\( \var(W) = (n + 1)^2 \var(X_{(1)}) = (n + 1)^2 \var(h - X_{(n)}) = (n + 1)^2 \frac{n}{(n + 1)^2 (n + 2)} h^2 = \frac{n}{n + 2} h^2\).

Run the uniform estimation experiment 1000 times for several values of the sample size \(n\) and the parameter \(a\). In each case, compare the empirical bias and mean square error of the estimators with their theoretical values. Rank the estimators in terms of empirical mean square error.

Our next series of exercises will show that the maximum likelihood estimator is not necessarily unique. Suppose that \(\bs{X} = (X_1, X_2, \ldots, X_n)\) is a random sample from the uniform distribution on the interval \([a, a + 1]\), where \(a \in \R\) is an unknown parameter. Thus, the sampling distribution has probability density function \[ g(x) = 1, \quad a \le x \le a + 1 \] As usual, let's first review the method of moments estimator.

The method of moments estimator of \(a\) is \(U = M - \frac{1}{2}\). The estimator \(U\) satisfies the following properties:

\(U\) is unbiased.
\(\var(U) = \frac{1}{12 n}\) so \(U\) is consistent.

Any statistic \(V \in \left[X_{(n)} - 1, X_{(1)}\right]\) is a maximum likelihood estimator of \(a\).

Details:

The likelihood function corresponding to the data \( \bs{x} = (x_1, x_2, \ldots, x_n\} \) is \( L_\bs{x}(a) = 1 \) for \( a \le x_i \le a + 1 \) and \( i \in \{1, 2, \ldots, n\} \). The domain is equivalent to \( a \le x_{(1)} \) and \( a \ge x_{(n)} - 1 \). Since the likelihood function is constant on this domain, the result follows.

For completeness, let's consider the full estimation problem. Suppose that \( \bs{X} = (X_1, X_2, \ldots, X_n) \) is a random sample of size \( n \) from the uniform distribution on \( [a, a + h] \) where \( a \in \R \) and \( h \in (0, \infty) \) are both unknown. Here's the result from the last section:

Let \( U \) and \( V \) denote the method of moments estimators of \( a \) and \( h \), respectively. Then \[ U = 2 M - \sqrt{3} T, \quad V = 2 \sqrt{3} T \] where \( M = \frac{1}{n} \sum_{i=1}^n X_i \) is the sample mean, and \( T = \frac{1}{n} \sum_{i=1}^n (X_i - M)^2 \) is the biased version of the sample variance.

It should come as no surprise at this point that the maximum likelihood estimators are functions of the largest and smallest order statistics.

The maximum likelihood estimators or \( a \) and \( h \) are \( U = X_{(1)} \) and \( V = X_{(n)} - X_{(1)} \), respectively.

\( E(U) = a + \frac{h}{n + 1} \) so \( U \) is positively biased and asymptotically unbiased.
\( E(V) = h \frac{n - 1}{n + 1} \) so \( V \) is negatively biased and asymptotically unbiased.
\( \var(U) = h^2 \frac{n}{(n + 1)^2 (n + 2)} \) so \( U \) is consistent.
\( \var(V) = h^2 \frac{2(n - 1)}{(n + 1)^2(n + 2)} \) so \( V \) is consistent.

Details:

The likelihood function corresponding to the data \( \bs{x} = (x_1, x_2, \ldots, x_n) \) is \( L_\bs{x}(a, h) = \frac{1}{h^n} \) for \( a \le x_i \le a + h \) and \( i \in \{1, 2, \ldots, n\} \). The domain is equivalent to \( a \le x_{(1)} \) and \( a + h \ge x_{(n)} \). Since the likelihood function depends only on \( h \) in this domain and is decreasing, the maximum occurs when \( a = x_{(1)} \) and \( h = x_{(n)} - x_{(1)} \). Parts (a)–(d) follow from standard results for the order statistics from the uniform distribution.

The Hypergeometric Model

In all of our previous examples, the sequence of observed random variables \( \bs{X} = (X_1, X_2, \ldots, X_n) \) is a random sample from a distribution. However, maximum likelihood is a very general method that does not require the observation variables to be independent or identically distributed. In the hypergeometric model, we have a population of \( N \) objects with \( r \) of the objects type 1 and the remaining \( N - r \) objects type 0. The population size \( N \), is a positive integer. The type 1 size \( r \), is a nonnegative integer with \( r \le N \). These are the basic parameters, and typically one or both is unknown. Here are some typical examples:

We sample \( n \) objects from the population at random, without replacement. Let \( X_i \) be the type of the \( i \)th object selected, so that our sequence of observed variables is \( \bs{X} = (X_1, X_2, \ldots, X_n) \). The variables are identically distributed indicator variables, with \( P(X_i = 1) = r / N \) for each \( i \in \{1, 2, \ldots, n\} \), but are dependent since the sampling is without replacement. The number of type 1 objects in the sample is \( Y = \sum_{i=1}^n X_i \). This statistic has the hypergeometric distribution with parameter \( N \), \( r \), and \( n \), and has probability density function given by \[ P(Y = y) = \frac{\binom{r}{y} \binom{N - r}{n - y}}{\binom{N}{n}} = \binom{n}{y} \frac{r^{(y)} (N - r)^{(n - y)}}{N^{(n)}}, \quad y \in \{\max\{0, N - n + r\}, \ldots, \min\{n, r\}\} \] Recall the falling power notation: \( x^{(k)} = x (x - 1) \cdots (x - k + 1) \) for \( x \in \R \) and \( k \in \N \).

As above, let \( \bs{X} = (X_1, X_2, \ldots, X_n) \) be the observed variables in the hypergeometric model with parameters \( N \) and \( r \). Then

The maximum likelihood estimator of \( r \) with \( N \) known is \( U = \lfloor N M \rfloor = \lfloor N Y / n \rfloor \).
The maximum likelihood estimator of \( N \) with \( r \) known is \( V = \lfloor r / M \rfloor = \lfloor r n / Y \rfloor \) if \( Y \gt 0 \).

Details:

By a simple application of the multiplication rule, the PDF \( f \) of \( \bs{X} \) is \[ f(\bs{x}) = \frac{r^{(y)} (N - r)^{(n - y)}}{N^{(n)}}, \quad \bs{x} = (x_1, x_2, \ldots, x_n) \in \{0, 1\}^n \] where \( y = \sum_{i=1}^n x_i \).

With \( N \) known, the likelihood function corresponding to the data \(\bs{x} = (x_1, x_2, \ldots, x_n) \in \{0, 1\}^n\) is \[ L_{\bs{x}}(r) = \frac{r^{(y)} (N - r)^{(n - y)}}{N^{(n)}}, \quad r \in \{y, \ldots, \min\{n, y + N - n\}\} \] After some algebra, \( L_{\bs{x}}(r - 1) \lt L_{\bs{x}}(r) \) if and only if \((r - y)(N - r + 1) \lt r (N - r - n + y + 1)\) if and only if \( r \lt N y / n \). So the maximum of \( L_{\bs{x}}(r) \) occurs when \( r = \lfloor N y / n \rfloor \).
Similarly, with \( r \) known, the likelihood function corresponding to the data \(\bs{x} = (x_1, x_2, \ldots, x_n) \in \{0, 1\}^n\) is \[ L_{\bs{x}}(N) = \frac{r^{(y)} (N - r)^{(n - y)}}{N^{(n)}}, \quad N \in \{\max\{r, n\}, \ldots\} \] After some algebra, \( L_{\bs{x}}(N - 1) \lt L_{\bs{x}}(N) \) if and only if \((N - r - n + y) / (N - n) \lt (N - r) / N\) if and only if \( N \lt r n / y \) (assuming \( y \gt 0 \)). So the maximum of \( L_{\bs{x}}(r) \) occurs when \( N = \lfloor r n / y \rfloor \).

In the reliability example, we might typically know \( N \) and would be interested in estimating \( r \). In the wildlife example, we would typically know \( r \) and would be interested in estimating \( N \). This example is known as the capture-recapture model.

Clearly there is a close relationship between the hypergeometric model and the Bernoulli trials model in the subsection above. In fact, if the sampling is with replacement, the Bernoulli trials model with \( p = r / N \) would apply rather than the hypergeometric model. In addition, if the population size \( N \) is large compared to the sample size \( n \), the hypergeometric model is well approximated by the Bernoulli trials model, again with \( p = r / N \).