\(\renewcommand{\P}{\mathbb{P}}\) \(\newcommand{\R}{\mathbb{R}}\) \(\newcommand{\N}{\mathbb{N}}\) \(\newcommand{\bs}{\boldsymbol}\) \(\newcommand{\ms}{\mathscr}\)
  1. Random
  2. 2. Distributions
  3. 1
  4. 2
  5. 3
  6. 4
  7. 5
  8. 6
  9. 7
  10. 8
  11. 9

1. Introduction

Basic Theory

This introductory section is from a measure-theoretic point of view that is very helpful for the study of joint distributions and the sections that follow. However, the sections on discrete distributions, continuous distributions, and mixed distributions are largely self-contained, so if you are a new student of probability you can skip to those sections first.

Preliminaries

Recall that a random experiment, is modeled by a probability space \((\Omega, \ms F, \P)\)

  1. \(\Omega\) is the set of outcomes
  2. \(\ms F\) the \(\sigma\)-algebra of events
  3. \(\P\) the probability measure on the sample space \((\Omega, \ms F)\)

So the sample space \((\Omega, \ms F)\) is a measurable space and the probability space \((\Omega, \ms F, \P)\) is a special type of measure space.

If \((S, \ms S)\) is another measurable space and \(X\) a random variable with values in \(S\), then the probability distribution of \(X\) is the probability measure \(P\) defined by \[P(A) = \P(X \in A), \quad A \in \ms S\]

Details:

So \(S\) is a set and \(\ms S\) is a \(\sigma\)-algebra of subsets of \(S\). The random variable \(X\) is a measurable function from \(\Omega\) to \(S\) so that \(\{X \in A\} \in \ms F\) for every \(A \in \ms S\). Thus the definition makes sense.

If we think of recording the value of \(X\) then we have a new random experiment with a new probability space \((S, \ms S, P)\), which has exactly the same mathematical structure as the original \((\Omega, \ms F, \P)\). Conversely, if we let \((S, \ms S) = (\Omega, \ms F)\) and let \(X\) be the identity function on \(\Omega\), so that \(X(\omega) = \omega\) for \(\omega \in \Omega\), then \(X\) is a random variable whose probability distibution is \(\P\). That is, \(\{X \in A\} = A\) and so \(\P(X \in A) = \P(A)\) for \(A \in \ms F\). The point of this is that there is no difference (in this text) between the terms probability measure and probability distribution of a random variable, so we will use whichever one is most appropriate for a given context. If \(X\) is a random variable with values in \(S\) and our interest is in the probability distribution of \(X\) on \((S, \ms S)\), the underlying sample space \((\Omega, \ms F)\) on which \(X\) is defined is often of little importance and not even referenced.

Initially, our starting point in this chapter is a measurable space \((S, \ms S)\) and our interest is in probability distributions on this space. One of the goals is a study of how probability measures are constructed from other mathematical objects. One of the main ways is by means of a type of function known as a probability density function. This requires that we actually start with the more general setting

For the remainder of this section, we assume that \((S, \ms S, \mu)\) is a \(\sigma\)-finite, positive measure space.

Details:

So again, \(S\) is a set, \(\ms S\) is a \(\sigma\)-algebra of subsets of \(S\), and \(\mu\) is a positive measure on \((S, \ms S)\). Recall that \(\mu\) is nonnegative, countably additive, and satisfies \(\mu(\emptyset) = 0\), just like a probability measure. But unlike a probability measure, \(\mu(S)\) need not be 1 (and indeed may well be \(\infty\)). The statement that the measure space \((S, \ms S, \mu)\) is \(\sigma\)-finite means that there exists a countable collection \(\{A_i: i \in I\}\) of sets in \(\ms S\) with \(\bigcup_{i \in I} A_i = S\) and \(\mu(A_i) \lt \infty\) for \(i \in I\).

Why start with this more general setting? Because most measurable spaces come with natural positive measures on them. The following discussion gives the two most important special cases of .

\((S, \ms S, \#)\) is a discrete measure space if \(S\) is countable, \(\ms S = \ms P(S)\) is the collection of all subsets of \(S\), and \(\#\) is counting measure.

Discrete measure spaces are the simplest kind and lead to discrete probability theory. Every subset of \(S\) is measurable as is every function defined on \(S\), so measure theory is not really necessary.

\((\R^n, \ms R^n, \lambda^n)\) is the \(n\)-dimensional Euclidean measure space for \(n \in \N_+\), where \(\ms R^n\) is the \(\sigma\)-algebra of Borel measurable subsets of \(\R^n\) and \(\lambda^n\) is Lebesgue measure.

  1. \(\ms R^n\) is generated by sets of the form \(I_1 \times I_2 \times \cdots \times I_n\) where \(I_j \subseteq \R\) is an interval for each \(j \in \{1, 2, \ldots, n\}\).
  2. \(\lambda^n\) is the \(n\)-fold product measure associated with \(\lambda = \lambda^1\).
  3. \(\lambda\) is length on sets in \(\ms R_1\).
  4. \(\lambda^2\) is area on sets in \(\ms R_2\).
  5. \(\lambda^3\) is volume on sets in \(\ms R_3\).
  6. Generally, \(\lambda^n\) can be thought of as \(n\)-dimensional volume on sets in \(\ms R^n\).

So \(\ms R^n\) is generated by the collection of \(n\)-dimensional boxes. Often we are interested in a subspace \((S, \ms S, \lambda^n)\) of \((\R^n, \ms R^n, \lambda^n)\) with \(\lambda^n(S) \gt 0\).

Associated with the measure space \((S, \ms S, \mu)\) in is an integral. The integral of a measurable function \(f: S \to \R\) over a set \(A \in \ms S\), if it exits, is denoted \[\int_A f(x) \, d\mu(x)\]

The main thing to keep in mind is the integral satisfies the usual properties that you are familar with from calculus. Here's what the integral looks like for our two special cases:

If \((S, \ms S, \#)\) is a discrete measure space, then integrals are just sums. That is, if \(f: S \to \R\) then the integral in is \[\sum_{x \in A} f(x)\] assuming that the sum makes sense, so that either the sum of the positive terms is finite or the sum of the negative terms is finte.

For the \(n\)-dimensional Euclidean space, the integral in is the Lebesgue integral, denoted \[\int_A f(x) \, dx\] The Lebesgue integral is an extension of the ordinary Riemann integral of calculus.

These two special cases alone justify the introduction of the abstract integral. Otherwise, we would have to state and prove many of the theorems in this chapter (at least) twice. The assumption that the measure space \((S, \ms S, \mu)\) is \(\sigma\)-finite is needed so that we can write double integrals as iterated integrals and interchange the order of integration. It is also needed for the existence of probability density functions, which we discuss next.

Probabiity Density Functions

Here is the definition that we have been building toward.

A measurable function \(f: S \to [0, \infty)\) that satisfies \(\int_S f(x) \, d\mu(x) = 1\) is a probability density function relative to \((S, \ms S, \mu)\). Then the following formula defines a probability distribution on \((S, \ms S)\): \[P(A) = \int_A f(x) \, d\mu(x), \quad A \in \ms S\]

Details:

The proof relies on basic properties of the integral.

  1. Since \(f\) is nonnegative, \(P(A) = \int_A f(x) \, d\mu(x) \ge 0\) for \(A \in \ms S\)
  2. \(P(S) = \int_S f(x) \, d\mu(x) = 1\) by assumption.
  3. If \(\{A_i: i \in I\}\) is a countable, disjoint collection of sets in \(\ms S\) and \(A = \bigcup_{i \in I} A_i\), then by the additivity of the integral over disjoint domains, \[P(A) = \int_A f(x) \, d\mu(x) = \sum_{i \in I} \int_{A_i} f(x) \, d \mu = \sum_{i \in I} P(A_i)\]

The following theorem gives a converse. For details and proofs, see the section on absolute continuity and density functions.

Suppose that \(P\) is a probability measure on \((S, \ms S)\). Then \(P\) is absolutely continuous with respect to \(\mu\) if \(\mu(A) = 0\) implies \(P(A) = 0\) for \(A \in \ms S\). In this case (and only this case) \(P\) has a probability density function \(f: S \to [0, \infty)\) so that \[P(A) = \int_A f(x) \, d\mu(x), \quad A \in \ms S\]

So the absolute continuity of \(P\) means that a null set of \(\mu\) must also be a null set of \(P\). It's easy to see that absolute continuity is necessary for the existence of a density function, since the integral over a set of measure 0 is 0. The interesting fact is that the condition is sufficient as well. For a discrete measurable space \((S, \ms S)\), the only null set of counting measure \(\#\) is \(\emptyset\), which is trivially also a null set of \(P\). So a probability measure on a discrete space always has a density function, and it's easy to see how the density function is defined.

Suppose that \(P\) is a probability measure on a discrete measurable space \((S, \ms S)\). Then the density function \(f\) of \(P\) relative to \(\#\) is defined by \(f(x) = P(\{x\})\) for \(x \in S\) so that \[P(A) = \sum_{x \in A} f(x), \quad A \subseteq S\]

The displayed equation is an immediate consequence of the definition of \(f\) and the countable additivity of \(P\).

The section on discrete distributions gives lots of examples. For a general measure space \((S, \ms S, \mu)\), a density function of a probability measure \(P\) relative to \(\mu\) (if it exists) is only unique up to a null set of \(\mu\). That is, if \(f\) and \(g\) are density functions of \(P\) relative to \(\mu\) then \[\mu\{x \in S: f(x) \ne g(x)\} = 0\] This is because only integrals of the density function are important as you can see from definition .

Suppose that \(g: S \to [0, \infty)\) is measurable and let \(c = \int_S g \, d\mu\). If \(c \in (0, \infty)\) then the function \(f\) defined by \(f(x) = g(x) / c\) for \(x \in S\) is a probability density function.

Details:

Note that by the scaling property of the integral, \(\int_S f \, d\mu = 1\).

This result is often used to construct probability measures with desired properties. The constant \(c\) is often referred to as the normalizing constant.

Suppose that \(P\) is a probability distribution on \((S, \ms S)\) that has density function \(f\). A point \(x \in S\) that maximizes \(f\) is a mode of the distribution.

When a probability distribution has a unique mode (and especially in the discrete setting), the mode is sometimes used as a measure of center of the distribution. More important measures of center for a real-valued random variable will be studied later.

Conditional Distributions

Suppose again that \((S, \ms S, \mu)\) is a positive measure space as in and that \(X\) is a random variable with values in \(S\), defined on an underlying probability space \((\Omega, \ms F, \P)\) as in . The conditional distribution of \(X\) given an event \(E\) is defined just as in , but using conditional probability. No new ideas are involved.

If \(E \in \ms F\) is an event with \(\P(E) \gt 0\), then the conditional distribution of \(X\) given \(E\) is the probability measure on \((S, \ms S)\) defined by \[\P(X \in A \mid E) = \frac{\P(X \in A, E)}{\P(E)}, \quad A \in \ms S\]

The comma in the numerator acts just like the intersection symbol.

If the distribution of \(X\) has a density function \(f\) with respect to \(\mu\) then the conditional distribution of \(X\) given \(E\) also has a density, which we denote by \(f(\cdot \mid E)\).

Details:

Trivially if the distribution of \(X\) is absolutely continuous with respect to \(\mu\) then so is the conditional distribution given \(E\).

We can't give an explicit formula for \(f(\cdot \mid E)\) in terms of \(f\) without knowing more about the event \(E\). But we can when \(E\) is an event defined in terms of \(X\) itself.

Suppose again that \(X\) has density function \(f\). If \(B \in \ms S\) with \(\P(X \in B) \gt 0\) then the conditional distribution of \(X\) given \(X \in B\) has density function \[f(x \mid X \in B) = \frac{f(x)}{\P(X \in B)} = \frac{f(x)}{\int_B f(y) \, d\mu(y)}, \quad x \in B\]

Details:

Suppose that \(A \in \ms S\) with \(A \subseteq B\). Then \[\P(X \in A \mid X \in B) = \frac{\P(X \in A)}{\P(X \in B)} = \frac{1}{\P(X \in B)} \int_A f(x) \, d\mu(x) = \int_A \frac{f(x)}{\P(X \in B)} \, d\mu(x)\] So by definition, \(x \mapsto f(x) / \P(X \in B)\) is a density of the conditional distribution of \(X\) given \(X \in B\).

The section on conditional distributions generalizes this idea.