\(\newcommand{\var}{\text{var}}\)
\(\newcommand{\sd}{\text{sd}}\)
\(\newcommand{\cov}{\text{cov}}\)
\(\newcommand{\cor}{\text{cor}}\)
\(\renewcommand{\P}{\mathbb{P}}\)
\(\newcommand{\E}{\mathbb{E}}\)
\(\newcommand{\R}{\mathbb{R}}\)
\(\newcommand{\N}{\mathbb{N}}\)
\( \newcommand{\bs}{\boldsymbol} \)

The goal of this section is to study a type of mathematical object that arises naturally in the context of conditional expected value and parametric distributions, and is of fundamental importance in the study of stochastic processes, particularly Markov processes. In a sense, the main object of study in this section is a generalization of a matrix, and the operations generalizations of matrix operations. If you keep this in mind, this section may seem less abstract.

Recall that a measurable space \( (S, \mathscr{S}) \) consists of a set \( S \) and a \( \sigma \)-algebra \( \mathscr{S} \) of subsets of \( S \). If \( S \) is countable, then \( \mathscr{S} \) is usually the power set of \( S \), the collection of all subsets of \( S \). If \( S \) is uncountable, then typically \( S \) is a topological space (a subset of a Euclidean space, for example), and \( S \) is the Borel \( \sigma \)-algebra, the \( \sigma \)-algebra generated by the open subsets of \( S \). A nice set of assumptions that we will use in the section is that \( S \) has a topology that is locally compact, Hausdorff, and has a countable base (LCCB), and that \( \mathscr{S} \) is the Borel \( \sigma \)-algebra. (See the section on topology to review the definitions.) These assumptions are general enough to encompass most measurable spaces that occur in probability and yet are restrictive enough to allow a nice mathematical theory. In particular, a countable set \( S \) with the discrete topology satisfies the assumptions and the corresponding Borel \( \sigma \)-algebra \( \mathscr{S} \) is the power set. In this case, every function from \( S \) to another measurable space is measurable, and every from from \( S \) to another topological space is continuous. For \( k \in \N_+ \), the Euclidean space \( \R^k \) also satisfies the assumptions, and in this case the Borel \( \sigma \)-algebra is the usual collection of measurable sets.

Let \( \mathscr{B}(S) \) denote the collection of bounded functions \( f: S \to \R \). Under the usual operations of pointwise addition and scalar multiplication, \( \mathscr{B}(S) \) is a vector space, and the natural norm on this space is the supremum norm \[ \| f \| = \sup\{\left|f(x)\right|: x \in S\}, \quad f \in \mathscr{B}(S) \] This vector space plays an important role.

In this section, it is sometimes more natural to write integrals with respect to a positive measure with the differential *before* the integrand, rather than after. However, rest assured that this is mere notation, the meaning of the integral is the same. Thus, if \( \mu \) is a positive measure on \( (S, \mathscr{S}) \) and \( f: S \to \R \) a measurable, real-valued function we may write the integral of \( f \) with respect to \( \mu \) in operator notation as
\[ \mu f = \int_S \mu(dx) f(x) \]
assuming, as usual, that the integral exists. If \( \mu \) is a probability measure on \( S \), then we can think of \( f \) as a real-valued random variable, in which case our new notation is not too far from our traditional \( \E(f) \). Our main definition comes next.

Suppose that \( (S, \mathscr{S}) \) and \( (T, \mathscr{T}) \) are measurable spaces. A kernel from \( (S, \mathscr{S}) \) to \( (T, \mathscr{T}) \) is a function \( K: S \times \mathscr{T} \to [0, \infty] \) such that

- \( x \mapsto K(x, A) \) is measurable for each \( A \in \mathscr{T} \).
- \( A \mapsto K(x, A) \) is a positive measure on \( \mathscr{T} \) for each \( x \in S \).

If \( (T, \mathscr{T}) = (S, \mathscr{S}) \), then \( K \) is said to be a kernel on \( (S, \mathscr{S}) \).

There are several classes of kernels that deserve special names.

Suppose that \( K \) is a kernel from \( (S, \mathscr{S}) \) to \( (T, \mathscr{T}) \). Then

- \( K \) is finite if \( K(x, T) \lt \infty \) for every \( x \in S \).
- \( K \) is bounded if \( K(x, T) \) is bounded in \( x \in S \), and in this case we define \( \|K\| = \sup\{K(x, T): x \in S\} \).
- \( K \) is a probability kernel if \( K(x, T) = 1 \) for every \( x \in S \).

So a probability kernel is bounded, and a bounded kernel is finite. The terms stochastic kernel and Markov kernel are also used for probability kernels, and for a probability kernel \( \|K\| = 1 \) of course. The terms are consistent with terms used for measures: \( K \) is a finite kernel if and only if \( K(x, \cdot) \) is a finite measure for each \( x \in S \), and \( K \) is a probability kernel if and only if \( K(x, \cdot) \) is a probability measure for each \( x \in S \). A kernel defines two natural integral operators, by operating on the *left* with measures, and by operating on the *right* with functions. As usual, we are often a bit casual witht the question of existence. Basically in this section, we assume that any integrals mentioned exist.

Suppose that \( K \) is a kernel from \( (S, \mathscr{S}) \) to \( (T, \mathscr{T}) \).

- If \( \mu \) is a positive measure on \( (S, \mathscr{S}) \), then \( \mu K \) defined as follows is a positive measure on \( (T, \mathscr{T}) \): \[ \mu K(A) = \int_S \mu(dx) K(x, A), \quad A \in \mathscr{T} \]
- If \( f: T \to \R \) is measurable, then \( K f: S \to \R \) defined as follows is measurable \[K f(x) = \int_S K(x, dy) f(y), \quad x \in S\]

- Clearly \( \mu K(A) \ge 0 \) for \( A \in \mathscr{T} \). Suppose that \( \{A_j: i \in J\} \) is a countable collection of disjoint sets in \( \mathscr{T} \) and \( A = \bigcup_{j \in J} A_j \). Then \begin{align*} \mu K(A) & = \int_S \mu(dx) K(x, A) = \int_S \mu(dx) \left(\sum_{j \in J} K(x, A_j) \right) \\ & = \sum_{j \in J} \int_S \mu(dx) K(x, A_j) = \sum_{j \in J} \mu K(A_j) \end{align*} The interchange of sum and integral is justified since the terms are nonnegative.
- The measurability of \( K f \) follows from the measurability of \( f \) and of \( x \mapsto K(x, A) \) for \( A \in \mathscr{S} \), and from basic properties of the integral.

Thus, a kernel transforms measures on \( (S, \mathscr{S}) \) into measures on \( (T, \mathscr{T}) \), and transforms measurable functions from \( T \) to \( \R \) into measurable functions from \( S \) to \( \R \). Part (b) assumes of course that \( (K f)(x) \) exists for \( x \in S \). This will be the case if \( f \) is nonnegative (although the integral may be infinite) or if \( f \) is integrable with respect to the measure \( K(x, \cdot) \) for every \( x \in S \). In particular, the last statement will hold in the following important special case:

Suppose again that \( K \) is a kernel from \( (S, \mathscr{S}) \) to \( (T, \mathscr{T}) \) and that \( f \in \mathscr{B}(T) \).

- If \( K \) is finite, then \( K f \) exists and \( \left|Kf(x)\right| \le \|f\| K(x, S) \).
- If \( K \) is bounded, then \( K f \in \mathscr{B}(T) \) and \( \|K f\| = \|K\| \|f\| \).

- Note that \( K \left|f\right|(x) = \int_S K(x, dy) \left|f(y)\right| \le \int_S K(x, dy) \|f\| = \|f\| K(x, S) \lt \infty \) for \( x \in S \). Hence \( f \) is integrable with respect to \( K(x, \cdot) \) for each \( x \in S \) and the inequality holds.
- From (a), we now have \( \left|K f(x)\right| \le \|K\| \|f\| \) for \( x \in S \), so \( \|K f\| \le \|K\| \|f\| \). Moreover equality holds when \( f = \bs{1}_T \), the constant function 1 on \( T \).

The identity kernel \( I \) on the measurable space \( (S, \mathscr{S}) \) is defined by \( I(x, A) = \bs{1}(x \in A) \) for \( x \in S \) and \( A \in \mathscr{S} \).

Thus, \( I(x, A) = 1 \) if \( x \in A \) and \( I(x, A) = 0 \) if \( x \notin A \). So \( x \mapsto I(x, A) \) is the indicator function of \( A \in \mathscr{S} \), while \( A \mapsto I(x, A) \) is point mass at \( x \in S \). Clearly the identity kernel is a probability kernel. If we need to indicate the dependence on the particular space, we will add a subscript and write \( I_S \). The following result justifies the name.

Let \( I \) denote the identity kernel on \( (S, \mathscr{S}) \).

- If \( \mu \) is a positive measure on \( (S, \mathscr{S}) \) then \( \mu I = \mu \).
- If \( f: S \to \R \) is measurable, then \( I f = f \).

We can create a new kernel from two given kernels, by the usual operations of addition and scalar multiplication.

Suppose that \( K \) and \( L\) are kernels from \( (S, \mathscr{S}) \) to \( (T, \mathscr{T}) \), and that \( c \in [0, \infty) \). Then \( c K \) and \( K + L \) defined below are also kernels from \( (S, \mathscr{S}) \) to \( (T, \mathscr{T}) \).

- \((c K)(x, A) = c K(x, A)\) for \( x \in S \) and \( A \in \mathscr{T} \).
- \((K + L)(x, A) = K(x, A) + L(x, A)\) for \( x \in S \) and \( A \in \mathscr{T} \).

These results are simple.

- Since \( x \mapsto K(x, A) \) is measurable for \( A \in \mathscr{T} \), so is \( x \mapsto c K(x, A) \). Since \( A \mapsto K(x, A) \) is a positive measure on \( (T, \mathscr{T}) \) for \( x \in S \), so is \( A \mapsto c K(x, A) \) since \( c \ge 0 \).
- Since \( x \mapsto K(x, A) \) and \( x \mapsto L(x, A) \) are measurable for \( A \in \mathscr{T} \), so is \( x \mapsto K(x, A) + L(x, A) \). Since \( A \mapsto K(x, A) \) and \( A \mapsto L(x, A) \) are positive measures on \( (T, \mathscr{T}) \) for \( x \in S \), so is \( A \mapsto K(x, A) + L(x, A) \).

A more interesting and important way to form a new kernel from two given kernels is via a multiplication

operation.

Suppose that \( K \) is a kernel from \( (R, \mathscr{R}) \) to \( (S, \mathscr{S}) \) and that \( L \) is a kernel from \( (S, \mathscr{S}) \) to \( (T, \mathscr{T}) \). Then \( K L \) defined as follows is a kernel from \( (R, \mathscr{R}) \) to \( (T, \mathscr{T}) \): \[ K L(x, A) = \int_S K(x, dy) L(y, A), \quad x \in R, \, A \in \mathscr{T} \]

The measurability of \( x \mapsto (K L)(x, A) \) for \( A \in \mathscr{T} \) follows from basic properties of the integral. For the second property, fix \( x \in R \). Clearly \( K L(x, A) \ge 0 \) for \( A \in \mathscr{T} \). Suppose that \( \{A_j: j \in J\} \) is a countable collection of disjoint sets in \( \mathscr{T} \) and \( A = \bigcup_{j \in J} A_j \). Then \begin{align*} K L(x, A) & = \int_S K(x, dy) L(x, A) = \int_S K(x, dy) \left(\sum_{j \in J} L(y, A_j)\right) \\ & = \sum_{j \in J} \int_S K(x, dy) L(y, A_j) = \sum_{j \in J} K L(x, A_j) \end{align*} The interchange of sum and integral is justified since the terms are nonnegative.

Once again, the identity kernel lives up to its name:

Suppose that \( K \) is a kernel from \( (S, \mathscr{S}) \) to \( (T, \mathscr{T}) \). Then

- \( I_S K = K \)
- \( K I_T = K \)

The next several results show that the operations are associative whenever they make sense.

Suppose that \( K \) is a kernel from \( (S, \mathscr{S}) \) to \( (T, \mathscr{T}) \) and that \( \mu \) is a positive measure on \( \mathscr{S} \), \( c \in [0, \infty) \), and \( f: T \to \R \) is measurable. Then

- \( c (\mu K) = (c \mu) K \)
- \( c (K f) = (c K) f \)
- \( (\mu K) f = \mu (K f)\)

These results follow easily from the definitions.

- The common measure on \( \mathscr{T} \) is \( c \mu K(A) = c \int_S \mu(dx) K(x, A) \) for \( A \in \mathscr{T} \).
- The common function from \( S \) to \( \R \) is \( c K f(x) = c \int_S K(x, dy) f(y) \) for \( x \in S \).
- The common real number is \( \mu K f = \int_S \mu(dx) \int_T K(x, dy) f(y) \).

Suppose that \( K \) is a kernel from \( (R, \mathscr{R}) \) to \( (S, \mathscr{S}) \) and \( L \) is a kernel from \( (S, \mathscr{S}) \) to \( (T, \mathscr{T}) \). Suppose also that \( \mu \) is a positive measure on \( (R, \mathscr{R}) \), \( f: T \to \R \) is measurable, and \( c \in [0, \infty) \). Then

- \( (\mu K) L = \mu (K L) \)
- \( K ( L f) = (K L) f \)
- \( c (K L) = (c K) L \)

These results follow easily from the definitions.

- The common measure on \( (T, \mathscr{T}) \) is \( \mu K L(A) = \int_R \mu(dx) \int_S K(x, dy) L(y, A) \) for \( A \in \mathscr{T} \).
- The common measurable function from \( R \) to \( \R \) is \( K L f(x) = \int_S K(x, dy) \int_T L(y, dz) f(z) \) for \( x \in R \)
- The common kernel from \( (R, \mathscr{R}) \) to \( (T, \mathscr{T}) \) is \( c K L(x, A) = c \int_S K(x, dy) L(y, A) \) for \( x \in R \) and \( A \in \mathscr{T} \).

Suppose that \( K \) is a kernel from \( (R, \mathscr{R}) \) to \( (S, \mathscr{S}) \), \( L \) is a kernel from \( (S, \mathscr{S}) \) to \( (T, \mathscr{T}) \), and \( M \) is a kernel from \( (T, \mathscr{T}) \) to \( (U, \mathscr{U})\). Then \( (K L) M = K (L M) \).

This results follow easily from the definitions. The common kernel from \( (R, \mathscr{R}) \) to \( (U, \mathscr{U}) \) is \[K L M(x, A) = \int_S K(x, dy) \int_T L(y, dz) M(z, A), \quad x \in R, \, A \in \mathscr{U} \]

The next several results show that the distributive property holds whenever the operations makes sense.

Suppose that \( K \) and \( L \) are kernels from \( (R, \mathscr{R}) \) to \( (S, \mathscr{S}) \) and that \( M \) and \( N \) are kernels from \( (S, \mathscr{S}) \) to \( (T, \mathscr{T}) \). Suppose also that \( \mu \) is a positive measure on \( (R, \mathscr{R}) \) and that \( f: S \to \R \) is measurable. Then

- \((K + L) M = K M + L M\)
- \( K (M + N) = K M + K N \)
- \( \mu (K + L) = \mu K + \mu L \)
- \( (K + L) f = K f + L f \)

Suppose that \( K \) is a kernel from \( (S, \mathscr{S}) \) to \( (T, \mathscr{T}) \), and that \( \mu \) and \( \nu \) are positive measures on \( (S, \mathscr{S}) \), and that \( f \) and \( g \) are measurable functions from \( T \) to \( \R \). Then

- \( (\mu + \nu) K = \mu K + \nu K \)
- \( K(f + g) = K f + K g \)
- \( \mu(f + g) = \mu f + \mu g \)
- \( (\mu + \nu) f = \mu f + \nu f \)

In particular, note that if \( K \) is a kernel from \( (S, \mathscr{S}) \) to \( (T, \mathscr{T}) \), then the transformation \( \mu \mapsto \mu K \) defined for positive measures on \( (S, \mathscr{S})\), and the transformation \( f \mapsto K f \) defined for measurable functions \( f: T \to \R \) (for which \( K f \) exists), are both *linear* operators. If \( \mu \) is a positive measure on \( (S, \mathscr{S}) \), then the integral operator \( f \mapsto \mu f \) defined for measurable \( f: S \to \R \) (for which \( \mu f \) exists) is also linear, but of course, we already knew that. Finally, note that the operator \( f \mapsto K f \) is positive: if \( f \ge 0 \) then \( K f \ge 0 \). Here is the important summary of our results when the kernel is bounded.

If \( K \) is a bounded kernel from \( (S, \mathscr{S}) \) to \( (T, \mathscr{T}) \), then \( f \mapsto K f \) is a bounded, linear transformation from \( \mathscr{B}(T) \) to \( \mathscr{B}(S) \) and \( \|K\| \) is the norm of the transformation.

The commutative property for the product of kernels does not hold in general. If \( K \) and \( L \) are kernels, then depending on the measurable spaces, \( K L \) may be well defined, but not \( L K \). Even if both products are defined, they may be kernels from-to different measurable spaces. Even if both are defined from-to the same measurable spaces, it may well happen that \( K L \neq L K \). Examples are given in the computational exercises below.

If \( K \) is a kernel on \( (S, \mathscr{S}) \) and \( n \in \N \), we let \( K^n = K K \cdots K \), the \( n \)-fold power of \( K \). By convention, \( K^0 = I \), the identity kernel on \( S \).

Fixed points of the operators associated with a kernel turn out to be very important.

Suppose that \( K \) is a kernel from \( (S, \mathscr{S}) \) to \( (T, \mathscr{T}) \).

- A positive measure \( \mu \) on \( (S, \mathscr{S}) \) such that \( \mu K = \mu \) is said to be invariant for \( K \).
- A measurable function \( f: T \to \R \) such that \( K f = f \) is said to be invariant for \( K \)

So in the language of linear algebra (or functional analysis), an invariant measure is a left eigenvector of the kernel, while an invariant function is a right eigenvector of the kernel, both corresponding to the eigenvalue 1. By our results above, if \( \mu \) and \( \nu \) are invariant measures and \( c \in [0, \infty) \), then \( \mu + \nu \) and \( c \mu \) are also invariant. Similarly, if \( f \) and \( g \) are invariant functions and \( c \in \R \), the \( f + g \) and \( c f \) are also invariant.

Of couse we are particularly interested in *probability* kernels.

Suppose that \( P \) is a probability kernel from \((R, \mathscr{R})\) to \( (S, \mathscr{S}) \) and that \( Q \) is a probability kernel from \( (S, \mathscr{S}) \) to \( (T, \mathscr{T}) \). Suppose also that \( \mu \) is a probability measure on \( (R, \mathscr{R}) \). Then

- \( P Q \) is a probability kernel from \( (R, \mathscr{R}) \) to \( (T, \mathscr{T}) \).
- \( \mu P \) is a probability measure on \( (S, \mathscr{S}) \).

- We know that \( P Q \) is a kernel from \( (R, \mathscr{R}) \) to \( (T, \mathscr{T}) \). So we just need to note that \[P Q(T) = \int_S P(x, dy) Q(y, T) = \int_S P(x, dy) = P(x, S) = 1, \quad x \in R \]
- We know that \( \mu P \) is a positive measure on \( (S, \mathscr{S})) \). So we just need to note that \[ \mu P(S) = \int_R \mu(dx) P(x, S) = \int_R \mu(dx) = \mu(R) = 1 \]

As a corollary, it follows that if \( P \) is a probability kernel on \( (S, \mathscr{S}) \), then so is \( P^n \) for \( n \in \N \).

The operators associated with a kernel are of fundamental importance, and we can easily recover the kernel from the operators. Suppose that \( K \) is a kernel from \( (S, \mathscr{S}) \) to \( (T, \mathscr{T}) \), and let \( x \in S \) and \( A \in \mathscr{T} \). Then trivially, \(K \bs{1}_A(x) = K(x, A)\) where as usual, \( \bs{1}_A \) is the indicator function of \( A \). Trivially also \( \delta_x K(A) = K(x, A) \) where \( \delta_x \) is point mass at \( x \).

Usually our *measurable* spaces are in fact *measure* spaces, with natural measures associated with the spaces (counting measure for a countable set and Lebesgue measure for a subset of a Euclidean space, for example). In such cases, kernels are usually constructed from *density functions* in much the same way that positive measures are defined from density functions. In the discussion that follows, we assume as usual that integrals that are written exist.

Suppose that \( (S, \mathscr{S}, \lambda) \) and \( (T, \mathscr{T}, \mu) \) are measure spaces. As usual, \( S \times T \) is given the product \( \sigma \)-algebra \( \mathscr{S} \otimes \mathscr{T} \). If \( k: S \times T \to [0, \infty) \) is measurable, then the function \( K \) defined as follows is a kernel from \( (S, \mathscr{S}) \) to \( (T, \mathscr{T}) \): \[ K(x, A) = \int_A k(x, y) \mu(dy), \quad x \in S, \, A \in \mathscr{T} \]

The measurability of \( x \mapsto K(x, A) = \int_A k(x, y) \mu(dy) \) for \( A \in \mathscr{T} \) follows from a basic property of the integral. The fact that \( A \mapsto K(x, A) = \int_A k(x, y) \mu(dy) \) is a positive measure on \( \mathscr{T} \) for \( x \in S \) also follows from a basic property of the integral. In fact, \( y \mapsto k(x, y) \) is the density of this measure with respect to \( \mu \).

Clearly the kernel \( K \) depends on the positive measure \( \mu \) on \( (T, \mathscr{T}) \) as well as the function \( k \), while the measure \( \lambda \) on \( (S, \mathscr{S}) \) plays no role (and so is not even necessary). But again, our point of view is that the spaces have fixed, natural measures. Appropriately enough, the function \( k \) is called a kernel density function (with respect to \( \mu \)), or simply a kernel function.

Suppose again that \( (S, \mathscr{S}, \lambda) \) and \( (T, \mathscr{T}, \mu) \) are measure spaces. Suppose also \( K \) is a kernel from \( (S, \mathscr{S}) \) to \( (T, \mathscr{T}) \) with kernel function \( k \). If \( f: T \to \R \) is measurable, then \[ K f(x) = \int_S k(x, y) f(y) \mu(dy), \quad x \in S \]

This follows since the function \( y \mapsto k(x, y) \) is the density of the measure \( A \mapsto K(x, A) \) with respect to \( \mu \): \[ K f(x) = \int_S K(x, dy) f(y) = \int_S k(x, y) f(y) \mu(dy), \quad x \in S \]

A kernel function defines an operator on the left with functions on \( S \) in a completely analogous way to the operator on the right above with functions on \( T \).

Suppose again that \( (S, \mathscr{S}, \lambda) \) and \( (T, \mathscr{T}, \mu) \) are measure spaces and that \( k: S \times T \to [0, \infty) \) is measurable. If \( f: S \to \R \) is measurable, then the function \( f K: T \to \R \) defined as follows is also measurable \[ f K(y) = \int_S \lambda(dx) f(x) k(x, y), \quad y \in T \]

The operator defined above depends on the measure \( \lambda \) on \( (S, \mathscr{S}) \) as well as the kernel function \( k \), while the measure \( \mu \) on \( (T, \mathscr{T}) \) playes no role (and so is not even necessary). But again, our point of view is that the spaces have fixed, natural measures. Here is how our new operation on the left with *functions* relates to our old operation on the left with *measures*.

Suppose again that \( (S, \mathscr{S}, \lambda) \) and \( (T, \mathscr{T}, \mu) \) are measure spaces and that \( k: S \times T \to [0, \infty) \) is measurable. Suppose also that \( f: S \to [0, \infty) \) is measurable, and let \( \rho \) denote the measure on \( (S, \mathscr{S}) \) that has density \( f \) with respect to \( \lambda \). Then \( f K \) is the density of the measure \( \rho K \) with respect to \( \mu \).

The main tool, as usual, is an interchange of integrals. For \( B \in \mathscr{T} \), \begin{align*} \rho K(B) & = \int_S \rho(dx) K(x, B) = \int_S f(x) K(x, B) \lambda(dx) = \int_S f(x) \left[\int_B k(x, y) \mu(dy)\right] \lambda(dx) \\ & = \int_B \left[\int_S f(x) k(x, y) \lambda(dx)\right] \mu(dy) = \int_B f K(y) \mu(dy) \end{align*}

As always, we are particularly interested in stochastic kernels. With a kernel function, we can have *doubly* stochastic kernels.

Suppose again that \( (S, \mathscr{S}, \lambda) \) and \( (T, \mathscr{T}, \mu) \) are measure spaces and that \( k: S \times T \to [0, \infty) \) is measurable. Then \( k \) is a double stochastic kernel function if

- \( \int_T k(x, y) \mu(dy) = 1 \) for \( x \in S \)
- \( \int_S \lambda(dx) k(x, y) = 1 \) for \( y \in S \)

Of course, condition (a) simply means that the kernel associated with \( k \) is a stochastic kernel according to our original definition.

The most common and important special case is when the two spaces are the same. Thus, if \( (S, \mathscr{S}, \lambda) \) is a measure space and \( k : S \times S \to [0, \infty) \) is measurable, then we have an operator \( K \) that operates on the left and on the right with measurable functions \( f: S \to \R \): \begin{align*} f K(y) & = \int_S \lambda(dx) f(x) k(x, y), \quad y \in S \\ K f(x) & = \int_S k(x, y) f(y) \lambda(d y), \quad x \in S \end{align*} If \( f \) is nonnegative and \( \mu \) is the measure on with density function \( f \), then \( f K \) is the density function of the measure \( \mu K \) (both with respect to \( \lambda \)).

Suppose again that \( (S, \mathscr{S}, \lambda) \) is a measure space and \( k : S \times S \to [0, \infty) \) is measurable. Then \( k \) is symmetric if \( k(x, y) = k(y, x) \) for all \( (x, y) \in S^2 \).

Of course, if \( k \) is a symmetric, stochastic kernel function on \( (S, \mathscr{S}, \lambda) \) then \( k \) is doubly stochastic, but the converse is not true.

Suppose that \( (R, \mathscr{R}, \lambda) \), \( (S, \mathscr{S}, \mu) \), and \( (T, \mathscr{T}, \rho) \) are measure spaces. Suppose also that \( K \) is a kernel from \( (R, \mathscr{R}) \) to \( (S, \mathscr{S}) \) with kernel function \( k \), and that \( L \) is a kernel from \( (S, \mathscr{S}) \) to \( (T, \mathscr{T}) \) with kernel function \( l \). Then the kernel \( K L \) from \( (R, \mathscr{R}) \) to \( (T, \mathscr{T}) \) has density \( k l \) given by \[ k l(x, z) = \int_S k(x, y) l(y, z) \mu(dy), \quad (x, z) \in R \times T \]

Once again, the main tool is an interchange of integrals via Fubini's theorem. Let \( x \in R \) and \( B \in \mathscr{T} \). Then \begin{align*} K L(x, B) & = \int_S K(x, dy) L(y, B) = \int_S k(x, y) L(y, B) \mu(dy) \\ & = \int_S k(x, y) \left[\int_B l(y, z) \rho(dz) \right] \mu(dy) = \int_B \left[\int_S k(x, y) l(y, z) \mu(dy) \right] \rho(dz) = \int_B k l(x, z) \mu(dz) \end{align*}

In this subsection, a countable set is given the power set as the \( \sigma \)-algebra (as usual). Thus all subsets of the set are measurable, and any function defined on the set is measurable. We also use counting measure \( \# \) as the natural measure on the set, so integrals become sums.

Suppose now that \( K \) is a kernel from a countable set \( S \) to a countable set \( T \). For \( x \in S \) and \( y \in T \), let \( K(x, y) = K(x, \{y\}) \). Then more generally,
\[ K(x, A) = \sum_{y \in A} K(x, y), \quad x \in S, \, A \subseteq T \]
The function \( (x, y) \mapsto K(x, y) \) is simply the kernel function of the kernel \( K \), as defind above, but in this case we usually don't bother with using a different symbol for the function as opposed to the kernel. The function \( K \) can be thought of as a *matrix*, with rows indexed by \( S \) and columns indexed by \( T \) (and so an infinite matrix if \( S \) or \( T \) is countably infinite). With this interpretation, all of the operations defined above can be thought of as matrix operations. If \( f: T \to \R \) and \( f \) is thought of as a column vector indexed by \( T \), then \( K f \) defined by
\[K f(x) = \sum_{y \in S} K(x, y) f(y), \quad x \in S \]
is simply the ordinary product of the matrix \( K \) and the vector \( f \); the product is a column vector indexed by \( S \). Similarly, if \( f: S \to \R \) and \( f \) is thought of as a row vector indexed by \( S \), then \( f K \) defined by
\[ f K(y) = \sum_{x \in S} f(x) K(x, y), \quad y \in T \]
is simple the ordinary product of the vector \( f \) and the matrix \( K \); the product is a row vector indexed by \( T \). If \( L \) is another kernel from \( T \) to another countable set \( U \), then as functions, \( K L \) defined by
\[ K L(x, z) = \sum_{y \in T} K(x, y) L(x, z), \quad (x, z) \in S \times L \]
is the simply the matrix product of \( K \) and \( L \).

Let \( S = \{1, 2, 3\} \) and \( T = \{1, 2, 3, 4\} \). Define the kernel \( K \) from \( S \) to \( T \) by \( K(x, y) = x + y \) for \( (x, y) \in S \times T \). Define the function \( f \) on \( S \) by \( f(x) = x! \) for \( x \in S \), and define the function \( g \) on \( T \) by \( g(y) = y^2\) for \( y \in T \). Compute each of the following using matrix algebra:

- \( f K \)
- \( K g \)

In matrix form, \[ K = \left[\begin{matrix} 2 & 3 & 4 & 5 \\ 3 & 4 & 5 & 6 \\ 4 & 5 & 6 & 7 \end{matrix} \right], \quad f = \left[\begin{matrix} 1 & 2 & 6 \end{matrix} \right], \quad g = \left[\begin{matrix} 1 \\ 4 \\ 9 \\ 16 \end{matrix} \right]\]

- As a row vector indexed by \( T \), the product is \( f K = \left[\begin{matrix} 32 & 41 & 50 & 59\end{matrix}\right] \)
- As a column vector indexed by \( S \), \[ K g = \left[\begin{matrix} 130 \\ 160 \\ 190 \end{matrix}\right] \]

Let \( R = \{0, 1\} \), \( S = \{a, b\} \), and \( T = \{1, 2, 3\} \). Define the kernel \( K \) from \( R \) to \( S \), the kernel \( L \) from \( S \) to \( S \) and the kernel \( M \) from \( S \) to \( T \) in matrix form as follows: \[ K = \left[\begin{matrix} 1 & 4 \\ 2 & 3\end{matrix}\right], \; L = \left[\begin{matrix} 2 & 2 \\ 1 & 5 \end{matrix}\right], \; M = \left[\begin{matrix} 1 & 0 & 2 \\ 0 & 3 & 1 \end{matrix} \right] \] Compute each of the following kernels, or explain why the operation does not make sense:

- \( K L \)
- \( L K \)
- \( K^2 \)
- \( L^2 \)
- \( K M \)
- \( L M \)

Note that these are not just abstract matrices, but rather have rows and columns indexed by the appropriate spaces. So the products make sense only when the spaces match appropriately; it's not just a matter of the number of rows and columns.

- \( K L \) is the kernel from \( R \) to \( S \) given by \[ K L = \left[\begin{matrix} 6 & 22 \\ 7 & 19 \end{matrix} \right] \]
- \( L K \) is not defined since the column space \( S \) of \( L \) is not the same as the row space \( R \) of \( K \).
- \( K^2 \) is not defined since the row space \( R \) is not the same as the column space \( S \).
- \( L^2 \) is the kernel from \( S \) to \( S \) given by \[ L^2 = \left[\begin{matrix} 6 & 14 \\ 7 & 27 \end{matrix}\right] \]
- \( K M \) is the kernel from \( R \) to \( T \) given by \[ K M = \left[\begin{matrix} 1 & 12 & 6 \\ 2 & 9 & 7 \end{matrix} \right] \]
- \( L M \) is the kernel from \( S \) to \( T \) given by \[ L M = \left[\begin{matrix} 2 & 6 & 6 \\ 1 & 15 & 7 \end{matrix}\right] \]

An important class of probability kernels arises from the distribution of one random variable, conditioned on the value of another random variable. In this subsection, suppose that \( (\Omega, \mathscr{F}, \P) \) is a probability space, and that \( (S, \mathscr{S}) \) and \( (T, \mathscr{T}) \) are measurable spaces. Further, suppose that \( X \) and \( Y \) are random variables defined on the probability space, with \( X \) taking values in \( S \) and that \( Y \) taking values in \( T \). Informally, \( X \) and \( Y \) are random variables defined on the same underlying random experiment.

The function \( P \) defined as follows is a probability kernel from \( (S, \mathscr{S}) \) to \( (T, \mathscr{T}) \). \[ P(x, A) = \P(Y \in A \mid X = x), \quad x \in S, \, A \in \mathscr{T} \]

Recall that for \( A \in \mathscr{T} \), the conditional probability \( \P(Y \in A \mid X) \) is itself a random variable, and is measurable with respect to \( \sigma(X) \). That is, \( \P(X \in A \mid X) = P(X, A) \) for some measurable function \(x \mapsto P(x, A) \) from \( S \) to \( [0, 1] \). Then, by definition, \( \P(X \in A \mid X = x) = P(x, A) \). Trivially, of course, \( A \mapsto P(x, A) \) is a probability measure on \( (T, \mathscr{T}) \) for \( x \in S \).

The operators associated with this kernel have natural interpretations.

Let \( P \) be the conditional probability kernel of \( Y \) given \( X \) as defined in the last result.

- If \( f: T \to \R \) is measurable, then \( Pf(x) = \E[f(Y) \mid X = x] \) for \( x \in S \) (assuming as usual that the expected value exists).
- If \( \mu \) is the probability distribution of \( X \) then \( \mu P \) is the probability distribution of \( Y \).

These are basic results that we have already studied, dressed up in new notation.

- Since \( A \mapsto P(x, A) \) is the conditional distribution of \( Y \) given \( X = x \), \[ \E[f(Y) \mid X = x] = \int_S P(x, dy) f(y) = P f(x) \]
- Let \( A \in \mathscr{T} \). Conditioning on \( X \) gives \[ \P(Y \in A) = \E[\P(Y \in A \mid X)] = \int_S \mu(dx) P(Y \in A \mid X = x) = \int_S \mu(dx) P(x, A) = \mu P(A) \]

As in the general discussion above, the measurable spaces \( (S, \mathscr{S}) \) and \( (T, \mathscr{T}) \) are usually *measure* spaces with natural measures attached. So the conditional probability distributions are often given via conditional probability density functions, which then play the role of kernel functions.

Suppose that \( X \) and \( Y \) are random variables for an experiment, taking values in \( \R \). For \( x \in \R \), the conditional distribution of \( Y \) given \( X = x \) is normal with mean \( x \) and standard deviation 1. Use the notation and operations of this section for the following problems:

- Give the probability density function for the conditional distribution of \( Y \) given \( X = x \).
- Find \( \E\left(Y^2 \bigm| X = x\right) \).
- Suppose that \( X \) has the standard normal distribution. Find the probability density function of \( Y \).

- The kernel function (with respect to Lebesgue measure, of course) is \[ p(x, y) = \frac{1}{\sqrt{2 \pi}} e^{-\frac{1}{2} (y - x)^2}, \quad x, \, y \in \R \]
- Let \( g(y) = y^2 \) for \( y \in \R \). Then \( E\left(Y^2 \bigm| X = x\right) = P g(x) = 1 + x^2\) for \( x \in \R \)
- The standard normal PDF \( f \) is given \( f(x) = \frac{1}{\sqrt{2 \pi}} e^{-x^2/2} \) for \( x \in \R \). Thus \( Y \) has PDF \( f P \). \[ f P(y) \int_{-\infty}^\infty f(x) p(x, y) dx = \frac{1}{2 \sqrt{\pi}} e^{-\frac{1}{4} y^2}, \quad y \in \R\] This is the PDF of the normal distribution with mean 0 and variance 2.

Suppose that \( X \) and \( Y \) are random variables for an experiment, with \( X \) taking values in \( \{a, b, c\} \) and \( Y \) taking values in \( \{1, 2, 3, 4\} \). The conditional density function of \( Y \) given \( X \) is as follows: \( P(a, y) = 1/4 \), \( P(b, y) = y / 10 \), and \( P(c, y) = y^2/30 \), each for \( y \in \{1, 2, 3, 4\} \).

- Give the kernel \( P \) in matrix form and verify that it is a probability kernel.
- Find \( f P \) where \( f(a) = f(b) = f(c) = 1/3 \). The result is the density function of \( Y \) given that \( X \) is uniformly distributed.
- Find \( P g \) where \( g(y) = y \) for \( y \in \{1, 2, 3, 4\} \). The resulting function is \( \E(Y \mid X = x) \) for \( x \in \{a, b, c\} \).

- In matrix form \[ P = \left[\begin{matrix} \frac{1}{4} & \frac{1}{4} & \frac{1}{4} & \frac{1}{4} \\ \frac{1}{10} & \frac{2}{10} & \frac{3}{10} & \frac{4}{10} \\ \frac{1}{30} & \frac{4}{30} & \frac{9}{30} & \frac{16}{30} \end{matrix} \right]\] Note that the row sums are 1.
- In matrix form, \( f = \left[\begin{matrix} \frac{1}{3} & \frac{1}{3} & \frac{1}{3} \end{matrix} \right]\) and \(f P = \left[\begin{matrix} \frac{23}{180} & \frac{35}{180} & \frac{51}{180} & \frac{71}{180} \end{matrix} \right]\).
- In matrix form, \[ g = \left[\begin{matrix} 1 \\ 2 \\ 3 \\ 4 \end{matrix} \right], \quad P g = \left[\begin{matrix} \frac{5}{2} \\ 3 \\ \frac{10}{3} \end{matrix} \right]\]

A parametric probability distribution also defines a probability kernel in a natural way, with the parameter playing the role of the kernel variable, and the distribution playing the role of the measure. Such distributions are usually defined in terms of a parametric density function which then defines a kernel function, again with the parameter playing the role of the first argument and the variable the role of the second argument. If the parameter is thought of as a given value of another random variable, as in Bayesian analysis, then there is considerable overlap with the previous subsection. In most cases, the spaces involved are either subsets of Euclidean spaces which naturally have Lebesgue measure, or countable spaces which naturally have counting measure.

Consider the parametric family of exponential distributions. Let \( f \) denote the identity function on \( (0, \infty) \).

- Give the probability density function as a probability kernel function \( p \) on \( (0, \infty) \).
- Find \( P f \).
- Find \( f P \).
- Find \( p^2 \), the kernel function corresponding to the product kernel \( P^2 \).

- \( p(r, x) = r e^{-r x} \) for \( r, \, x \in (0, \infty) \).
- For \( r \in (0, \infty) \), \[ P f(r) = \int_0^\infty p(r, x) f(x) \, dx = \int_0^\infty x r e^{-r x} dx = \frac{1}{r} \] This is the mean of the exponential distribution.
- For \( x \in (0, \infty) \), \[ f P(x) = \int_0^\infty f(r) p(r, x) \, dr = \int_0^\infty r^2 e^{-r x} dr = \frac{2}{x^3} \]
- For \( r, \, y \in (0, \infty) \), \[ p^2(r, y) = \int_0^\infty p(r, x) p(x, y) \, dx = \int_0^\infty = \int_0^\infty r x e^{-(r + y) x} dx = \frac{r}{(r + y)^2} \]

Consider the parametric family of Poisson distributions. Let \(f \) be the identity function on \(\N \) and let \( g \) be the identity function on \( (0, \infty) \).

- Give the probability density function \( p \) as a probability kernel function from \( (0, \infty) \) to \( \N \).
- Show that \( P f = g \).
- Show that \( g P = f \).

- \( p(r, n) = e^{-r} \frac{r^n}{n!} \) for \( r \in (0, \infty) \) and \( n \in \N \).
- For \( r \in (0, \infty) \), \[ P f(r) = \sum_{n=0}^\infty p(r, n) f(n) = \sum_{n=0}^\infty n e^{-r} \frac{r^n}{n!} = r \] This is the mean of the Poisson distribution.
- For \( n \in \N \), \[ g P(n) = \int_0^\infty g(r) p(r, n) \, dr = \int_0^\infty e^{-r} \frac{r^{n+1}}{n!} dr = n \]

Clearly the Poisson distribution has some very special and elegant properties. The next family of distributions also has some very special properties. Compare this exercise with the corresponding one above

Consider the family of normal distributions, parameterized by the mean and with variance 1.

- Give the probability density function as a probability kernel function \( p \) on \( \R \).
- Show that \( p \) is symmetric.
- Let \( f \) be the identity function on \( \R \). Show that \( P f = f \) and \( f P = f \).
- For \( n \in \N \), find \( p^n \) the kernel function for the operator \( P^n \).

- For \( \mu, \, x \in \R \), \[ p(\mu, x) = \frac{1}{\sqrt{2 \pi}} e^{-\frac{1}{2}(x - \mu)^2} \] That is, \( x \mapsto p(x, \mu) \) is the normal probability density function with mean \( \mu \) and variance 1.
- Note that \( p(\mu, x) = p(x, \mu) \) for \( \mu, \, x \in \R \). So \( \mu \mapsto p(\mu, x) \) is the normal probability density function with mean \( x \) and variance 1.
- Since \( f(x) = x \) for \( x \in \R \), this follows from the previous two parts: \( P f(\mu) = \mu \) for \( \mu \in \R \) and \( f P(x) = x \) for \( x \in \R \)
- For \( \mu, \, y \in \R \), \[ p^2(\mu, x) = \int_{-\infty}^\infty p(\mu, t) p(t, y) \, dt = \frac{1}{\sqrt{4 \pi}} e^{-\frac{1}{4}(x - \mu)^2} \] so that \( x \mapsto p^2(\mu, x) \) is the normal PDF with mean \( \mu \) and variance 2. By induction, \[ p^n(\mu, x) = \frac{1}{\sqrt{2 \pi n}} e^{-\frac{1}{2 n}(x - \mu)^2} \] for \( n \in \N_+ \) and \( \mu, \, x \in \R \). Thus \( x \mapsto p^n(\mu, x) \) is the normal PDF with mean \( \mu \) and variance \( n \).

For each of the following special distributions, express the probability density function as a probability kernel function. Be sure to specify the parameter spaces.

- The general normal distribution on \( \R \).
- The beta distribution on \( (0, 1) \).
- The negative binomial distribution on \( \N \).

- The normal distribution with mean \( \mu \) and standard deviation \( \sigma \) defines a kernel function \( p \) from \( \R \times (0, \infty) \) to \( \R \) given by \[ p[(\mu, \sigma), x] = \frac{1}{\sqrt{2 \pi} \sigma} \exp\left[-\left(\frac{x - \mu}{\sigma}\right)^2\right] \]
- The beta distribution with left parameter \( a \) and right parameter \( b \) defines a kernel function \( p \) from \( (0, \infty)^2 \) to \( (0, 1) \) given by \[ p[(a, b), x] = \frac{1}{B(a, b)} x^{a - 1} y^{b - 1} \] where \( B \) is the beta function.
- The negative binomial distribution with stopping parameter \( k \) and success parameter \( \alpha \) defines a kernel function \( p \) from \( (0, \infty) \times (0, 1) \) to \( \N \) given by \[ p[(n, \alpha), k] = \binom{n + k - 1}{n} \alpha^k (1 - \alpha)^n \]