A note on relationship between marginal probabilities and conditionals.

by d_ijk_stra

I occasionally come across people who define their models by a “Gibbs” sampling algorithm, without explicitly specifying a marginal distribution. That is, to model a bivariate distribution on X and Y, these people do not define their marginal distribution p(x,y); instead, they show me a pair of conditional distributions p(x \mid y) and p(y \mid x), and tell me that’s their probabilistic model. This always make me feel very uncomfortable, but I couldn’t tell exactly why that’s a bad idea; so I gave it some thought this time. I could not find a compelling reason to be against this practice, but in the process I learned some interesting facts about marginal probabilities and conditional probabilities; I was so entertained that I wanted to write down my thought process here.

I started thinking about whether a bivariate random variable X,Y is fully specified by p(x \mid y) and p(y \mid x). In other words, if you are given p(x \mid y) and p(y \mid x), how much do you know about p(x,y)? If there is any hole, then I thought that could make a strong argument. I started by coming up with a somewhat atrocious pair of conditional distributions which I thought to be a counter-example:

X|Y=y \sim I(X=y), Y|X=x \sim I(Y=x),

where I(\cdot) is an indicator function. That is, given Y=y, X=y with probability 1; and also, given X=x, Y=x with probability 1. This pair of conditional probabilities give you absolutely no information about marginal distribution of X, Y, or X,Y. X could be a normal distribution, a Cauchy, or a whatever distribution you name it. So this is an example specifying conditional distributions is not enough.

I was not satisfied with this example, however, because one would argue that nobody would use such a stupid distribution to specify their model. Then, I came up with a better-looking example:

X|Y=y \sim \mathcal{N}(y, 1^2),Y|X=x \sim \mathcal{N}(x, 1^2),

where \mathcal{N}(\mu, \sigma^2) is a normal distribution with mean \mu and standard deviation \sigma. This may look somewhat legitimate if you don’t pay enough attention, but this does not specify a well-defined probability distribution either; if you run Gibbs sampling algorithm with these conditional distributions, for example, your samples will follow a random walk and will drift to anywhere in \mathbb{R}^2. So, this implies that some pair of conditional distributions do not specify a non-degenerate marginal distribution, and therefore care should be taken if you are specifying your model with only conditionals.

However, I still felt uncomfortable that some may still argue that my counter-examples do not satisfy some necessary regularity conditions, while theirs do. It occurred to me that Gibbs sampling works under fairly weak regularity conditions; since you can recover your distribution up to arbitrary precision with Gibbs sampling algorithm, actually conditional distributions should be enough to specify the marginal distribution. So this time, I wrote down conditional distributions as follows:

p(x \mid y) p(y) = p(y \mid x) p(x).

Then, I realized that

\frac{p(y)}{p(x)} = \frac{p(y \mid x)}{p(x \mid y)}.

If you integrate both sides in y, since \int p(y) dy = 1, you get

\frac{1}{p(x)} =\int \frac{p(y \mid x)}{p(x \mid y)} dy,

in other words,

p(x)=\frac{1}{\int \frac{p(y \mid x)}{p(x \mid y)} dy}.

This implies that you can recover the marginal distribution p(x) from conditional distributions. You can also get marginal distribution on X,Y by p(x) \cdot p(y \mid x), so a pair of conditional distributions actually fully specifies the model as long as the above integration can be done with probability 1. Going back to the previous example where X|Y=y \sim \mathcal{N}(y, 1^2),Y|X=x \sim \mathcal{N}(x, 1^2), the ratio of these conditional density function is a constant and not integrable everywhere; that was why this pair of conditional distributions do not define a non-degenerate marginal distribution.

In general, it may not be easy to check whether \int \frac{p(y \mid x)}{p(x \mid y)} dy is integrable with probability 1. However, I feel like this is not a strong enough argument yet… I feel like there has to be more constraints on conditional probabilities, but I don’t know about them yet. On the other hand, the ratio of two conditional densities \frac{p(y \mid x)}{p(x \mid y)} seems pretty interesting! I wonder whether it turned out to be useful in somewhere else.