Bayesian Struggles: why we need conjugate priors

I want to start this first post of the series of Bayesian stats with this meme.

bayesian meme

(image from Bayesian Fun)

It is also important to make sure my dear reader(s) knows that I am by all means not an expert in statistics (let me know if u found mistakes!). I came from an applied math background but had no interest in taking stats in college (thought it was easy and too formulaic at that time, naive!!). So I am almost completely self-taught in this subject and prone to interpret things in a familiar way to help me understand.

However, I think because I had (a lot of) struggles, I can share better how the logic flows to overcome them.

How did we come to realize that we might need something called conjugate prior?

Consider the example given by Bayes himself in his 1763 paper: let us roll a ball along unit interval with uniform probability of stopping anywhere in between. This ball finally stopped at distance and we then throw this ball for times. Denote the number of times ball stops before reaching as , what can we learn about now (aka what is )?

In this case our prior is no longer just our ‘belief’ but a physics fact. We understand that is 1 since it follows uniform distribution, and is just binomial probability density function.

keep doing integration by parts

which verifies to be beta distribution of .

With all the elements known, by Baye’s theorem , our posterior follows beta distribution of the above parameters. However, if instead we take as our prior, our posterior then follows: .

highlight:

  1. our posterior has the same form as prior which makes it easy to update

  2. prior now also have explanation power:

    while we roll the ball, each time it stops before , it corresponds to the probability density of binomial as in increase of successful trail (aka ), so imagine if we were live-recording the rolling, we would add 1 to . If it exceeds the line, we would add 1 to . Very intuitive.

Did that really make much of a difference? Yes if we bring complicated integrals in the house!

let’s imagine a slightly more complicated case, where you are trying to infer the parameters of your dataset, which you believe to follow normal distribution. You don’t get nice property like just equals to 1 and needs to do
every time you update. This is actually why most times we want find nice conjugate priors to avoid computing high dimensional integration. The benefits will came clear after the below explanation.

Consider i.i.d sample are drawn from distribution with and random. Let , it is suffice to have parameter set .

Our goal: find $p(x|\mu,\tau)$

*Tricky part:* find

Similarly to the problem of only fixed, consider a gamma prior, , and we use the parametrization that PDF is then for is (obviously) an inverse gamma distribution, aka .

However, I did struggle a while understanding what should be because it seems independent of . But if and are independent of each other, posterior won’t be in inverse gamma format so we end up having nothing handy. Later from the wisdom of my professor, he made that for being some real number and of course . The reason behind is given both conditioned on , and should be dependent.

The full prior density is then (1)

and posterior ​ = (1) x

Omitting some merging and substituting tricks that nobody wants to see, we come down to

leaves only four simple algebra to maintain at each update.

To end this article, I will share a fun read, A Catalog of Noninformative Priors I found a while ago.


© Copyright © Mengzhou (Jojo) Tang , 2018

Powered by Hydejack v8.2.0