The goal is for you to come away from this

video understanding one of the most important formulas in all of probability, Bayes’ theorem. This formula is central to scientific discovery,

it’s a core tool in machine learning and AI, and it’s even been used for treasure

hunting, when in the 80’s a small team led by Tommy Thompson used Bayesian search tactics

to help uncover a ship that had sunk a century and half earlier carrying what, in today’s

terms, amounts to $700,000,000 worth of gold. So it's a formula worth understanding. But of course there were multiple different levels of possible understanding. At the simplest there’s just knowing what each part means, so

you can plug in numbers. Then there’s understanding why it’s true; and later

I’m gonna show you a certain diagram that’s helpful for rediscovering the formula on the fly as

needed. Then there’s being able to recognize when

you need to use it.

With the goal of gaining a deeper understanding,

you and I will tackle these in reverse order. So before dissecting the formula, or explaining

the visual that makes it obvious, I’d like to tell you about a man named Steve. Listen

carefully. Steve is very shy and withdrawn, invariably

helpful but with very little interest in people or in the world of reality. A meek and tidy

soul, he has a need for order and structure, and a passion for detail. Which of the following do you find more likely:

“Steve is a librarian”, or “Steve is a farmer”? Some of you may recognize this as an example

from a study conducted by the psychologists Daniel Kahneman and Amos Tversky, whose Nobel-prize-winning work was popularized in books like “Thinking Fast and Slow”, “The Undoing Project”,

.

They researched human judgments, with a frequent focus on when these judgments irrationally contradict what the laws of probability suggest they should be. The example with Steve, the maybe-librarian-maybe-farmer,

illustrates one specific type of irrationality. Or maybe I should say “alleged” irrationality;

some people debate the conclusion, but more on all that in a moment. According to Kahneman and Tversky, after people are given this description of Steve as “meek and tidy soul”, most say he is more likely

to be a librarian than a farmer. After all, these traits line up better with the stereotypical view of a librarian than that of a farmer. And according to Kahneman and Tversky, this is irrational. The point is not whether people hold correct or biased views about the personalities of librarians or farmers, it’s that almost

no one thinks to incorporate information about ratio of farmers to librarians into their

judgments.

In their paper, Kahneman and Tversky said that in the US that ratio is about 20

to 1. The numbers I can find for today put it much higher than that, but let’s just

run with the 20 to 1 ratio since it’s a bit easier to illustrate, and proves the point

just as well. To be clear, anyone who is asked this question is not expected to have perfect information on the actual statistics of farmers, librarians,

and their personality traits. But the question is whether people even think to consider this ratio, enough to make a rough estimate. Rationality is not about knowing facts, it’s about recognizing which facts are relevant. If you do think to make this estimate, there’s a pretty simple way to reason about the question – which, spoiler alert, involves all the

essential reasoning behind Bayes’ theorem.

You might start by picturing a representative sample of farmers and librarians, say, 200 farmers and 10 librarians. Then when you hear the meek and tidy soul description, let’s say your gut instinct is that 40% of librarians would fit that description and that 10% of farmers would. That would mean that from your sample, you’d expect that about 4 librarians fit it, and that 20 farmers do. The probability that a random person who fits this description is a librarian is 4/24, or 16.7%. So even if you think a librarian is 4 times

as likely as a farmer to fit this description, that’s not enough to overcome the fact that

there are way more farmers.

The upshot, and this is the key mantra underlying Bayes’

theorem, is that new evidence should not completely determine your beliefs in a vacuum; it should update prior beliefs. If this line of reasoning makes sense to you, the way seeing evidence restricts the space of possibilities, and the ratio you need to consider after that, then congratulations! You understand the heart of Bayes’ theorem. Maybe the numbers you’d estimate would be a little bit different, but what matters is how you fit the numbers together to update a belief based on evidence. Here, see if you can take a minute to generalize what we just did and write it

down as a formula.

The general situation where Bayes’ theorem is relevant is when you have some hypothesis, say that Steve is a librarian, and you see

some evidence, say this verbal description of Steve as a “meek and tidy soul”, and

you want to know the probability that the hypothesis holds given that the evidence is

true. In the standard notation, this vertical bar means “given that”. As in, we’re

restricting our view only to the possibilities where the evidence holds. The first relevant number is the probability

that the hypothesis holds before considering the new evidence. In our example, that was

the 1/21, which came from considering the ratio of farmers to librarians in the general

population. This is known as the prior. After that, we needed to consider the proportion of librarians that fit this description; the probability we would see the evidence given that the hypothesis is true.

Again, when you see this vertical bar, it means we’re talking

about a proportion of a limited part of the total space of possibilities, in this cass,

limited to the left slide where the hypothesis holds. In the context of Bayes’ theorem,

this value also has a special name, it’s the “likelihood”. Similarly, we need to know how much of the other side of our space includes the evidence; the probability of seeing the evidence given

that our hypothesis isn’t true.

This little elbow symbol is commonly used to mean “not” in probability. Now remember what our final answer was. The probability that our librarian hypothesis is true given the evidence is the total number of librarians fitting the evidence, 4, divided by the total number of people fitting the

evidence, 24. Where does that 4 come from? Well it’s the

total number of people, times the prior probability of being a librarian, giving us the 10 total

librarians, times the probability that one of those fits the evidence. That same number shows up again in the denominator, but we need to add in the total number of people

times the proportion who are not librarians, times the proportion of those who fit the

evidence, which in our example gave 20.

The total number of people in our example,

210, gets canceled out – which of course it should, that was just an arbitrary choice

we made for illustration – leaving us finally with the more abstract representation purely in terms of probabilities. This, my friends, is Bayes’ theorem. You often see this big denominator written

more simply as P(E), the total probability of seeing the evidence. In practice, to calculate it, you almost always have to break it down into the case where the hypothesis is true,

and the one where it isn’t. Piling on one final bit of jargon, this final

answer is called the “posterior”; it’s your belief about the hypothesis after seeing the evidence. Writing it all out abstractly might seem more complicated than just thinking through the example directly with a representative sample; and yeah, it is! Keep in mind, though, the value of a formula like this is that it lets

you quantify and systematize the idea of changing beliefs.

Scientists use this formula when

analyzing the extent to which new data validates or invalidates their models; programmers use it in building artificial intelligence, where you sometimes want to explicitly and numerically model a machine’s belief. And honestly just for how you view yourself, your own opinions and what it takes for your mind to change, Bayes’ theorem can reframe how you think

about thought itself. Putting a formula to it is also all the more important as the examples get more intricate. However you end up writing it, I’d actually

encourage you not to memorize the formula, but to draw out this diagram as needed. This is sort of the distilled version of thinking with a representative sample where we think with areas instead of counts, which is more

flexible and easier to sketch on the fly. Rather than bringing to mind some specific

number of examples, think of the space of all possibilities as a 1×1 square. Any event

occupies some subset of this space, and the probability of that event can be thought about as the area of that subset.

For example, I like to think of the hypothesis as filling

the left part of this square, with a width of P(H). I recognize I’m being a bit repetitive,

but when you see evidence, the space of possibilities gets restricted. Crucially, that restriction

may not happen evenly between the left and the right. So the new probability for the

hypothesis is the proportion it occupies in this restricted subspace. If you happen to think a farmer is just as

likely to fit the evidence as a librarian, then the proportion doesn’t change, which

should make sense. Irrelevant evidence doesn’t change your belief. But when these likelihoods are very different, that's when your belief changes a lot. This is actually a good time to step back

and consider a few broader takeaways about how to make probability more intuitive, beyond Bayes’ theorem. First off, there’s the trick of thinking about a representative sample with a specific number of examples, like our 210 librarians and farmers.

There’s actually

another Kahneman and Tversky result to this effect, which is interesting enough to interject here. They did an experiment similar to the one

with Steve, but where people were given the following description of a fictitious woman

named Linda: Linda is 31 years old, single, outspoken,

and very bright. She majored in philosophy. As a student, she was deeply concerned with issues of discrimination and social justice, and also participated in anti-nuclear demonstrations. They were then asked what is more likely:

That Linda is a bank teller, or that Linda is a bank teller and is active in the feminist

movement.

85% of participants said the latter is more likely, even though the set of bank

tellers active in the femist movement is a subset of the set of bank tellers! But, what’s fascinating is that there’s

a simple way to rephrase the question that dropped this error from 85% to 0. Instead,

if participants are told there are 100 people who fit this description, and asked people

to estimate how many of those 100 are bank tellers, and how many are bank tellers who

are active in the feminist movement, no one makes the error. Everyone correctly assigns a higher number to the first option than to the second. Somehow a phrase like “40 out of 100”

kicks our intuition into gear more effectively than “40%”, much less “0.4”, or abstractly

referencing the idea of something being more or less likely.

That said, representative samples don’t

easily capture the continuous nature of probability, so turning to area is a nice alternative,

not just because of the continuity, but also because it’s way easier to sketch out while

you’re puzzling over some problem. You see, people often think of probability

as being the study of uncertainty. While that is, of course, how it’s applied in science,

the actual math of probability is really just the math of proportions, where turning to

geometry is exceedingly helpful. I mean, if you look at Bayes’ theorem as

a statement about proportions – proportions of people, of areas, whatever – once you

digest what it’s saying, it’s actually kind of obvious.

Both sides tell you to look

at all the cases where the evidence is true, and consider the proportion where the hypothesis is also true. That’s it. That’s all it’s saying. What’s noteworthy is that such a straightforward fact about proportions can become hugely significant for science, AI, and any situation where you

want to quantify belief. You’ll get a better glimpse of this as we get into more examples. But before any more examples, we have some unfinished business with Steve.

Some psychologists debate Kahneman and Tversky’s conclusion, that the rational thing to do is to bring to mind the ratio of farmers to librarians.

They complain that the context is ambiguous. Who is Steve, exactly? Should you expect he’s a randomly sampled American? Or would you be better to assume he’s a friend of these

two psychologists interrogating you? Or perhaps someone you’re personally likely to know? This assumption determines the prior. I, for one, run into many more librarians

in a given month than farmers. And needless to say, the probability of a librarian or

a farmer fitting this description is highly open to interpretation. But for our purposes, understanding the math, notice how any questions worth debating can be pictured in the context of the diagram.

Questions of context shift around the prior, and questions of personalities and stereotypes

shift the relevant likelihoods.

All that said, whether or not you buy this

particular experiment the ultimate point that evidence should not determine beliefs, but

update them, is worth tattooing in your mind. I’m in no position to say whether this does

or doesn’t run against natural human intuition, we’ll leave that to the psychologists. What’s

more interesting to me is how we can reprogram our intuitions to authentically reflect the

implications of math, and bringing to mind the right image can often do just that..