bayes-rules-notes/R/ch2.qmd

686 lines
18 KiB
Plaintext

---
title: "Chapter 2 Notes"
author: "Emanuel Rodriguez"
execute:
message: false
warning: false
format:
html:
monofont: "Cascadia Mono"
highlight-style: gruvbox-dark
css: styles.css
callout-icon: false
callout-apperance: simple
toc: false
html-math-method: katex
---
*Note: these notes are a work in progress*
In this chapter we step through an example
of "fake" vs "real" news to build a framework to determine the probability
of real vs fake of a new news article titled "The President has a secret!"
We then go on to build a probability known as the Binomial model using the
Bayesian framework
```{r}
#| message: false
#| warning: false
# libraries
library(bayesrules)
library(dplyr)
library(tidyr)
library(gt)
library(tibble)
library(ggplot2)
data(fake_news)
fake_news <- tibble::as_tibble(fake_news)
```
What is the proportion of news articles that were labeled fake vs real.
```{r}
fake_news |> head()
fake_news |>
group_by(type) |>
summarise(
total = n(),
prop = total / nrow(fake_news)
)
```
If we let $B$ be the event that a news article is "fake" news, and
$B^c$ be the event that a news article is "real", we can write the following:
$$P(B) = .4$$
$$P(B^c) = .6$$
This is the first "clue" or set of data that we have to build into our framework.
Namely, majority of articles are "real", therefore we could simply predict that
the new article is "real". This updated sense or reality now becomes our priors.
Getting additional data, and updating our priors, based on additional data. The
new observation we make is the use of exclamation marks "!". We note that the use
of "!" is more frequent in news articles labeled as "fake". We will want to incorporate
this into our framework to decide whether the new incoming should be labelled as
real or fake.
### Likelihood
:::{.callout-note}
## Probability and Likelihood
When the event $B$ is known, then we can evaluate the uncertainy of events
$A$ and $A^c$ given $B$
$$P(A|B) \text{ vs } P(A^c|B)$$
If on the other hand, we know event $A$ then we can evaluate the relative
compatability of data $A$ with $B$ and $B^c$ using likelihood functions
$$L(B|A) \text{ vs } L(B^c|A)$$
$$=P(A|B) \text{ vs } P(A|B^c)$$
:::
So in our case, we don't know whether this new incoming article is real or not,
but we do know that the title has an exclamation mark. This means we can
evaluate how likely this article is real or not given that it contains
an "!" in the title using likelihood functions. We can formualte this as:
$$L(B|A) \text{ vs } L(B^c|A)$$
And perform the computation in R as follows:
```{r}
# if fake, what are the proprotions of ! vs no-!
prop_of_excl_within_type <- fake_news |>
group_by(type, title_has_excl) |>
summarise(
total = n()
) |>
ungroup() |>
group_by(type) |>
summarise(
has_excl = title_has_excl,
prop_within_type = total / sum(total)
)
```
```{r}
prop_of_excl_within_type |>
pivot_wider(names_from = "type", values_from = prop_within_type) |>
gt() |>
gt::cols_label(
has_excl = "Contains Exclamtion",
fake = "Fake",
real = "Real") |>
gt::fmt_number(columns=c("fake", "real"), decimals = 3) |>
gt::cols_width(everything() ~ px(100))
```
The table above also shows the likelihoods for the case
when an article does not contain exclamation point in
the title as well. It's really important to note that these are likelihoods,
and its not the case that $L(B|A) + L(B^c|A) = 1$ as a matter of fact this
value evaluates to a number less than one. However, since we have that
$L(B|A) = .267$ and $L(B^c|A) = .022$ then we have gained additional
knowledge in knowing the use of "!" in a title is more compatible
with a fake news article than a real one.
Up to this point we can summarize our framework as follows
| event | $B$ | $B^c$ | Total |
|------- |-----|-------|------|
| prior | .4 | .6 | 1 |
| likelihood |.267 | .022 | .289 |
Our next goal is come up with normalizing factors in order to build our
probability table:
| | $B$| $B^c$| Total |
|------|----|------|-------|
|$A$ | (1)| (2) | |
|$A^c$ | (3)| (4) | |
|Total | .4 | .6 | 1 |
A couple things to note about our table (1) + (3) = .4 and (2) + (4) = .6.
(1) + (2) + (3) + (4) = 1.
(1.) $P(A \cap B) = P(A|B)P(B)$ we know the likelihood of $L(B|A) = P(A|B)$ and we also
know the prior so we insert these to get
$$ P(A \cap B) = P(A|B)P(B) = .267 \times .4 = .1068$$
(3.) $P(A^c \cap B) = P(A^c|B)P(B)$ in this case we do know the prior $P(B) = .4$, but we
don't directly know the value of $P(A^c|B)$, however, we note that $P(A|B) + P(A^c|B) = 1$,
therefore we compute $P(A^c|B) = 1 - P(A|B) = 1 - .267 = .733$
$$ P(A^c \cap B) = P(A^c|B)P(B) = .733 \times .4 = .2932$$
we now can confirm that $.1068 + .2932 = .4$
Moving on to (2), (4)
(2.) $P(A \cap B^c) = P(A|B^c)P(B^c)$. In this case know the likelihood $L(B^c|A) = P(A|B^c)$ and
we know the prior $P(B^c)$ therefore,
$$P(A \cap B^c) = P(A|B^c)P(B^c) = .022 \times .6 = .0132$$
(4.) $P(A^c \cap B^c) = P(A^c|B^c)P(B^c) = (1 - .022) \times .6 = .5868$
and can confirm that $.0132 + .5868 = .6$
and we can fill the rest of the table:
| | $B$ | $B^c$ | Total |
|------|---- |------ |-------|
|$A$ | .1068 | .0132 | .12 |
|$A^c$ | .2932 | .5868 | .88 |
|Total | .4 | .6 | 1 |
An important concept we implemented in above is the idea of **total probability**
:::{.callout-tip}
## total probability
The **total probability** of observing a real article is made up the sum of its
parts. Namely
$$P(B^c) = P(A \cap B^c) + P(A^c \cap B^c)$$
$$=P(A|B^c)P(B^c) + P(A^c|B^c)P(B^c)$$
$$=.0132 + .5868 = .6$$
:::
In the above calculations we also step through **joint probabilities**
:::{.callout-note}
## Joint and conditional probability
$$P(A \cap B) = P(A|B)P(B)$$
$A$ and $B$ are said to be independent events, if and only if
$$P(A \cap B) = P(A)P(B)$$
from this we can also derive the definition of a conditional probability
$$P(A|B) = \frac{P(A \cap B)}{P(B)}$$
:::
At this point we are able to answer the question, "What is the probability,
the new article is fake?". Given that the new article has an exclamation
point, we can zoom into the top row of the table of probabilitties. Within
this row we have probabilities $.1068/.12 = .833$ for fake and $.0132 / .12 = .11$
for real.
This is essentially Baye's Rule. We developed a posterior probability for an event
$B$ given some observation $A$. We did so by combining the likelihood of event $B$
given some new data $A$ and the prior probability of event $B$. More formally we
have the following definition:
:::{.callout-note}
## Baye's Rule
The posterior probability of an event $B$ given a $A$ is:
$$ P(B|A) = \frac{P(A \cap B)}{P(A)} = \frac{L(B|A)P(B)}{P(A)}$$
where $L$ is the likelihood function $L(B|A) = P(B|A)$ and $P(A)$ is the
total probability of $A$.
More generally,
$$ \frac{likelihood \cdot prior}{normalizing \;\; constant}$$
:::
### Simualation
```{r}
articles <- tibble::tibble(type = c("real", "fake"))
priors <- c(.6, .4)
articles_sim <- sample_n(articles, 10000, replace = TRUE, weight = priors)
```
```{r}
articles_sim |>
ggplot(aes(x = type)) + geom_bar()
```
and a summary table
```{r}
articles_sim |>
group_by(type) |>
summarise(
total = n(),
prop = total / nrow(articles_sim)
) |>
gt()|>
gt::cols_width(everything() ~ px(100))
```
the simulation of 10,000 articles shows us very nearly
the same priors we had from the data. We can now add
the exclamation usage into the data.
```{r}
articles_sim <- articles_sim |>
mutate(model_data = case_when(
type == "fake" ~ .267,
type == "real" ~ .022
))
```
The plan here is to iterate through the 10,000 samples
and use the `data_model` value to assign either, "yes" or
"no" using the `sample` function.
```{r}
data <- c("yes", "no")
articles_sim <- articles_sim |>
mutate(id = row_number()) |>
group_by(id) |>
mutate(usage = sample(data, 1, prob = c(model_data, 1 - model_data)))
```
```{r}
articles_sim |>
group_by(usage, type) |>
summarise(
total = n()
) |>
pivot_wider(names_from = type, values_from = total)
```
```{r}
articles_sim |>
ggplot(aes(x = type, fill = usage)) +
geom_bar() +
scale_fill_discrete(type = c("gray8", "dodgerblue4"))
```
So far have compute both the priors and likelihoods, we can simply
filter our data to reflect the incoming article and determine our
posterior.
```{r}
articles_sim |>
filter(usage == "yes") |>
group_by(type) |>
summarise(
total = n()
) |>
mutate(
prop = total / sum(total)
)
```
## Binomial Model and the chess example
The example used here is the case of a chess match between a human
and a computer "Deep Blue". The set up is such that we know the two
faced each other in 1996, in which the human won. There is a rematch
scheduled for the next 1997. We would like to model the number of games
out of 6 that the human can win.
Let $\pi$ be the probability that the human wins any one match against
the computer. To simplify things greatly we assume that $\pi$ takes on
values of .2, .5, .8. We also assume the following prior (we are told
in the book that we will learn how to build these later on):
| $\pi$ | .2 | .5 | .8 | total |
|--------|----|----|----|-------|
|$f(\pi)$|.10 |.25 |.65 | 1 |
:::{.callout-tip}
## Note
its important to note here that the sum of the values of $\pi$ **do
not** add up to 1. $\pi$ represents the chances of winning any single
game, we would expect $\pi$ to take on any value in $\mathbb{R}$. On
the other hand $f$ is a function that maps $\pi$ into a space of
probabilities, this is next.
:::
:::{.callout-note}
## Discrete Probability Model
Let $Y$ be a discrete random variable. The probability model for $Y$ is
described by a **probability mass function** (pmf) defined as:
$$f(y) = P(Y = y)$$
and has the following properties
1. $0 \leq f(y) \leq 1\;\; \forall y$
2. $\sum_{\forall y}f(y) = 1$
:::
:::{.callout-tip}
## in emanuel's words
what does this mean? well its very straightforward a pmf is a function that takes
in a some value y and outputs the probability that the random variable
$Y$ equals $y$.
:::
next we would like add a the dependancy of $Y$ on $\pi$, we do so by introducing
the conditional pmf.
:::{.callout-note}
## Conditional probability model of data $Y$
Let $Y$ be a discrete random variable that depends on some parameter
$\pi$. We define the conditional probability model of $Y$ as the
conditional pmf,
$$f(y|\pi) = P(Y = y | \pi)$$
and has the following properties,
1. $0 \leq f(y|\pi) \leq 1\;\; \forall y$
2. $\sum_{\forall y}f(y|\pi) = 1$
:::
:::{.callout-tip}
## in emanuel's words
this is essentially the same probability model had defined above, except
now we are condition probabilities by some parameter $\pi$
:::
in the example of the chess player we must make some assumptions:
1. the chances of winning any match in the game stay constant. So if
at match number 1 human has a .65% of winning, then that is the same
for match 2-6.
2. Winning or loosing a game does not affect the chances of winning
or loosing the next game, i.e matches are independent of one another.
These two assumptions lead us to the **Binomial Model**.
:::{.callout-note}
## The Binomial Model
Let the random variable $Y$ represent the number of successes in $n$ trials.
Assume that each trial is independent, and the probability of sucess in a
given trial is $\pi$. Then the conditional dependence of $Y$ on $\pi$ can
be modeled by the **Binomial Model** with parameters $n$ and $\pi$. We can
write this as,
$$Y|\pi \sim Bin(n, \pi)$$
the binomial model is specified by the pmf:
$$f(y|\pi) = {n \choose y} \pi^y(1 - \pi)^{n-y}$$
:::
knowing this we can represent $Y$ the total number of matches out of 6
that the human can win.
$$Y|\pi \sim Bin(6, \pi)$$
and conditional pmf:
$$f(y|\pi) = {6 \choose y}\pi^y(1 - \pi)^{6 - y}\;\; \text{for } y \in \{1, 2, 3, 4, 5, 6\}$$
with the pmf we can now determine the probability of the human winning $Y$ matches
out of 6 for any given value of $\pi$
```{r}
chess_pmf <- function(y, p, n = 6) {
choose(n, y) * (p ^ y) * (1 - p)^(n - y)
}
# what is probability that human wins 6 games given a pi value of .8
chess_pmf(y = 5, p = .8)
```
:::{.callout-tip}
##
the formula for the binomial is actually pretty intuitive, first you have
the scalar ${n \choose y}$ this will determine the total number of ways
the player can win $y$ games out of the possible $n$. This is first multiplied
by the probablility of success in the $n$ trials since $(p ^ y)$ can be
re-written as $p\times p\times \cdots \times p$, and then multiplied by
the probability of $n-y$ failures $(1 - p)^{n - y}$
:::
```{r}
pies <- seq(0, 1, by = .05)
py <- chess_pmf(y = 4, p = pies)
d <- data.frame(pies = pies, py = py)
d |>
ggplot(aes(pies, py)) + geom_col()
```
```{r}
pies <- c(.2, .5, .8)
ys <- 0:6
d <- tidyr::expand_grid(pies, ys)
fys <- purrr::map2_dbl(d$ys, d$pies, ~chess_pmf(.x, .y), n=6)
d$fys <- fys
d$display_pi <- as.factor(paste("pi =", d$pies))
d |>
ggplot(aes(x = ys, y = fys)) +
geom_col() +
scale_x_continuous(breaks = 0:6) +
facet_wrap(vars(display_pi))
```
The plot shows the three possible values for $\pi$ along
with the value of the pmf for each of the possible
matches the human can win in a game. The values of $f(y|\pi)$
are pretty intuitive, we would expect the random variable $Y$
to be lower when the value of $\pi$ is lower and higher when
the value of $\pi$ is higher.
For the sake of the excercise lets add more values of $\pi$
so that we can see this shift happen in more detail.
```{r}
pies <- seq(.1, .9, by = .1)
ys <- 0:6
d <- tidyr::expand_grid(pies, ys)
fys <- purrr::map2_dbl(d$ys, d$pies, ~chess_pmf(.x, .y), n=6)
d$fys <- fys
d$display_pi <- as.factor(paste("pi =", d$pies))
d |>
ggplot(aes(x = ys, y = fys)) +
geom_col() +
scale_x_continuous(breaks = 0:6) +
facet_wrap(vars(display_pi), nrow = 3)
```
as it turns out we learn that the human ended up winning just
one game in the 1997 rematch, $Y = 1$. The next step in our
analysis is to determine how compatible this new data is with
each value of $\pi$, the likelihood that is.
This is very easy to do with all the work we have done so far:
```{r}
d |>
filter(ys == 1) |>
ggplot(aes(pies, fys)) +
geom_col() +
scale_x_continuous(breaks = seq(.1, .9, by = .1))
```
It's very important to note the following
```{r}
# this will sum to a value greater than 1!!
d |>
filter(ys == 1) |>
pull(fys) |>
sum()
```
:::{.callout-important icon="true"}
this has been mentioned before but its an important message
to drive home. Note that the reason why the values sum to a
value greater than 1 is that they are **not** probabilities, they
are likelihoods. We are determining how likely each value of
$\pi$ is given that we have observed $Y = 1$.
:::
We can formalize the likelihood function $L$ in our example
as follows:
$$L(\pi|y=1) = f(y=1|\pi) = {6 \choose 1}\pi^1(1-\pi)^{6-1}$$
$$ = 6\pi(1 - \pi)^5$$
We can test this out
```{r}
6 * .2 * (.8 ^ 5)
```
which is the value we get as .2 in the bar plot.
```{r}
#| echo: false
d |>
filter(ys == 1) |>
mutate(hl = ifelse(pies == .2, "y", "n")) |>
ggplot(aes(pies, fys, fill=hl)) +
geom_col() +
scale_x_continuous(breaks = seq(.1, .9, by = .1)) +
scale_fill_manual(values = c("y"="darkblue", "n"="darkgrey"), guide=FALSE)
```
the likelihood values for $Y = 1$ are here:
```{r}
d |>
filter(ys == 1)|>
select(-display_pi) |>
knitr::kable()
```
The overall take-away from having observed the new data
$Y=1$ is that it is most compatible with a the $\pi$ value
of .2, this means that its safe to assume that the human is
a weaker player since we can think of the value $\pi$ as
a measure of relative weakness/superior of the human compared
to the computer with 0 being the weakest and 1 being the
strongest.
:::{.callout-note}
## Probability mass functions vs likelihood functions
When $\pi$ is known the conditional pmf $f(\cdot | \pi)$
allows us to compare the probabilities of the different values
of $Y$ occuring with $\pi$
On the other hand when $Y = y$ is known the likelihood
function $L(\cdot|Y=y) = f(Y=y|\cdot)$ allows us to compare
relative likelihoods of observing data $y$ under different
values of $\pi$
:::
Now that we have the priors for $\pi$ and the likelihoods
for $Y=1$ all we need is the **normalizing constant**
to make use of Beye's Rule in order to update our priors
with this new information and develop our posterior. Recall that the normalizing
constant is just the total probability of observing $Y = 1$.
To get this we simply calculate the probability of observing
$Y = 1$ for all values of $\pi$ and weight each one of these
by the corresponding prior of each value $\pi$.
$$f(y = 1) = f(Y=1|\pi = .2)f(\pi = .2)$$
$$+ f(Y=1|\pi = .5)f(\pi = .5) + f(Y = 1|\pi = .8)f(\pi=.8)$$
$$\approx .637$$
### Posterior
Our posterior distribution has pmf:
$$f(\pi|y = 1)$$
we can write this out as:
$$f(\pi | y = 1) = \frac{f(\pi)\times L(\pi| y = 1)}{f(y = 1)}$$
for $\pi \in \{0.2, 0.5, 0.8\}$
using just our simplified set of $\pi$ values we have the following
posterior:
| $\pi$ | 0.2 | 0.5 | 0.8 | Total |
|-------|-----|-----|-----|-------|
|$f(\pi)$ | 0.10 | 0.25 | 0.65 | 1 |
### Chess Posterior Simulation
Set up the scenario, we have possible values of $\pi$
and the corresponding prior probability for each one
```{r}
chess <- tibble::tibble(pi = c(.2, .5, .8))
prior <- c(.1, .25, .65)
```
next we sample use a sample function to generate 10,000
different values of $\pi$ from our dataframe, we will use
each of these to simuate a 6 match game.
```{r}
chess_sim <- sample_n(chess, size = 10000, weight = prior,
replace = TRUE)
```
Simulate 10,000 games
```{r}
chess_sim <- chess_sim |>
mutate(y = rbinom(10000, size = 6, prob = pi))
```
Lets check how close this simulation is to our known
confitional pmf's
```{r}
chess_sim |>
ggplot(aes(x = y)) +
stat_count(aes(y = ..prop..)) +
facet_wrap(~pi)
```
Let's now focus on the events where $Y = 1$ and tally up
results to see how well these approximated the values
we formally computed as our posterior.
```{r}
chess_sim |>
filter(y == 1) |>
group_by(pi) |>
tally() |>
mutate(
prop = n / sum(n)
)
```