adds more work

This commit is contained in:
2022-09-10 00:54:37 -07:00
parent 5c071fcbb2
commit c01c507087
4 changed files with 797 additions and 48 deletions

156
R/ch2.qmd
View File

@@ -25,6 +25,8 @@ library(bayesrules)
library(dplyr)
library(tidyr)
library(gt)
library(tibble)
library(ggplot2)
data(fake_news)
fake_news <- tibble::as_tibble(fake_news)
```
@@ -180,4 +182,158 @@ parts. Namely
$$P(B^c) = P(A \cap B^c) + P(A^c \cap B^c)$$
$$=P(A|B^c)P(B^c) + P(A^c|B^c)P(B^c)$$
$$=.0132 + .5868 = .6$$
:::
In the above calculations we also step through **joint probabilities**
:::{.callout-note}
## Joint and conditional probability
$$P(A \cap B) = P(A|B)P(B)$$
$A$ and $B$ are said to be independent events, if and only if
$$P(A \cap B) = P(A)P(B)$$
from this we can also derive the definition of a conditional probability
$$P(A|B) = \frac{P(A \cap B)}{P(B)}$$
:::
At this point we are able to answer the question, "What is the probability,
the new article is fake?". Given that the new article has an exclamation
point, we can zoom into the top row of the table of probabilitties. Within
this row we have probabilities $.1068/.12 = .833$ for fake and $.0132 / .12 = .11$
for real.
This is essentially Baye's Rule. We developed a posterior probability for an event
$B$ given some observation $A$. We did so by combining the likelihood of event $B$
given some new data $A$ and the prior probability of event $B$. More formally we
have the following definition:
:::{.callout-note}
## Baye's Rule
The posterior probability of an event $B$ given a $A$ is:
$$ P(B|A) = \frac{P(A \cap B)}{P(A)} = \frac{L(B|A)P(B)}{P(A)}$$
where $L$ is the likelihood function $L(B|A) = P(B|A)$ and $P(A)$ is the
total probability of $A$.
More generally,
$$ \frac{likelihood \cdot prior}{normalizing \;\; constant}$$
:::
### Simualation
```{r}
articles <- tibble::tibble(type = c("real", "fake"))
priors <- c(.6, .4)
articles_sim <- sample_n(articles, 10000, replace = TRUE, weight = priors)
```
```{r}
articles_sim |>
ggplot(aes(x = type)) + geom_bar()
```
and a summary table
```{r}
articles_sim |>
group_by(type) |>
summarise(
total = n(),
prop = total / nrow(articles_sim)
) |>
gt()|>
gt::cols_width(everything() ~ px(100))
```
the simulation of 10,000 articles shows us very nearly
the same priors we had from the data. We can now add
the exclamation usage into the data.
```{r}
articles_sim <- articles_sim |>
mutate(model_data = case_when(
type == "fake" ~ .267,
type == "real" ~ .022
))
```
The plan here is to iterate through the 10,000 samples
and use the `data_model` value to assign either, "yes" or
"no" using the `sample` function.
```{r}
data <- c("yes", "no")
articles_sim <- articles_sim |>
mutate(id = row_number()) |>
group_by(id) |>
mutate(usage = sample(data, 1, prob = c(model_data, 1 - model_data)))
```
```{r}
articles_sim |>
group_by(usage, type) |>
summarise(
total = n()
) |>
pivot_wider(names_from = type, values_from = total)
```
```{r}
articles_sim |>
ggplot(aes(x = type, fill = usage)) +
geom_bar() +
scale_fill_discrete(type = c("gray8", "dodgerblue4"))
```
So far have compute both the priors and likelihoods, we can simply
filter our data to reflect the incoming article and determine our
posterior.
```{r}
articles_sim |>
filter(usage == "yes") |>
group_by(type) |>
summarise(
total = n()
) |>
mutate(
prop = total / sum(total)
)
```
:::{.callout-note}
## Discrete Probability Model
Let $Y$ be a discrete random variable. The probability model for $Y$ is
described by a **probability mass function** (pmf) defined as:
$$f(y) = P(Y = y)$$
and has the following properties
1. $0 \leq f(y) \leq 1\;\; \forall y$
2. $\sum_{\forall y}f(y) = 1$
:::
:::{.callout-caution}
## in emanuel's words
what does this mean? well its very straightforward a pmf is a function that takes
in a some value y and outputs the probability that the random variable
$Y$ equals $y$.
:::