183 lines
5.5 KiB
Plaintext
183 lines
5.5 KiB
Plaintext
---
|
|
title: "Chapter 2 Notes"
|
|
author: "Emanuel Rodriguez"
|
|
execute:
|
|
message: false
|
|
warning: false
|
|
format:
|
|
html:
|
|
monofont: "Cascadia Mono"
|
|
highlight-style: gruvbox-dark
|
|
css: styles.css
|
|
callout-icon: false
|
|
callout-apperance: simple
|
|
---
|
|
|
|
In this chapter we step through an example
|
|
of "fake" vs "real" news to build a framework to determine the probability
|
|
of real vs fake of a new news article titled "The President has a secret!"
|
|
|
|
```{r}
|
|
#| message: false
|
|
#| warning: false
|
|
# libraries
|
|
library(bayesrules)
|
|
library(dplyr)
|
|
library(tidyr)
|
|
library(gt)
|
|
data(fake_news)
|
|
fake_news <- tibble::as_tibble(fake_news)
|
|
```
|
|
|
|
What is the proportion of news articles that were labeled fake vs real.
|
|
|
|
```{r}
|
|
fake_news |> glimpse()
|
|
|
|
fake_news |>
|
|
group_by(type) |>
|
|
summarise(
|
|
total = n(),
|
|
prop = total / nrow(fake_news)
|
|
)
|
|
```
|
|
|
|
If we let $B$ be the event that a news article is "fake" news, and
|
|
$B^c$ be the event that a news article is "real", we can write the following:
|
|
|
|
$$P(B) = .4$$
|
|
$$P(B^c) = .6$$
|
|
|
|
This is the first "clue" or set of data that we have to build into our framework.
|
|
Namely, majority of articles are "real", therefore we could simply predict that
|
|
the new article is "real". This updated sense or reality now becomes our priors.
|
|
|
|
Getting additional data, and updating our priors, based on additional data. The
|
|
new observation we make is the use of exclamation marks "!". We note that the use
|
|
of "!" is more frequent in news articles labeled as "fake". We will want to incorporate
|
|
this into our framework to decide whether the new incoming should be labelled as
|
|
real or fake.
|
|
|
|
### Likelihood
|
|
|
|
:::{.callout-note}
|
|
## Probability and Likelihood
|
|
|
|
When the event $B$ is known, then we can evaluate the uncertainy of events
|
|
$A$ and $A^c$ given $B$
|
|
|
|
$$P(A|B) \text{ vs } P(A^c|B)$$
|
|
|
|
If on the other hand, we know event $A$ then we can evaluate the relative
|
|
compatability of data $A$ with $B$ and $B^c$ using likelihood functions
|
|
|
|
$$L(B|A) \text{ vs } L(B^c|A)$$
|
|
$$=P(A|B) \text{ vs } P(A|B^c)$$
|
|
|
|
:::
|
|
|
|
So in our case, we don't know whether this new incoming article is real or not,
|
|
but we do know that the title has an exclamation mark. This means we can
|
|
evaluate how likely this article is real or not given that it contains
|
|
an "!" in the title using likelihood functions. We can formualte this as:
|
|
|
|
$$L(B|A) \text{ vs } L(B^c|A)$$
|
|
|
|
And perform the computation in R as follows:
|
|
|
|
```{r}
|
|
# if fake, what are the proprotions of ! vs no-!
|
|
prop_of_excl_within_type <- fake_news |>
|
|
group_by(type, title_has_excl) |>
|
|
summarise(
|
|
total = n()
|
|
) |>
|
|
ungroup() |>
|
|
group_by(type) |>
|
|
summarise(
|
|
has_excl = title_has_excl,
|
|
prop_within_type = total / sum(total)
|
|
)
|
|
```
|
|
|
|
```{r}
|
|
prop_of_excl_within_type |>
|
|
pivot_wider(names_from = "type", values_from = prop_within_type) |>
|
|
gt() |>
|
|
gt::cols_label(
|
|
has_excl = "Contains Exclamtion",
|
|
fake = "Fake",
|
|
real = "Real") |>
|
|
gt::fmt_number(columns=c("fake", "real"), decimals = 3) |>
|
|
gt::cols_width(everything() ~ px(100))
|
|
```
|
|
|
|
The table above also shows the likelihoods for the case
|
|
when an article does not contain exclamation point in
|
|
the title as well. It's really important to note that these are likelihoods,
|
|
and its not the case that $L(B|A) + L(B^c|A) = 1$ as a matter of fact this
|
|
value evaluates to a number less than one. However, since we have that
|
|
$L(B|A) = .267$ and $L(B^c|A) = .022$ then we have gained additional
|
|
knowledge in knowing the use of "!" in a title is more compatible
|
|
with a fake news article than a real one.
|
|
|
|
Up to this point we can summarize our framework as follows
|
|
|
|
| event | $B$ | $B^c$ | Total |
|
|
|------- |-----|-------|------|
|
|
| prior | .4 | .6 | 1 |
|
|
| likelihood |.267 | .022 | .289 |
|
|
|
|
Our next goal is come up with normalizing factors in order to build our
|
|
probability table:
|
|
|
|
| | $B$| $B^c$| Total |
|
|
|------|----|------|-------|
|
|
|$A$ | (1)| (2) | |
|
|
|$A^c$ | (3)| (4) | |
|
|
|Total | .4 | .6 | 1 |
|
|
|
|
A couple things to note about our table (1) + (3) = .4 and (2) + (4) = .6.
|
|
(1) + (2) + (3) + (4) = 1.
|
|
|
|
(1.) $P(A \cap B) = P(A|B)P(B)$ we know the likelihood of $L(B|A) = P(A|B)$ and we also
|
|
know the prior so we insert these to get
|
|
$$ P(A \cap B) = P(A|B)P(B) = .267 \times .4 = .1068$$
|
|
|
|
(3.) $P(A^c \cap B) = P(A^c|B)P(B)$ in this case we do know the prior $P(B) = .4$, but we
|
|
don't directly know the value of $P(A^c|B)$, however, we note that $P(A|B) + P(A^c|B) = 1$,
|
|
therefore we compute $P(A^c|B) = 1 - P(A|B) = 1 - .267 = .733$
|
|
$$ P(A^c \cap B) = P(A^c|B)P(B) = .733 \times .4 = .2932$$
|
|
|
|
we now can confirm that $.1068 + .2932 = .4$
|
|
|
|
Moving on to (2), (4)
|
|
|
|
(2.) $P(A \cap B^c) = P(A|B^c)P(B^c)$. In this case know the likelihood $L(B^c|A) = P(A|B^c)$ and
|
|
we know the prior $P(B^c)$ therefore,
|
|
$$P(A \cap B^c) = P(A|B^c)P(B^c) = .022 \times .6 = .0132$$
|
|
|
|
(4.) $P(A^c \cap B^c) = P(A^c|B^c)P(B^c) = (1 - .022) \times .6 = .5868$
|
|
|
|
and can confirm that $.0132 + .5868 = .6$
|
|
|
|
and we can fill the rest of the table:
|
|
|
|
| | $B$ | $B^c$ | Total |
|
|
|------|---- |------ |-------|
|
|
|$A$ | .1068 | .0132 | .12 |
|
|
|$A^c$ | .2932 | .5868 | .88 |
|
|
|Total | .4 | .6 | 1 |
|
|
|
|
An important concept we implemented in above is the idea of **total probability**
|
|
|
|
:::{.callout-tip}
|
|
## total probability
|
|
|
|
The **total probability** of observing a real article is made up the sum of its
|
|
parts. Namely
|
|
|
|
$$P(B^c) = P(A \cap B^c) + P(A^c \cap B^c)$$
|
|
$$=P(A|B^c)P(B^c) + P(A^c|B^c)P(B^c)$$
|
|
$$=.0132 + .5868 = .6$$
|
|
::: |