117 lines
3.2 KiB
Plaintext
117 lines
3.2 KiB
Plaintext
---
|
|
title: "Chapter 2 Notes"
|
|
author: "Emanuel Rodriguez"
|
|
execute:
|
|
message: false
|
|
warning: false
|
|
format:
|
|
html:
|
|
monofont: "Cascadia Mono"
|
|
highlight-style: gruvbox-dark
|
|
css: styles.css
|
|
callout-icon: false
|
|
callout-apperance: simple
|
|
---
|
|
|
|
In this chapter we step through an example
|
|
of "fake" vs "real" news to build a framework to determine the probability
|
|
of real vs fake of a new news article titled "The President has a secret!"
|
|
|
|
```{r}
|
|
#| message: false
|
|
#| warning: false
|
|
# libraries
|
|
library(bayesrules)
|
|
library(dplyr)
|
|
library(tidyr)
|
|
library(gt)
|
|
data(fake_news)
|
|
fake_news <- tibble::as_tibble(fake_news)
|
|
```
|
|
|
|
What is the proportion of news articles that were labeled fake vs real.
|
|
|
|
```{r}
|
|
fake_news |> glimpse()
|
|
|
|
fake_news |>
|
|
group_by(type) |>
|
|
summarise(
|
|
total = n(),
|
|
prop = total / nrow(fake_news)
|
|
)
|
|
```
|
|
|
|
If we let $B$ be the event that a news article is "fake" news, and
|
|
$B^c$ be the event that a news article is "real", we can write the following:
|
|
|
|
$$P(B) = .4$$
|
|
$$P(B^c) = .6$$
|
|
|
|
This is the first "clue" or set of data that we have to build into our framework.
|
|
Namely, majority of articles are "real", therefore we could simply predict that
|
|
the new article is "real". This updated sense or reality now becomes our priors.
|
|
|
|
Getting additional data, and updating our priors, based on additional data. The
|
|
new observation we make is the use of exclamation marks "!". We note that the use
|
|
of "!" is more frequent in news articles labeled as "fake". We will want to incorporate
|
|
this into our framework to decide whether the new incoming should be labelled as
|
|
real or fake.
|
|
|
|
### Likelihood
|
|
|
|
:::{.callout-note}
|
|
## Probability and Likelihood
|
|
|
|
When the event $B$ is known, then we can evaluate the uncertainy of events
|
|
$A$ and $A^c$ given $B$
|
|
|
|
$$P(A|B) \text{ vs } P(A^c|B)$$
|
|
|
|
If on the other hand, we know event $A$ then we can evaluate the relative
|
|
compatability of data $A$ with $B$ and $B^c$ using likelihood functions
|
|
|
|
$$L(B|A) \text{ vs } L(B^c|A)$$
|
|
$$=P(A|B) \text{ vs } P(A|B^c)$$
|
|
|
|
:::
|
|
|
|
So in our case, we don't know whether this new incoming article is real or not,
|
|
but we do know that the title has an exclamation mark. This means we can
|
|
evaluate how likely this article is real or not given that it contains
|
|
an "!" in the title using likelihood functions. We can formualte this as:
|
|
|
|
$$L(B|A) \text{ vs } L(B^c|A)$$
|
|
|
|
And perform the computation in R as follows:
|
|
|
|
```{r}
|
|
# if fake, what are the proprotions of ! vs no-!
|
|
prop_of_excl_within_type <- fake_news |>
|
|
group_by(type, title_has_excl) |>
|
|
summarise(
|
|
total = n()
|
|
) |>
|
|
ungroup() |>
|
|
group_by(type) |>
|
|
summarise(
|
|
has_excl = title_has_excl,
|
|
prop_within_type = total / sum(total)
|
|
)
|
|
```
|
|
|
|
```{r}
|
|
prop_of_excl_within_type |>
|
|
pivot_wider(names_from = "type", values_from = prop_within_type) |>
|
|
gt() |>
|
|
gt::cols_label(
|
|
has_excl = "Contains Exclamtion",
|
|
fake = "Fake",
|
|
real = "Real") |>
|
|
gt::fmt_number(columns=c("fake", "real"), decimals = 3) |>
|
|
gt::cols_width(everything() ~ px(100))
|
|
```
|
|
|
|
The table above also shows the likelihoods for the case
|
|
when an article does not contain exclamation point in
|
|
the title. |