Chapter 2 Notes

Author

Emanuel Rodriguez

Note: these notes are a work in progress

In this chapter we step through an example of “fake” vs “real” news to build a framework to determine the probability of real vs fake of a new news article titled “The President has a secret!”

We then go on to build a probability known as the Binomial model using the Bayesian framework

# libraries
library(bayesrules)
library(dplyr)
library(tidyr)
library(gt)
library(tibble)
library(ggplot2)
data(fake_news)
fake_news <- tibble::as_tibble(fake_news)

What is the proportion of news articles that were labeled fake vs real.

fake_news |> head()

# A tibble: 6 × 30
  title        text  url   authors type  title…¹ text_…² title…³ text_…⁴ title…⁵
  <chr>        <chr> <chr> <chr>   <fct>   <int>   <int>   <int>   <int>   <int>
1 Clinton's E… "0 S… http… <NA>    fake       17     219     110    1444       0
2 Donald Trum… "\n\… http… <NA>    real       18     509      95    3016       0
3 Michelle Ob… "Mic… http… Sierra… fake       16     494      96    2881       1
4 Trump hits … "“Cr… http… Jack S… real       11     268      60    1674       0
5 Australia V… "Whe… http… Blair … fake        9     479      54    2813       0
6 It’s “Trump… "Lik… http… View A… real       12     220      66    1351       1
# … with 20 more variables: text_caps <int>, title_caps_percent <dbl>,
#   text_caps_percent <dbl>, title_excl <int>, text_excl <int>,
#   title_excl_percent <dbl>, text_excl_percent <dbl>, title_has_excl <lgl>,
#   anger <dbl>, anticipation <dbl>, disgust <dbl>, fear <dbl>, joy <dbl>,
#   sadness <dbl>, surprise <dbl>, trust <dbl>, negative <dbl>, positive <dbl>,
#   text_syllables <int>, text_syllables_per_word <dbl>, and abbreviated
#   variable names ¹title_words, ²text_words, ³title_char, ⁴text_char, …

fake_news |>
    group_by(type) |> 
    summarise(
        total = n(),
        prop = total / nrow(fake_news)
    )

# A tibble: 2 × 3
  type  total  prop
  <fct> <int> <dbl>
1 fake     60   0.4
2 real     90   0.6

If we let B be the event that a news article is “fake” news, and B^c be the event that a news article is “real”, we can write the following:

P(B) = .4 P(B^c) = .6

This is the first “clue” or set of data that we have to build into our framework. Namely, majority of articles are “real”, therefore we could simply predict that the new article is “real”. This updated sense or reality now becomes our priors.

Getting additional data, and updating our priors, based on additional data. The new observation we make is the use of exclamation marks “!”. We note that the use of “!” is more frequent in news articles labeled as “fake”. We will want to incorporate this into our framework to decide whether the new incoming should be labelled as real or fake.

Likelihood

Probability and Likelihood

When the event B is known, then we can evaluate the uncertainy of events A and A^c given B

P(A|B) \text{ vs } P(A^c|B)

If on the other hand, we know event A then we can evaluate the relative compatability of data A with B and B^c using likelihood functions

L(B|A) \text{ vs } L(B^c|A) =P(A|B) \text{ vs } P(A|B^c)

So in our case, we don’t know whether this new incoming article is real or not, but we do know that the title has an exclamation mark. This means we can evaluate how likely this article is real or not given that it contains an “!” in the title using likelihood functions. We can formualte this as:

L(B|A) \text{ vs } L(B^c|A)

And perform the computation in R as follows:

# if fake, what are the proprotions of ! vs no-!
prop_of_excl_within_type <- fake_news |>
    group_by(type, title_has_excl) |>
    summarise(
        total = n()
    ) |>
    ungroup() |>
    group_by(type) |>
    summarise(
        has_excl = title_has_excl,
        prop_within_type = total / sum(total)
    )

prop_of_excl_within_type |>
    pivot_wider(names_from = "type", values_from = prop_within_type) |>
    gt() |>
    gt::cols_label(
        has_excl = "Contains Exclamtion",
        fake = "Fake", 
        real = "Real") |>
    gt::fmt_number(columns=c("fake", "real"), decimals = 3) |>
    gt::cols_width(everything() ~ px(100))

Contains Exclamtion	Fake	Real
FALSE	0.733	0.978
TRUE	0.267	0.022

The table above also shows the likelihoods for the case when an article does not contain exclamation point in the title as well. It’s really important to note that these are likelihoods, and its not the case that L(B|A) + L(B^c|A) = 1 as a matter of fact this value evaluates to a number less than one. However, since we have that L(B|A) = .267 and L(B^c|A) = .022 then we have gained additional knowledge in knowing the use of “!” in a title is more compatible with a fake news article than a real one.

Up to this point we can summarize our framework as follows

event	B	B^c	Total
prior	.4	.6	1
likelihood	.267	.022	.289

Our next goal is come up with normalizing factors in order to build our probability table:

	B	B^c	Total
A	(1)	(2)
A^c	(3)	(4)
Total	.4	.6	1

A couple things to note about our table (1) + (3) = .4 and (2) + (4) = .6. (1) + (2) + (3) + (4) = 1.

(1.) P(A \cap B) = P(A|B)P(B) we know the likelihood of L(B|A) = P(A|B) and we also know the prior so we insert these to get P(A \cap B) = P(A|B)P(B) = .267 \times .4 = .1068

we now can confirm that .1068 + .2932 = .4

Moving on to (2), (4)

(2.) P(A \cap B^c) = P(A|B^c)P(B^c). In this case know the likelihood L(B^c|A) = P(A|B^c) and we know the prior P(B^c) therefore, P(A \cap B^c) = P(A|B^c)P(B^c) = .022 \times .6 = .0132

(4.) P(A^c \cap B^c) = P(A^c|B^c)P(B^c) = (1 - .022) \times .6 = .5868

and can confirm that .0132 + .5868 = .6

and we can fill the rest of the table:

	B	B^c	Total
A	.1068	.0132	.12
A^c	.2932	.5868	.88
Total	.4	.6	1

An important concept we implemented in above is the idea of total probability

total probability

The total probability of observing a real article is made up the sum of its parts. Namely

P(B^c) = P(A \cap B^c) + P(A^c \cap B^c) =P(A|B^c)P(B^c) + P(A^c|B^c)P(B^c) =.0132 + .5868 = .6

In the above calculations we also step through joint probabilities

Joint and conditional probability

P(A \cap B) = P(A|B)P(B)

A and B are said to be independent events, if and only if

P(A \cap B) = P(A)P(B)

from this we can also derive the definition of a conditional probability

P(A|B) = \frac{P(A \cap B)}{P(B)}

At this point we are able to answer the question, “What is the probability, the new article is fake?”. Given that the new article has an exclamation point, we can zoom into the top row of the table of probabilitties. Within this row we have probabilities .1068/.12 = .833 for fake and .0132 / .12 = .11 for real.

This is essentially Baye’s Rule. We developed a posterior probability for an event B given some observation A. We did so by combining the likelihood of event B given some new data A and the prior probability of event B. More formally we have the following definition:

Baye’s Rule

The posterior probability of an event B given a A is:

P(B|A) = \frac{P(A \cap B)}{P(A)} = \frac{L(B|A)P(B)}{P(A)}

where L is the likelihood function L(B|A) = P(B|A) and P(A) is the total probability of A.

More generally,

\frac{likelihood \cdot prior}{normalizing \;\; constant}

Simualation

articles <- tibble::tibble(type = c("real", "fake"))

priors <- c(.6, .4)

articles_sim <- sample_n(articles, 10000, replace = TRUE, weight = priors)

articles_sim |>
    ggplot(aes(x = type)) + geom_bar()

and a summary table

articles_sim |>
    group_by(type) |>
    summarise(
        total = n(), 
        prop = total / nrow(articles_sim)
    ) |>
    gt()|>
    gt::cols_width(everything() ~ px(100))

type	total	prop
fake	4037	0.4037
real	5963	0.5963

the simulation of 10,000 articles shows us very nearly the same priors we had from the data. We can now add the exclamation usage into the data.

articles_sim <- articles_sim |>
    mutate(model_data = case_when(
        type == "fake" ~ .267, 
        type == "real" ~ .022
    ))

The plan here is to iterate through the 10,000 samples and use the data_model value to assign either, “yes” or “no” using the sample function.

data <- c("yes", "no")

articles_sim <- articles_sim |>
    mutate(id = row_number()) |>
    group_by(id) |>
    mutate(usage = sample(data, 1, prob = c(model_data, 1 - model_data)))

articles_sim |>
    group_by(usage, type) |>
    summarise(
        total = n()
    ) |>
    pivot_wider(names_from = type, values_from = total)

# A tibble: 2 × 3
# Groups:   usage [2]
  usage  fake  real
  <chr> <int> <int>
1 no     2919  5824
2 yes    1118   139

articles_sim |>
    ggplot(aes(x = type, fill = usage)) + 
    geom_bar() + 
    scale_fill_discrete(type = c("gray8", "dodgerblue4"))

So far have compute both the priors and likelihoods, we can simply filter our data to reflect the incoming article and determine our posterior.

articles_sim |>
    filter(usage == "yes") |>
    group_by(type) |>
    summarise(
        total = n()
    ) |>
    mutate(
        prop = total / sum(total)
    )

# A tibble: 2 × 3
  type  total  prop
  <chr> <int> <dbl>
1 fake   1118 0.889
2 real    139 0.111

Binomial Model and the chess example

The example used here is the case of a chess match between a human and a computer “Deep Blue”. The set up is such that we know the two faced each other in 1996, in which the human won. There is a rematch scheduled for the next 1997. We would like to model the number of games out of 6 that the human can win.

Let \pi be the probability that the human wins any one match against the computer. To simplify things greatly we assume that \pi takes on values of .2, .5, .8. We also assume the following prior (we are told in the book that we will learn how to build these later on):

\pi	.2	.5	.8	total
f(\pi)	.10	.25	.65	1

Note

its important to note here that the sum of the values of \pi do not add up to 1. \pi represents the chances of winning any single game, we would expect \pi to take on any value in \mathbb{R}. On the other hand f is a function that maps \pi into a space of probabilities, this is next.

Discrete Probability Model

Let Y be a discrete random variable. The probability model for Y is described by a probability mass function (pmf) defined as: f(y) = P(Y = y)

and has the following properties

0 \leq f(y) \leq 1\;\; \forall y
\sum_{\forall y}f(y) = 1

in emanuel’s words

what does this mean? well its very straightforward a pmf is a function that takes in a some value y and outputs the probability that the random variable Y equals y.

next we would like add a the dependancy of Y on \pi, we do so by introducing the conditional pmf.

Conditional probability model of data Y

Let Y be a discrete random variable that depends on some parameter \pi. We define the conditional probability model of Y as the conditional pmf,

f(y|\pi) = P(Y = y | \pi)

and has the following properties,

0 \leq f(y|\pi) \leq 1\;\; \forall y
\sum_{\forall y}f(y|\pi) = 1

in emanuel’s words

this is essentially the same probability model had defined above, except now we are condition probabilities by some parameter \pi

in the example of the chess player we must make some assumptions:

the chances of winning any match in the game stay constant. So if at match number 1 human has a .65% of winning, then that is the same for match 2-6.
Winning or loosing a game does not affect the chances of winning or loosing the next game, i.e matches are independent of one another.

These two assumptions lead us to the Binomial Model.

The Binomial Model

Let the random variable Y represent the number of successes in n trials. Assume that each trial is independent, and the probability of sucess in a given trial is \pi. Then the conditional dependence of Y on \pi can be modeled by the Binomial Model with parameters n and \pi. We can write this as,

Y|\pi \sim Bin(n, \pi)

the binomial model is specified by the pmf:

f(y|\pi) = {n \choose y} \pi^y(1 - \pi)^{n-y}

knowing this we can represent Y the total number of matches out of 6 that the human can win.

Y|\pi \sim Bin(6, \pi)

and conditional pmf:

f(y|\pi) = {6 \choose y}\pi^y(1 - \pi)^{6 - y}\;\; \text{for } y \in \{1, 2, 3, 4, 5, 6\}

with the pmf we can now determine the probability of the human winning Y matches out of 6 for any given value of \pi

chess_pmf <- function(y, p, n = 6) {
    choose(n, y) * (p ^ y) * (1 - p)^(n - y)
}

# what is probability that human wins 6 games given a pi value of .8 
chess_pmf(y = 5, p = .8)

[1] 0.393216

the formula for the binomial is actually pretty intuitive, first you have the scalar {n \choose y} this will determine the total number of ways the player can win y games out of the possible n. This is first multiplied by the probablility of success in the n trials since (p ^ y) can be re-written as p\times p\times \cdots \times p, and then multiplied by the probability of n-y failures (1 - p)^{n - y}

pies <- seq(0, 1, by = .05)
py <- chess_pmf(y = 4, p = pies)

d <- data.frame(pies = pies, py = py)

d |>
    ggplot(aes(pies, py)) + geom_col()

pies <- c(.2, .5, .8)
ys <- 0:6

d <- tidyr::expand_grid(pies, ys)
fys <- purrr::map2_dbl(d$ys, d$pies, ~chess_pmf(.x, .y), n=6)

d$fys <- fys
d$display_pi <- as.factor(paste("pi =", d$pies))

d |>
    ggplot(aes(x = ys, y = fys)) + 
    geom_col() + 
    scale_x_continuous(breaks = 0:6) + 
    facet_wrap(vars(display_pi))

The plot shows the three possible values for \pi along with the value of the pmf for each of the possible matches the human can win in a game. The values of f(y|\pi) are pretty intuitive, we would expect the random variable Y to be lower when the value of \pi is lower and higher when the value of \pi is higher.

For the sake of the excercise lets add more values of \pi so that we can see this shift happen in more detail.

pies <- seq(.1, .9, by = .1)
ys <- 0:6

d <- tidyr::expand_grid(pies, ys)
fys <- purrr::map2_dbl(d$ys, d$pies, ~chess_pmf(.x, .y), n=6)

d$fys <- fys
d$display_pi <- as.factor(paste("pi =", d$pies))

d |>
    ggplot(aes(x = ys, y = fys)) + 
    geom_col() + 
    scale_x_continuous(breaks = 0:6) + 
    facet_wrap(vars(display_pi), nrow = 3)

as it turns out we learn that the human ended up winning just one game in the 1997 rematch, Y = 1. The next step in our analysis is to determine how compatible this new data is with each value of \pi, the likelihood that is.

This is very easy to do with all the work we have done so far:

d |>
    filter(ys == 1) |>
    ggplot(aes(pies, fys)) + 
    geom_col() + 
    scale_x_continuous(breaks = seq(.1, .9, by = .1))

It’s very important to note the following

# this will sum to a value greater than 1!!
d |>
    filter(ys == 1) |>
    pull(fys) |>
    sum()

[1] 1.37907

Important

this has been mentioned before but its an important message to drive home. Note that the reason why the values sum to a value greater than 1 is that they are not probabilities, they are likelihoods. We are determining how likely each value of \pi is given that we have observed Y = 1.

We can formalize the likelihood function L in our example as follows:

L(\pi|y=1) = f(y=1|\pi) = {6 \choose 1}\pi^1(1-\pi)^{6-1} = 6\pi(1 - \pi)^5

We can test this out

6 * .2 * (.8 ^ 5)

[1] 0.393216

which is the value we get as .2 in the bar plot.

the likelihood values for Y = 1 are here:

d |>
    filter(ys == 1)|>
    select(-display_pi) |>
    knitr::kable()

pies	ys	fys
0.1	1	0.354294
0.2	1	0.393216
0.3	1	0.302526
0.4	1	0.186624
0.5	1	0.093750
0.6	1	0.036864
0.7	1	0.010206
0.8	1	0.001536
0.9	1	0.000054

The overall take-away from having observed the new data Y=1 is that it is most compatible with a the \pi value of .2, this means that its safe to assume that the human is a weaker player since we can think of the value \pi as a measure of relative weakness/superior of the human compared to the computer with 0 being the weakest and 1 being the strongest.

Probability mass functions vs likelihood functions

When \pi is known the conditional pmf f(\cdot | \pi) allows us to compare the probabilities of the different values of Y occuring with \pi

On the other hand when Y = y is known the likelihood function L(\cdot|Y=y) = f(Y=y|\cdot) allows us to compare relative likelihoods of observing data y under different values of \pi

Now that we have the priors for \pi and the likelihoods for Y=1 all we need is the normalizing constant to make use of Beye’s Rule in order to update our priors with this new information and develop our posterior. Recall that the normalizing constant is just the total probability of observing Y = 1. To get this we simply calculate the probability of observing Y = 1 for all values of \pi and weight each one of these by the corresponding prior of each value \pi.

f(y = 1) = f(Y=1|\pi = .2)f(\pi = .2) + f(Y=1|\pi = .5)f(\pi = .5) + f(Y = 1|\pi = .8)f(\pi=.8) \approx .637

Posterior

Our posterior distribution has pmf:

f(\pi|y = 1)

we can write this out as:

f(\pi | y = 1) = \frac{f(\pi)\times L(\pi| y = 1)}{f(y = 1)}

for \pi \in \{0.2, 0.5, 0.8\}

using just our simplified set of \pi values we have the following posterior:

\pi	0.2	0.5	0.8	Total
f(\pi)	0.10	0.25	0.65	1

Chess Posterior Simulation

Set up the scenario, we have possible values of \pi and the corresponding prior probability for each one

chess <- tibble::tibble(pi = c(.2, .5, .8))
prior <- c(.1, .25, .65)

next we sample use a sample function to generate 10,000 different values of \pi from our dataframe, we will use each of these to simuate a 6 match game.

chess_sim <- sample_n(chess, size = 10000, weight = prior, 
replace = TRUE)

Simulate 10,000 games

chess_sim <- chess_sim |>
    mutate(y = rbinom(10000, size = 6, prob = pi))

Lets check how close this simulation is to our known confitional pmf’s

chess_sim |>
    ggplot(aes(x = y)) + 
    stat_count(aes(y = ..prop..)) + 
    facet_wrap(~pi)

Let’s now focus on the events where Y = 1 and tally up results to see how well these approximated the values we formally computed as our posterior.

chess_sim |>
    filter(y == 1) |>
    group_by(pi) |>
    tally() |>
    mutate(
        prop = n / sum(n)
    )

# A tibble: 3 × 3
     pi     n   prop
  <dbl> <int>  <dbl>
1   0.2   344 0.604 
2   0.5   218 0.382 
3   0.8     8 0.0140