diff --git a/R/ch2.html b/R/ch2.html index c574ee7..54a30d8 100644 --- a/R/ch2.html +++ b/R/ch2.html @@ -138,8 +138,10 @@ code span.wa { color: #60a0b0; font-weight: bold; font-style: italic; } /* Warni library(dplyr) library(tidyr) library(gt) -data(fake_news) -fake_news <- tibble::as_tibble(fake_news) +library(tibble) +library(ggplot2) +data(fake_news) +fake_news <- tibble::as_tibble(fake_news)

What is the proportion of news articles that were labeled fake vs real.

@@ -243,12 +245,12 @@ Probability and Likelihood gt::cols_width(everything() ~ px(100))
-
+
@@ -787,6 +789,597 @@ total probability

\[P(B^c) = P(A \cap B^c) + P(A^c \cap B^c)\] \[=P(A|B^c)P(B^c) + P(A^c|B^c)P(B^c)\] \[=.0132 + .5868 = .6\]

+

In the above calculations we also step through joint probabilities

+
+
+
+ +
+
+Joint and conditional probability +
+
+
+

\[P(A \cap B) = P(A|B)P(B)\]

+

\(A\) and \(B\) are said to be independent events, if and only if

+

\[P(A \cap B) = P(A)P(B)\]

+

from this we can also derive the definition of a conditional probability

+

\[P(A|B) = \frac{P(A \cap B)}{P(B)}\]

+
+
+

At this point we are able to answer the question, “What is the probability, the new article is fake?”. Given that the new article has an exclamation point, we can zoom into the top row of the table of probabilitties. Within this row we have probabilities \(.1068/.12 = .833\) for fake and \(.0132 / .12 = .11\) for real.

+

This is essentially Baye’s Rule. We developed a posterior probability for an event \(B\) given some observation \(A\). We did so by combining the likelihood of event \(B\) given some new data \(A\) and the prior probability of event \(B\). More formally we have the following definition:

+
+
+
+ +
+
+Baye’s Rule +
+
+
+

The posterior probability of an event \(B\) given a \(A\) is:

+

\[ P(B|A) = \frac{P(A \cap B)}{P(A)} = \frac{L(B|A)P(B)}{P(A)}\]

+

where \(L\) is the likelihood function \(L(B|A) = P(B|A)\) and \(P(A)\) is the total probability of \(A\).

+

More generally,

+

\[ \frac{likelihood \cdot prior}{normalizing \;\; constant}\]

+
+
+ +
+

Simualation

+
+
articles <- tibble::tibble(type = c("real", "fake"))
+
+priors <- c(.6, .4)
+
+articles_sim <- sample_n(articles, 10000, replace = TRUE, weight = priors)
+
+
+
articles_sim |>
+    ggplot(aes(x = type)) + geom_bar()
+
+

+
+
+

and a summary table

+
+
articles_sim |>
+    group_by(type) |>
+    summarise(
+        total = n(), 
+        prop = total / nrow(articles_sim)
+    ) |>
+    gt()|>
+    gt::cols_width(everything() ~ px(100))
+
+ +
+ + + + + + + + + + + + + + + + + + + + + + + + + +
typetotalprop
fake39410.3941
real60590.6059
+
+
+
+

the simulation of 10,000 articles shows us very nearly the same priors we had from the data. We can now add the exclamation usage into the data.

+
+
articles_sim <- articles_sim |>
+    mutate(model_data = case_when(
+        type == "fake" ~ .267, 
+        type == "real" ~ .022
+    ))
+
+

The plan here is to iterate through the 10,000 samples and use the data_model value to assign either, “yes” or “no” using the sample function.

+
+
data <- c("yes", "no")
+
+articles_sim <- articles_sim |>
+    mutate(id = row_number()) |>
+    group_by(id) |>
+    mutate(usage = sample(data, 1, prob = c(model_data, 1 - model_data)))
+
+
+
articles_sim |>
+    group_by(usage, type) |>
+    summarise(
+        total = n()
+    ) |>
+    pivot_wider(names_from = type, values_from = total)
+
+
# A tibble: 2 × 3
+# Groups:   usage [2]
+  usage  fake  real
+  <chr> <int> <int>
+1 no     2936  5932
+2 yes    1005   127
+
+
+
+
articles_sim |>
+    ggplot(aes(x = type, fill = usage)) + 
+    geom_bar() + 
+    scale_fill_discrete(type = c("gray8", "dodgerblue4"))
+
+

+
+
+

So far have compute both the priors and likelihoods, we can simply filter our data to reflect the incoming article and determine our posterior.

+
+
articles_sim |>
+    filter(usage == "yes") |>
+    group_by(type) |>
+    summarise(
+        total = n()
+    ) |>
+    mutate(
+        prop = total / sum(total)
+    )
+
+
# A tibble: 2 × 3
+  type  total  prop
+  <chr> <int> <dbl>
+1 fake   1005 0.888
+2 real    127 0.112
+
+
+
+
+
+ +
+
+Discrete Probability Model +
+
+
+

Let \(Y\) be a discrete random variable. The probability model for \(Y\) is described by a probability mass function (pmf) defined as: \[f(y) = P(Y = y)\]

+

and has the following properties

+
    +
  1. \(0 \leq f(y) \leq 1\;\; \forall y\)
  2. +
  3. \(\sum_{\forall y}f(y) = 1\)
  4. +
+
+
+
+
+
+ +
+
+in emanuel’s words +
+
+
+

what does this mean? well its very straightforward a pmf is a function that takes in a some value y and outputs the probability that the random variable \(Y\) equals \(y\).

+
+
diff --git a/R/ch2.qmd b/R/ch2.qmd index ef93a80..4f2dd87 100644 --- a/R/ch2.qmd +++ b/R/ch2.qmd @@ -25,6 +25,8 @@ library(bayesrules) library(dplyr) library(tidyr) library(gt) +library(tibble) +library(ggplot2) data(fake_news) fake_news <- tibble::as_tibble(fake_news) ``` @@ -180,4 +182,158 @@ parts. Namely $$P(B^c) = P(A \cap B^c) + P(A^c \cap B^c)$$ $$=P(A|B^c)P(B^c) + P(A^c|B^c)P(B^c)$$ $$=.0132 + .5868 = .6$$ +::: + +In the above calculations we also step through **joint probabilities** + +:::{.callout-note} +## Joint and conditional probability + +$$P(A \cap B) = P(A|B)P(B)$$ + +$A$ and $B$ are said to be independent events, if and only if + +$$P(A \cap B) = P(A)P(B)$$ + +from this we can also derive the definition of a conditional probability + +$$P(A|B) = \frac{P(A \cap B)}{P(B)}$$ + +::: + +At this point we are able to answer the question, "What is the probability, +the new article is fake?". Given that the new article has an exclamation +point, we can zoom into the top row of the table of probabilitties. Within +this row we have probabilities $.1068/.12 = .833$ for fake and $.0132 / .12 = .11$ +for real. + +This is essentially Baye's Rule. We developed a posterior probability for an event +$B$ given some observation $A$. We did so by combining the likelihood of event $B$ +given some new data $A$ and the prior probability of event $B$. More formally we +have the following definition: + +:::{.callout-note} +## Baye's Rule + +The posterior probability of an event $B$ given a $A$ is: + +$$ P(B|A) = \frac{P(A \cap B)}{P(A)} = \frac{L(B|A)P(B)}{P(A)}$$ + +where $L$ is the likelihood function $L(B|A) = P(B|A)$ and $P(A)$ is the +total probability of $A$. + +More generally, + +$$ \frac{likelihood \cdot prior}{normalizing \;\; constant}$$ +::: + +### Simualation + + +```{r} +articles <- tibble::tibble(type = c("real", "fake")) + +priors <- c(.6, .4) + +articles_sim <- sample_n(articles, 10000, replace = TRUE, weight = priors) +``` + +```{r} +articles_sim |> + ggplot(aes(x = type)) + geom_bar() +``` + +and a summary table + +```{r} +articles_sim |> + group_by(type) |> + summarise( + total = n(), + prop = total / nrow(articles_sim) + ) |> + gt()|> + gt::cols_width(everything() ~ px(100)) +``` + +the simulation of 10,000 articles shows us very nearly +the same priors we had from the data. We can now add +the exclamation usage into the data. + +```{r} + +articles_sim <- articles_sim |> + mutate(model_data = case_when( + type == "fake" ~ .267, + type == "real" ~ .022 + )) +``` + + +The plan here is to iterate through the 10,000 samples +and use the `data_model` value to assign either, "yes" or +"no" using the `sample` function. + +```{r} +data <- c("yes", "no") + +articles_sim <- articles_sim |> + mutate(id = row_number()) |> + group_by(id) |> + mutate(usage = sample(data, 1, prob = c(model_data, 1 - model_data))) +``` + + +```{r} +articles_sim |> + group_by(usage, type) |> + summarise( + total = n() + ) |> + pivot_wider(names_from = type, values_from = total) +``` + +```{r} +articles_sim |> + ggplot(aes(x = type, fill = usage)) + + geom_bar() + + scale_fill_discrete(type = c("gray8", "dodgerblue4")) +``` + +So far have compute both the priors and likelihoods, we can simply +filter our data to reflect the incoming article and determine our +posterior. + +```{r} +articles_sim |> + filter(usage == "yes") |> + group_by(type) |> + summarise( + total = n() + ) |> + mutate( + prop = total / sum(total) + ) +``` + + +:::{.callout-note} +## Discrete Probability Model + +Let $Y$ be a discrete random variable. The probability model for $Y$ is +described by a **probability mass function** (pmf) defined as: +$$f(y) = P(Y = y)$$ + +and has the following properties + +1. $0 \leq f(y) \leq 1\;\; \forall y$ +2. $\sum_{\forall y}f(y) = 1$ +::: + + +:::{.callout-caution} +## in emanuel's words +what does this mean? well its very straightforward a pmf is a function that takes +in a some value y and outputs the probability that the random variable +$Y$ equals $y$. ::: \ No newline at end of file diff --git a/R/ch2_files/figure-html/unnamed-chunk-11-1.png b/R/ch2_files/figure-html/unnamed-chunk-11-1.png new file mode 100644 index 0000000..faa2b66 Binary files /dev/null and b/R/ch2_files/figure-html/unnamed-chunk-11-1.png differ diff --git a/R/ch2_files/figure-html/unnamed-chunk-6-1.png b/R/ch2_files/figure-html/unnamed-chunk-6-1.png new file mode 100644 index 0000000..84c0bbf Binary files /dev/null and b/R/ch2_files/figure-html/unnamed-chunk-6-1.png differ