diff --git a/R/ch2.html b/R/ch2.html index 54a30d8..1d52094 100644 --- a/R/ch2.html +++ b/R/ch2.html @@ -103,10 +103,23 @@ code span.wa { color: #60a0b0; font-weight: bold; font-style: italic; } /* Warni - +
- +
@@ -131,7 +144,9 @@ code span.wa { color: #60a0b0; font-weight: bold; font-style: italic; } /* Warni
+

Note: these notes are a work in progress

In this chapter we step through an example of “fake” vs “real” news to build a framework to determine the probability of real vs fake of a new news article titled “The President has a secret!”

+

We then go on to build a probability known as the Binomial model using the Bayesian framework

# libraries
 library(bayesrules)
@@ -145,40 +160,24 @@ code span.wa { color: #60a0b0; font-weight: bold; font-style: italic; } /* Warni
 

What is the proportion of news articles that were labeled fake vs real.

-
fake_news |> glimpse()
+
fake_news |> head()
-
Rows: 150
-Columns: 30
-$ title                   <chr> "Clinton's Exploited Haiti Earthquake ‘to Stea…
-$ text                    <chr> "0 SHARES Facebook Twitter\n\nBernard Sansaric…
-$ url                     <chr> "http://freedomdaily.com/former-haitian-senate…
-$ authors                 <chr> NA, NA, "Sierra Marlee", "Jack Shafer,Nolan D"…
-$ type                    <fct> fake, real, fake, real, fake, real, fake, fake…
-$ title_words             <int> 17, 18, 16, 11, 9, 12, 11, 18, 10, 13, 10, 11,…
-$ text_words              <int> 219, 509, 494, 268, 479, 220, 184, 500, 677, 4…
-$ title_char              <int> 110, 95, 96, 60, 54, 66, 86, 104, 66, 81, 59, …
-$ text_char               <int> 1444, 3016, 2881, 1674, 2813, 1351, 1128, 3112…
-$ title_caps              <int> 0, 0, 1, 0, 0, 1, 0, 2, 1, 1, 0, 1, 0, 0, 0, 0…
-$ text_caps               <int> 1, 1, 3, 3, 0, 0, 0, 12, 12, 1, 2, 5, 1, 1, 6,…
-$ title_caps_percent      <dbl> 0.000000, 0.000000, 6.250000, 0.000000, 0.0000…
-$ text_caps_percent       <dbl> 0.4566210, 0.1964637, 0.6072874, 1.1194030, 0.…
-$ title_excl              <int> 0, 0, 2, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0…
-$ text_excl               <int> 0, 0, 2, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 2, 0, 0…
-$ title_excl_percent      <dbl> 0.0000000, 0.0000000, 2.0833333, 0.0000000, 0.…
-$ text_excl_percent       <dbl> 0.00000000, 0.00000000, 0.06942034, 0.00000000…
-$ title_has_excl          <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE…
-$ anger                   <dbl> 4.24, 2.28, 1.18, 4.66, 0.82, 1.29, 2.56, 3.47…
-$ anticipation            <dbl> 2.12, 1.71, 2.16, 1.79, 1.23, 0.43, 2.05, 1.74…
-$ disgust                 <dbl> 2.54, 1.90, 0.98, 1.79, 0.41, 1.72, 2.05, 1.35…
-$ fear                    <dbl> 3.81, 1.90, 1.57, 4.30, 0.82, 0.43, 5.13, 4.25…
-$ joy                     <dbl> 1.27, 1.71, 1.96, 0.36, 1.23, 0.86, 1.54, 1.35…
-$ sadness                 <dbl> 4.66, 1.33, 0.78, 1.79, 0.82, 0.86, 2.05, 1.93…
-$ surprise                <dbl> 2.12, 1.14, 1.18, 1.79, 0.82, 0.86, 1.03, 1.35…
-$ trust                   <dbl> 2.97, 4.17, 3.73, 2.51, 2.46, 2.16, 5.13, 3.86…
-$ negative                <dbl> 8.47, 4.74, 3.33, 6.09, 2.66, 3.02, 4.10, 4.63…
-$ positive                <dbl> 3.81, 4.93, 5.49, 2.15, 4.30, 2.16, 4.10, 4.25…
-$ text_syllables          <int> 395, 845, 806, 461, 761, 376, 326, 891, 1133, …
-$ text_syllables_per_word <dbl> 1.803653, 1.660118, 1.631579, 1.720149, 1.5887…
+
# A tibble: 6 × 30
+  title        text  url   authors type  title…¹ text_…² title…³ text_…⁴ title…⁵
+  <chr>        <chr> <chr> <chr>   <fct>   <int>   <int>   <int>   <int>   <int>
+1 Clinton's E… "0 S… http… <NA>    fake       17     219     110    1444       0
+2 Donald Trum… "\n\… http… <NA>    real       18     509      95    3016       0
+3 Michelle Ob… "Mic… http… Sierra… fake       16     494      96    2881       1
+4 Trump hits … "“Cr… http… Jack S… real       11     268      60    1674       0
+5 Australia V… "Whe… http… Blair … fake        9     479      54    2813       0
+6 It’s “Trump… "Lik… http… View A… real       12     220      66    1351       1
+# … with 20 more variables: text_caps <int>, title_caps_percent <dbl>,
+#   text_caps_percent <dbl>, title_excl <int>, text_excl <int>,
+#   title_excl_percent <dbl>, text_excl_percent <dbl>, title_has_excl <lgl>,
+#   anger <dbl>, anticipation <dbl>, disgust <dbl>, fear <dbl>, joy <dbl>,
+#   sadness <dbl>, surprise <dbl>, trust <dbl>, negative <dbl>, positive <dbl>,
+#   text_syllables <int>, text_syllables_per_word <dbl>, and abbreviated
+#   variable names ¹​title_words, ²​text_words, ³​title_char, ⁴​text_char, …
fake_news |>
     group_by(type) |> 
@@ -245,12 +244,12 @@ Probability and Likelihood
     gt::cols_width(everything() ~ px(100))
-
+
@@ -855,12 +854,12 @@ Baye’s Rule gt::cols_width(everything() ~ px(100))
-
+
@@ -1276,11 +1275,11 @@ Baye’s Rule fake -3941 -0.3941 +4031 +0.4031 real -6059 -0.6059 +5969 +0.5969 @@ -1317,8 +1316,8 @@ Baye’s Rule # Groups: usage [2] usage fake real <chr> <int> <int> -1 no 2936 5932 -2 yes 1005 127 +1 no 2955 5845 +2 yes 1076 124
@@ -1345,8 +1344,46 @@ Baye’s Rule
# A tibble: 2 × 3
   type  total  prop
   <chr> <int> <dbl>
-1 fake   1005 0.888
-2 real    127 0.112
+1 fake 1076 0.897 +2 real 124 0.103 +
+
+ +
+

Binomial Model and the chess example

+

The example used here is the case of a chess match between a human and a computer “Deep Blue”. The set up is such that we know the two faced each other in 1996, in which the human won. There is a rematch scheduled for the next 1997. We would like to model the number of games out of 6 that the human can win.

+

Let \(\pi\) be the probability that the human wins any one match against the computer. To simplify things greatly we assume that \(\pi\) takes on values of .2, .5, .8. We also assume the following prior (we are told in the book that we will learn how to build these later on):

+ + + + + + + + + + + + + + + + + + + +
\(\pi\).2.5.8total
\(f(\pi)\).10.25.651
+
+
+
+ +
+
+Note +
+
+
+

its important to note here that the sum of the values of \(\pi\) do not add up to 1. \(\pi\) represents the chances of winning any single game, we would expect \(\pi\) to take on any value in \(\mathbb{R}\). On the other hand \(f\) is a function that maps \(\pi\) into a space of probabilities, this is next.

@@ -1380,6 +1417,41 @@ in emanuel’s words

what does this mean? well its very straightforward a pmf is a function that takes in a some value y and outputs the probability that the random variable \(Y\) equals \(y\).

+
+

The Binomial Model

+
+
+
+ +
+
+Conditional probability model of data \(Y\) +
+
+
+

Let \(Y\) be a discrete random variable that depends on some parameter \(\pi\). We define the conditional probability model of \(Y\) as the conditional pmf,

+

\[f(y|\pi) = P(Y = y | \pi)\]

+

and has the following properties,

+
    +
  1. \(0 \leq f(y|\pi) \leq 1\;\; \forall y\)
  2. +
  3. \(\sum_{\forall y}f(y|\pi) = 1\)
  4. +
+
+
+
+
+
+ +
+
+in emanuel’s words +
+
+
+

this is essentially the same probability model had defined above, except now we are condition probabilities by some parameter \(\pi\)

+
+
+
diff --git a/R/ch2.qmd b/R/ch2.qmd index 4f2dd87..c4b1d05 100644 --- a/R/ch2.qmd +++ b/R/ch2.qmd @@ -11,12 +11,18 @@ format: css: styles.css callout-icon: false callout-apperance: simple + toc: true --- +*Note: these notes are a work in progress* + In this chapter we step through an example of "fake" vs "real" news to build a framework to determine the probability of real vs fake of a new news article titled "The President has a secret!" +We then go on to build a probability known as the Binomial model using the +Bayesian framework + ```{r} #| message: false #| warning: false @@ -34,7 +40,7 @@ fake_news <- tibble::as_tibble(fake_news) What is the proportion of news articles that were labeled fake vs real. ```{r} -fake_news |> glimpse() +fake_news |> head() fake_news |> group_by(type) |> @@ -316,6 +322,33 @@ articles_sim |> ) ``` +## Binomial Model and the chess example + +The example used here is the case of a chess match between a human +and a computer "Deep Blue". The set up is such that we know the two +faced each other in 1996, in which the human won. There is a rematch +scheduled for the next 1997. We would like to model the number of games +out of 6 that the human can win. + +Let $\pi$ be the probability that the human wins any one match against +the computer. To simplify things greatly we assume that $\pi$ takes on +values of .2, .5, .8. We also assume the following prior (we are told +in the book that we will learn how to build these later on): + +| $\pi$ | .2 | .5 | .8 | total | +|--------|----|----|----|-------| +|$f(\pi)$|.10 |.25 |.65 | 1 | + +:::{.callout-caution} +## Note + +its important to note here that the sum of the values of $\pi$ **do +not** add up to 1. $\pi$ represents the chances of winning any single +game, we would expect $\pi$ to take on any value in $\mathbb{R}$. On +the other hand $f$ is a function that maps $\pi$ into a space of +probabilities, this is next. +::: + :::{.callout-note} ## Discrete Probability Model @@ -336,4 +369,27 @@ and has the following properties what does this mean? well its very straightforward a pmf is a function that takes in a some value y and outputs the probability that the random variable $Y$ equals $y$. +::: + +### The Binomial Model + +:::{.callout-note} +## Conditional probability model of data $Y$ + +Let $Y$ be a discrete random variable that depends on some parameter +$\pi$. We define the conditional probability model of $Y$ as the +conditional pmf, + +$$f(y|\pi) = P(Y = y | \pi)$$ + +and has the following properties, + +1. $0 \leq f(y|\pi) \leq 1\;\; \forall y$ +2. $\sum_{\forall y}f(y|\pi) = 1$ +::: + +:::{.callout-caution} +## in emanuel's words +this is essentially the same probability model had defined above, except +now we are condition probabilities by some parameter $\pi$ ::: \ No newline at end of file diff --git a/R/ch2_files/figure-html/unnamed-chunk-11-1.png b/R/ch2_files/figure-html/unnamed-chunk-11-1.png index faa2b66..d6b4e1a 100644 Binary files a/R/ch2_files/figure-html/unnamed-chunk-11-1.png and b/R/ch2_files/figure-html/unnamed-chunk-11-1.png differ diff --git a/R/ch2_files/figure-html/unnamed-chunk-6-1.png b/R/ch2_files/figure-html/unnamed-chunk-6-1.png index 84c0bbf..303bf1d 100644 Binary files a/R/ch2_files/figure-html/unnamed-chunk-6-1.png and b/R/ch2_files/figure-html/unnamed-chunk-6-1.png differ