Chapter 2 Notes

Author

Emanuel Rodriguez

In this chapter we step through an example of “fake” vs “real” news to build a framework to determine the probability of real vs fake of a new news article titled “The President has a secret!”

# libraries
library(bayesrules)
library(dplyr)
data(fake_news)
fake_news <- tibble::as_tibble(fake_news)

What is the proportion of news articles that were labeled fake vs real.

fake_news |> glimpse()

Rows: 150
Columns: 30
$ title                   <chr> "Clinton's Exploited Haiti Earthquake ‘to Stea…
$ text                    <chr> "0 SHARES Facebook Twitter\n\nBernard Sansaric…
$ url                     <chr> "http://freedomdaily.com/former-haitian-senate…
$ authors                 <chr> NA, NA, "Sierra Marlee", "Jack Shafer,Nolan D"…
$ type                    <fct> fake, real, fake, real, fake, real, fake, fake…
$ title_words             <int> 17, 18, 16, 11, 9, 12, 11, 18, 10, 13, 10, 11,…
$ text_words              <int> 219, 509, 494, 268, 479, 220, 184, 500, 677, 4…
$ title_char              <int> 110, 95, 96, 60, 54, 66, 86, 104, 66, 81, 59, …
$ text_char               <int> 1444, 3016, 2881, 1674, 2813, 1351, 1128, 3112…
$ title_caps              <int> 0, 0, 1, 0, 0, 1, 0, 2, 1, 1, 0, 1, 0, 0, 0, 0…
$ text_caps               <int> 1, 1, 3, 3, 0, 0, 0, 12, 12, 1, 2, 5, 1, 1, 6,…
$ title_caps_percent      <dbl> 0.000000, 0.000000, 6.250000, 0.000000, 0.0000…
$ text_caps_percent       <dbl> 0.4566210, 0.1964637, 0.6072874, 1.1194030, 0.…
$ title_excl              <int> 0, 0, 2, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0…
$ text_excl               <int> 0, 0, 2, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 2, 0, 0…
$ title_excl_percent      <dbl> 0.0000000, 0.0000000, 2.0833333, 0.0000000, 0.…
$ text_excl_percent       <dbl> 0.00000000, 0.00000000, 0.06942034, 0.00000000…
$ title_has_excl          <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE…
$ anger                   <dbl> 4.24, 2.28, 1.18, 4.66, 0.82, 1.29, 2.56, 3.47…
$ anticipation            <dbl> 2.12, 1.71, 2.16, 1.79, 1.23, 0.43, 2.05, 1.74…
$ disgust                 <dbl> 2.54, 1.90, 0.98, 1.79, 0.41, 1.72, 2.05, 1.35…
$ fear                    <dbl> 3.81, 1.90, 1.57, 4.30, 0.82, 0.43, 5.13, 4.25…
$ joy                     <dbl> 1.27, 1.71, 1.96, 0.36, 1.23, 0.86, 1.54, 1.35…
$ sadness                 <dbl> 4.66, 1.33, 0.78, 1.79, 0.82, 0.86, 2.05, 1.93…
$ surprise                <dbl> 2.12, 1.14, 1.18, 1.79, 0.82, 0.86, 1.03, 1.35…
$ trust                   <dbl> 2.97, 4.17, 3.73, 2.51, 2.46, 2.16, 5.13, 3.86…
$ negative                <dbl> 8.47, 4.74, 3.33, 6.09, 2.66, 3.02, 4.10, 4.63…
$ positive                <dbl> 3.81, 4.93, 5.49, 2.15, 4.30, 2.16, 4.10, 4.25…
$ text_syllables          <int> 395, 845, 806, 461, 761, 376, 326, 891, 1133, …
$ text_syllables_per_word <dbl> 1.803653, 1.660118, 1.631579, 1.720149, 1.5887…

fake_news |>
    group_by(type) |> 
    summarise(
        total = n(),
        prop = total / nrow(fake_news)
    )

# A tibble: 2 × 3
  type  total  prop
  <fct> <int> <dbl>
1 fake     60   0.4
2 real     90   0.6

If we let \(B\) be the event that a news article is “fake” news, and \(B^c\) be the event that a news article is “real”, we can write the following:

\[P(B) = .4\] \[P(B^c) = .6\]

This is the first “clue” or set of data that we have to build into our framework. Namely, majority of articles are “real”, therefore we could simply predict that the new article is “real”. This updated sense or reality now becomes our priors.

Getting additional data, and updating our priors, based on additional data.