Post hoc synopsis

Ова веројатно е еден од покорисните документи за нови TidyTuesdayAtKIKA посетители. Вклучен е код за инспекција и чистење на проблематични варијабли, како текстуални така и нумерички (датуми). Има неколку типови на визуализации на податоците, со основен ggplot2 код и специјализирани пакети. И има кратко теоретско објаснување за синтаксата на ggplot2 како и неколку корисни кратенки за RStudio.

Intro

Recap

  • .Rmd (R Markdown) е околината во која ќе работиме бидејќи овозможува код, текст, коментари и аутпут да се презентираат заедно;
  • Нов R Notebook: Мени: File -> New File -> R notebook
  • Корисни кратенки:
    • insert code chunk (ctrl+alt+i), execute all above/below cursor (ctr+alt+b/ctrl+alt+e),
    • pipes (ctrl+shift+m) [not a true bash pipe. More a function to chain other functions]
    • view()

The tidyverse

The tidyverse is an “opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.” (source: https://thomasmock.netlify.com/post/tidytuesday-a-weekly-social-data-project-in-r/)

TidyTuesday 37 недела, 2 пат КИКА

  • Податоци на github
  • Добар пример за реални, ‘валкани’ податоци што треба да се исчистат. Ако на пример работите во осигурителна куќа и ви речат ајде да видиме што да се прави за полиси за осигурување за повреди во забавни паркови.

Libraties (packages) and data

Load the libraries

suppressPackageStartupMessages(library(tidyverse))

Load the data from github

tx_injuries <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-09-10/tx_injuries.csv")

safer_parks <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-09-10/saferparks.csv")

Средување NA вредности

A lot of free text this week, some inconsistent NAs (n/a, N/A) and dates (ymd, dmy). A good chance to do some data cleaning and then take a look at frequency, type of injury, and analyze free text.

Прашање: Колку сѐ можни форми на NA (not applicable) има во податоците?

glimpse(tx_injuries)
## Rows: 542
## Columns: 13
## $ injury_report_rec <dbl> 2032, 1897, 837, 99, 55, 780, 253, 253, 55, 55, 2...
## $ name_of_operation <chr> "Skygroup Investments LLC DBA iFly Austin", "Will...
## $ city              <chr> "Austin", "Galveston", "Grapevine", "San Antonio"...
## $ st                <chr> "TX", "TX", "TX", "TX", "AZ", "TX", "TX", "TX", "...
## $ injury_date       <chr> "2/12/2013", "3/2/2013", "3/3/2013", "3/3/2013", ...
## $ ride_name         <chr> "I Fly", "Gulf Glider", "Howlin Tornado", "Scooby...
## $ serial_no         <chr> "SV024", "GS-11-10-WG-14", "0643-C1-T1-TN60", "n/...
## $ gender            <chr> "F", "F", "F", "F", "F", "F", "F", "M", "M", "F",...
## $ age               <chr> "37", "43", "n/a", "51", "17", "40", "36", "23", ...
## $ body_part         <chr> "Mouth", "Knee", "Right Shoulder", "Lower Leg", "...
## $ alleged_injury    <chr> "Student hit mouth on wall", "Alleged arthroscopy...
## $ cause_of_injury   <chr> "Student attempted unfamiliar manuever", "Hit her...
## $ other             <chr> NA, "Prior history of problems with this knee. Fi...
#map(tx_injuries, unique)

Колку што можевме да видиме од dplyr::glimpse() командата, има:N/A, n/a. Можеме да ги конвертираме овие форми на NA во стандардна R NA со функцијата dplyr::na_if() (за потсетување, синтаксата со :: означува пакет::функција).

tx_injuries <- tx_injuries %>% 
    mutate(age = na_if(age, "n/a")) %>% 
    mutate(age = na_if(age, "N/A")) %>% 
    mutate(age = na_if(age, "0")) %>% 
    mutate(age = as.numeric(age)) %>% 
    mutate(gender = na_if(gender, "n/a")) %>% 
    mutate(gender = na_if(gender, "N/A")) %>% 
    mutate(gender=str_to_upper(gender)) %>%
    mutate(gender = as_factor(gender)) %>% 
    mutate(injury_date = na_if(injury_date, "n/a")) %>% 
    mutate(injury_date = na_if(injury_date, "#########")) 

#map(tx_injuries, unique)

Компјутеризирано барање на NA

Очигледно е дека горниот приод каде повикуваме na_if за секоја варијабла и можна форма на NA не е многу ефикасен. Особено ако имаме посериозен проблем со повеќе колони со податоци кои сакаме да ги процесираме. Во пракса ваквото чистење на податоци најдобро се прави со custom функции, особено кога иста или слична трансформација треба да се аплицира на повеќе вектори (колони).

# make a function to mutate a column to fix NAs
fix_na <- function(vect, na_strings) {
  vect[vect %in% na_strings] <- NA
  return(vect)
}

# find possible NA values
possible_na <-
  tx_injuries %>%
  gather(col_name, value) %>%
  filter(str_detect(value, "^n|^N"), str_length(value) < 5) %>%
  distinct(value, .keep_all = FALSE) %>%
  filter(!value %in% c("Neck", "Nose", "neck")) %>%
  pull(value)

possible_na
## [1] "N/A" "n/a"
# apply our custom function to all columns 
# while using all possible NA we just found
tx_injuries_na_fixed <- tx_injuries %>% 
  mutate_all(.funs = fix_na, na_strings = possible_na)

Чистење на датуми

Со пакетот lubridate

suppressPackageStartupMessages(library(lubridate))

tx_injuries$injury_date
##   [1] "2/12/2013" "3/2/2013"  "3/3/2013"  "3/3/2013"  "3/11/2013" "3/12/2013"
##   [7] "3/15/2013" "3/15/2013" "3/16/2013" "3/16/2013" "3/28/2013" "3/30/2013"
##  [13] "4/6/2013"  "4/11/2013" "4/12/2013" "4/14/2013" "4/20/2013" "4/27/2013"
##  [19] "4/27/2013" "4/28/2013" "4/28/2013" "4/28/2013" "5/11/2013" "41406"    
##  [25] "5/12/2013" "5/18/2013" "5/18/2013" "5/26/2013" "5/27/2013" "5/29/2013"
##  [31] "5/31/2013" "41426"     "6/2/2013"  "6/4/2013"  "6/5/2013"  "6/6/2013" 
##  [37] "6/7/2013"  "6/8/2013"  "41434"     "6/9/2013"  "6/13/2013" "41440"    
##  [43] "6/18/2013" "6/20/2013" "6/22/2013" "6/24/2013" "6/25/2013" "6/25/2013"
##  [49] "6/26/2013" "6/27/2013" "41452"     "41452"     "41452"     "6/30/2013"
##  [55] "41455"     "6/30/2013" "7/1/2013"  "7/2/2013"  "41458"     "41458"    
##  [61] "7/4/2013"  "7/5/2013"  "7/7/2013"  "7/7/2013"  "7/7/2013"  "7/7/2013" 
##  [67] "7/9/2013"  "7/10/2013" "7/11/2013" "7/13/2013" "7/14/2013" "7/14/2013"
##  [73] "7/14/2013" "7/16/2013" "7/18/2013" "7/18/2013" "7/19/2013" "7/19/2013"
##  [79] "7/20/2013" "7/20/2013" "7/21/2013" "7/22/2013" "7/23/2013" "7/27/2013"
##  [85] "7/27/2013" "7/27/2013" "7/27/2013" "7/27/2013" "7/28/2013" "7/28/2013"
##  [91] "7/28/2013" "8/4/2013"  "41493"     "41493"     "8/8/2013"  "8/10/2013"
##  [97] "8/10/2013" "8/10/2013" "8/11/2013" "8/12/2013" "8/14/2013" "8/15/2013"
## [103] "41504"     "41504"     "41504"     "8/21/2013" "8/23/2013" "41509"    
## [109] "8/28/2013" "8/30/2013" "8/31/2013" "8/31/2013" "9/1/2013"  "9/7/2013" 
## [115] "9/7/2013"  "9/14/2013" "41531"     "41531"     "41532"     "41539"    
## [121] "41545"     "41551"     NA          NA          NA          NA         
## [127] NA          "41580"     NA          "12/8/2013" NA          "1/11/2014"
## [133] "1/19/2014" "41664"     "41670"     "41688"     "41699"     "41699"    
## [139] "41708"     "41708"     "41710"     "41711"     "41718"     "41718"    
## [145] "41741"     "41748"     "41755"     "41756"     "41756"     "41756"    
## [151] "4/28/2014" "41761"     "5/3/2014"  "5/4/2014"  "5/9/2014"  "41768"    
## [157] "41768"     "5/9/2014"  "5/10/2014" "5/10/2014" "5/11/2014" "5/16/2014"
## [163] "5/16/2014" "5/16/2014" "5/17/2014" "41776"     "5/19/2014" "5/21/2014"
## [169] "5/24/2014" "41785"     "5/29/2014" "41790"     "6/2/2014"  "6/5/2014" 
## [175] "6/7/2014"  "6/8/2014"  "41801"     "41807"     "6/17/2014" "6/17/2014"
## [181] "6/18/2014" "41810"     "41810"     "6/21/2014" "6/21/2014" "6/22/2014"
## [187] "6/26/2014" "6/28/2014" "6/30/2014" "41823"     "7/4/2014"  "41824"    
## [193] "7/7/2014"  "7/7/2014"  "7/8/2014"  "7/9/2014"  "7/11/2014" "7/13/2014"
## [199] "7/16/2014" "7/17/2014" "7/21/2014" "7/22/2014" "41843"     "7/23/2014"
## [205] "7/25/2014" "7/25/2014" "7/26/2014" "7/27/2014" "7/30/2014" "8/5/2014" 
## [211] "8/8/2014"  "8/8/2014"  "8/9/2014"  "8/9/2014"  "8/10/2014" "8/10/2014"
## [217] "8/18/2014" "8/19/2014" "8/19/2014" "8/22/2014" "41875"     "8/24/2014"
## [223] "8/25/2014" "8/31/2014" "8/31/2014" "41882"     "9/1/2014"  "9/1/2014" 
## [229] "9/1/2014"  "9/7/2014"  "9/13/2014" "9/27/2014" NA          NA         
## [235] NA          NA          NA          NA          NA          "1/14/2015"
## [241] "1/26/2015" "2/15/2015" "2/26/2015" "42069"     "3/7/2015"  "3/13/2015"
## [247] "42083"     "4/16/2015" "4/23/2015" "5/2/2015"  "5/16/2015" "42140"    
## [253] "5/16/2015" "5/17/2015" "5/18/2015" "5/22/2015" "42148"     "42149"    
## [259] "42151"     "42151"     "42152"     "42155"     "42155"     "42157"    
## [265] "42159"     "42159"     "42159"     "42159"     "42161"     "42161"    
## [271] "42162"     "42165"     "42168"     "42168"     "42168"     "42169"    
## [277] "42169"     "42169"     "42169"     "42169"     "42170"     "42171"    
## [283] "42173"     "42173"     "42174"     "42174"     "42174"     "42174"    
## [289] "42175"     "42175"     "42175"     "42176"     "42176"     "42176"    
## [295] "42176"     "42177"     "42177"     "42177"     "42180"     "42180"    
## [301] "42181"     "42181"     "42182"     "42183"     "42186"     "42188"    
## [307] "42188"     "42189"     "42191"     "42191"     "42192"     "42192"    
## [313] "42192"     "42197"     "42197"     "42198"     "42198"     "42201"    
## [319] "42202"     "42202"     "42203"     "42203"     "42204"     "42204"    
## [325] "42205"     "42205"     "42207"     "42208"     "42210"     "42216"    
## [331] "42217"     "42217"     "42217"     "42218"     "42219"     "41855"    
## [337] "42224"     "42225"     "42225"     "42226"     "42226"     "42227"    
## [343] "42229"     "42229"     "42230"     "42230"     "42232"     "42232"    
## [349] "42233"     "42240"     "42244"     "42244"     "42246"     "42252"    
## [355] "42253"     "42259"     "42260"     "42260"     "42261"     "42271"    
## [361] "42280"     NA          NA          NA          NA          "42316"    
## [367] NA          NA          "42414"     "42442"     "42443"     "42454"    
## [373] "42455"     "42456"     "42463"     "42467"     "42483"     "42489"    
## [379] "42489"     "42496"     "42502"     "42505"     "42505"     "42506"    
## [385] "42512"     "42515"     "42515"     "42516"     "42518"     "42518"    
## [391] "42518"     "42518"     "42522"     "42523"     "42524"     "42528"    
## [397] "42531"     "42540"     "42542"     "42543"     "42544"     "42545"    
## [403] "42546"     "42546"     "42546"     "42548"     "42549"     "42549"    
## [409] "42550"     "42553"     "42553"     "42554"     "42555"     "42556"    
## [415] "42557"     "42558"     "42559"     "42559"     "42560"     "42560"    
## [421] "42560"     "42560"     "42561"     "42561"     "42563"     "42564"    
## [427] "42566"     "42567"     "42568"     "42568"     "42568"     "42570"    
## [433] "42570"     "42571"     "42571"     "42572"     "42572"     "42572"    
## [439] "42572"     "42572"     "42573"     "42574"     "42575"     "42576"    
## [445] "42581"     "42581"     "42586"     "42588"     "42589"     "42589"    
## [451] "42590"     "42590"     "42596"     "42602"     "42603"     "42610"    
## [457] "42616"     "42617"     "42623"     "42628"     "42638"     "42644"    
## [463] "42651"     NA          NA          NA          NA          NA         
## [469] "42785"     "42799"     "42806"     "42808"     "42811"     "42819"    
## [475] "42826"     "42826"     "42834"     "42834"     "42840"     "42840"    
## [481] "42861"     "42869"     "42875"     "42878"     "42880"     "42886"    
## [487] "42891"     "42896"     "42897"     "42898"     "42899"     "42900"    
## [493] "42904"     "42905"     "42906"     "42908"     "42909"     "42909"    
## [499] "42912"     "42916"     "42917"     "42919"     "42922"     "42925"    
## [505] "42928"     "42931"     "42931"     "42932"     "42936"     "42936"    
## [511] "42938"     "42938"     "42942"     "42942"     "42943"     "42944"    
## [517] "42946"     "42949"     "42949"     "42957"     "42958"     "42958"    
## [523] "42959"     "42959"     "42960"     "42960"     "42960"     "42960"    
## [529] "42961"     "42963"     "42965"     "42981"     "42982"     "43005"    
## [535] "43005"     "43009"     "43011"     NA          NA          NA         
## [541] "43073"     NA
tx_injuries <- tx_injuries %>% 
    mutate(injury_date = case_when(
      str_length(injury_date) == 5 ~ as_date(
        as.numeric(injury_date), origin='1899-12-30'),
      str_detect(injury_date, "/") == TRUE ~ mdy(injury_date),
      TRUE ~ as.Date(NA)
      ))
      

tx_injuries$injury_date
##   [1] "2013-02-12" "2013-03-02" "2013-03-03" "2013-03-03" "2013-03-11"
##   [6] "2013-03-12" "2013-03-15" "2013-03-15" "2013-03-16" "2013-03-16"
##  [11] "2013-03-28" "2013-03-30" "2013-04-06" "2013-04-11" "2013-04-12"
##  [16] "2013-04-14" "2013-04-20" "2013-04-27" "2013-04-27" "2013-04-28"
##  [21] "2013-04-28" "2013-04-28" "2013-05-11" "2013-05-12" "2013-05-12"
##  [26] "2013-05-18" "2013-05-18" "2013-05-26" "2013-05-27" "2013-05-29"
##  [31] "2013-05-31" "2013-06-01" "2013-06-02" "2013-06-04" "2013-06-05"
##  [36] "2013-06-06" "2013-06-07" "2013-06-08" "2013-06-09" "2013-06-09"
##  [41] "2013-06-13" "2013-06-15" "2013-06-18" "2013-06-20" "2013-06-22"
##  [46] "2013-06-24" "2013-06-25" "2013-06-25" "2013-06-26" "2013-06-27"
##  [51] "2013-06-27" "2013-06-27" "2013-06-27" "2013-06-30" "2013-06-30"
##  [56] "2013-06-30" "2013-07-01" "2013-07-02" "2013-07-03" "2013-07-03"
##  [61] "2013-07-04" "2013-07-05" "2013-07-07" "2013-07-07" "2013-07-07"
##  [66] "2013-07-07" "2013-07-09" "2013-07-10" "2013-07-11" "2013-07-13"
##  [71] "2013-07-14" "2013-07-14" "2013-07-14" "2013-07-16" "2013-07-18"
##  [76] "2013-07-18" "2013-07-19" "2013-07-19" "2013-07-20" "2013-07-20"
##  [81] "2013-07-21" "2013-07-22" "2013-07-23" "2013-07-27" "2013-07-27"
##  [86] "2013-07-27" "2013-07-27" "2013-07-27" "2013-07-28" "2013-07-28"
##  [91] "2013-07-28" "2013-08-04" "2013-08-07" "2013-08-07" "2013-08-08"
##  [96] "2013-08-10" "2013-08-10" "2013-08-10" "2013-08-11" "2013-08-12"
## [101] "2013-08-14" "2013-08-15" "2013-08-18" "2013-08-18" "2013-08-18"
## [106] "2013-08-21" "2013-08-23" "2013-08-23" "2013-08-28" "2013-08-30"
## [111] "2013-08-31" "2013-08-31" "2013-09-01" "2013-09-07" "2013-09-07"
## [116] "2013-09-14" "2013-09-14" "2013-09-14" "2013-09-15" "2013-09-22"
## [121] "2013-09-28" "2013-10-04" NA           NA           NA          
## [126] NA           NA           "2013-11-02" NA           "2013-12-08"
## [131] NA           "2014-01-11" "2014-01-19" "2014-01-25" "2014-01-31"
## [136] "2014-02-18" "2014-03-01" "2014-03-01" "2014-03-10" "2014-03-10"
## [141] "2014-03-12" "2014-03-13" "2014-03-20" "2014-03-20" "2014-04-12"
## [146] "2014-04-19" "2014-04-26" "2014-04-27" "2014-04-27" "2014-04-27"
## [151] "2014-04-28" "2014-05-02" "2014-05-03" "2014-05-04" "2014-05-09"
## [156] "2014-05-09" "2014-05-09" "2014-05-09" "2014-05-10" "2014-05-10"
## [161] "2014-05-11" "2014-05-16" "2014-05-16" "2014-05-16" "2014-05-17"
## [166] "2014-05-17" "2014-05-19" "2014-05-21" "2014-05-24" "2014-05-26"
## [171] "2014-05-29" "2014-05-31" "2014-06-02" "2014-06-05" "2014-06-07"
## [176] "2014-06-08" "2014-06-11" "2014-06-17" "2014-06-17" "2014-06-17"
## [181] "2014-06-18" "2014-06-20" "2014-06-20" "2014-06-21" "2014-06-21"
## [186] "2014-06-22" "2014-06-26" "2014-06-28" "2014-06-30" "2014-07-03"
## [191] "2014-07-04" "2014-07-04" "2014-07-07" "2014-07-07" "2014-07-08"
## [196] "2014-07-09" "2014-07-11" "2014-07-13" "2014-07-16" "2014-07-17"
## [201] "2014-07-21" "2014-07-22" "2014-07-23" "2014-07-23" "2014-07-25"
## [206] "2014-07-25" "2014-07-26" "2014-07-27" "2014-07-30" "2014-08-05"
## [211] "2014-08-08" "2014-08-08" "2014-08-09" "2014-08-09" "2014-08-10"
## [216] "2014-08-10" "2014-08-18" "2014-08-19" "2014-08-19" "2014-08-22"
## [221] "2014-08-24" "2014-08-24" "2014-08-25" "2014-08-31" "2014-08-31"
## [226] "2014-08-31" "2014-09-01" "2014-09-01" "2014-09-01" "2014-09-07"
## [231] "2014-09-13" "2014-09-27" NA           NA           NA          
## [236] NA           NA           NA           NA           "2015-01-14"
## [241] "2015-01-26" "2015-02-15" "2015-02-26" "2015-03-06" "2015-03-07"
## [246] "2015-03-13" "2015-03-20" "2015-04-16" "2015-04-23" "2015-05-02"
## [251] "2015-05-16" "2015-05-16" "2015-05-16" "2015-05-17" "2015-05-18"
## [256] "2015-05-22" "2015-05-24" "2015-05-25" "2015-05-27" "2015-05-27"
## [261] "2015-05-28" "2015-05-31" "2015-05-31" "2015-06-02" "2015-06-04"
## [266] "2015-06-04" "2015-06-04" "2015-06-04" "2015-06-06" "2015-06-06"
## [271] "2015-06-07" "2015-06-10" "2015-06-13" "2015-06-13" "2015-06-13"
## [276] "2015-06-14" "2015-06-14" "2015-06-14" "2015-06-14" "2015-06-14"
## [281] "2015-06-15" "2015-06-16" "2015-06-18" "2015-06-18" "2015-06-19"
## [286] "2015-06-19" "2015-06-19" "2015-06-19" "2015-06-20" "2015-06-20"
## [291] "2015-06-20" "2015-06-21" "2015-06-21" "2015-06-21" "2015-06-21"
## [296] "2015-06-22" "2015-06-22" "2015-06-22" "2015-06-25" "2015-06-25"
## [301] "2015-06-26" "2015-06-26" "2015-06-27" "2015-06-28" "2015-07-01"
## [306] "2015-07-03" "2015-07-03" "2015-07-04" "2015-07-06" "2015-07-06"
## [311] "2015-07-07" "2015-07-07" "2015-07-07" "2015-07-12" "2015-07-12"
## [316] "2015-07-13" "2015-07-13" "2015-07-16" "2015-07-17" "2015-07-17"
## [321] "2015-07-18" "2015-07-18" "2015-07-19" "2015-07-19" "2015-07-20"
## [326] "2015-07-20" "2015-07-22" "2015-07-23" "2015-07-25" "2015-07-31"
## [331] "2015-08-01" "2015-08-01" "2015-08-01" "2015-08-02" "2015-08-03"
## [336] "2014-08-04" "2015-08-08" "2015-08-09" "2015-08-09" "2015-08-10"
## [341] "2015-08-10" "2015-08-11" "2015-08-13" "2015-08-13" "2015-08-14"
## [346] "2015-08-14" "2015-08-16" "2015-08-16" "2015-08-17" "2015-08-24"
## [351] "2015-08-28" "2015-08-28" "2015-08-30" "2015-09-05" "2015-09-06"
## [356] "2015-09-12" "2015-09-13" "2015-09-13" "2015-09-14" "2015-09-24"
## [361] "2015-10-03" NA           NA           NA           NA          
## [366] "2015-11-08" NA           NA           "2016-02-14" "2016-03-13"
## [371] "2016-03-14" "2016-03-25" "2016-03-26" "2016-03-27" "2016-04-03"
## [376] "2016-04-07" "2016-04-23" "2016-04-29" "2016-04-29" "2016-05-06"
## [381] "2016-05-12" "2016-05-15" "2016-05-15" "2016-05-16" "2016-05-22"
## [386] "2016-05-25" "2016-05-25" "2016-05-26" "2016-05-28" "2016-05-28"
## [391] "2016-05-28" "2016-05-28" "2016-06-01" "2016-06-02" "2016-06-03"
## [396] "2016-06-07" "2016-06-10" "2016-06-19" "2016-06-21" "2016-06-22"
## [401] "2016-06-23" "2016-06-24" "2016-06-25" "2016-06-25" "2016-06-25"
## [406] "2016-06-27" "2016-06-28" "2016-06-28" "2016-06-29" "2016-07-02"
## [411] "2016-07-02" "2016-07-03" "2016-07-04" "2016-07-05" "2016-07-06"
## [416] "2016-07-07" "2016-07-08" "2016-07-08" "2016-07-09" "2016-07-09"
## [421] "2016-07-09" "2016-07-09" "2016-07-10" "2016-07-10" "2016-07-12"
## [426] "2016-07-13" "2016-07-15" "2016-07-16" "2016-07-17" "2016-07-17"
## [431] "2016-07-17" "2016-07-19" "2016-07-19" "2016-07-20" "2016-07-20"
## [436] "2016-07-21" "2016-07-21" "2016-07-21" "2016-07-21" "2016-07-21"
## [441] "2016-07-22" "2016-07-23" "2016-07-24" "2016-07-25" "2016-07-30"
## [446] "2016-07-30" "2016-08-04" "2016-08-06" "2016-08-07" "2016-08-07"
## [451] "2016-08-08" "2016-08-08" "2016-08-14" "2016-08-20" "2016-08-21"
## [456] "2016-08-28" "2016-09-03" "2016-09-04" "2016-09-10" "2016-09-15"
## [461] "2016-09-25" "2016-10-01" "2016-10-08" NA           NA          
## [466] NA           NA           NA           "2017-02-19" "2017-03-05"
## [471] "2017-03-12" "2017-03-14" "2017-03-17" "2017-03-25" "2017-04-01"
## [476] "2017-04-01" "2017-04-09" "2017-04-09" "2017-04-15" "2017-04-15"
## [481] "2017-05-06" "2017-05-14" "2017-05-20" "2017-05-23" "2017-05-25"
## [486] "2017-05-31" "2017-06-05" "2017-06-10" "2017-06-11" "2017-06-12"
## [491] "2017-06-13" "2017-06-14" "2017-06-18" "2017-06-19" "2017-06-20"
## [496] "2017-06-22" "2017-06-23" "2017-06-23" "2017-06-26" "2017-06-30"
## [501] "2017-07-01" "2017-07-03" "2017-07-06" "2017-07-09" "2017-07-12"
## [506] "2017-07-15" "2017-07-15" "2017-07-16" "2017-07-20" "2017-07-20"
## [511] "2017-07-22" "2017-07-22" "2017-07-26" "2017-07-26" "2017-07-27"
## [516] "2017-07-28" "2017-07-30" "2017-08-02" "2017-08-02" "2017-08-10"
## [521] "2017-08-11" "2017-08-11" "2017-08-12" "2017-08-12" "2017-08-13"
## [526] "2017-08-13" "2017-08-13" "2017-08-13" "2017-08-14" "2017-08-16"
## [531] "2017-08-18" "2017-09-03" "2017-09-04" "2017-09-27" "2017-09-27"
## [536] "2017-10-01" "2017-10-03" NA           NA           NA          
## [541] "2017-12-04" NA

Графици 101

Ако се присетите на чистењето на податоците, некако излегува дека решивме да работиме со три променливи кои се различни по тип: age, gender, injury_date.

glimpse(tx_injuries)
## Rows: 542
## Columns: 13
## $ injury_report_rec <dbl> 2032, 1897, 837, 99, 55, 780, 253, 253, 55, 55, 2...
## $ name_of_operation <chr> "Skygroup Investments LLC DBA iFly Austin", "Will...
## $ city              <chr> "Austin", "Galveston", "Grapevine", "San Antonio"...
## $ st                <chr> "TX", "TX", "TX", "TX", "AZ", "TX", "TX", "TX", "...
## $ injury_date       <date> 2013-02-12, 2013-03-02, 2013-03-03, 2013-03-03, ...
## $ ride_name         <chr> "I Fly", "Gulf Glider", "Howlin Tornado", "Scooby...
## $ serial_no         <chr> "SV024", "GS-11-10-WG-14", "0643-C1-T1-TN60", "n/...
## $ gender            <fct> F, F, F, F, F, F, F, M, M, F, M, F, NA, M, M, F, ...
## $ age               <dbl> 37, 43, NA, 51, 17, 40, 36, 23, 40, 48, 10, NA, 4...
## $ body_part         <chr> "Mouth", "Knee", "Right Shoulder", "Lower Leg", "...
## $ alleged_injury    <chr> "Student hit mouth on wall", "Alleged arthroscopy...
## $ cause_of_injury   <chr> "Student attempted unfamiliar manuever", "Hit her...
## $ other             <chr> NA, "Prior history of problems with this knee. Fi...

Template за графици

ggplot(data = <DATA>) + 
  <GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))

Извор: https://r4ds.had.co.nz/data-visualisation.html#first-steps

MAPPING: A variable is mapped onto a geom.

Examples: age is mapped to x, gender is mapped to fill

Many geoms. They have a corresponding stat function that transforms the data to make it suitable for the geom (?stat_boxplot)

Histogram

ggplot(data = drop_na(tx_injuries, gender)) + 
  geom_histogram(mapping = aes(x = age, fill = gender), 
                 na.rm = TRUE, color="white") +
  labs(x="Години", y="Број") +
  labs(title="Број на повреди по години по пол", fill = "Пол") +
  theme_minimal()

Scatterplot

ggplot(data = drop_na(tx_injuries, gender)) + 
  geom_point(mapping = aes(x = age, y = injury_date, color=gender)) +
  labs(x="Година", y="Возраст") +
  labs(title="Повреди по години по возраст") +
  #facet_wrap("gender", ncol=2)
  theme_minimal()

Heatmap

tx_injuries %>% select(age, gender, injury_date) %>% 
  mutate(Month=month.name[month(injury_date)]) %>% 
  mutate(Year=factor(year(injury_date))) %>% 
  drop_na(Month, Year, gender) %>% 
  group_by(Year, Month, gender) %>% 
  tally() %>% 
  ggplot(data=.) + 
    geom_tile(aes(y=Month, x=Year, fill=n), color="white") +
    facet_wrap("gender", ncol=2) +
      theme_classic()

Ridges

suppressPackageStartupMessages(library(ggridges))

tx_injuries %>% select(age, gender, injury_date) %>%
  mutate(Month=month.name[month(injury_date)]) %>%
  mutate(Year=factor(year(injury_date))) %>%
  group_by(Year) %>%
  mutate(Day_of_year=injury_date-min(injury_date)) %>%
  drop_na(Year, Day_of_year, gender) %>%
  ggplot(data=.) +
    geom_density_ridges(mapping = aes(
      x=Day_of_year, y=Year, fill=gender), alpha=.5) +
    facet_wrap("gender", ncol=2)

Homework: What was potentially wrong with the assumptions for the previous graph?

R референци