Во овој документ започнавме да разгледуваме процедури за линеарна регресија на групирани податоци. Со пакетите од tidyverse
и broom
можеме многу лесно да направиме регресија за секоја категорија (пол, тежина, натпреварување). Графирање на ваквите резултати е исто така лесно со овие платформи. Клучната техника тука е вгнездување (nest()
) на подгрупите, вршење операции врз ваквите вгнездени податоци, и потоа ‘одгнездување’ на посакуваните елементи. Оваа е моќна техника, но не е наједноставна за совладување, бидејќи бара подетално познавање на структурата на tibble
, data.frame
и list
. Овој документ може да служи како минимален вовед за овие процедури и објекти.
library(tidyverse)
library(broom)
lifting <-
ipf_lifts <-
readr::read_csv(
"https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-10-08/ipf_lifts.csv"
)
glimpse(lifting)
## Rows: 41,152
## Columns: 16
## $ name <chr> "Hiroyuki Isagawa", "David Mannering", "Eddy Penge...
## $ sex <chr> "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", ...
## $ event <chr> "SBD", "SBD", "SBD", "SBD", "SBD", "SBD", "SBD", "...
## $ equipment <chr> "Single-ply", "Single-ply", "Single-ply", "Single-...
## $ age <dbl> NA, 24.0, 35.5, 19.5, NA, NA, 32.5, 31.5, NA, NA, ...
## $ age_class <chr> NA, "24-34", "35-39", "20-23", NA, NA, "24-34", "2...
## $ division <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ bodyweight_kg <dbl> 67.5, 67.5, 67.5, 67.5, 67.5, 67.5, 67.5, 90.0, 90...
## $ weight_class_kg <chr> "67.5", "67.5", "67.5", "67.5", "67.5", "67.5", "6...
## $ best3squat_kg <dbl> 205.0, 225.0, 245.0, 195.0, 240.0, 200.0, 220.0, 2...
## $ best3bench_kg <dbl> 140.0, 132.5, 157.5, 110.0, 140.0, 100.0, 140.0, 2...
## $ best3deadlift_kg <dbl> 225.0, 235.0, 270.0, 240.0, 215.0, 230.0, 235.0, 3...
## $ place <chr> "1", "2", "3", "4", "5", "6", "7", "1", "2", "2", ...
## $ date <date> 1985-08-03, 1985-08-03, 1985-08-03, 1985-08-03, 1...
## $ federation <chr> "IPF", "IPF", "IPF", "IPF", "IPF", "IPF", "IPF", "...
## $ meet_name <chr> "World Games", "World Games", "World Games", "Worl...
Convert to long format such that we have one column and weight lifted.
ll <- lifting_long <-
lifting %>% gather(type, weight, starts_with("best"))
ll <- group_by(ll, sex, type)
Now we can fit a linear model of weight ~ boyweight
while keeping track of relevant groupings.
Check these step by step to see what nest does what the map within mutate does and unnest.
LMs_by_sex_and_type <-
ll %>%
tidyr::nest() %>%
mutate(fit = purrr::map(data, ~ lm(weight ~ bodyweight_kg, data = .))) %>%
mutate(results = purrr::map(fit, broom::tidy)) %>%
tidyr::unnest(results)
# if you get this error:
# Error: 'vec_dim' is not an exported object from 'namespace:vctrs'
# reinstall pillar with install.packages('pillar')
# https://github.com/r-lib/vctrs/issues/487
Now we can ask what are the coefficients of the linear regressions for each group
LMs_by_sex_and_type %>%
select(-data, -fit) %>%
filter(term == "(Intercept)") %>%
knitr::kable(caption = "Inercepts for lifted weight by bodyweight by sex and type of lifting")
sex | type | term | estimate | std.error | statistic | p.value |
---|---|---|---|---|---|---|
M | best3squat_kg | (Intercept) | 98.74112 | 1.3982126 | 70.61954 | 0 |
F | best3squat_kg | (Intercept) | 63.31427 | 1.4309542 | 44.24619 | 0 |
M | best3bench_kg | (Intercept) | 50.68200 | 0.9550533 | 53.06719 | 0 |
F | best3bench_kg | (Intercept) | 31.15649 | 0.8731337 | 35.68352 | 0 |
M | best3deadlift_kg | (Intercept) | 144.95591 | 1.1529829 | 125.72252 | 0 |
F | best3deadlift_kg | (Intercept) | 96.62448 | 1.1283408 | 85.63413 | 0 |
# caption will render in Rmd
LMs_by_sex_and_type %>%
select(-data, -fit) %>%
filter(term == "bodyweight_kg") %>%
knitr::kable(caption = "Inercepts for lifted weight by bodyweight by sex and type of lifting")
sex | type | term | estimate | std.error | statistic | p.value |
---|---|---|---|---|---|---|
M | best3squat_kg | bodyweight_kg | 1.7301136 | 0.0154233 | 112.17530 | 0 |
F | best3squat_kg | bodyweight_kg | 1.3466930 | 0.0213294 | 63.13781 | 0 |
M | best3bench_kg | bodyweight_kg | 1.3672329 | 0.0104211 | 131.19822 | 0 |
F | best3bench_kg | bodyweight_kg | 0.9039786 | 0.0129411 | 69.85324 | 0 |
M | best3deadlift_kg | bodyweight_kg | 1.2396643 | 0.0127121 | 97.51838 | 0 |
F | best3deadlift_kg | bodyweight_kg | 0.9465361 | 0.0168002 | 56.34075 | 0 |
Neat, everything is significant.
To visualize, we can use this table, or plot the raw data and use geom_smooth while faceting by sex and type.
LMs_by_sex_and_type_wide <-
LMs_by_sex_and_type %>%
select(sex, type, term, estimate) %>%
spread(term, estimate) %>%
rename("Intercept" = `(Intercept)`) %>%
rename("Slope" = bodyweight_kg)
ggplot(data = LMs_by_sex_and_type_wide) +
# aes(group=interaction(sex, type)) +
geom_abline(size = 1,
aes(
slope = Slope,
intercept = Intercept,
color = type,
linetype = sex
)) +
# since we don't have mapping for x and y
# the slope lines are not visible
# add manual axis limits to see them
scale_x_continuous(limits = c(0, 100)) +
scale_y_continuous(limits = c(0, 200)) +
labs(x="bodyweight (kg)") +
labs(y="weight lifter (kg)") +
theme_bw() +
theme(legend.position="top")
Then you can ask are slopes for women higher than men? Are slopes significantly different from each other (maybe use likelihood ratio test or anova to compare models).