r - How to efficiently implement dplyr do call for lmer function? -
i have dataset ~400000 rows trying extract lme4
mixed model variance components using dplyr
do
call in r. function is:
myfunc <- function(dat) { if (sum(!is.na(dat$value)) > 840) { # >70% data present v = data.frame(varcorr(lmer(value ~ 0 + (1|gid) + (1|trial:rep) + (1|trial:rep:block), data=dat))) data.frame(a=round(v[1,4]/(v[1,4]+(v[4,4]/2)),2), b=round(v[1,4],2), c=round(v[4,4],2), n_obs=nrow(dat), na_obs=sum(is.na(dat$value))) } else { data.frame(a=na, b=na, c=na, n_obs= nrow(dat), na_obs=sum(is.na(dat$value))) } }
this function called dplyr
do
call after grouping data 4 grouping variables. final dplyr
call is:
system.time(out <- tst %>% group_by(iyear,ilocation,trait_id,date) %>% do(myfunc(.)))
now, when code run on smaller test dataframe of 11000 rows, takes 25 seconds. running on full set of 443k rows takes 8-9 hours finish, awefully slow. seems obvious there part of code pulling down performance can't seem figure out whether lmer
part or dplyr
causing slow down. have feeling there wrong way function handling vectorization operation not sure. tried initializing 'out' matrix outside function call, didn't improve performance.
unfortunately, don't have smaller reproducible dataset share. hear thoughts on how make code more efficient.
solution: mclapply
function parallel
package came rescue. @gregor rightly pointed, lmer
part slowing things down. ended parallelizing function call:
myfunc <- function(i) { dat = tst[tst$comb==unique(tst$comb)[i],] #comb concatenated iyear,ilocation....columns if (sum(!is.na(dat$value)) > 840) { # >70% data present per column v = data.frame(varcorr(lmer(value ~ 0 + rand_factor + nested_random_factor), data=dat))) data.frame(trait=unique(tst$comb)[i], a=round(v[1,4])/5, b=round(v[1,4],2), c=round(v[4,4],2), n_obs=nrow(dat), na_obs=sum(is.na(dat$value))) } else { data.frame(trait=unique(tst$comb)[i], a=na, b=na, c=na, n_obs= nrow(dat), na_obs=sum(is.na(dat$value))) } } #initialize empty matrix out <- matrix(na,length(unique(tst$comb)),6) ## apply function in parallel. output list n_cores = detectcores() - 2 system.time(my.h2 <- mclapply(1:length(unique(tst$comb)),fun = myfunc, mc.cores = n_cores))
a twelve core unix machine took ~2 minutes complete.
wiki
Comments
Post a Comment