r - How to efficiently implement dplyr do call for lmer function? -
i have dataset ~400000 rows trying extract lme4 mixed model variance components using dplyr do call in r.  function is:
myfunc <- function(dat) {     if (sum(!is.na(dat$value)) > 840) {  # >70% data present             v = data.frame(varcorr(lmer(value ~ 0 + (1|gid) + (1|trial:rep) + (1|trial:rep:block), data=dat)))            data.frame(a=round(v[1,4]/(v[1,4]+(v[4,4]/2)),2), b=round(v[1,4],2), c=round(v[4,4],2), n_obs=nrow(dat), na_obs=sum(is.na(dat$value)))      } else {          data.frame(a=na, b=na, c=na, n_obs= nrow(dat), na_obs=sum(is.na(dat$value)))     } } this function called dplyr do call after grouping data 4 grouping variables. final dplyr call is:
system.time(out <- tst %>% group_by(iyear,ilocation,trait_id,date) %>%            do(myfunc(.))) now, when code run on smaller test dataframe of 11000 rows, takes 25 seconds. running on full set of 443k rows takes 8-9 hours finish, awefully slow. seems obvious there part of code pulling down performance can't seem figure out whether lmer part or dplyr causing slow down. have feeling there wrong way function handling vectorization operation not sure. tried initializing 'out' matrix outside function call, didn't improve performance.
 unfortunately, don't have smaller reproducible dataset share. hear thoughts on how make code more efficient.
solution:  mclapply function parallel package came rescue. @gregor rightly pointed, lmer part slowing things down. ended parallelizing function call:
myfunc <- function(i) {      dat = tst[tst$comb==unique(tst$comb)[i],]  #comb concatenated iyear,ilocation....columns      if (sum(!is.na(dat$value)) > 840) {  # >70% data present per column          v = data.frame(varcorr(lmer(value ~ 0 + rand_factor + nested_random_factor), data=dat)))          data.frame(trait=unique(tst$comb)[i], a=round(v[1,4])/5, b=round(v[1,4],2), c=round(v[4,4],2), n_obs=nrow(dat), na_obs=sum(is.na(dat$value)))       } else {           data.frame(trait=unique(tst$comb)[i], a=na, b=na, c=na, n_obs= nrow(dat), na_obs=sum(is.na(dat$value)))       } }  #initialize empty matrix out <- matrix(na,length(unique(tst$comb)),6)  ## apply function in parallel. output list n_cores = detectcores() - 2 system.time(my.h2 <- mclapply(1:length(unique(tst$comb)),fun = myfunc, mc.cores = n_cores)) a twelve core unix machine took ~2 minutes complete.
wiki
Comments
Post a Comment