r - Split comma-separated column into separate rows -

i have data frame, so:

data.frame(director = c("aaron blaise,bob walker", "akira kurosawa",                          "alan j. pakula", "alan parker", "alejandro amenabar", "alejandro gonzalez inarritu",                          "alejandro gonzalez inarritu,benicio del toro", "alejandro gonzález iñárritu",                          "alex proyas", "alexander hall", "alfonso cuaron", "alfred hitchcock",                          "anatole litvak", "andrew adamson,marilyn fox", "andrew dominik",                          "andrew stanton", "andrew stanton,lee unkrich", "angelina jolie,john stevenson",                          "anne fontaine", "anthony harvey"), ab = c('a', 'b', 'a', 'a', 'b', 'b', 'b', 'a', 'b', 'a', 'b', 'a', 'a', 'b', 'b', 'b', 'b', 'b', 'b', 'a'))

as can see, entries in director column multiple names separated commas. split these entries separate rows while maintaining values of other column. example, first row in data frame above should split 2 rows, single name each in director column , 'a' in ab column.

this old question being used dupe target (tagged r-faq). of today, has been answered 3 times offering 6 different approaches is lacking benchmark guidance of approaches fastest¹.

the benchmarked solutions include

matthew lundberg's base r approach modified according rich scriven's comment,
jaap's 2 data.table methods , 2 dplyr / tidyr approaches,
ananda's splitstackshapesolution,
and 2 additional variants of jaap's data.table methods.

overall 8 different methods benchmarked on 6 different sizes of data frames using microbenchmark package (see code below).

the sample data given op consists of 20 rows. create larger data frames, these 20 rows repeated 1, 10, 100, 1000, 10000, , 100000 times give problem sizes of 2 million rows.

benchmark results

the benchmark results show sufficiently large data frames data.table methods faster other method. data frames more 5000 rows, jaap's data.table method 2 , variant dt3 fastest, magnitudes faster slowest methods.

remarkably, timings of 2 tidyverse methods , splistackshape solution similar it's difficult distiguish curves in chart. slowest of benchmarked methods across data frame sizes.

for smaller data frames, matt's base r solution , data.table method 4 seem have less overhead other methods.

code

director <-    c("aaron blaise,bob walker", "akira kurosawa", "alan j. pakula",      "alan parker", "alejandro amenabar", "alejandro gonzalez inarritu",      "alejandro gonzalez inarritu,benicio del toro", "alejandro gonzález iñárritu",      "alex proyas", "alexander hall", "alfonso cuaron", "alfred hitchcock",      "anatole litvak", "andrew adamson,marilyn fox", "andrew dominik",      "andrew stanton", "andrew stanton,lee unkrich", "angelina jolie,john stevenson",      "anne fontaine", "anthony harvey") ab <- c("a", "b", "a", "a", "b", "b", "b", "a", "b", "a", "b", "a",          "a", "b", "b", "b", "b", "b", "b", "a")  library(data.table) library(magrittr)

define function benchmark runs of problem size `n`

run_mb <- function(n) {   # compute number of benchmark runs depending on problem size `n`   mb_times <- scales::squish(10000l / n , c(3l, 100l))    cat(n, " ", mb_times, "\n")   # create data   df <- data.frame(director = rep(director, n), ab = rep(ab, n))   dt <- as.data.table(df)   # start benchmarks   microbenchmark::microbenchmark(     matt_mod = {       s <- strsplit(as.character(df$director), ',')       data.frame(director=unlist(s), ab=rep(df$ab, lengths(s)))},     jaap_dt1 = {       dt[, lapply(.sd, function(x) unlist(tstrsplit(x, ",", fixed=true))), = ab          ][!is.na(director)]},     jaap_dt2 = {       dt[, strsplit(as.character(director), ",", fixed=true),           = .(ab, director)][,.(director = v1, ab)]},     jaap_dplyr = {       df %>%          dplyr::mutate(director = strsplit(as.character(director), ",")) %>%         tidyr::unnest(director)},     jaap_tidyr = {       tidyr::separate_rows(df, director, sep = ",")},     csplit = {       splitstackshape::csplit(df, "director", ",", direction = "long")},     dt3 = {       dt[, strsplit(as.character(director), ",", fixed=true),          = .(ab, director)][, director := null][            , setnames(.sd, "v1", "director")]},     dt4 = {       dt[, .(director = unlist(strsplit(as.character(director), ",", fixed = true))),           = .(ab)]},     times = mb_times   ) }

run benchmark different problem sizes

# define vector of problem sizes n_rep <- 10l^(0:5) # run benchmark different problem sizes mb <- lapply(n_rep, run_mb)

prepare data plotting

mbl <- rbindlist(mb, idcol = "n") mbl[, n_row := nrow(director) * n_rep[n]] mba <- mbl[, .(median_time = median(time), n = .n), = .(n_row, expr)] mba[, expr := forcats::fct_reorder(expr, -median_time)]

create chart

library(ggplot2) ggplot(mba, aes(n_row, median_time*1e-6, group = expr, colour = expr)) +    geom_point() + geom_smooth(se = false) +    scale_x_log10(breaks = nrow(director) * n_rep) + scale_y_log10() +    xlab("number of rows") + ylab("median of execution time [ms]") +   ggtitle("microbenchmark results") + theme_bw()

session info & package versions (excerpt)

devtools::session_info() #session info # version  r version 3.3.2 (2016-10-31) # system   x86_64, mingw32 #packages # data.table      * 1.10.4  2017-02-01 cran (r 3.3.2) # dplyr             0.5.0   2016-06-24 cran (r 3.3.1) # forcats           0.2.0   2017-01-23 cran (r 3.3.2) # ggplot2         * 2.2.1   2016-12-30 cran (r 3.3.2) # magrittr        * 1.5     2014-11-22 cran (r 3.3.0) # microbenchmark    1.4-2.1 2015-11-25 cran (r 3.3.3) # scales            0.4.1   2016-11-09 cran (r 3.3.2) # splitstackshape   1.4.2   2014-10-23 cran (r 3.3.3) # tidyr             0.6.1   2017-01-10 cran (r 3.3.2)

¹_{my curiosity piqued this exuberant comment brilliant! orders of magnitude faster! tidyverse answer of a question closed duplicate of question.}

wiki

Search This Blog

tL