r - Split comma-separated column into separate rows -
i have data frame, so:
data.frame(director = c("aaron blaise,bob walker", "akira kurosawa", "alan j. pakula", "alan parker", "alejandro amenabar", "alejandro gonzalez inarritu", "alejandro gonzalez inarritu,benicio del toro", "alejandro gonzález iñárritu", "alex proyas", "alexander hall", "alfonso cuaron", "alfred hitchcock", "anatole litvak", "andrew adamson,marilyn fox", "andrew dominik", "andrew stanton", "andrew stanton,lee unkrich", "angelina jolie,john stevenson", "anne fontaine", "anthony harvey"), ab = c('a', 'b', 'a', 'a', 'b', 'b', 'b', 'a', 'b', 'a', 'b', 'a', 'a', 'b', 'b', 'b', 'b', 'b', 'b', 'a'))
as can see, entries in director
column multiple names separated commas. split these entries separate rows while maintaining values of other column. example, first row in data frame above should split 2 rows, single name each in director
column , 'a' in ab
column.
this old question being used dupe target (tagged r-faq
). of today, has been answered 3 times offering 6 different approaches is lacking benchmark guidance of approaches fastest1.
the benchmarked solutions include
- matthew lundberg's base r approach modified according rich scriven's comment,
- jaap's 2
data.table
methods , 2dplyr
/tidyr
approaches, - ananda's
splitstackshape
solution, - and 2 additional variants of jaap's
data.table
methods.
overall 8 different methods benchmarked on 6 different sizes of data frames using microbenchmark
package (see code below).
the sample data given op consists of 20 rows. create larger data frames, these 20 rows repeated 1, 10, 100, 1000, 10000, , 100000 times give problem sizes of 2 million rows.
benchmark results
the benchmark results show sufficiently large data frames data.table
methods faster other method. data frames more 5000 rows, jaap's data.table
method 2 , variant dt3
fastest, magnitudes faster slowest methods.
remarkably, timings of 2 tidyverse
methods , splistackshape
solution similar it's difficult distiguish curves in chart. slowest of benchmarked methods across data frame sizes.
for smaller data frames, matt's base r solution , data.table
method 4 seem have less overhead other methods.
code
director <- c("aaron blaise,bob walker", "akira kurosawa", "alan j. pakula", "alan parker", "alejandro amenabar", "alejandro gonzalez inarritu", "alejandro gonzalez inarritu,benicio del toro", "alejandro gonzález iñárritu", "alex proyas", "alexander hall", "alfonso cuaron", "alfred hitchcock", "anatole litvak", "andrew adamson,marilyn fox", "andrew dominik", "andrew stanton", "andrew stanton,lee unkrich", "angelina jolie,john stevenson", "anne fontaine", "anthony harvey") ab <- c("a", "b", "a", "a", "b", "b", "b", "a", "b", "a", "b", "a", "a", "b", "b", "b", "b", "b", "b", "a") library(data.table) library(magrittr)
define function benchmark runs of problem size n
run_mb <- function(n) { # compute number of benchmark runs depending on problem size `n` mb_times <- scales::squish(10000l / n , c(3l, 100l)) cat(n, " ", mb_times, "\n") # create data df <- data.frame(director = rep(director, n), ab = rep(ab, n)) dt <- as.data.table(df) # start benchmarks microbenchmark::microbenchmark( matt_mod = { s <- strsplit(as.character(df$director), ',') data.frame(director=unlist(s), ab=rep(df$ab, lengths(s)))}, jaap_dt1 = { dt[, lapply(.sd, function(x) unlist(tstrsplit(x, ",", fixed=true))), = ab ][!is.na(director)]}, jaap_dt2 = { dt[, strsplit(as.character(director), ",", fixed=true), = .(ab, director)][,.(director = v1, ab)]}, jaap_dplyr = { df %>% dplyr::mutate(director = strsplit(as.character(director), ",")) %>% tidyr::unnest(director)}, jaap_tidyr = { tidyr::separate_rows(df, director, sep = ",")}, csplit = { splitstackshape::csplit(df, "director", ",", direction = "long")}, dt3 = { dt[, strsplit(as.character(director), ",", fixed=true), = .(ab, director)][, director := null][ , setnames(.sd, "v1", "director")]}, dt4 = { dt[, .(director = unlist(strsplit(as.character(director), ",", fixed = true))), = .(ab)]}, times = mb_times ) }
run benchmark different problem sizes
# define vector of problem sizes n_rep <- 10l^(0:5) # run benchmark different problem sizes mb <- lapply(n_rep, run_mb)
prepare data plotting
mbl <- rbindlist(mb, idcol = "n") mbl[, n_row := nrow(director) * n_rep[n]] mba <- mbl[, .(median_time = median(time), n = .n), = .(n_row, expr)] mba[, expr := forcats::fct_reorder(expr, -median_time)]
create chart
library(ggplot2) ggplot(mba, aes(n_row, median_time*1e-6, group = expr, colour = expr)) + geom_point() + geom_smooth(se = false) + scale_x_log10(breaks = nrow(director) * n_rep) + scale_y_log10() + xlab("number of rows") + ylab("median of execution time [ms]") + ggtitle("microbenchmark results") + theme_bw()
session info & package versions (excerpt)
devtools::session_info() #session info # version r version 3.3.2 (2016-10-31) # system x86_64, mingw32 #packages # data.table * 1.10.4 2017-02-01 cran (r 3.3.2) # dplyr 0.5.0 2016-06-24 cran (r 3.3.1) # forcats 0.2.0 2017-01-23 cran (r 3.3.2) # ggplot2 * 2.2.1 2016-12-30 cran (r 3.3.2) # magrittr * 1.5 2014-11-22 cran (r 3.3.0) # microbenchmark 1.4-2.1 2015-11-25 cran (r 3.3.3) # scales 0.4.1 2016-11-09 cran (r 3.3.2) # splitstackshape 1.4.2 2014-10-23 cran (r 3.3.3) # tidyr 0.6.1 2017-01-10 cran (r 3.3.2)
1my curiosity piqued this exuberant comment brilliant! orders of magnitude faster! tidyverse
answer of a question closed duplicate of question.
wiki
Comments
Post a Comment