Spark's dataframe count() function taking very long -
in code, have sequence of dataframes want filter out dataframe's empty. i'm doing like:
seq(df1, df2).map(df => df.count() > 0)
however, taking extremely long , consuming around 7 minutes approximately 2 dataframe's of 100k rows each.
my question: why spark's implementation of count() slow. there work-around?
count lazy operation. not matter how big dataframe. if have many costly operations on data dataframe, once count called spark operations these dataframe.
some of costly operations may operations needs shuffling of data. groupby, reduce etc.
so guess have complex processing these dataframes or initial data used dataframe huge.
wiki
Comments
Post a Comment