Spark's dataframe count() function taking very long -




in code, have sequence of dataframes want filter out dataframe's empty. i'm doing like:

seq(df1, df2).map(df => df.count() > 0) 

however, taking extremely long , consuming around 7 minutes approximately 2 dataframe's of 100k rows each.

my question: why spark's implementation of count() slow. there work-around?

count lazy operation. not matter how big dataframe. if have many costly operations on data dataframe, once count called spark operations these dataframe.

some of costly operations may operations needs shuffling of data. groupby, reduce etc.

so guess have complex processing these dataframes or initial data used dataframe huge.





wiki

Comments

Popular posts from this blog

python - Read npy file directly from S3 StreamingBody -

kotlin - Out-projected type in generic interface prohibits the use of metod with generic parameter -

Asterisk AGI Python Script to Dialplan does not work -