Storing dask collection to files/CSV asynchronously -




i'm implementing various kinds of data processing pipelines using dask.distributed. original data read s3 , in end processed (large) collection written csv on s3 well.

i can run processing asynchonously , monitor progress, i've noticed to_xxx() methods store collections file(s) seem synchronous calls. 1 downside of call blocks, potentially long time. second, cannot construct complete graph executed later.

is there way run e.g. to_csv() asynchronously , future object instead of blocking?

ps: i'm pretty sure can implement async storage myself, e.g. converting collection delayed() , storing each partition. seems common case - unless missed existing feature nice have included in framework.

most to_* functions have compute=true keyword argument can replaced compute=false. in these cases return sequence of delayed values can compute asynchronously

values = df.to_csv('s3://...', compute=false) futures = client.compute(values) 




wiki

Comments

Popular posts from this blog

Asterisk AGI Python Script to Dialplan does not work -

python - Read npy file directly from S3 StreamingBody -

kotlin - Out-projected type in generic interface prohibits the use of metod with generic parameter -