Storing dask collection to files/CSV asynchronously -
i'm implementing various kinds of data processing pipelines using dask.distributed. original data read s3 , in end processed (large) collection written csv on s3 well.
i can run processing asynchonously , monitor progress, i've noticed to_xxx() methods store collections file(s) seem synchronous calls. 1 downside of call blocks, potentially long time. second, cannot construct complete graph executed later.
is there way run e.g. to_csv() asynchronously , future object instead of blocking?
ps: i'm pretty sure can implement async storage myself, e.g. converting collection delayed() , storing each partition. seems common case - unless missed existing feature nice have included in framework.
most to_*
functions have compute=true
keyword argument can replaced compute=false
. in these cases return sequence of delayed values can compute asynchronously
values = df.to_csv('s3://...', compute=false) futures = client.compute(values)
wiki
Comments
Post a Comment