pandas - Python pivot table for large file in chunks: memory error -
using bunch of dataframe chunks this, version , assay form unique identifier:
version assay resp_rob_sigmas 0 a123 f 0.56 1 b234 g 0.78 2 c345 r 0.9 3 d456 f 1.0 4 d456 g 0.3
i'm creating pivot table needs this:
f g r a123 0.56 na na b234 na 0.78 na c345 na na 0.9 d456 1.0 0.3 na
pre-chunking , pre-unzipping, data frame 13 gb, pivot table explodes in size during creation causing memory error. current code looks this:
import pandas pd import zipfile # number of lines read @ time csv chunk_size = 10 ** 5 merged_df = pd.dataframe([]) folder = zipfile.zipfile(op_directory + "/file.zip") # reading csv in chunks, dropping columns, dropping rows null responses. chunk in pd.read_csv(folder.open("file.csv"), chunksize=chunk_size): df = pd.dataframe(chunk) # operations on df ... ... merged_df = merged_df.append(df) # pivoting data create matrix df = pd.pivot_table(merged_df, index=['version'], values=['resp_rob_sigmas'], columns=['assay']) df.to_csv("output.csv")
how can prevent memory error , optimize this?
wiki
Comments
Post a Comment