pandas - Python pivot table for large file in chunks: memory error -




using bunch of dataframe chunks this, version , assay form unique identifier:

   version  assay   resp_rob_sigmas 0    a123    f           0.56 1    b234    g           0.78 2    c345    r           0.9 3    d456    f           1.0 4    d456    g           0.3 

i'm creating pivot table needs this:

        f      g      r a123   0.56   na     na b234   na     0.78   na c345   na     na     0.9 d456   1.0    0.3    na 

pre-chunking , pre-unzipping, data frame 13 gb, pivot table explodes in size during creation causing memory error. current code looks this:

import pandas pd import zipfile  # number of lines read @ time csv chunk_size = 10 ** 5 merged_df = pd.dataframe([]) folder = zipfile.zipfile(op_directory + "/file.zip")  # reading csv in chunks, dropping columns, dropping rows null responses. chunk in pd.read_csv(folder.open("file.csv"), chunksize=chunk_size):     df = pd.dataframe(chunk)     # operations on df     ...     ...      merged_df = merged_df.append(df)  # pivoting data create matrix df = pd.pivot_table(merged_df, index=['version'], values=['resp_rob_sigmas'], columns=['assay']) df.to_csv("output.csv") 

how can prevent memory error , optimize this?





wiki

Comments

Popular posts from this blog

Asterisk AGI Python Script to Dialplan does not work -

python - Read npy file directly from S3 StreamingBody -

kotlin - Out-projected type in generic interface prohibits the use of metod with generic parameter -