python - Sort rows and get column IDs in a pandas dataframe -
with given pandas dataframe, i'd create new columns highest, second highest, third highest, etc... values in row. , create column corresponding column name of each of those. code below max value of row, not follow.
adapted find column name has maximum value each row
import pandas pd df = pd.dataframe({'a': (23, 24, 55, 77, 33, 66), 'b': (12, 33, 0.2, 44, 23.5, 66), 'c': (1, 33, 66, 44, 5, 62), 'd': (9, 343, 4, 64, 24, 63), 'e': (123, 33, 2.2, 42, 2, 99)}) # determine max value , column name , add columns df df['max1'] = df.max(axis=1) df['col_max1'] = df.idxmax(axis=1) # determine 2nd , 3rd max pr , threshold levels , add columns # ??????????? print(df)
this produces:
b c d e max1 col_max1 0 23 12.0 1 9 123.0 123.0 e 1 24 33.0 33 343 33.0 343.0 d 2 55 0.2 66 4 2.2 66.0 c 3 77 44.0 44 64 42.0 77.0 4 33 23.5 5 24 2.0 33.0 5 66 66.0 62 63 99.0 99.0 e process finished exit code 0
only caveat possible have large number of columns, if matters performance. guys.
one approach using underlying array data focus on performance -
a = df.values c = df.columns idx = a.argsort(1)[:,::-1] vals = a[np.arange(idx.shape[0])[:,none], idx] ids = c[idx] names_vals = ['max'+str(i+1) in range(a.shape[1])] names_ids = ['col_max'+str(i+1) in range(a.shape[1])] df_vals = pd.dataframe(vals, columns=names_vals) df_ids = pd.dataframe(ids, columns=names_ids) df_out = pd.concat([df, df_vals, df_ids], axis=1)
sample input, output -
in [40]: df out[40]: b c d e 0 23 12.0 1 9 123.0 1 24 33.0 33 343 33.0 2 55 0.2 66 4 2.2 3 77 44.0 44 64 42.0 4 33 23.5 5 24 2.0 5 66 66.0 62 63 99.0 in [41]: df_out out[41]: b c d e max1 max2 max3 max4 max5 col_max1 col_max2 \ 0 23 12.0 1 9 123.0 123.0 23.0 12.0 9.0 1.0 e 1 24 33.0 33 343 33.0 343.0 33.0 33.0 33.0 24.0 d e 2 55 0.2 66 4 2.2 66.0 55.0 4.0 2.2 0.2 c 3 77 44.0 44 64 42.0 77.0 64.0 44.0 44.0 42.0 d 4 33 23.5 5 24 2.0 33.0 24.0 23.5 5.0 2.0 d 5 66 66.0 62 63 99.0 99.0 66.0 66.0 63.0 62.0 e b col_max3 col_max4 col_max5 0 b d c 1 c b 2 d e b 3 c b e 4 b c e 5 d c
if need values , ids in sequence, need modify last few steps there -
df0 = pd.dataframe(np.dstack((vals, ids)).reshape(a.shape[0],-1)) df0.columns = np.vstack((names_vals, names_ids)).t.ravel() df_out = pd.concat([df, df0], axis=1)
sample output -
in [62]: df_out out[62]: b c d e max1 col_max1 max2 col_max2 max3 col_max3 max4 \ 0 23 12.0 1 9 123.0 123 e 23 12 b 9 1 24 33.0 33 343 33.0 343 d 33 e 33 c 33 2 55 0.2 66 4 2.2 66 c 55 4 d 2.2 3 77 44.0 44 64 42.0 77 64 d 44 c 44 4 33 23.5 5 24 2.0 33 24 d 23.5 b 5 5 66 66.0 62 63 99.0 99 e 66 b 66 63 col_max4 max5 col_max5 0 d 1 c 1 b 24 2 e 0.2 b 3 b 42 e 4 c 2 e 5 d 62 c
wiki
Comments
Post a Comment