python - How to use join with many conditions in pyspark? -
i able use dataframe join statement single on condition ( in pyspark) but, if try add multiple conditions, failing.
code :
summary2 = summary.join(county_prop, ["category_id", "bucket"], how = "leftouter").
the above code works. if add other condition list like, summary.bucket == 9 or something, fails. please me fix issue.
error statement summary2 = summary.join(county_prop, ["category_id", (summary.bucket)==9], how = "leftouter") error : typeerror: 'column' object not callable
edit :
adding full working example.
schema = structtype([structfield("category", stringtype()), structfield("category_id", stringtype()), structfield("bucket", stringtype()), structfield("prop_count", stringtype()), structfield("event_count", stringtype()), structfield("accum_prop_count",stringtype())]) bucket_summary = sqlcontext.createdataframe([],schema) temp_county_prop = sqlcontext.createdataframe([("nation","nation",1,222,444,555),("nation","state",2,222,444,555)],schema) bucket_summary = bucket_summary.unionall(temp_county_prop) county_prop = sqlcontext.createdataframe([("nation","state",2,121,221,551)],schema)
want join on :
category_id , bucket columns, want replace values of county_prop on bucket_summary.
cond = [bucket_summary.bucket == county_prop.bucket, bucket_summary.bucket == 2]
bucket_summary2 = bucket_summary.join(county_prop, cond, how = "leftouter")
1. works if mention whole statement cols, if list conditions ["category_id", "bucket"] --- works. 2. but, if use combination of both cond =["bucket", bucket_summary.category_id == "state"]
it not working. can go wrong 2 statement?
e.g.
df1.join(df2, on=[df1['age'] == df2['age'], df1['sex'] == df2['sex']], how='left_outer')
but in case, (summary.bucket)==9
should not appear join condition
update:
in join condition can use list of column join expression
or list of column / column_name
wiki
Comments
Post a Comment