python - How to use join with many conditions in pyspark? -

- February 25, 2015

i able use dataframe join statement single on condition ( in pyspark) but, if try add multiple conditions, failing.

code :

   summary2 = summary.join(county_prop, ["category_id", "bucket"], how = "leftouter").

the above code works. if add other condition list like, summary.bucket == 9 or something, fails. please me fix issue.

   error statement     summary2 = summary.join(county_prop, ["category_id", (summary.bucket)==9], how = "leftouter")     error : typeerror: 'column' object not callable

edit :

adding full working example.

   schema = structtype([structfield("category", stringtype()), structfield("category_id", stringtype()), structfield("bucket", stringtype()), structfield("prop_count", stringtype()), structfield("event_count", stringtype()), structfield("accum_prop_count",stringtype())])    bucket_summary = sqlcontext.createdataframe([],schema)     temp_county_prop = sqlcontext.createdataframe([("nation","nation",1,222,444,555),("nation","state",2,222,444,555)],schema)    bucket_summary = bucket_summary.unionall(temp_county_prop)    county_prop = sqlcontext.createdataframe([("nation","state",2,121,221,551)],schema)

want join on :

category_id , bucket columns, want replace values of county_prop on bucket_summary.

   cond = [bucket_summary.bucket == county_prop.bucket, bucket_summary.bucket == 2]

bucket_summary2 = bucket_summary.join(county_prop, cond, how = "leftouter")

   1. works if mention whole statement cols, if list conditions ["category_id", "bucket"]  --- works.     2. but, if use combination of both cond =["bucket", bucket_summary.category_id == "state"]

it not working. can go wrong 2 statement?

e.g.

df1.join(df2, on=[df1['age'] == df2['age'], df1['sex'] == df2['sex']], how='left_outer')

but in case, (summary.bucket)==9 should not appear join condition

update:

in join condition can use list of column join expression or list of column / column_name

wiki

Search This Blog

tL

python - How to use join with many conditions in pyspark? -

Comments

Post a Comment

Popular posts from this blog

python - Read npy file directly from S3 StreamingBody -

Asterisk AGI Python Script to Dialplan does not work -

kotlin - Out-projected type in generic interface prohibits the use of metod with generic parameter -