python - How to split Vector into columns - using PySpark -




context: have dataframe 2 columns: word , vector. column type of "vector" vectorudt.

an example:

word    |  vector  assert  | [435,323,324,212...] 

and want this:

word   |  v1 | v2  | v3 | v4 | v5 | v6 ......  assert | 435 | 5435| 698| 356|.... 

question:

how can split column vectors in several columns each dimension using pyspark ?

thanks in advance

one possible approach convert , rdd:

from pyspark.ml.linalg import vectors  df = sc.parallelize([     ("assert", vectors.dense([1, 2, 3])),     ("require", vectors.sparse(3, {1: 2})) ]).todf(["word", "vector"])  def extract(row):     return (row.word, ) + tuple(row.vector.toarray().tolist())  df.rdd.map(extract).todf(["word"])  # vector values named _2, _3, ...  ## +-------+---+---+---+ ## |   word| _2| _3| _4| ## +-------+---+---+---+ ## | assert|1.0|2.0|3.0| ## |require|0.0|2.0|0.0| ## +-------+---+---+---+ 

an alternative solution create udf:

from pyspark.sql.functions import udf, col pyspark.sql.types import arraytype, doubletype  def to_array(col):     def to_array_(v):         return v.toarray().tolist()     return udf(to_array_, arraytype(doubletype()))(col)  (df     .withcolumn("xs", to_array(col("vector")))     .select(["word"] + [col("xs")[i] in range(3)]))  ## +-------+-----+-----+-----+ ## |   word|xs[0]|xs[1]|xs[2]| ## +-------+-----+-----+-----+ ## | assert|  1.0|  2.0|  3.0| ## |require|  0.0|  2.0|  0.0| ## +-------+-----+-----+-----+ 




wiki

Comments

Popular posts from this blog

python - Read npy file directly from S3 StreamingBody -

kotlin - Out-projected type in generic interface prohibits the use of metod with generic parameter -

Asterisk AGI Python Script to Dialplan does not work -