python - How to split Vector into columns - using PySpark -
context: have dataframe
2 columns: word , vector. column type of "vector" vectorudt
.
an example:
word | vector assert | [435,323,324,212...]
and want this:
word | v1 | v2 | v3 | v4 | v5 | v6 ...... assert | 435 | 5435| 698| 356|....
question:
how can split column vectors in several columns each dimension using pyspark ?
thanks in advance
one possible approach convert , rdd:
from pyspark.ml.linalg import vectors df = sc.parallelize([ ("assert", vectors.dense([1, 2, 3])), ("require", vectors.sparse(3, {1: 2})) ]).todf(["word", "vector"]) def extract(row): return (row.word, ) + tuple(row.vector.toarray().tolist()) df.rdd.map(extract).todf(["word"]) # vector values named _2, _3, ... ## +-------+---+---+---+ ## | word| _2| _3| _4| ## +-------+---+---+---+ ## | assert|1.0|2.0|3.0| ## |require|0.0|2.0|0.0| ## +-------+---+---+---+
an alternative solution create udf:
from pyspark.sql.functions import udf, col pyspark.sql.types import arraytype, doubletype def to_array(col): def to_array_(v): return v.toarray().tolist() return udf(to_array_, arraytype(doubletype()))(col) (df .withcolumn("xs", to_array(col("vector"))) .select(["word"] + [col("xs")[i] in range(3)])) ## +-------+-----+-----+-----+ ## | word|xs[0]|xs[1]|xs[2]| ## +-------+-----+-----+-----+ ## | assert| 1.0| 2.0| 3.0| ## |require| 0.0| 2.0| 0.0| ## +-------+-----+-----+-----+
wiki
Comments
Post a Comment