python 3.x - How to run python3 on google's dataproc pyspark -




i want run pyspark job through google cloud platform dataproc, can't figure out how setup pyspark run python3 instead of 2.7 default.

the best i've been able find adding these initialization commands

however, when ssh cluster then
(a) python command still python2,
(b) job fails due python 2 incompatibility.

i've tried uninstalling python2 , aliasing alias python='python3' in init.sh script, alas, no success. alias doesn't seem stick.

i create cluster this

cluster_config = {     "projectid": self.project_id,     "clustername": cluster_name,     "config": {         "gceclusterconfig": gce_cluster_config,         "masterconfig": master_config,         "workerconfig": worker_config,         "initializationactions": [             [{             "executablefile": executable_file_uri,             "executiontimeout": execution_timeout,         }]         ],     } }  credentials = googlecredentials.get_application_default() api = build('dataproc', 'v1', credentials=credentials)  response = api.projects().regions().clusters().create(     projectid=self.project_id,     region=self.region, body=cluster_config ).execute() 

my executable_file_uri sits on google storage; init.sh:

apt-get -y update apt-get install -y python-dev wget -o /root/get-pip.py https://bootstrap.pypa.io/get-pip.py python /root/get-pip.py apt-get install -y python-pip pip install --upgrade pip pip install --upgrade 6 pip install --upgrade gcloud pip install --upgrade requests pip install numpy 

i found answer here such initialization script looks this:

#!/bin/bash  # install tools apt-get -y install python3 python-dev build-essential python3-pip easy_install3 -u pip  # install requirements pip3 install --upgrade google-cloud==0.27.0 pip3 install --upgrade google-api-python-client==1.6.2 pip3 install --upgrade pytz==2013.7  # setup python3 dataproc echo "export pyspark_python=python3" | tee -a  /etc/profile.d/spark_config.sh  /etc/*bashrc /usr/lib/spark/conf/spark-env.sh echo "export pythonhashseed=0" | tee -a /etc/profile.d/spark_config.sh /etc/*bashrc /usr/lib/spark/conf/spark-env.sh echo "spark.executorenv.pythonhashseed=0" >> /etc/spark/conf/spark-defaults.conf 




wiki

Comments

Popular posts from this blog

python - Read npy file directly from S3 StreamingBody -

kotlin - Out-projected type in generic interface prohibits the use of metod with generic parameter -

Asterisk AGI Python Script to Dialplan does not work -