python 3.x - How to run python3 on google's dataproc pyspark -
i want run pyspark job through google cloud platform dataproc, can't figure out how setup pyspark run python3 instead of 2.7 default.
the best i've been able find adding these initialization commands
however, when ssh cluster then
(a) python
command still python2,
(b) job fails due python 2 incompatibility.
i've tried uninstalling python2 , aliasing alias python='python3'
in init.sh script, alas, no success. alias doesn't seem stick.
i create cluster this
cluster_config = { "projectid": self.project_id, "clustername": cluster_name, "config": { "gceclusterconfig": gce_cluster_config, "masterconfig": master_config, "workerconfig": worker_config, "initializationactions": [ [{ "executablefile": executable_file_uri, "executiontimeout": execution_timeout, }] ], } } credentials = googlecredentials.get_application_default() api = build('dataproc', 'v1', credentials=credentials) response = api.projects().regions().clusters().create( projectid=self.project_id, region=self.region, body=cluster_config ).execute()
my executable_file_uri
sits on google storage; init.sh
:
apt-get -y update apt-get install -y python-dev wget -o /root/get-pip.py https://bootstrap.pypa.io/get-pip.py python /root/get-pip.py apt-get install -y python-pip pip install --upgrade pip pip install --upgrade 6 pip install --upgrade gcloud pip install --upgrade requests pip install numpy
i found answer here such initialization script looks this:
#!/bin/bash # install tools apt-get -y install python3 python-dev build-essential python3-pip easy_install3 -u pip # install requirements pip3 install --upgrade google-cloud==0.27.0 pip3 install --upgrade google-api-python-client==1.6.2 pip3 install --upgrade pytz==2013.7 # setup python3 dataproc echo "export pyspark_python=python3" | tee -a /etc/profile.d/spark_config.sh /etc/*bashrc /usr/lib/spark/conf/spark-env.sh echo "export pythonhashseed=0" | tee -a /etc/profile.d/spark_config.sh /etc/*bashrc /usr/lib/spark/conf/spark-env.sh echo "spark.executorenv.pythonhashseed=0" >> /etc/spark/conf/spark-defaults.conf
wiki
Comments
Post a Comment