pyspark 3.x in pypi limited to hadoop 2.7.4 - hadoop

I want to pip install pyspark into my python3 virtual environment but the only choice I have is the PyPI version compiled with Hadoop 2.7.4 dependencies. I need a Hadoop 3.x version since 2.7.4 is too old for modern AWS S3 integration.
Does anyone know why there isn't an option to pip install pyspark with Hadoop 3.x support?
Is my only option to build my own pyspark from source?

Related

How to install Python 3.10 on the base enviornment of Anaconda?

I have python 3.9 installed in the anaconda base environment and have trouble installing the latest 3.10 version. I have tried to use "conda install -c conda-forge python=3.10" but it does not help resolve my problem. Over half a day, I have iteratively applied the command in the terminal but the package would not be installed. Any suggestion?
I've tried to use "conda install -c conda-forge python=3.10" and it did not work. My software still is Python 3.9 and I cannot apply the latest python version there.
Anaconda currently doesn't support versions higher than 3.9.
You can create your own environment using your IDE that utilize Python 3.11 and install all the required libraries via pip install from the command line interface.

pip search showed apache-beam 2.9 but pip install apache-beam only get apache-beam2.2 installed

In my fresh new virtual environment.
I run
pip search apache-beam
I got
apache-beam (2.9.0)
Then I run
pip install apache-beam
pip list
But I got apache-beam 2.2 installed, instead of 2.9
apache-beam 2.2.0
I then run
python -m apache_beam.examples.wordcount --output cout
I got the error
The Apache Beam SDK for Python is supported only on Python 2.7.
From this document
https://towardsdatascience.com/hands-on-apache-beam-building-data-pipelines-in-python-6548898b66a5
beam 2.9 will support python3. But pip search I found apache-beam 2.9. but pip install, I still get apache-beam 2.2.
Please help.
I got the same kind of requirement, I tried this way to install apache Beam
it was worked for me.
Step 01: Make sure to install python 3.7 or above
Step 02: Beam version, I choose 2.27 for my requirement
pip3 install apache-beam==2.27.0

Homebrew install specific version of hadoop 2.8.0 instead of 3.1.1

How do i install hadoop 2.8.0 instead of hadoop 3.1.1 via brew ?
alternatively, how to use brew to install downloded hadoop 2.8.0 from my desktop ?

Installing Apache Spark using yum

I am in the process of installing spark in my organization's HDP box. I run yum install spark and it installs Spark 1.4.1. How do I install Spark 2.0? Please help!
Spark 2 is supported (as a technical preview) in HDP 2.5. You can get the specific HDP 2.5 repo added to your yum repo directory and then install the same. Spark 1.6.2 is the version default in HDP 2.5.
wget http://public-repo-1.hortonworks.com/HDP/centos7/2.x/updates/2.5.0.0/hdp.repo
sudo cp hdp.repo /etc/yum.repos.d/hdp.repo
sudo yum install spark2-master
or
sudo yum install spark2 (also seems to be doing the same when i tried)
see whats new in HDP 2.5 http://hortonworks.com/products/data-center/hdp/
For full list of repos see https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.0/bk_release-notes/content/download-links-250.html

pyspark: pip install couldn't find a version

I am trying to install the pyspark using pip install like below. But I got the following errors.
(python_virenv)edamame$ pip install pyspark
Collecting pyspark
Could not find a version that satisfies the requirement pyspark (from versions: )
No matching distribution found for pyspark
Does anyone have any idea? Thanks!
As of Spark 2.2, PySpark is now available in PyPI.
pip install pyspark
As of Spark 2.1, PySpark is pip installable but not yet from PyPI, which is under consideration for 2.2 in this ticket. To install PySpark you now just need download Spark 2.1+ and run setup.py:
cd spark-2.1/python/
pip install -e .
Big thanks to #Holden!
pyspark is not in PyPI so you could not directly use pip install to install it.
Instead you could download a proper version of Spark here: http://spark.apache.org/downloads.html, and you will get a compressed TAR file. Then unpack it and pyspark is in its python folder.
To open the Python version of the Spark shell, you could go into your Spark directory and type:
bin/pyspark
or
bin\pyspark
in Windows.
pyspark doesn't even exist in PyPI as you can see from https://pypi.python.org/pypi?%3Aaction=search&term=pyspark&submit=search, that's why pip is telling you it can't find it.
PySpark can be installed in the following ways.
Download spark from : Spark Downloads
Download and extract the compressed file. Go to the bin folder, and execute
./bin/pyspark
You might want to add the bin folder in the $PATH variable of your shell as well.
Or,
You can install it from the CDH distribution :
Add CDH keys following the steps here :
http://www.cloudera.com/documentation/enterprise/latest/topics/cdh_ig_cdh5_install.html
Install spark following the steps here :
http://www.cloudera.com/documentation/enterprise/5-4-x/topics/cdh_ig_spark_install.html#xd_583c10bfdbd326ba--6eed2fb8-14349d04bee--7ef8

Resources