Installing spark on hadoop - hadoop

I installed hadoop 2.7 on my mac. Then i want to install spark on it. But there is no any document for this.can anybody explain step by step how to install spark on hadoop?

Steps to Install Apache Spark
1) Open Apache Spark Website http://spark.apache.org/
2) Click on Downloads Tab a new Page will get open
3) Choose Pre-built for Hadoop 2.7 and later
4) Choose Direct Download
5) Click on Download Spark: spark-2.0.2-bin-hadoop2.7.tgz and save it on your desired location.
6) Go to the Downloaded Tar file and Extract it.
7) Again Extract the spark-2.0.2-bin-hadoop2.7.tar [File name will differ as version changes] to generate spark-2.0.2-bin-hadoop2.7 folder
8) Now open Shell Prompt and go to the bin directory of spark-2.0.2-bin-hadoop2.7 folder [Folder name will differ as version changes ]
9) Execute command spark-shell.sh
You will be in Spark Shell you can execute the spark commands
https://spark.apache.org/docs/latest/quick-start.html <-- Quick start Guide from spark
Hope this Helps!!!

For running spark on yarn cluster there is lot of steps to install hadoop and spark and all so i write one blog on it step by step you can install it and run spark shell on yarn see the below link
https://blog.knoldus.com/2016/01/30/spark-shell-on-yarn-resource-manager-basic-steps-to-create-hadoop-cluster-and-run-spark-on-it/

Here are the steps I took to install Apache Spark to a Linux Centos system with hadoop:
Install a default Java system (ex: sudo yum install java-11-openjdk)
Download latest release of Apache Spark from spark.apache.org
Extract the Spark tarball (tar xvf spark-2.4.5-bin-hadoop2.7.tgz)
Move Spark folder created after extraction to the /opt/ directory (sudo mv spark-2.4.5-bin-hadoop2.7/ /opt/spark)
Execute with command /opt/spark/bin/spark-shell if you wish to work with Scala or /opt/spark/bin/pyspark if you want to work with Python

Related

How to install custom Spark version in Cloudera

I am new to Spark, Hadoop and Cloudera. We need to use a specific version (1.5.2) of Spark and also have the requirement to use Cloudera for the cluster management, also for Spark.
However, CDH 5.5 comes with Spark 1.5.0 and can not be changed very easily.
People are mentioning to "just download" a custom version of spark manually. But how to manage this "custom" spark version by Cloudera, so I can distribute it across the cluster? Or, does it need to be operated and provisioned completely separate from Cloudera?
Thanks for any help and explanation.
Yes, It is possible to run any Apache Spark version .!!
Steps we need to make sure before doing it:
You have YARN configured in the CM. After which you can run your application as a YARN application with spark-submit. please refer to this link. It will be used to work like any other YARN application.
It is not mandatory to install spark, you can run your application.
Under YARN, you can run any application, with any version of Spark. After all, Spark it's a bunch of libraries, so you can pack your jar with your dependencies and send it to YARN. However there are some additional, small tasks to be done.
In the following link, dlb8 provides a list of tasks to be done to run Spark 2.0 in an installation with a previous version. Just change version/paths accordingly.
Find the version of CDH and Hadoop running on your cluster using
$ hadoop version
Hadoop 2.6.0-cdh5.4.8
Download Spark and extract the sources. Pre built Spark binaries should work out of the box with most CDH versions, unless there are custom fixes in your CDH build in which case you can use the spark-2.0.0-bin-without-hadoop.tgz.
(Optional) You can also build Spark by opening the distribution directory in the shell and running the following command using the CDH and Hadoop version from step 1
$ ./dev/make-distribution.sh --name custom-spark --tgz -Psparkr -Phadoop-2.6 -Phive -Phive-thriftserver -Pyarn
Note: With Spark 2.0 the default build uses Scala version 2.11. If you need to stick to Scala 2.10, use the -Dscala-2.10 property or
$ ./dev/change-scala-version.sh 2.10
Note that -Phadoop-provided enables the profile to build the assembly without including Hadoop-ecosystem dependencies provided by Cloudera.
Extract the tgz file.
$tar -xvzf /path/to/spark-2.0.0-bin-hadoop2.6.tgz
cd into the custom Spark distribution and configure the custom Spark distribution by copying the configuration from your current Spark version
$ cp -R /etc/spark/conf/* conf/
$ cp /etc/hive/conf/hive-site.xml conf/
Change SPARK_HOME to point to folder with the Spark 2.0 distribution
$ sed -i "s#\(.*SPARK_HOME\)=.*#\1=$(pwd)#" conf/spark-env.sh
Change spark.master to yarn from yarn-client in spark-defaults.conf
$ sed -i 's/spark.master=yarn-client/spark.master=yarn/' conf/spark-
defaults.conf
Delete spark.yarn.jar from spark-defaults.conf
$ sed '-i /spark.yarn.jar/d' conf/spark-defaults.conf
Finally test your new Spark installation:
$ ./bin/run-example SparkPi 10 --master yarn
$ ./bin/spark-shell --master yarn
$ ./bin/pyspark
Update log4j.properties to suppress annoying warnings. Add the following to conf/log4j.properties
echo "log4j.logger.org.spark_project.jetty=ERROR" >> conf/log4j.properties
However, it can be adapted to the opposite, since the bottom line is "to use a Spark version on an installation with a different version".
It's even simpler if you don't have to deal with 1.x - 2.x version changes, because you don't need to pay attention to the change of scala version and of the assembly approach.
I tested it in a CDH5.4 installation to set 1.6.3 and it worked fine. I did it with the "spark.yarn.jars" option:
#### set "spark.yarn.jars"
$ cd $SPARK_HOME
$ hadoop fs mkdir spark-2.0.0-bin-hadoop
$ hadoop fs -copyFromLocal jars/* spark-2.0.0-bin-hadoop
$ echo "spark.yarn.jars=hdfs:///nameservice1/user/<yourusername>/spark-2.0.0-bin-hadoop/*" >> conf/spark-defaults.conf

Hadoop installation status

I'm running debian. I'm new to Hadoop. Sometime back, I was trying to install Hadoop. I'm not sure if I've successfully installed. When I enter command at terminal
hadoop version
I see the output:
Hadoop 2.7.1
Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r 15ecc87ccf4a0228f35af08fc56de536e6ce657a
Compiled by jenkins on 2015-06-29T06:04Z
Compiled with protoc 2.5.0
From source with checksum fc0a1a23fc1868e4d5ee7fa2b28a58a
This command was run using /home/xxxxxxx/java/hadoop-2.7.1/share/hadoop/common/hadoop-common-2.7.1.jar
Is hadoop installed properly? If not, what other tests I've to do? If yes, is there some simple "get-started" tutorial/exercise you're aware of, that can help me get started with?
Thank you!

how to install Spark and Hadoop from tarball separately [Cloudera]

I want to install Cloudera distribution of Hadoop and Spark using tarball.
I have already set up Hadoop in Pseudo-Distributed mode in my local machine and successfully ran a Yarn example.
I have downloaded latest tarballs CDH 5.3.x from here
But the folder structure of Spark downloaded from Cloudera is differrent from Apache website. This may be because Cloudera provides it's own version maintained separately.
So, as there are no documentation I have found yet to install Spark from this Cloudera's tarball separately.
Could someone help me to understand how to do it?
Spark could be extracted to any directory. You just need to run the ./bin/spark-submit command (available in extracted spark directory) with required parameters to submit the job. To start spark interactive shell, please use command ./bin/spark-shell.

How to download CDH4 setup manually

How can I download the CDH4 setup manually?
I mean I want to download the setup without using apt-get from the ubuntu command prompt.
CDH4 is Cloudera's distribution of Apache Hadoop. CDH is a collection of Apache Hadoop and several components of its ecosystem.
Assuming that you are requesting for the source, each component can be downloaded as a tarball from the following location:
CDH Packaging and Tarball Information

Hadoop - install process for /usr/libexec etc

I'm trying to compile/install/run Hadoop as a single node cluster on a Mac OS X 10.7.5.
I've downloaded the hadoop-2.2.0-src, and am able to compile all modules with
mvn install
The install is successful, and the tests check out too.
When trying to run hadoop (specifically, hdfs -namenode format to start off with), I see a requirement for hadoop components to exist in directories like:
/usr/libexec
/usr/lib/conf etc.
What is the install step required to get the files into this directory? Can it be done from Maven, or is there a manual install step required?
One option, and I'm not sure if it's correct, is to set my HADOOP_HOME - is this where hadoop finds its /libexecs?
Thanks guys
Pete

Resources