Installing Apache Mahout 0.11.0 on RHEL on multinode hadoop cluster - hadoop

Need help on how to configure mahout to run on multinode hadoop cluster. Also how to run sample programs shipped with Mahout.
Steps done till now:
Download tar gz file.
sudo tar -zxvf mahout-distribution-x.x.tar.gz.
sudo mv mahout-distribution-x.x /usr/lib/mahout
sudo gedit ~/.bashrc".
export MAHOUT_HOME=/usr/local/mahout
source ~/.bashrc".

You also need to add:
export PATH=$PATH:$MAHOUT_HOME/bin

Related

Installing spark on hadoop

I installed hadoop 2.7 on my mac. Then i want to install spark on it. But there is no any document for this.can anybody explain step by step how to install spark on hadoop?
Steps to Install Apache Spark
1) Open Apache Spark Website http://spark.apache.org/
2) Click on Downloads Tab a new Page will get open
3) Choose Pre-built for Hadoop 2.7 and later
4) Choose Direct Download
5) Click on Download Spark: spark-2.0.2-bin-hadoop2.7.tgz and save it on your desired location.
6) Go to the Downloaded Tar file and Extract it.
7) Again Extract the spark-2.0.2-bin-hadoop2.7.tar [File name will differ as version changes] to generate spark-2.0.2-bin-hadoop2.7 folder
8) Now open Shell Prompt and go to the bin directory of spark-2.0.2-bin-hadoop2.7 folder [Folder name will differ as version changes ]
9) Execute command spark-shell.sh
You will be in Spark Shell you can execute the spark commands
https://spark.apache.org/docs/latest/quick-start.html <-- Quick start Guide from spark
Hope this Helps!!!
For running spark on yarn cluster there is lot of steps to install hadoop and spark and all so i write one blog on it step by step you can install it and run spark shell on yarn see the below link
https://blog.knoldus.com/2016/01/30/spark-shell-on-yarn-resource-manager-basic-steps-to-create-hadoop-cluster-and-run-spark-on-it/
Here are the steps I took to install Apache Spark to a Linux Centos system with hadoop:
Install a default Java system (ex: sudo yum install java-11-openjdk)
Download latest release of Apache Spark from spark.apache.org
Extract the Spark tarball (tar xvf spark-2.4.5-bin-hadoop2.7.tgz)
Move Spark folder created after extraction to the /opt/ directory (sudo mv spark-2.4.5-bin-hadoop2.7/ /opt/spark)
Execute with command /opt/spark/bin/spark-shell if you wish to work with Scala or /opt/spark/bin/pyspark if you want to work with Python

How to install custom Spark version in Cloudera

I am new to Spark, Hadoop and Cloudera. We need to use a specific version (1.5.2) of Spark and also have the requirement to use Cloudera for the cluster management, also for Spark.
However, CDH 5.5 comes with Spark 1.5.0 and can not be changed very easily.
People are mentioning to "just download" a custom version of spark manually. But how to manage this "custom" spark version by Cloudera, so I can distribute it across the cluster? Or, does it need to be operated and provisioned completely separate from Cloudera?
Thanks for any help and explanation.
Yes, It is possible to run any Apache Spark version .!!
Steps we need to make sure before doing it:
You have YARN configured in the CM. After which you can run your application as a YARN application with spark-submit. please refer to this link. It will be used to work like any other YARN application.
It is not mandatory to install spark, you can run your application.
Under YARN, you can run any application, with any version of Spark. After all, Spark it's a bunch of libraries, so you can pack your jar with your dependencies and send it to YARN. However there are some additional, small tasks to be done.
In the following link, dlb8 provides a list of tasks to be done to run Spark 2.0 in an installation with a previous version. Just change version/paths accordingly.
Find the version of CDH and Hadoop running on your cluster using
$ hadoop version
Hadoop 2.6.0-cdh5.4.8
Download Spark and extract the sources. Pre built Spark binaries should work out of the box with most CDH versions, unless there are custom fixes in your CDH build in which case you can use the spark-2.0.0-bin-without-hadoop.tgz.
(Optional) You can also build Spark by opening the distribution directory in the shell and running the following command using the CDH and Hadoop version from step 1
$ ./dev/make-distribution.sh --name custom-spark --tgz -Psparkr -Phadoop-2.6 -Phive -Phive-thriftserver -Pyarn
Note: With Spark 2.0 the default build uses Scala version 2.11. If you need to stick to Scala 2.10, use the -Dscala-2.10 property or
$ ./dev/change-scala-version.sh 2.10
Note that -Phadoop-provided enables the profile to build the assembly without including Hadoop-ecosystem dependencies provided by Cloudera.
Extract the tgz file.
$tar -xvzf /path/to/spark-2.0.0-bin-hadoop2.6.tgz
cd into the custom Spark distribution and configure the custom Spark distribution by copying the configuration from your current Spark version
$ cp -R /etc/spark/conf/* conf/
$ cp /etc/hive/conf/hive-site.xml conf/
Change SPARK_HOME to point to folder with the Spark 2.0 distribution
$ sed -i "s#\(.*SPARK_HOME\)=.*#\1=$(pwd)#" conf/spark-env.sh
Change spark.master to yarn from yarn-client in spark-defaults.conf
$ sed -i 's/spark.master=yarn-client/spark.master=yarn/' conf/spark-
defaults.conf
Delete spark.yarn.jar from spark-defaults.conf
$ sed '-i /spark.yarn.jar/d' conf/spark-defaults.conf
Finally test your new Spark installation:
$ ./bin/run-example SparkPi 10 --master yarn
$ ./bin/spark-shell --master yarn
$ ./bin/pyspark
Update log4j.properties to suppress annoying warnings. Add the following to conf/log4j.properties
echo "log4j.logger.org.spark_project.jetty=ERROR" >> conf/log4j.properties
However, it can be adapted to the opposite, since the bottom line is "to use a Spark version on an installation with a different version".
It's even simpler if you don't have to deal with 1.x - 2.x version changes, because you don't need to pay attention to the change of scala version and of the assembly approach.
I tested it in a CDH5.4 installation to set 1.6.3 and it worked fine. I did it with the "spark.yarn.jars" option:
#### set "spark.yarn.jars"
$ cd $SPARK_HOME
$ hadoop fs mkdir spark-2.0.0-bin-hadoop
$ hadoop fs -copyFromLocal jars/* spark-2.0.0-bin-hadoop
$ echo "spark.yarn.jars=hdfs:///nameservice1/user/<yourusername>/spark-2.0.0-bin-hadoop/*" >> conf/spark-defaults.conf

Installing hadoop on Centos 7 but command is not working

I am trying to install cluster hadoop on centos7. But command is not responding , I manually download the hadoop from this link on windows and then copy on VMware and installing using this command but this is not working.
What if you run:
tar xvzf /home/hadoop.2.7.0.tar.gz
Also could you please run and paste results for:
du /home/hadoop.2.7.0.tar.gz
tar tvzf /home/hadoop.2.7.0.tar.gz
md5sum /home/hadoop.2.7.0.tar.gz
You can find distribution with checksums here

Hadoop installation on Ubuntu

Can anybody provide me with the commands to install Hadoop 2.2 on Ubuntu 14.04?
I have checked various sites but they all seem to have different procedures.
I successfully installed it using this step-by-step guide (in my case it was on 14.10, but I doubt there will be any difference)
Here is a important part:
wget http://apache.mirrors.pair.com/hadoop/common/stable2/hadoop-
2.2..tar.gz
tar –xvzf hadoop-2.2.0.tar.gz
mv hadoop-2.2.0 hadoop
sudo mv hadoop /usr/local/
sudo chown -R hduser:hadoop Hadoop
You can further configure it for your convenience, choose interface etc.
Hope this helps

Hadoop showing old version despite latest version installation

I am trying to install hadoop in my ubuntu OS. I followed each and every step exactly from this link Hadoop Install Tutorial and everything was going as expected until i tried to run
$ start-dfs.sh and $ hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar pi 2 5 command. These commands doesn't work as expected.I tried R&D and somehow came to know that i was using older hadoop version Hadoop 1.0.2 despite of me getting latest 2.2.0 version.
As i could not solve this, i tried to uninstall hadoop completely, Now when i try doing it, it says
$ sudo dpkg -r hadoop
dpkg: dependency problems prevent removal of hadoop:
hadoop-native depends on hadoop (= 1.0.2-0ubuntu1~hadoop1).
dpkg: error processing hadoop (--remove):
dependency problems - not removing
Errors were encountered while processing:
hadoop
Appreciate any help !
I dont know whether its a proper way to remove hadoop or not, but i have removed it using below method.
I first manually deleted the /usr/local/hadoop folder from all the users(If any).If you are not able to remove it due to lack of permissions, then make sure about the permissions of the folder. Make the permission of the folder to "Sudo" and on "Creating and deleting files" so that every user can delete from their instances.
Then from Terminal $ rm -r hadoop does the job going to the /usr/local path.
After this, i checked $ hadoop version again in terminal ..and boom it again showed its existence. Then i did below step.
2.Goto terminal sudo apt-get purge hadoop or sudo apt-get remove hadoop...then it worked

Resources