Spark installed but no command 'hdfs' or 'hadoop' found - hadoop

I am a new pyspark user.
I just downloaded and installed a spark cluster ("spark-2.0.2-bin-hadoop2.7.tgz")
after installation I wanted to access the file system (upload local files to cluster). But when I tried to type hadoop or hdfs in command it will say "no command found".
Am I gonna install hadoop/HDFS (I thought it's built in the spark, I don't get)?
Thanks in advance.

You have to install hadoop first to access HDFS.
Follow this http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/
Choose the latest version of hadoop from the apache site.
Once you done with hadoop setup go to spark http://d3kbcqa49mib13.cloudfront.net/spark-2.0.2-bin-hadoop2.7.tgz download this, Extract files. Setup java_home and hadoop_home in spark-env.sh.

You don't have hdfs or hadoop on classpath so this is the reason why you are getting message: "no command found".
If you run \yourparh\hadoop-2.7.1\bin\hdfs dfs -ls / it should works and show root content.
But, You can add your hadoop/bin (hdfs, hadoop ...) commands to classpath with something like this:
export PATH $PATH:$HADOOP_HOME/bin
where HADOOP_HOME is your env. variable with path to hadoop installation folder (download and install is required)

Related

spark-shell throws error in Apache spark

I have installed hadoop on ubuntu on virtual box(host os Windows 7).I have also installed Apache spark, configured SPARK_HOME in .bashrc and added HADOOP_CONF_DIR to spark-env.sh. Now when I start the spark-shell it throws error and do not initialize spark context, sql context. Am I missing something in installation and also I would want to run it on a cluster (hadoop 3 node cluster is set up).
I have the same issue when trying to install Spark local with Windows 7. Please make sure the below paths is correct and I am sure I will work with you. I answer same question in this link So, you can follow the below and it will work.
Create JAVA_HOME variable: C:\Program Files\Java\jdk1.8.0_181\bin
Add the following part to your path: ;%JAVA_HOME%\bin
Create SPARK_HOME variable: C:\spark-2.3.0-bin-hadoop2.7\bin
Add the following part to your path: ;%SPARK_HOME%\bin
The most important part Hadoop path should include bin file before winutils.ee as the following: C:\Hadoop\bin Sure you will locate winutils.exe inside this path.
Create HADOOP_HOME Variable: C:\Hadoop
Add the following part to your path: ;%HADOOP_HOME%\bin
Now you can run the cmd and write spark-shell it will work.

Create hdfs when using integrated spark build

I'm working with Windows and trying to set up Spark.
Previously I installed Hadoop in addition to Spark, edited the config files, run the hadoop namenode -format and away we went.
I'm now trying to achieve the same by using the bundled version of Spark that is pre built with hadoop - spark-1.6.1-bin-hadoop2.6.tgz
So far it's been a much cleaner, simpler process however I no longer have access to the command that creates the hdfs, the config files for the hdfs are no longer present and I've no 'hadoop' in any of the bin folders.
There wasn't an Hadoop folder in the spark install, I created one for the purpose of winutils.exe.
It feels like I've missed something. Do the pre-built versions of spark not include hadoop? Is this functionality missing from this variant or is there something else that I'm overlooking?
Thanks for any help.
By saying that Spark is built with Hadoop, it is meant that Spark is built with the dependencies of Hadoop, i.e. with the clients for accessing Hadoop (or HDFS, to be more precise).
Thus, if you use a version of Spark which is built for Hadoop 2.6 you will be able to access HDFS filesystem of a cluster with the version 2.6 of Hadoop via Spark.
It doesn't mean that Hadoop is part of the pakage and downloading it Hadoop is installed as well. You have to install Hadoop separately.
If you download a Spark release without Hadoop support, you'll need to include the Hadoop client libraries in all the applications you write wiƬhich are supposed to access HDFS (by a textFile for instance).
I am also using same spark in my windows 10. What I have done create C:\winutils\bin directory and put winutils.exe there. Than create HADOOP_HOME=C:\winutils variable. If you have set all
env variables and PATH like SPARK_HOME,HADOOP_HOME etc than it should work.

Spark: Run Spark shell from a different directory than where Spark is installed on slaves and master

I have a small cluster (4 machines) set up with 3 slaves and a master node, all installed to /home/spark/spark. (I.e, $SPARK_HOME is /home/spark/spark)
When I use the spark shell: /home/spark/spark/bin/pyspark --master spark://192.168.0.11:7077 everything works fine. However I'd like for my colleagues to be able to connect to the cluster from a local instance of spark on their machine installed in whatever directory they wish.
Currently if somebody has spark installed in say /home/user12/spark and run /home/user12/spark/bin/pyspark --master spark://192.168.0.11:7077 the spark shell will connect to the master without problems but fails with an error when I try to run code:
class java.io.IOException: Cannot run program
"/home/user12/bin/compute-classpath.sh"
(in directory "."): error=2, No such file or directory)
The problem here is that Spark is looking for the spark installation in /home/user12/spark/, where as I'd like to just tell spark to look in /home/spark/spark/ instead.
How do I do this?
You need to edit three files, spark-submit, spark-class and pyspark (all in the bin folder).
Find the line
export SPARK_HOME = [...]
Then change it to
SPARK_HOME = [...]
Finally make sure you set SPARK_HOME to the directory where spark is installed on the cluster.
This works for me.
Here you can find a detailed explanation.
http://apache-spark-user-list.1001560.n3.nabble.com/executor-failed-cannot-find-compute-classpath-sh-td859.html

Hive failed to create /user/hive/warehouse

I just get started on Apache Hive, and I am using my local Ubuntu box 12.04, with Hive 0.10.0 and Hadoop 1.1.2.
Following the official "Getting Started" guide on Apache website, I am now stuck at the Hadoop command to create the hive metastore with the command in the guide:
$ $HADOOP_HOME/bin/hadoop fs -mkdir /user/hive/warehouse
the error was mkdir: failed to create /user/hive/warehouse
Does Hive require hadoop in a specific mode? I know I didn't have to do much to my Hadoop installation other that update JAVA_HOME so it is in standalone mode. I am sure Hadoop itself is working since I am run the PI example that comes with hadoop installation.
Also, the other command to create /tmp shows the /tmp directory already exists so it didn't recreate, and /bin/hadoop fs -ls is listing the current directory.
So, how can I get around it?
Almost all examples of the documentation have this command wrong. Just like unix you will need the "-p" flag to create the parent directories as well unless you have already created them. This command will work.
$HADOOP_HOME/bin/hadoop fs -mkdir -p /user/hive/warehouse
When running hive on local system, just add to ~/.hiverc:
SET hive.metastore.warehouse.dir=${env:HOME}/Documents/hive-warehouse;
You can specify any folder to use as a warehouse. Obviously, any other hive configuration method will do (hive-site.xml or hive -hiveconf, for example).
That's possibly what Ambarish Hazarnis kept in mind when saying "or Create the warehouse in your home directory".
This seems like a permission issue. Do you have access to root folder / ?
Try the following options-
1. Run command as superuser
OR
2.Create the warehouse in your home directory.
Let us know if this helps. Good luck!
When setting hadoop properties in the spark configuration, prefix them with spark.hadoop.
Therefore set
conf.set("spark.hadoop.hive.metastore.warehouse.dir","/new/location")
This works for older versions of Spark. The property has changed in spark 2.0.0
Adding answer for ref to Cloudera CDH users who are seeing this same issue.
If you are using Cloudera CDH distribution, make sure you have followed these steps:
launched Cloudera Manager (Express / Enterprise) by clicking on the desktop icon.
Open Cloudera Manager page in browser
Start all services
Cloudera has /user/hive/warehouse folder created by default. Its just that YARN and HDFS might not be up and running to access this path.
While this is a simple permission issue that was resolved with sudo in my comment above, there are a couple of notes:
create it in home directory should work as well, but then you may need to update hive setting for the path of metastore, which I think defaults to /user/hive/warehouse
I ran into another error of CREATE TABLE statement with Hive shell, the error was something like this:
hive> CREATE TABLE pokes (foo INT, bar STRING);
FAILED: Error in metadata: MetaException(message:Got exception: java.io.FileNotFoundException File file:/user/hive/warehouse/pokes does not exist.)
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask
It turns to be another permission issue, you have to create a group called "hive" and then add the current user to that group and change ownership of /user/hive/warehouse to that group. After that, it works. Details can be found from this link below:
http://mail-archives.apache.org/mod_mbox/hive-user/201104.mbox/%3CBANLkTinq4XWjEawu6zGeyZPfDurQf+j8Bw#mail.gmail.com%3E
if you r running linux check (in hadoop core-site.xml ) data directory & permission, it looks like you ve kept the default which is /data/tmp and im most cases that will take root permission ..
change the xml config file , delete /data/tmp and run fs format (OC after you ve modified the core xml config)
I recommend using upper versions of hive i.e. 1.1.0 version, 0.10.0 is very buggy.
Run this command and try to create a directory it would grant full permission for the user in hdfs /user directory.
hadoop fs -chmod -R 755 /user
I am using MacOS and homebrew as package manager. I had to set the property in hive-site.xml as
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/usr/local/Cellar/hive/2.3.1/libexec/conf/warehouse</value>
</property>

HADOOP_HOME and hadoop streaming

Hi I am trying to run hadoop on a server that has hadoop installed but I have no idea the directory where hadoop resides. The server was configure by the server admin.
In order to load hadoop I use the use command from the dotkit package.
There may be several solutions but wanted to know where the hadoop package was installed, how to set up the $HADOOP_HOME variable, and how to approp run a hadoop streaming job, such as $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/mapred/contrib/streaming/hadoop-streaming.jar, aka, http://wiki.apache.org/hadoop/HadoopStreaming.
Thanks! any help would be greatly appreciated!
If you're using a cloudera distribution then it's most probably in /usr/lib/hadoop, otherwise it could be anywhere (at the discretion of your system admin).
There are some tricks you can use to try and locate it:
locate hadoop-env.sh (assuming that locate has been installed and updatedb has been run recently)
If the machine you're running this on is running a hadoop service (such as data node, job tracker, task tracker, name node), then you can perform a process list and grep for the hadoop command: ps axww | grep hadoop
Failing the above two, look for the hadoop root directory in some common locations such as: /usr/lib, /usr/local, /opt
Failing all this, and assuming your current user has the permissions: find / -name hadoop-env.sh
If you're install with rpm then it's most probably in /etc/hadoop.
Why don't you try:
echo $HADOOP_HOME
Obiviously the above env variable has to be set before you could even issue hadoop executables from anywhere on the box.

Resources