PIG automatically connected with default HDFS, how? - hadoop

I just started learning Hadoop and PIG (from last two days!) for one of my future project.
For experiments I've installed Hadoop (HDFS on default localhost:9000) as pseudo distributed mode and PIG (map-reduce mode).
When I initialized PIG by typing ./bin/pig command it launched GRUNT command line and I got message that pig connected with HDFS (localhost:9000), later I could successfully able to access HDFS thru pig.
I was expecting to perform some manual configuration for PIG to access HDFS (as per various internet articles).
My question is, from where PIG identified default HDFS configuration (localhost:9000)? I checked pig.properties but I didn't find anything there. I need this info as I might change default HDFS configuration in future.
BTW, I have HADOOP_HOME and PIG_HOME defined in my OS PATH variable.

When installing Pig (I assume v0.10.0) you have to tell how it will connect to the HDFS.
I don't know how you did this but generally this is done by adding the hadoop conf dir path to the PIG_CLASSPATH environment variable. You can also set HADOOP_CONF_DIR as well.
If you are starting the grunt shell Pig will locate the directory of the Hadoop configuration XMLs, and takes the value of fs.default.name (core-site.xml) and mapred.job.tracker (mapred-site.xml) , i.e: the location of the Namenode and JobTracker.
For reference you may have a look at the Pig shell script to see how env. variables are collected and evaluated.

PIG can connects to underlying HDFS in the 3 ways
1-
Pig uses HADOOP_HOME for finding the HADOOP client to Run.
your HADOOP_HOME should have been already setup in your bash_profile
export HADOOP_HOME=~/myHadoop/hadoop-2.5.2
2-
or else there might be possibility that your HADOOP_CONF_DIR has already been setup which contains the xml file for the hadoop configuration
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop/
3-And if these are not setup you can also connect to underlying hdfs
by changing the pig.properties which is present under PIG_HOME/conf dir

Related

How hive access Hadoop setup using different user

If i install hadoop using 'hadoop' user, and install hive using 'hive' user on same node(Pseudo distribution mode).
How can my hive access hadoop?
when i input 'hive --version', i receive error like this:
Cannot find hadoop installation: $HADOOP_HOME or $HADOOP_PREFIX must be set or hadoop must be in the path.
The question is hive user have no right to access hadoop, but i don't know how to fix it.
Thanks a lot.
As the error says, $HADOOP_HOME or $HADOOP_PREFIX must be set or hadoop must be in the path.
So, edit /home/hive/.bash_profile, for example assuming you're on Linux, and add one of those values to set the environment variable to the downloaded Hadoop package
For example
export HADOOP_HOME=/opt/hadoop # example
export PATH=$HADOOP_HOME/bin:$PATH

Hive installation

After we install hive-3.2.1 on Hadoop-3.3.0 in Ubuntu, we start the hive services. I am not sure how HIVE identifies hadoop services though we don't give anything related to Hadoop in the HIVE setup process. Does HIVE identify hadoop by the means of HADOOP_HOME environment variable defined in .bashrc file ?
Can someone please confirm my understanding.
Thanks!
Yes, Hive uses HADOOP_HOME/conf to discover the cluster, which could be specified in hive-env.sh

Why Hive will search its configuration profile in HADOOP_CONF_DIR first?

Today I found that if I copy hive-site.xml into $HADOOP_HOME/etc/hadoop/, Hive will use the hive-site.xml in the $HADOOP_HOME/etc/hadoop/ instead of the one in $HIVE_HOME/conf, and it will also search for the hive-log4j.properties in $HADOOP_HOME/etc/hadoop/.
If not found, Hive will just use the default one in /lib/hive-common-1.1.0-cdh5.7.6.jar!/hive-log4j.properties instead of the customized one in $HIVE_HOME/conf, but why?
I searched the keyword copy hive-site.xml to HADOOP_HOME in the official Hive manual in apache.org but failed to find any explanation...
My Hive version is hive-1.1.0-cdh5.7.6, Hadoop version hadoop-2.6.0-cdh5.7.6, JDK 1.7.
So, you've mentioned Sqoop, therefore I'll point out the proper processes for getting hive XML configuration.
1) There's a classpath problem if the file isn't found. Copying the file is one solution, but a poor one. A symlink is preferred.
Every time I've used Sqoop, I never messed around with controlling any XML files - it just worked. Therefore, both HDP and CDH must have the proper classpath and/or symlinks setup.
2) The documentation states where configurations are loaded from
Sqoop will fall back to $HADOOP_HOME. If it is not set either, Sqoop will use the default installation locations for Apache Bigtop, /usr/lib/hadoop and /usr/lib/hadoop-mapreduce, respectively.
The active Hadoop configuration is loaded from $HADOOP_HOME/conf/, unless the $HADOOP_CONF_DIR environment variable is set
This classpath controls where configurations are loaded from
3) You can also, at runtime, give extra files
-files <comma separated list of files> specify comma separated files to be copied to the map reduce cluster
sqoop import -files $HIVE_HOME/conf/hive-site.xml ...

what should be the correct flow of data in hadoop and mahout?

I am working with hadoop, hive and mahout technology.
I am processing some data with a mapreduce job in hadoop for recommendation purposes in mahout.
I want to know the correct workflow of above model, i.e when hadoop processes the data and stores it in HDFS, then how will mahout use this data and how will mahout get this data and after mahout processes the data, where will mahout put this recommended data?
Note: I am working with hadoop for processing the data and my colleague is working with mahout on a different machine .
Hope u got my question correctly.
If you want to take input from hadoop hdfs in mahout then you have to do following steps-
first copy input file to hdfs by command
hadoop dfs -copyFromLocal input /
Then run the mahout command for recommendation which take input from hdfs and save the output in hdfs
Assuming your JAVA_HOME is appropriately set and Mahout was installed properly we’re ready to configure our syntax. Enter the following command:
$ mahout recommenditembased -s SIMILARITY_LOGLIKELIHOOD -i hdfs://localhost:9000/inputfile -o hdfs://localhost:9000/output --numRecommendations 25
Running the command will execute a series of jobs the final product of which will be an output file deposited to the directory specified in the command syntax. The output file will contain two columns: the userID and an array of itemIDs and scores.
It all depends on how Mahout is configured to run. Mahout can run in local mode or distributed mode. We need to set the "MAHOUT_LOCAL" variable.
MAHOUT_LOCAL set to anything other than an empty string to force
mahout to run locally even if
HADOOP_CONF_DIR and HADOOP_HOME are set
For example, If we don't configure MAHOUT_LOCAL and tries to execute any Mahout algorithm, Then you can see below in the console.
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop,
When running in distributed mode, Mahout treats all the paths as HDFS path's. So even after Mahout processing your data, final output will be stored in HDFS.

Configuring pig relation with Hadoop

I'm having troubles understanding the relation between Hadoop and Pig.
I understand Pig's purpose is to hide the MapReduce pattern behind a scripting language, Pig Latin.
What I don't understand is how Hadoop and Pig are linked. So far, the only installation procedures seem to assume that pig is run on the same machine as the main hadoop node.
Indeed, it uses the hadoop configuration files.
Is this because pig only translates the scripts into mapreduce code and send them to hadoop ?
If that's the case, how could I configure Pig in order to make it send the scripts to a distant server ?
If not, does it mean we always need to have hadoop running within pig ?
Pig can run in two modes:
Local mode. In this mode Hadoop cluster is not used at all. All processes run in single JVM and files are read from the local filesystem. To run Pig in local mode, use the command:
pig -x local
MapReduce Mode. In this mode Pig converts scripts to MapReduce jobs and run them on Hadoop cluster. It is the default mode.
Cluster can be local or remote. Pig uses the HADOOP_MAPRED_HOME environment variable to find Hadoop installation on local machine (see Installing Pig).
If you want to connect to remote cluster, you should specify cluster parameters in the pig.properties file. Example for MRv1:
fs.default.name=hdfs://namenode_address:8020/
mapred.job.tracker=jobtracker_address:8021
You can also specify remote cluster address at the command line:
pig -fs namenode_address:8020 -jt jobtracker_address:8021
Hence, you can install Pig to any machine and connect to remote cluster. Pig includes Hadoop client, therefore you don't have to install Hadoop to use Pig.

Resources