Configuring pig relation with Hadoop - hadoop

I'm having troubles understanding the relation between Hadoop and Pig.
I understand Pig's purpose is to hide the MapReduce pattern behind a scripting language, Pig Latin.
What I don't understand is how Hadoop and Pig are linked. So far, the only installation procedures seem to assume that pig is run on the same machine as the main hadoop node.
Indeed, it uses the hadoop configuration files.
Is this because pig only translates the scripts into mapreduce code and send them to hadoop ?
If that's the case, how could I configure Pig in order to make it send the scripts to a distant server ?
If not, does it mean we always need to have hadoop running within pig ?

Pig can run in two modes:
Local mode. In this mode Hadoop cluster is not used at all. All processes run in single JVM and files are read from the local filesystem. To run Pig in local mode, use the command:
pig -x local
MapReduce Mode. In this mode Pig converts scripts to MapReduce jobs and run them on Hadoop cluster. It is the default mode.
Cluster can be local or remote. Pig uses the HADOOP_MAPRED_HOME environment variable to find Hadoop installation on local machine (see Installing Pig).
If you want to connect to remote cluster, you should specify cluster parameters in the pig.properties file. Example for MRv1:
fs.default.name=hdfs://namenode_address:8020/
mapred.job.tracker=jobtracker_address:8021
You can also specify remote cluster address at the command line:
pig -fs namenode_address:8020 -jt jobtracker_address:8021
Hence, you can install Pig to any machine and connect to remote cluster. Pig includes Hadoop client, therefore you don't have to install Hadoop to use Pig.

Related

Type of clusters in hadoop

how can i differentiate hadoop standalone mode & pseudo distributed mode? Can anyone explain difference between all hadoop daemons as a single java process and separate java process
Hadoop standalone mode is running Hadoop commands without starting Hadoop daemons i.e. on local file system.
The pseudo distributed mode is running Hadoop daemons on a single machine.

Can sqoop run without hadoop?

Just wondering can sqoop run without a hadoop cluster? sort of in a standalone mode? Has anyone tried to run sqoop on spark, please share some experiences on it.
To run Sqoop commands (both sqoop1 and sqoop2), Hadoop is a mandatory prerequisite. You cannot run sqoop commands without the Hadoop libraries.
Sqoop works in local mode too, so it is not a requirement that the Hadoop daemons must be running. To run sqoop in local mode,
sqoop [tool-name] -fs local -jt local [tool-arguments]
Sqoop on Spark is still In-Progress. See SQOOP-1532

Run PIG in local mode from oozie

I want to run PIG in local mode, which is very easy
pig -x local file.pig
My requirement is to run PIG in local mode from OOZIE?
Is it possible as i think OOZIE will automatically launch map task first?
It's possible. When a pig script is run by Oozie, it's run as a one-map map-reduce job, which only runs the pig script, which in turn runs other map-reduce jobs (when pig is run in mapred mode).
It seems, that Pig action configuration doesn't allow running in local mode, but you can still run Pig script in local mode using shell action type. You only have to make sure, that your script, input and output data are in HDFS.
I don't think, we can run pig in local mode from oozie. Comment which Vishal wrote makes sense. In some cases, where there is lesser amount of data, Its better to go for pig in local mode. To run in local mode, you can run by writing a shell script and scheduling that in crontab.If you try this through oozie. Upto my knowledge It won't suit well , because Oozie is meant to run in HDFS.
If you want oozie to run on some data . It expects that data to be in HDFS (i.e distributed).And You must have the pig script as well in hdfs.I rembered seeing post from AlanGates where he mentioned PIG is designed to process data from/to HDFS and hive is for local to HDFS or HDFS to HDFS.

what should be the correct flow of data in hadoop and mahout?

I am working with hadoop, hive and mahout technology.
I am processing some data with a mapreduce job in hadoop for recommendation purposes in mahout.
I want to know the correct workflow of above model, i.e when hadoop processes the data and stores it in HDFS, then how will mahout use this data and how will mahout get this data and after mahout processes the data, where will mahout put this recommended data?
Note: I am working with hadoop for processing the data and my colleague is working with mahout on a different machine .
Hope u got my question correctly.
If you want to take input from hadoop hdfs in mahout then you have to do following steps-
first copy input file to hdfs by command
hadoop dfs -copyFromLocal input /
Then run the mahout command for recommendation which take input from hdfs and save the output in hdfs
Assuming your JAVA_HOME is appropriately set and Mahout was installed properly we’re ready to configure our syntax. Enter the following command:
$ mahout recommenditembased -s SIMILARITY_LOGLIKELIHOOD -i hdfs://localhost:9000/inputfile -o hdfs://localhost:9000/output --numRecommendations 25
Running the command will execute a series of jobs the final product of which will be an output file deposited to the directory specified in the command syntax. The output file will contain two columns: the userID and an array of itemIDs and scores.
It all depends on how Mahout is configured to run. Mahout can run in local mode or distributed mode. We need to set the "MAHOUT_LOCAL" variable.
MAHOUT_LOCAL set to anything other than an empty string to force
mahout to run locally even if
HADOOP_CONF_DIR and HADOOP_HOME are set
For example, If we don't configure MAHOUT_LOCAL and tries to execute any Mahout algorithm, Then you can see below in the console.
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop,
When running in distributed mode, Mahout treats all the paths as HDFS path's. So even after Mahout processing your data, final output will be stored in HDFS.

PIG automatically connected with default HDFS, how?

I just started learning Hadoop and PIG (from last two days!) for one of my future project.
For experiments I've installed Hadoop (HDFS on default localhost:9000) as pseudo distributed mode and PIG (map-reduce mode).
When I initialized PIG by typing ./bin/pig command it launched GRUNT command line and I got message that pig connected with HDFS (localhost:9000), later I could successfully able to access HDFS thru pig.
I was expecting to perform some manual configuration for PIG to access HDFS (as per various internet articles).
My question is, from where PIG identified default HDFS configuration (localhost:9000)? I checked pig.properties but I didn't find anything there. I need this info as I might change default HDFS configuration in future.
BTW, I have HADOOP_HOME and PIG_HOME defined in my OS PATH variable.
When installing Pig (I assume v0.10.0) you have to tell how it will connect to the HDFS.
I don't know how you did this but generally this is done by adding the hadoop conf dir path to the PIG_CLASSPATH environment variable. You can also set HADOOP_CONF_DIR as well.
If you are starting the grunt shell Pig will locate the directory of the Hadoop configuration XMLs, and takes the value of fs.default.name (core-site.xml) and mapred.job.tracker (mapred-site.xml) , i.e: the location of the Namenode and JobTracker.
For reference you may have a look at the Pig shell script to see how env. variables are collected and evaluated.
PIG can connects to underlying HDFS in the 3 ways
1-
Pig uses HADOOP_HOME for finding the HADOOP client to Run.
your HADOOP_HOME should have been already setup in your bash_profile
export HADOOP_HOME=~/myHadoop/hadoop-2.5.2
2-
or else there might be possibility that your HADOOP_CONF_DIR has already been setup which contains the xml file for the hadoop configuration
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop/
3-And if these are not setup you can also connect to underlying hdfs
by changing the pig.properties which is present under PIG_HOME/conf dir

Resources