Run time environment for Hadoop Pig - hadoop

I am building a hadoop pig editor, similar like sql editor where user can write execute their pig command, and view history of commands executed.There is intelligence as well.
I need to know how do i parse my pig command and run it.
Thanks in advance.

Related

Run PIG in local mode from oozie

I want to run PIG in local mode, which is very easy
pig -x local file.pig
My requirement is to run PIG in local mode from OOZIE?
Is it possible as i think OOZIE will automatically launch map task first?
It's possible. When a pig script is run by Oozie, it's run as a one-map map-reduce job, which only runs the pig script, which in turn runs other map-reduce jobs (when pig is run in mapred mode).
It seems, that Pig action configuration doesn't allow running in local mode, but you can still run Pig script in local mode using shell action type. You only have to make sure, that your script, input and output data are in HDFS.
I don't think, we can run pig in local mode from oozie. Comment which Vishal wrote makes sense. In some cases, where there is lesser amount of data, Its better to go for pig in local mode. To run in local mode, you can run by writing a shell script and scheduling that in crontab.If you try this through oozie. Upto my knowledge It won't suit well , because Oozie is meant to run in HDFS.
If you want oozie to run on some data . It expects that data to be in HDFS (i.e distributed).And You must have the pig script as well in hdfs.I rembered seeing post from AlanGates where he mentioned PIG is designed to process data from/to HDFS and hive is for local to HDFS or HDFS to HDFS.

Renaming part files of PIG output

I have a requirement of changing the part file naming convention after running my PIG job. I want part-r-0000 to be userdefinedName-r-0000.
Any possible solution to that?
I am avoiding hadoop -cp and hadoop -mv commands.
Thanks
This files are created by map-reduce jobs generated by Pig. So you should configure Apache Map-reduce. The corresponding property is mapreduce.output.basename
You can define any Hadoop property directly in your pig script:
SET mapreduce.output.basename 'custom-name';
Starting the pig like this would do the same
pig -Dmapreduce.job.queuename=my-queue -Dmapreduce.output.basename=my-outputfilename;

pass pig arguments via hue (multiquery)

I am running a pig job from HUE.
On the project I am now I am required to run pig with
pig -no_multiquery
Where (and how) do I pass this when using Hue? I can't run this job using multiquery.
Or alternatively, is there a way to switch of multiquery?
I didn't look hard enough.
SET opt.multiquery false;
In the pig script itself seems to be running my job as intended.

Configuring pig relation with Hadoop

I'm having troubles understanding the relation between Hadoop and Pig.
I understand Pig's purpose is to hide the MapReduce pattern behind a scripting language, Pig Latin.
What I don't understand is how Hadoop and Pig are linked. So far, the only installation procedures seem to assume that pig is run on the same machine as the main hadoop node.
Indeed, it uses the hadoop configuration files.
Is this because pig only translates the scripts into mapreduce code and send them to hadoop ?
If that's the case, how could I configure Pig in order to make it send the scripts to a distant server ?
If not, does it mean we always need to have hadoop running within pig ?
Pig can run in two modes:
Local mode. In this mode Hadoop cluster is not used at all. All processes run in single JVM and files are read from the local filesystem. To run Pig in local mode, use the command:
pig -x local
MapReduce Mode. In this mode Pig converts scripts to MapReduce jobs and run them on Hadoop cluster. It is the default mode.
Cluster can be local or remote. Pig uses the HADOOP_MAPRED_HOME environment variable to find Hadoop installation on local machine (see Installing Pig).
If you want to connect to remote cluster, you should specify cluster parameters in the pig.properties file. Example for MRv1:
fs.default.name=hdfs://namenode_address:8020/
mapred.job.tracker=jobtracker_address:8021
You can also specify remote cluster address at the command line:
pig -fs namenode_address:8020 -jt jobtracker_address:8021
Hence, you can install Pig to any machine and connect to remote cluster. Pig includes Hadoop client, therefore you don't have to install Hadoop to use Pig.

How to run the hadoop simple program through command line

I'm new to the hadoop technologies .How to run the simple program through command line.I'm using windows environment.I install the Cygwin.Can you help me ...
Try the below URLs.
http://v-lad.org/Tutorials/Hadoop/00%20-%20Intro.html
http://hayesdavis.net/2008/06/14/running-hadoop-on-windows/
If you are new to Hadoop, try using one of the IDE plugins. This will help you get started quickly.
http://karmasphere.com/Studio-Eclipse/quick-click-guide.html
http://wiki.apache.org/hadoop/EclipsePlugIn
FYI ..... Hadoop on Windows is not recommended for Production.
Are your program written in Java? If so, you need to compile your program and pack the compiled files into a Jar file. And then run the program with hadoop command:
${hadoop_home}/bin/hadoop jar ${your_program_jar_file} ${main_class_of_jar}
You can run the Hadoop commands from anywhere in the terminal/command line, but only if the $path variable is set properly.
The syntax would be like this:
hadoop fs -<command> or hdfs fs -<command>
You review the docs for more information.

Resources