Hadoop Running pig script with spesific map reduce job - hadoop

After writing my first Map Reduce in java and written my first pig script .
How do I tell the pig engine which map reduce job to use ?

Related

hive outputs map-reduce counter

Since HQL (hive sql) uses map-reduce underneath, I wonder how can I output the job's counters info (to console or to a log file) such as the counter info a map-reduce job outputs by default when job finishes.
I use counters in my custom Hive UDTF to get debug info logged for example:
context.getReporter().getCounter("MyCounter","Forward_Count").increment(1);
An example screen shot of the "counters info" i mentioned above:
map-reduce counter example pic

How does Pig Pass Data from First Job and its next Job

As we all know, Apache Pig is a data flow language. If I write a Pig Script and the Pig decides to split and run two or more jobs to execute the task in hand, so How does Pig Store the data which it passes from job1 to job 2. ???!!
I read the Pig documentation which says :-
"Pig allocates a fix amount of memory to store bags and spills to disk as soon as the memory limit is reached. This is very similar to how Hadoop decides when to spill data accumulated by the combiner."
(url : http://pig.apache.org/docs/r0.9.1/perf.html#memory-management)
So Does Pig has a writer which stores the output of an intermediate job in Memory / RAM for better performance (spill to disk if required) and then if PIG has implemented a Reader which reads the data directly from memory to pass that data to the next Job for Processing???
In Mapreduce, we write the entire data to disk and then read it again for the next job to start.
Does Pig has a upper hand here, by implementing readers and writers which writes in RAM/memory (spill if required) and reads from RAM (and disk if required) for better Performance.
Kindly share your expertise/ views on the highlighted comment from the PIG documentation as to what does it actually mean or is stating otherwise.
Thanks in Advance,
Cheers :))
If pig script has multiple jobs,than output of each job is written into a temporary folder in HDFS, which is defined by pig.temp.dir (default is /tmp). See Pig Docs "Storing Intermediate Results". Also while running script do hadoop fs -ls /tmp/pig*, sometimes when jobs are interrupted, these folders are not cleaned up and need to be freed manually.
The spilling of the bags refers to what happens in the Mapper stage and there is not RAM communication between MR jobs.

Hive always create mapreduce job

I'm been using hive at my work, when i run a select like this
"Select * from TABLENAME"
hive executes a mapreduce job and when I run
"Select * from TABLENAME LIMIT X" independently of x.
hive doesn't execute mapreduce jobs.
I use hive 1.2.1, HDP 2.3.0, hue 2.6.1 and hadoop 2.7.1
Any ideas about this fact?
Thanks!
Select * from table;
Requires no map nor reduce. There is no filter(where statement) or aggregation function here. This query simply reads from HDFS.
This is hive's essential task. It is just an abstraction to map-reduce jobs. The former facebook engineers had to write 100s of map-reduce jobs for ad-hoc analysis and map-reduce jobs are somewhat a pain in the ass. So they abstracted it by an sql-language that will be translated in map-reduce jobs.
This is the same with Pig (yahoo).
P.S Some queries are so easy that they aren't translated to map-reduce jobs but are executed locally on one node as far as i know

How to see all Hadoop counter when running pig

I run my pig via the command line, and I want to see all Hadoop counters after the run is finish.
I have written UDF that write to Hadoop counter base on this blog, but I want to test it - when the pig start I can see logs from the the constructor, but later I see no log
Currently all I see is simple static - see below
Counters:
Total records written : 3487
Total bytes written : 38078
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 101
Total records proactively spilled: 12464701
Pig job is actually a MapReduce job so you could see the status of the job and its complete list of counters from JobTracker page (if using MR1) or Application Master page (if using YARN).
A single pig script may create multiple jobs depending on the complexity. You can query all the counters for each job from the command line by running
mapred job -status <job-id>
If you know the actual counter you are interested you can retrieve individual counters with
mapred job -counter <job-id> <group-name> <counter-name>
Of course, you need to know the job-id(s) - those should be available in the original pig output following the line 'Job DAG:'

How to use map reduce output as an input for another map reduce job?

In the first map reduce job I am processing an HBase table and outputting a smaller list of the rowkeys. I need to use this list of strings in order to process another map reduce job which is pulling from a different HBase table and outputting to another Hbase table. What is the proper way to store and access the ouput of the first map reduce job?
Hadoop doesn't support streaming the output of one MR job to another. So, the output of the first MR job has to be stored in HDFS (or some other persistent storage) and then read in the second MR job. Create a DAG of jobs using Oozie or Azkaban. For a simple work flow use Hadoop's JobControl API.
Apache Tez which is still in the incubator phase allows streaming of data across MR tasks. As mentioned, Tez is still in the Incubator stage, so use it with a bit of caution.

Resources