How to see all Hadoop counter when running pig - hadoop

I run my pig via the command line, and I want to see all Hadoop counters after the run is finish.
I have written UDF that write to Hadoop counter base on this blog, but I want to test it - when the pig start I can see logs from the the constructor, but later I see no log
Currently all I see is simple static - see below
Counters:
Total records written : 3487
Total bytes written : 38078
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 101
Total records proactively spilled: 12464701

Pig job is actually a MapReduce job so you could see the status of the job and its complete list of counters from JobTracker page (if using MR1) or Application Master page (if using YARN).

A single pig script may create multiple jobs depending on the complexity. You can query all the counters for each job from the command line by running
mapred job -status <job-id>
If you know the actual counter you are interested you can retrieve individual counters with
mapred job -counter <job-id> <group-name> <counter-name>
Of course, you need to know the job-id(s) - those should be available in the original pig output following the line 'Job DAG:'

Related

ResourceManager webpage UI not showing Hive mapreduce job

Hive version: 3.1.3, Hadoop version: 3.3.4
I'm new to Hive and Hadoop environment. Followed these links for Hadoop and Hive installations. Was trying out Hive insertion using the example shown here: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML
CREATE TABLE students (name VARCHAR(64), age INT, gpa DECIMAL(3, 2))
CLUSTERED BY (age) INTO 2 BUCKETS STORED AS ORC;
INSERT INTO TABLE students
VALUES ('fred flintstone', 35, 1.28), ('barney rubble', 32, 2.32);
show databases; and CREATE query worked fine, but when INSERT INTO was called I got execution error which log is the following:
hive> INSERT INTO students VALUES ('fred gg', 36, 1.48), ('barney tt', 46, 2.02);
Query ID = vincent.chandra_20220908020315_f032aacb-ac55-47cf-bcb2-c6ba0f89dd88
Total jobs = 2
Launching Job 1 out of 2
Number of reduce tasks determined at compile time: 2
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Job running in-process (local Hadoop)
2022-09-08 02:03:17,325 Stage-1 map = 0%, reduce = 0%
Ended Job = job_local1202096292_0007 with errors
Error during job, obtaining debugging information...
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
MapReduce Jobs Launched:
Stage-Stage-1: HDFS Read: 0 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 0 msec
From searching around, it seems the console doesn't show insightful information regarding the error, and I came across a post saying to read the details on resourceManager webpage UI. I accessed the webpage (localhost:8088), but it's not displaying any job as shown below.
In case this helps in any way, here is jps return results:
5569 DataNode
6674 Jps
5842 ResourceManager
5461 NameNode
6567 RunJar
5944 NodeManager
I'm at a loss on what to do, debugging doesn't seem to be an option with resourceManager UI not showing any kind of Hive mapreduce job. Any advice would be helpful and appreciated.
EDIT: Actually just realized SecondaryNamenode also seems to be missing. I'll update if I find anything useful.

How to access MR job counters after job completion?

I have a MR job which runs fine in the cluster.
After the job completion I'm able to get YARN logs but I couldn't find the MR job counters like no of input records, output records.
Is it possible to get that information after job completion?

hive outputs map-reduce counter

Since HQL (hive sql) uses map-reduce underneath, I wonder how can I output the job's counters info (to console or to a log file) such as the counter info a map-reduce job outputs by default when job finishes.
I use counters in my custom Hive UDTF to get debug info logged for example:
context.getReporter().getCounter("MyCounter","Forward_Count").increment(1);
An example screen shot of the "counters info" i mentioned above:
map-reduce counter example pic

How does Pig Pass Data from First Job and its next Job

As we all know, Apache Pig is a data flow language. If I write a Pig Script and the Pig decides to split and run two or more jobs to execute the task in hand, so How does Pig Store the data which it passes from job1 to job 2. ???!!
I read the Pig documentation which says :-
"Pig allocates a fix amount of memory to store bags and spills to disk as soon as the memory limit is reached. This is very similar to how Hadoop decides when to spill data accumulated by the combiner."
(url : http://pig.apache.org/docs/r0.9.1/perf.html#memory-management)
So Does Pig has a writer which stores the output of an intermediate job in Memory / RAM for better performance (spill to disk if required) and then if PIG has implemented a Reader which reads the data directly from memory to pass that data to the next Job for Processing???
In Mapreduce, we write the entire data to disk and then read it again for the next job to start.
Does Pig has a upper hand here, by implementing readers and writers which writes in RAM/memory (spill if required) and reads from RAM (and disk if required) for better Performance.
Kindly share your expertise/ views on the highlighted comment from the PIG documentation as to what does it actually mean or is stating otherwise.
Thanks in Advance,
Cheers :))
If pig script has multiple jobs,than output of each job is written into a temporary folder in HDFS, which is defined by pig.temp.dir (default is /tmp). See Pig Docs "Storing Intermediate Results". Also while running script do hadoop fs -ls /tmp/pig*, sometimes when jobs are interrupted, these folders are not cleaned up and need to be freed manually.
The spilling of the bags refers to what happens in the Mapper stage and there is not RAM communication between MR jobs.

Can we read built-in counters in Hadoop for individual tasks

Can we read built-in counters in Hadoop for individual tasks and in a periodic manner (say every 500 ms or 1 sec) and record in a file. If we can do that then how to do that?
How to get the individual task pids?

Resources