Logging in Pig UDF - user-defined-functions

I am calling a UDF written in Java from a Pig script.
In the UDF if for some reason the input is not proper I return null and that particular row/line is skipped.
Now there are many reason for which I could have skipped the current line/row. I am current using the following log statement in my UDF
warn("XML is null, so skipping it", PigWarning.UDF_WARNING_1);
....
warn("Entity is null, so skipping it", PigWarning.UDF_WARNING_5);
.... and so on
Once the Pig script is done, this give me a consolidate info like below
2013-01-21 07:03:42,163 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Encountered Warning UDF_WARNING_5 5473 time(s).
2013-01-21 07:03:42,163 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Encountered Warning UDF_WARNING_1 1466 time(s).
But now I want to know the reason why each line failed, instead of just the numbers. Is there any way to do this in Pig?

I found out a library called penny which allowed me to do logging and filtering in Pig.

Related

Not able to export Hbase table into CSV file using HUE Pig Script

I have installed Apache Amabari and configured the Hue. I want to export hbase table data into csv file using pig script but I am getting following error.
2017-06-03 10:27:45,518 [ATS Logger 0] INFO org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Exception caught by TimelineClientConnectionRetry, will try 30 more time(s).
Message: java.net.ConnectException: Connection refused
2017-06-03 10:27:45,703 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2017-06-03 10:27:45,709 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 101: file '/usr/lib/hbase/lib/hbase-common-1.2.0-cdh5.11.0.jar' does not exist.
2017-06-03 10:27:45,899 [main] INFO org.apache.pig.Main - Pig script completed in 4 seconds and 532 milliseconds (4532 ms)
<<< Invocation of Main class completed <<<
Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.PigMain], exit code 2
Oozie Launcher failed, finishing Hadoop job gracefully
Please help me and where I am doing wrong.
Let me know your concerns.

Pig error while while entering simple scripts in hadoop 2 environment

I am using hadoop-2.5.1 and pig-0.13.0, and my hadoop cluster running very well. When I try to run simple pig script
test = load '/input-data/data10' using PigStorage(',');
I am getting an error:
2014-11-13 15:41:19,278 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2014-11-13 15:41:19,279 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.addres
Please if any one having solution let me know.
This is not an error, it's just an info logging.
If your script does not attemp to perform any action on loaded data and doesn't throw any error, then it probably works fine.
Try to add some actions with test variable, for example:
DUMP test;

Pig Error: Unhandled internal error. Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected

I have just upgraded Pig 0.12.0 to 0.13.0 version on Hortonworks HDP 2.1
I am getting below error when I am trying to use XMLLoader in my script, even though I have registered piggybank already.
Script:
A = load 'EPAXMLDownload.xml' using org.apache.pig.piggybank.storage.XMLLoader('Document') as (x:chararray);
Error:
dump A
2014-08-10 23:08:56,494 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - io.bytes.per.checksum is deprecated. Instead, use dfs.bytes-per-checksum
2014-08-10 23:08:56,496 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2014-08-10 23:08:56,651 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: UNKNOWN
2014-08-10 23:08:56,727 [main] INFO org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer - {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, GroupByConstParallelSetter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, MergeForEach, PartitionFilterOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter], RULES_DISABLED=[FilterLogicExpressionSimplifier]}
2014-08-10 23:08:57,191 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - io.bytes.per.checksum is deprecated. Instead, use dfs.bytes-per-checksum
2014-08-10 23:08:57,199 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2014-08-10 23:08:57,214 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2014-08-10 23:08:57,223 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
2014-08-10 23:08:57,247 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2998: Unhandled internal error. Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
Note that pig decides the hadoop version depending on which context var you have set
HADOOP_HOME -> v1
HADOOP_PREFIX -> v2
If you use hadoop2, you need to recompile the piggybank (which is by default compiled for hadoop1)
go to pig/contrib/piggybank/java
$ ant -Dhadoopversion=23
then copy that jar over pig/lib/piggybank.jar
A few more details because the other answers didn't work for me:
Git clone the pig git mirror https://github.com/apache/pig
cd into the cloned directory
if you've already built pig in the past in this directory, you should run a clean
ant clean
build pig for hadoop 2
ant -Dhadoopversion=23
cd into piggybank
cd contrib/piggybank/java
again, if you've build piggybank before, make sure to clean out the old build files
ant clean
build piggybank for hadoop 2 (same command, different directory)
ant -Dhadoopversion=23
If you don't build pig first, piggybank will throw a bunch of "symbol not found" exceptions while compiling. In addition, since I had previously built pig for Hadoop 1 (accidentally), without running a clean, I ran into runtime errors.
Some times you may get problem after installing Pig like below:-
java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.JobContext, but class was expected
at org.apache.hcatalog.common.HCatUtil.checkJobContextIfRunningFromBackend(HCatUtil.java:88)
at org.apache.hcatalog.pig.HCatLoader.setLocation(HCatLoader.java:162)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.getJob(JobControlCompiler.java:540)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.compile(JobControlCompiler.java:322)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:199)
at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.launchPig(HExecutionEngine.java:277)
at org.apache.pig.PigServer.launchPlan(PigServer.java:1367)
at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1352)
at org.apache.pig.PigServer.execute(PigServer.java:1341)
Many blogs suggest you to recompile the Pig by executing command:
ant clean jar-all -Dhadoopversion=23
or recompile piggybank.jar by executing below steps
cd contrib/piggybank/java
ant clean
ant -Dhadoopversion=23
But this may not solve your problem big time. The actual cause here is related to HCatalog. Try updating it!!. In my case, I was using Hive0.13 and Pig.0.13. And I was using HCatalog provided with Hive0.13.
Then I updated Pig to 0.15 and used separate hive-hcatalog-0.13.0.2.1.1.0-385 library jars. And problem was resolved....
Because later I identified it was not Pig who was creating problem rather it was Hive-HCatalog libraries. Hope this may help.
Even i faced the same error with Hadoop version 2.2.0.
The work around is, we have to register following jar files using the grunt shell.
The paths that i am gonna paste below will be according to the hadoop-2.2.0 version. Kindly find the jars according to your version.
/hadoop-2.2.0/share/hadoop/mapreduce/ hadoop-mapreduce-client-core-2.2.0.jar
/hadoop-2.2.0/share/hadoop/common/hadoop-common-2.2.0.jar
Using the REGISTER command we have to register these jars along with piggybank.
Run the pig script/command now and revert if you face any issue.

How to make pig abort after a certain number of warnings?

I just had a job terminate with this stdout:
Success!
...
Input(s):
Successfully read 14982562 records from: "..."
Successfully read 21532901 records from: "..."
Successfully read 9322681 records from: "..."
Output(s):
Successfully stored 0 records in: "..."
Successfully stored 0 records in: "..."
...
2013-11-14 22:50:46,179 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Encountered Warning TOO_LARGE_FOR_INT 7 time(s).
2013-11-14 22:50:46,180 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Encountered Warning ACCESSING_NON_EXISTENT_FIELD 9322681 time(s).
2013-11-14 22:50:46,180 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Encountered Warning FIELD_DISCARDED_TYPE_CONVERSION_FAILED 61711157 time(s).
2013-11-14 22:50:46,180 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
Obviously, I did something wrong, but instead of telling me what, pig
counted 70M+ warnings and claimed success.
Clearly one cannot store detailed logs for 70M warnings, but claiming success in
such a situation is absurd.
Is there a way to configure it to report every warning in excruciating
detail (which field in which script line for
ACCESSING_NON_EXISTENT_FIELD, which field, which value, which type,
which script line for FIELD_DISCARDED_TYPE_CONVERSION_FAILED) for the
first, say, 10 or 100 or 1000 warnings of each type on each host and then abort?
Is there a way to abort after X warnings?
I don't think so. The reason is that the UDF you are using thinks it is acceptable to skip over these errors. You could create your own UDF based on the UDF that you are using, and throw an exception instead of a warning upon an error. See this presentation for more details (slide 15).
Are you using HBaseStorage? If so, this other SO overflow question could help you solve this problem. Quick summary: use HBaseBinaryConverter.
Is there a way to configure it to report every warning in excruciating detail [...]?
Yes, you can take a look at each warning independently. See this other SO question for details on how to do this. Quick summary: parse the logs yourself and look for the WARN messages.

Running pig script gives the error as: job has failed. Stop running all dependent jobs

I need a help in why I am getting an error while running a pig script. But when I try the same script in a smaller data, it executes successfully.
There are a few questions with similar issues, but none of them have the solution.
My script looks like this:
A = load ‘test.txt’ using TextLoader();
B = foreach A generate STRSPLIT($0,’”,”’) as t;
C = FILTER B BY (t.$1==2 and t.$2 matches ‘.*xxx.*’);
Store C into temp;
The error is:
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 40% complete
2013-07-15 14:21:41,914 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - job job_201307111759_7495 has failed! Stop running all dependent jobs
2013-07-15 14:21:41,914 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
2013-07-15 14:21:42,754 [main] ERROR org.apache.pig.tools.pigstats.SimplePigStats - ERROR 2997: Unable to recreate exception from backed error: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on /xxx/ temp/_temporary/_attempt_201307111759_7495_m_000527_0/part-m-00527 File does not exist. Holder DFSClient_attempt_201307111759_7495_m_000527_0 does not have any open files.
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1606)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1597)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInternal(FSNamesystem.java:1652)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:1640)
at org.apache.hadoop.hdfs.server.namenode.NameNode.complete(NameNode.java:689)
at sun.reflect.GeneratedMethodAccessor27.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.c
2013-07-15 14:21:42,754 [main] ERROR org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s) failed!
Any help will be appreciated.
Thanks.
After some research, I found that the problem here is LeaseExpiredException. This might be because the output of mapper was removed. One of the reason for this might be due to the quota allocated for the user. In my case, I was running this script in a very large data, and my quota was insufficient to process/store the data.
We can check the quota by the following command:
hadoop fs -count -q /user/username
Thank you.

Resources