Hadoop Job fails when using PIG - hadoop

I am trying to run basic pig script on cluster not able to submit job it gets fail with the following warning message:
WARN
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- Ooops! Some job has failed! Specify -stop_on_failure if you want
Pig to stop immediately on failure.
Not able to read file in the cluster using PIG even when specifying correct location can anyone help me on this?
ERROR
Failed to read data from "hdfs://172.16.30.119:8020/user/hadoop/wordfile"
CODE
data = load 'hdfs://172.16.30.119:8020/user/hadoop/wordfile';

Related

Pig job hangs on first failure

I'm encountering a problem with Pig and Oozie.
I have pig script that tries to read data from non-existent table, so an exception happens in initialize method of RecordReader . And that is ok, it should occur ( as the table definitely doesn't exist).
The problem starts when such a script is launched via oozie on a multi-node hadoop cluster - after the first attempt job just hangs and does nothing until any other job is submitted to the cluster.
If launched via CMD (pig -f test.pig) it doesn't hang. It also doesn't hang if launched in local mode or on a single-node cluster(via CMD or via Oozie).
I really hope someone had a problem like this and can help me.

Workflow error logs disabled in Oozie 4.2

I am using Oozie 4.2 that comes bundled with HDP 2.3.
while working with a few example workflow's that comes with the oozie package, I noticed that the "job error log is disabled" and this makes debugging really difficult in the event of a failure. I tried running the below commands,
# oozie job -config /home/santhosh/examples/apps/hive/job.properties -run
job: 0000063-150904123805993-oozie-oozi-W
# oozie job -errorlog 0000063-150904123805993-oozie-oozi-W
Error Log is disabled!!
Can someone please tell me how to enable the workflow error log for oozie?
In the Oozie UI, 'Job Error Log' is a tab which was introduced in HDP v2.3 on Oozie v4.2 .
This is the most simplest way of looking for error for the specified oozie job from the oozie log file.
To enable the oozie's Job Error Log, please make the following changes in the oozie log4j property file:
Add the below set of lines after log4j.appender.oozie and before log4j.appender.oozieops:
log4j.appender.oozieError=org.apache.log4j.rolling.RollingFileAppender
log4j.appender.oozieError.RollingPolicy=org.apache.oozie.util.OozieRollingPolicy
log4j.appender.oozieError.File=${oozie.log.dir}/oozie-error.log
log4j.appender.oozieError.Append=true
log4j.appender.oozieError.layout=org.apache.log4j.PatternLayout
log4j.appender.oozieError.layout.ConversionPattern=%d{ISO8601} %5p %c{1}:%L - SERVER[${oozie.instance.id}] %m%n
log4j.appender.oozieError.RollingPolicy.FileNamePattern=${log4j.appender.oozieError.File}-%d{yyyy-MM-dd-HH}
log4j.appender.oozieError.RollingPolicy.MaxHistory=720
log4j.appender.oozieError.filter.1 = org.apache.log4j.varia.LevelMatchFilter
log4j.appender.oozieError.filter.1.levelToMatch = WARN
log4j.appender.oozieError.filter.2 = org.apache.log4j.varia.LevelMatchFilter
log4j.appender.oozieError.filter.2.levelToMatch = ERROR
log4j.appender.oozieError.filter.3 =`enter code here` org.apache.log4j.varia.LevelMatchFilter
log4j.appender.oozieError.filter.3.levelToMatch = FATAL
log4j.appender.oozieError.filter.4 = org.apache.log4j.varia.DenyAllFilter
Modify the following from log4j.logger.org.apache.oozie=WARN, oozie to log4j.logger.org.apache.oozie=ALL, oozie, oozieError
Restart the oozie service. This would help in getting the job error log for the new jobs launched after restart of oozie service.
As mentioned, the errorlog is new and may not be made available for good reasons. However it seems that you have the wrong expectation of the oozie error log.
The error log is meant to be a subset of the log file. Not an addition to it.
So yes, it could make things easier to debug, but if you checked the oozie log and did not find what you are looking for, the error log will not be the solution for you.
Probably you will want to look at the log of the underlying tasks, which can be found via the external ID.

Pig, Oozie, and HBase - java.io.IOException: No FileSystem for scheme: hbase

My Pig script works fine on its own, until I put it in an Oozie workflow, where I receive the following error:
ERROR 2043: Unexpected error during execution.
org.apache.pig.backend.executionengine.ExecException: ERROR 2043: Unexpected error during execution.
...
Caused by: java.io.IOException: No FileSystem for scheme: hbase
I registered the HBase and Zookeeper jars successfully, but received the same error.
I also attempted to set the Zookeeper Quorum by adding variation of these lines in the Pig script:
SET hbase.zookeeper.quorum 'vm-myhost-001,vm-myhost-002,vm-myhost-003'
Some searching on the internet instructed me to add this to the beginning of my workflow.xml:
SET mapreduce.fileoutputcommitter.marksuccessfuljobs false
This solved the problem. I was even able to remove the registration of the HBase and Zookeeper jars and the Zookeeper quorum.
Now after double checking, I noticed that my jobs actually do their job: they store the results in HBase as expected. But, Oozie claims that a failure occurred, when it didn't.
I don't think that setting the mapreduce.fileoutputcommitter.marksuccessfuljobs to false constitutes a solution.
Are there any other solutions?
It seems that there is currently no real solution for this.
However, this answer to a different question seems to indicate that the best workaround is to create the success flag 'manually'.

UnsupportedOperationException: Not implemented by the KosmosFileSystem FileSystem implementation

I'd like to know your input as to why this error is happening. On production environment onshore, we're using CDH4. On our local testing environment, we're just using Apache Hadoop v2.2.0. When I run the same jar compiled on CDH4, the MR jobs are executed fine. But when I run the jar on Hadoop v2.2.0 (YARN enabled), I get this error:
INFO mapreduce.Job: Task Id : attempt_1391062333435_0001_m_000000_0, Status : FAILED
Error: java.lang.UnsupportedOperationException: Not implemented by the KosmosFileSystem FileSystem implementation
The log showed Map jobs ran successfully, but the Reduce jobs - all of them failed - with the above error. There's not too many hits on Google regarding this error so I'm kind of nowhere to run but here.
Any thoughts guys? Thanks.
Sorry for the lateness of this reply.
This problem was solved when we synched our environment with the one onshore. That is, instead of using plain Apache Hadoop, we used the Cloudera distribution.

Making a Cassandra Connection inside Hadoop MapReduce Task

I am successfully using the DataStax Java Driver to access Cassandra inside my Java code just before I start a MapReduce Job.
cluster = Cluster.builder().addContactPoint("127.0.0.1").build();
However I am needing to check additional information to decide on a per record basis how to reduce the record. If I attempt to use the same code inside a Hadoop Reducer class it fails to connect with the error:
INFO mapred.JobClient: Task Id :
attempt_201310280851_0012_r_000000_1, Status : FAILED
com.datastax.driver.core.exceptions.NoHostAvailableException:
All host(s) tried for query failed (tried: /127.0.0.1 ([/127.0.0.1]
Unexpected error during transport initialization
(com.datastax.driver.core.TransportException: [/127.0.0.1] Error writing)))
at com.datastax.driver.core.ControlConnection.reconnectInternal(ControlConnection.java:186)
at com.datastax.driver.core.ControlConnection.connect(ControlConnection.java:81)
at com.datastax.driver.core.Cluster$Manager.init(Cluster.java:662)
at com.datastax.driver.core.Cluster$Manager.access$100(Cluster.java:604)
at com.datastax.driver.core.Cluster.<init>(Cluster.java:69)
at com.datastax.driver.core.Cluster.buildFrom(Cluster.java:96)
at com.datastax.driver.core.Cluster$Builder.build(Cluster.java:585)
The MapReduce input and output will successfully read and write to Cassandra. As I mentioned I can connect before I run the job so I do not think it is an issue with the Cassandra server.

Resources