What is causing "org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: null"? - hadoop

I have an Elastic MapReduce job which uses elasticsearch-hadoop via scalding-taps to transfer data from Amazon S3 to Amazon Elasticsearch Service. For a long time this job ran successfully. However, it has recently started failing with the following stack trace:
2016-03-02 07:28:34,003 FATAL [IPC Server handler 0 on 41019] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: attempt_1456902751849_0012_m_000000_0 - exited : cascading.tuple.TupleException: unable to sink into output identifier: myindex/mytable
at cascading.tuple.TupleEntrySchemeCollector.collect(TupleEntrySchemeCollector.java:160)
at cascading.tuple.TupleEntryCollector.safeCollect(TupleEntryCollector.java:145)
at cascading.tuple.TupleEntryCollector.add(TupleEntryCollector.java:95)
at cascading.tuple.TupleEntrySchemeCollector.add(TupleEntrySchemeCollector.java:134)
at cascading.flow.stream.SinkStage.receive(SinkStage.java:90)
at cascading.flow.stream.SinkStage.receive(SinkStage.java
:37)
at cascading.flow.stream.FunctionEachStage$1.collect(FunctionEachStage.java:80)
at cascading.tuple.TupleEntryCollector.safeCollect(TupleEntryCollector.java:145)
at cascading.tuple.TupleEntryCollector.add(TupleEntryCollector.java:133)
at com.twitter.scalding.MapFunction.operate(Operations.scala:59)
at cascading.flow.stream.FunctionEachStage.receive(FunctionEachStage.java:99)
at cascading.flow.stream.FunctionEachStage.receive(FunctionEachStage.java:39)
at cascading.flow.stream.SourceStage.map(SourceStage.java:102)
at cascading.flow.stream.SourceStage.run(SourceStage.java:58)
at cascading.flow.hadoop.FlowMapper.run(FlowMapper.java:130)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:432)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:170)
Caused by: org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: null
We have enabled the "es.nodes.wan.only" setting.
What could be causing this failure?

Related

Not able to run oozie workflow in AWS EMR

We are trying to implement a POC where we are trying to run Oozie in AWS EMR. Due to security reasons, I cannot post the workflow but it is an simple example where we only have an rename action which renames the file name. The rest of the actions are the standard ones like start, end, Fatal error, error Handler etc.
The same workflow worked fine on EC2 instance. But when we try to run Oozie workflow on EMR we are getting the following error
2019-09-12 19:34:41,300 WARN ActionStartXCommand:523 - SERVER[<hostname>] USER[hadoop] GROUP[-] TOKEN[] APP[<WorkflowName>] JOB[0000006-190911195656052-oozie-oozi-W] ACTION[0000006-190911195656052-oozie-oozi-W#ErrorHandler] Error starting action [ErrorHandler]. ErrorType [ERROR], ErrorCode [EM007], Message [EM007: Encountered an error while sending the email message over SMTP.]
org.apache.oozie.action.ActionExecutorException: EM007: Encountered an error while sending the email message over SMTP.
at org.apache.oozie.action.email.EmailActionExecutor.email(EmailActionExecutor.java:304)
at org.apache.oozie.action.email.EmailActionExecutor.validateAndMail(EmailActionExecutor.java:173)
at org.apache.oozie.action.email.EmailActionExecutor.start(EmailActionExecutor.java:112)
at org.apache.oozie.command.wf.ActionStartXCommand.execute(ActionStartXCommand.java:243)
at org.apache.oozie.command.wf.ActionStartXCommand.execute(ActionStartXCommand.java:68)
at org.apache.oozie.command.XCommand.call(XCommand.java:291)
at org.apache.oozie.command.wf.SignalXCommand.execute(SignalXCommand.java:459)
at org.apache.oozie.command.wf.SignalXCommand.execute(SignalXCommand.java:82)
at org.apache.oozie.command.XCommand.call(XCommand.java:291)
at org.apache.oozie.command.wf.ActionEndXCommand.execute(ActionEndXCommand.java:283)
at org.apache.oozie.command.wf.ActionEndXCommand.execute(ActionEndXCommand.java:62)
at org.apache.oozie.command.XCommand.call(XCommand.java:291)
at org.apache.oozie.command.wf.ActionCheckXCommand.execute(ActionCheckXCommand.java:244)
at org.apache.oozie.command.wf.ActionCheckXCommand.execute(ActionCheckXCommand.java:56)
at org.apache.oozie.command.XCommand.call(XCommand.java:291)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at org.apache.oozie.service.CallableQueueService$CallableWrapper.run(CallableQueueService.java:210)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: javax.mail.MessagingException: Could not connect to SMTP host: <hostname>, port: 25;
nested exception is:
java.net.ConnectException: Connection refused (Connection refused)
at com.sun.mail.smtp.SMTPTransport.openServer(SMTPTransport.java:1961)
at com.sun.mail.smtp.SMTPTransport.protocolConnect(SMTPTransport.java:654)
When we check the application logs, we get the below error
Launcher AM execution failed
java.lang.UnsupportedOperationException: Not implemented by the S3FileSystem FileSystem implementation
at org.apache.hadoop.fs.FileSystem.getScheme(FileSystem.java:216)
at org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2564)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2574)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
at org.apache.oozie.action.hadoop.FSLauncherURIHandler.create(FSLauncherURIHandler.java:36)
at org.apache.oozie.action.hadoop.PrepareActionsHandler.execute(PrepareActionsHandler.java:86)
at org.apache.oozie.action.hadoop.PrepareActionsHandler.prepareAction(PrepareActionsHandler.java:73)
at org.apache.oozie.action.hadoop.LauncherAM.executePrepare(LauncherAM.java:371)
at org.apache.oozie.action.hadoop.LauncherAM.access$000(LauncherAM.java:55)
at org.apache.oozie.action.hadoop.LauncherAM$2.run(LauncherAM.java:220)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)
at org.apache.oozie.action.hadoop.LauncherAM.run(LauncherAM.java:217)
at org.apache.oozie.action.hadoop.LauncherAM$1.run(LauncherAM.java:153)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)
at org.apache.oozie.action.hadoop.LauncherAM.main(LauncherAM.java:141)
Exception in thread "main" java.lang.UnsupportedOperationException: Not implemented by the S3FileSystem FileSystem implementation
at org.apache.hadoop.fs.FileSystem.getScheme(FileSystem.java:216)
at org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2564)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2574)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
at org.apache.hadoop.io.SequenceFile$Writer.<init>(SequenceFile.java:1060)
at org.apache.hadoop.io.SequenceFile$RecordCompressWriter.<init>(SequenceFile.java:1371)
Hadoop distribution:Amazon 2.8.5
Oozie version:Oozie 5.1.0
EMR version : emr-5.26.0
Appreciate any guidance here.
Issue resolved after we used the older version of Oozie i.e., 4.3. No other changes made. Works fine. Had read in one of the AWS links that some people were not able to execute oozie with 5.X versions. Will update the answer once we get an concrete reply from AWS.

MapReduce Jobs failing, after accepted by YARN

Even a simple WordCount mapreduce also fails with same error.
Hadoop 2.6.0
Below are the Yarn logs.
It seems some sort of timeout happens during resource negotiation.
But i am unable to verify the same, exactly what causes timeout.
2016-11-11 15:38:09,313 INFO
org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher:
Error launching appattempt_1478856936677_0004_000002. Got exception:
java.io.IOException: Failed on local exception: java.io.IOException:
java.net.SocketTimeoutException: 60000 millis timeout while waiting
for channel to be ready for read. ch :
java.nio.channels.SocketChannel[connected local=/10.0.37.145:49054
remote=platform-demo/10.0.37.145:60487]; Host Details : local host is:
"platform-demo/10.0.37.145"; destination host is:
"platform-demo":60487;
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
at org.apache.hadoop.ipc.Client.call(Client.java:1472)
at org.apache.hadoop.ipc.Client.call(Client.java:1399)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
at com.sun.proxy.$Proxy79.startContainers(Unknown Source)
at org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:96)
at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:119)
at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:254)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: java.net.SocketTimeoutException: 60000 millis
timeout while waiting for channel to be ready for read. ch :
java.nio.channels.SocketChannel[connected local=/10.0.37.145:49054
remote=platform-demo/10.0.37.145:60487]
at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:680)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:643)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:730)
at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:368)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1521)
at org.apache.hadoop.ipc.Client.call(Client.java:1438)
... 9 more Caused by: java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch :
java.nio.channels.SocketChannel[connected local=/10.0.37.145:49054
remote=platform-demo/10.0.37.145:60487]
at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
at java.io.FilterInputStream.read(FilterInputStream.java:133)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
at java.io.DataInputStream.readInt(DataInputStream.java:387)
at org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:367)
at org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:553)
at org.apache.hadoop.ipc.Client$Connection.access$1800(Client.java:368)
at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:722)
at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:718)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:717)
... 12 more
2016-11-11 15:38:09,319 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl:
Updating application attempt appattempt_1478856936677_0004_000002 with
final state: FAILED, and exit status: -1000 2016-11-11 15:38:09,319
INFO
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl:
appattempt_1478856936677_0004_000002 State change from ALLOCATED to
FINAL_SAVING
I tried to change below properties
yarn.nodemanager.resource.memory-mb
2200 Amount of physical memory, in MB,
that can be allocated for containers.
yarn.scheduler.minimum-allocation-mb
500
dfs.datanode.socket.write.timeout
3000000
dfs.socket.timeout 3000000
Q1.MapReduce Jobs failing, after accepted by YARN
Reason, multiple connections around 130 stuck on port 60487.
Q2.MapReduce Jobs failing, after accepted by YARN
Issue is due to hadoop tmp /app/hadoop/tmp. Empty this directory and re-tried MAPR job, job was executed successfully.
Q3.Unhealthy Node local-dirs are bad: /tmp/hadoop-hduser/nm-local-dir
Edit yarn-site.xml with folowing property.
<property>
<name>yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage</name>
<value>98.5</value>
</property>
Refer Why does Hadoop report "Unhealthy Node local-dirs and log-dirs are bad"?

Spark 2.0 Thrift server not started in yarn mode

I have started the spark-2.0 thrift server in local environment and it's working fine, when I'll tried with the cluster environment the following exception thrown.
16/06/02 10:21:06 INFO spark.SparkContext: Successfully stopped SparkContext
Exception in thread "main" org.apache.spark.SparkException: Yarn application has already ended!
It might have been killed or unable to launch application master.
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitFor
Application(YarnClientSchedulerBackend.scala:85)
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start
(YarnClientSchedulerBackend.scala:62)
at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:148)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:502)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2246)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:749)
at org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:57)
at org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$.main(HiveThriftServer2.scala:81)
at org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.main(HiveThriftServer2.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:724)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
16/06/02 10:21:06 INFO util.ShutdownHookManager: Shutdown hook called
When checking in application master logs
Error: Could not find or load main class org.apache.spark.deploy.yarn.ExecutorLauncher
Spark-defaults configuration :
spark.executor.memory 2g
spark.driver.memory 4g
spark.executor.cores 1

Hadoop job fails, Resource Manager doesnt recognize AttemptID

Im trying to aggregate some data in an Oozie workflow. However the aggregation step fails.
I found two points of interests in the logs: The first is an error(?) that seems to occur repeatedly:
After a container finishes, it gets killed but exits with non-zero Exit code 143.
It finishes:
2015-05-04 15:35:12,013 INFO [IPC Server handler 7 on 49697] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1430730089455_0009_m_000048_0 is : 0.7231312
2015-05-04 15:35:12,015 INFO [IPC Server handler 19 on 49697] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1430730089455_0009_m_000048_0 is : 1.0
And then then when it gets killed by Application Master:
2015-05-04 15:35:13,831 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1430730089455_0009_m_000048_0: Container killed by the ApplicationMaster.
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
The second point of interest is the actual error that crashes the job completely, this happens in the reduce-phase, not sure if these two are related though:
2015-05-04 15:35:28,767 INFO [IPC Server handler 20 on 49697] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1430730089455_0009_m_000051_0 is : 0.31450257
2015-05-04 15:35:29,930 INFO [IPC Server handler 10 on 49697] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1430730089455_0009_m_000052_0 is : 0.19511986
2015-05-04 15:35:31,549 INFO [IPC Server handler 1 on 49697] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1430730089455_0009_m_000050_0 is : 0.5324404
2015-05-04 15:35:31,771 INFO [IPC Server handler 28 on 49697] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1430730089455_0009_m_000051_0 is : 0.31450257
2015-05-04 15:35:31,890 ERROR [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Error communicating with RM: Resource Manager doesn't recognize AttemptId: application_1430730089455_0009
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Resource Manager doesn't recognize AttemptId: application_1430730089455_0009
at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.getResources(RMContainerAllocator.java:675)
at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:244)
at org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:282)
at java.lang.Thread.run(Thread.java:695)
Caused by: org.apache.hadoop.yarn.exceptions.ApplicationAttemptNotFoundException: Application attempt appattempt_1430730089455_0009_000001 doesn't exist in ApplicationMasterService cache.
at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:436)
at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
at org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:394)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
at org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:101)
at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79)
at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy36.allocate(Unknown Source)
at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.makeRemoteRequest(RMContainerRequestor.java:188)
at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.getResources(RMContainerAllocator.java:667)
... 3 more
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.yarn.exceptions.ApplicationAttemptNotFoundException): Application attempt appattempt_1430730089455_0009_000001 doesn't exist in ApplicationMasterService cache.
at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:436)
at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
at org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:394)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
at org.apache.hadoop.ipc.Client.call(Client.java:1468)
at org.apache.hadoop.ipc.Client.call(Client.java:1399)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
at com.sun.proxy.$Proxy35.allocate(Unknown Source)
at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
... 11 more
After that the oozie:launcher job and the job that got the error just sit there indefinitely with STATE:accepted, FINALSTATUS:undefined and TRACKING UI:unassigned.
Does anyone know what is causing this error and how I can fix it?
The same workflow worked before, and I couldnt say that I changed anything inbetween...
Just in case somebody else stubles upon this error: It seemed like this was caused due to hadoop running out of disc space... Pretty cryptic error for something as simple as that. I thought ~90GB would be enough to work on my 30GB Dataset, I was wrong.
I was also facing the similar issue, In my case task got completed successfully. But it took very much time to finish and there were very large logs on jobtracker, most of the log entries were from the progress report such as the following ones :
2015-05-04 15:35:12,013 INFO [IPC Server handler 7 on 49697] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1430730089455_0009_m_000048_0 is : 0.7231312
2015-05-04 15:35:12,015 INFO [IPC Server handler 19 on 49697] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1430730089455_0009_m_000048_0 is : 1.0
On debugging my code I found that distributed cache was getting read through the map function of Mapper class, After moving the logic of reading dist-cache data from map function to configure method of Mapper class, my issue got resolved.
We should have to use configure method for such kind of operations which are mainly related with configurations or the code blocks which we need to process only once before calling map function.

Hadoop error stalling job reduce process

I have been running a Hadoop job(word count example) a few times on my two-node cluster setup, and it´s been working fine up until now. I keep getting a RuntimeException which stalls the reduce process at 19%:
2013-04-13 18:45:22,191 INFO org.apache.hadoop.mapred.Task: Task:attempt_201304131843_0001_m_000000_0 is done. And is in the process of commiting
2013-04-13 18:45:22,299 INFO org.apache.hadoop.mapred.Task: Task 'attempt_201304131843_0001_m_000000_0' done.
2013-04-13 18:45:22,318 INFO org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1
2013-04-13 18:45:23,181 WARN org.apache.hadoop.mapred.Child: Error running child
java.lang.RuntimeException: Error while running command to get file permissions : org.apache.hadoop.util.Shell$ExitCodeException:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:255)
at org.apache.hadoop.util.Shell.run(Shell.java:182)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:375)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:461)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:444)
at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:710)
at org.apache.hadoop.fs.RawLocalFileSystem$RawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:443)
at org.apache.hadoop.fs.RawLocalFileSystem$RawLocalFileStatus.getOwner(RawLocalFileSystem.java:426)
at org.apache.hadoop.mapred.TaskLog.obtainLogDirOwner(TaskLog.java:267)
at org.apache.hadoop.mapred.TaskLogsTruncater.truncateLogs(TaskLogsTruncater.java:124)
at org.apache.hadoop.mapred.Child$4.run(Child.java:260)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1149)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
at org.apache.hadoop.fs.RawLocalFileSystem$RawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:468)
at org.apache.hadoop.fs.RawLocalFileSystem$RawLocalFileStatus.getOwner(RawLocalFileSystem.java:426)
at org.apache.hadoop.mapred.TaskLog.obtainLogDirOwner(TaskLog.java:267)
at org.apache.hadoop.mapred.TaskLogsTruncater.truncateLogs(TaskLogsTruncater.java:124)
at org.apache.hadoop.mapred.Child$4.run(Child.java:260)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1149)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
Has anyone any ideas of what might be causing this?
Edit: Solved it myself.
If anyone else runs into the same problem, this was caused by the etc/hosts file on the master-node. I hadn´t entered the host-name and address of the slave-node.
This is how my hosts-file is structured on the master-node:
127.0.0.1 MyUbuntuServer
192.xxx.x.xx2 master
192.xxx.x.xx3 MySecondUbuntuServer
192.xxx.x.xx3 slave
a similar problem is described here:
http://comments.gmane.org/gmane.comp.apache.mahout.user/8898
The info there might be related to other version of hadoop. It says:
java.lang.RuntimeException: Error while running command to
get file permissions : java.io.IOException: Cannot run program
"/bin/ls": error=12, Not enough space
The solution their was to change heapsize through mapred.child.java.opts* *-Xmx1200M
See also: https://groups.google.com/a/cloudera.org/forum/?fromgroups=#!topic/cdh-user/BHGYJDNKMGE
HTH,
Avner

Resources