I'm getting the following warning while running my mapreduce jobs under cd4.
java.io.IOException: Lease timeout of 0 seconds expired.
at org.apache.hadoop.hdfs.DFSOutputStream.abort(DFSOutputStream.java:1700)
at org.apache.hadoop.hdfs.DFSClient.closeAllFilesBeingWritten(DFSClient.java:652)
at org.apache.hadoop.hdfs.DFSClient.renewLease(DFSClient.java:604)
at org.apache.hadoop.hdfs.LeaseRenewer.renew(LeaseRenewer.java:411)
at org.apache.hadoop.hdfs.LeaseRenewer.run(LeaseRenewer.java:436)
at org.apache.hadoop.hdfs.LeaseRenewer.access$700(LeaseRenewer.java:70)
at org.apache.hadoop.hdfs.LeaseRenewer$1.run(LeaseRenewer.java:297)
at java.lang.Thread.run(Thread.java:662)
Any idea what this means?
Related
I am trying to run pig program using oozie in command prompt but i am getting error like
Connection exception has occurred [ java.net.ConnectException Connection timed out ]. Trying after 1 sec. Retry count = 1
Connection exception has occurred [ java.net.ConnectException Connection timed out ]. Trying after 2 sec. Retry count = 2
Connection exception has occurred [ java.net.ConnectException Connection timed out ]. Trying after 4 sec. Retry count = 3
Connection exception has occurred [ java.net.ConnectException Connection timed out ]. Trying after 8 sec. Retry count = 4
Error: IO_ERROR : java.io.IOException: Error while connecting Oozie server. No of retries = 4. Exception = Connection timed out
and i am running this command
oozie job -oozie http://localhost:11000/oozie -config job.properties -run
Even a simple WordCount mapreduce also fails with same error.
Hadoop 2.6.0
Below are the Yarn logs.
It seems some sort of timeout happens during resource negotiation.
But i am unable to verify the same, exactly what causes timeout.
2016-11-11 15:38:09,313 INFO
org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher:
Error launching appattempt_1478856936677_0004_000002. Got exception:
java.io.IOException: Failed on local exception: java.io.IOException:
java.net.SocketTimeoutException: 60000 millis timeout while waiting
for channel to be ready for read. ch :
java.nio.channels.SocketChannel[connected local=/10.0.37.145:49054
remote=platform-demo/10.0.37.145:60487]; Host Details : local host is:
"platform-demo/10.0.37.145"; destination host is:
"platform-demo":60487;
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
at org.apache.hadoop.ipc.Client.call(Client.java:1472)
at org.apache.hadoop.ipc.Client.call(Client.java:1399)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
at com.sun.proxy.$Proxy79.startContainers(Unknown Source)
at org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:96)
at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:119)
at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:254)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: java.net.SocketTimeoutException: 60000 millis
timeout while waiting for channel to be ready for read. ch :
java.nio.channels.SocketChannel[connected local=/10.0.37.145:49054
remote=platform-demo/10.0.37.145:60487]
at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:680)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:643)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:730)
at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:368)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1521)
at org.apache.hadoop.ipc.Client.call(Client.java:1438)
... 9 more Caused by: java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch :
java.nio.channels.SocketChannel[connected local=/10.0.37.145:49054
remote=platform-demo/10.0.37.145:60487]
at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
at java.io.FilterInputStream.read(FilterInputStream.java:133)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
at java.io.DataInputStream.readInt(DataInputStream.java:387)
at org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:367)
at org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:553)
at org.apache.hadoop.ipc.Client$Connection.access$1800(Client.java:368)
at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:722)
at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:718)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:717)
... 12 more
2016-11-11 15:38:09,319 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl:
Updating application attempt appattempt_1478856936677_0004_000002 with
final state: FAILED, and exit status: -1000 2016-11-11 15:38:09,319
INFO
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl:
appattempt_1478856936677_0004_000002 State change from ALLOCATED to
FINAL_SAVING
I tried to change below properties
yarn.nodemanager.resource.memory-mb
2200 Amount of physical memory, in MB,
that can be allocated for containers.
yarn.scheduler.minimum-allocation-mb
500
dfs.datanode.socket.write.timeout
3000000
dfs.socket.timeout 3000000
Q1.MapReduce Jobs failing, after accepted by YARN
Reason, multiple connections around 130 stuck on port 60487.
Q2.MapReduce Jobs failing, after accepted by YARN
Issue is due to hadoop tmp /app/hadoop/tmp. Empty this directory and re-tried MAPR job, job was executed successfully.
Q3.Unhealthy Node local-dirs are bad: /tmp/hadoop-hduser/nm-local-dir
Edit yarn-site.xml with folowing property.
<property>
<name>yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage</name>
<value>98.5</value>
</property>
Refer Why does Hadoop report "Unhealthy Node local-dirs and log-dirs are bad"?
I am reading millions of xml files via
val xmls = sc.binaryFiles(xmlDir)
The operation runs fine locally but on yarn it fails with:
client token: N/A
diagnostics: Application application_1433491939773_0012 failed 2 times due to ApplicationMaster for attempt appattempt_1433491939773_0012_000002 timed out. Failing the application.
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1433750951883
final status: FAILED
tracking URL: http://controller01:8088/cluster/app/application_1433491939773_0012
user: ariskk
Exception in thread "main" org.apache.spark.SparkException: Application finished with failed status
at org.apache.spark.deploy.yarn.Client.run(Client.scala:622)
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:647)
at org.apache.spark.deploy.yarn.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
On hadoops/userlogs logs I am frequently getting these messages:
15/06/08 09:15:38 WARN util.AkkaUtils: Error sending message [message = Heartbeat(1,[Lscala.Tuple2;#2b4f336b,BlockManagerId(1, controller01.stratified, 58510))] in 2 attempts
java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:195)
at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:427)
I run my spark job via spark-submit and it works for an other HDFS directory that contains only 37k files. Any ideas how to resolve this?
Ok after getting some help on sparks mailing list, I found out there were 2 issues:
the src directory, if it is given as /my_dir/ it makes spark fail and creates the heartbeat issues. Instead it should be given as hdfs:///my_dir/*
An out of memory error appears in the logs after fixing #1. This is the spark driver running on yarn running out of memory due to the number of files (apparently it keeps all file info in memory). So I spark-submit'ed the job with --conf spark.driver.memory=8g which fixed the issue.
Mapreduce job failed because of container failed with below log.
15/03/21 20:18:25 INFO mapreduce.Job: Job job_1426295876693_0015 failed with state FAILED due to: Application application_1426295876693_0015 failed 2 times due to Error launching appattempt_1426295876693_0015_000002. Got exception: org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start container.
This token is expired. current time is 1426996344559 found 1426969281613
It means that your cluster is not synced with same system time. Install NTP server. It will fix your issue.
I am running a spark streaming 24X7 and using updateStateByKey function to save the computed historical data like in the case of NetworkWordCount Example..
I am tried to stream a file with 3lac records with 1 sec sleep for every 1500 records.
I am using 3 workers
Over a period updateStateByKey is growing, then the program throws the following exception
ERROR Executor: Exception in task ID 1635
java.lang.ArrayIndexOutOfBoundsException: 3
14/10/23 21:20:43 ERROR TaskSetManager: Task 29170.0:2 failed 1 times; aborting job
14/10/23 21:20:43 ERROR DiskBlockManager: Exception while deleting local spark dir: /var/folders/3j/9hjkw0890sx_qg9yvzlvg64cf5626b/T/spark-local-20141023204346-b232
java.io.IOException: Failed to delete: /var/folders/3j/9hjkw0890sx_qg9yvzlvg64cf5626b/T/spark-local-20141023204346-b232/24
14/10/23 21:20:43 ERROR Executor: Exception in task ID 8037
java.io.FileNotFoundException: /var/folders/3j/9hjkw0890sx_qg9yvzlvg64cf5626b/T/spark-local-20141023204346-b232/22/shuffle_81_0_1 (No such file or directory)
at java.io.FileOutputStream.open(Native Method)
How to handle this?
I guess updateStateByKey should be periodically reset as its growing in a rapid rate, please share some example on when and how to reset the updateStateByKey.. or i there any other problem? shed some light.
Any help is much appreciated. Thanks for your time
Did you set the CheckPoint
ssc.checkpoint("path to checkpoint")