I have several parallel Spark jobs doing the same thing, they work on separate input/output dirs, at the end they write results to parquet from dataframe using one of the columns as a partitioner. Jobs with the biggest inputs often fail. Some of executors start to fail with below exceptions, then a stage fails and start recalculating a failed partition, if number of failed stages reaches 4(if it reaches, sometimes it doesn't and the whole job finishes successfully) the whole job is canceled.
Stages fails with these failure reasons(from spark UI):
org.apache.spark.shuffle.FetchFailedException
Connection closed by
peer
I tried to find clues on the Internet and it seems the reason maybe speculative execution, but I don't enable it in Spark, any other ideas what is the reason of that?
Spark job code:
sqlContext
.createDataFrame(finalRdd, structType)
.write()
.partitionBy(PARTITION_COLUMN_NAME)
.parquet(tmpDir);
Exceptions in executors:
16/09/14 11:04:06 ERROR datasources.DynamicPartitionWriterContainer: Aborting task.
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException): Failed to create file [/erm/data/core/internal/ekp/stg/tmp/Z_PLAN_OPER/_temporary/0/_temporary/attempt_201609141104_0001_m_006023_0/partition=2/part-r-06023-482b0b4d-1174-4c76-b203-92b2b47c78cb.parquet] for [DFSClient_NONMAPREDUCE_1489398656_198] for client [10.117.102.72], because this file is already being created by [DFSClient_NONMAPREDUCE_-2049022202_200] on [10.117.102.15]
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:3152)
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /erm/data/core/internal/ekp/stg/tmp/Z_PLAN_OPER/_temporary/0/_temporary/attempt_201609141105_0001_m_006489_0/partition=2/part-r-06489-482b0b4d-1174-4c76-b203-92b2b47c78cb.parquet (inode 318361396): File does not exist. Holder DFSClient_NONMAPREDUCE_-1428957718_196 does not have any open files.
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3625)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:3428)
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException): Failed to create file [/erm/data/core/internal/ekp/stg/tmp/Z_PLAN_OPER/_temporary/0/_temporary/attempt_201609141105_0001_m_006310_0/partition=2/part-r-06310-482b0b4d-1174-4c76-b203-92b2b47c78cb.parquet] for [DFSClient_NONMAPREDUCE_-419723425_199] for client [10.117.102.44], because this file is already being created by [DFSClient_NONMAPREDUCE_596138765_198] on [10.117.102.35]
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:3152)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /erm/data/core/internal/ekp/stg/tmp/Z_PLAN_OPER/_temporary/0/_temporary/attempt_201609141104_0001_m_005877_0/partition=2/part-r-05877-482b0b4d-1174-4c76-b203-92b2b47c78cb.parquet (inode 318359423): File does not exist. Holder DFSClient_NONMAPREDUCE_193375828_196 does not have any open files.
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3625)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:3428)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException): Failed to create file [/erm/data/core/internal/ekp/stg/tmp/Z_PLAN_OPER/_temporary/0/_temporary/attempt_201609141104_0001_m_005621_0/partition=2/part-r-05621-482b0b4d-1174-4c76-b203-92b2b47c78cb.parquet] for [DFSClient_NONMAPREDUCE_498917218_197] for client [10.117.102.36], because this file is already being created by [DFSClient_NONMAPREDUCE_-578682558_197] on [10.117.102.16]
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:3152)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /erm/data/core/internal/ekp/stg/tmp/Z_PLAN_OPER/_temporary/0/_temporary/attempt_201609141104_0001_m_006311_0/partition=2/part-r-06311-482b0b4d-1174-4c76-b203-92b2b47c78cb.parquet (inode 318359109): File does not exist. Holder DFSClient_NONMAPREDUCE_-60951070_198 does not have any open files.
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3625)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:3428)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3284)
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /erm/data/core/internal/ekp/stg/tmp/Z_PLAN_OPER/_temporary/0/_temporary/attempt_201609141104_0001_m_006215_0/partition=2/part-r-06215-482b0b4d-1174-4c76-b203-92b2b47c78cb.parquet (inode 318359393): File does not exist. Holder DFSClient_NONMAPREDUCE_-331523575_197 does not have any open files.
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3625)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:3428)
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException): Failed to create file [/erm/data/core/internal/ekp/stg/tmp/Z_PLAN_OPER/_temporary/0/_temporary/attempt_201609141104_0001_m_006311_0/partition=2/part-r-06311-482b0b4d-1174-4c76-b203-92b2b47c78cb.parquet] for [DFSClient_NONMAPREDUCE_1869576560_198] for client [10.117.102.44], because this file is already being created by [DFSClient_NONMAPREDUCE_-60951070_198] on [10.117.102.70]
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:3152)
Spark UI:
We use Spark 1.6 (CDH 5.8)
Related
I'm trying to checkpoint/savepoint my flink state running on EMR to an s3 bucket on AWS. Please note:
The instances (master and core nodes) have the IAM role properly set up to access the s3 bucket and all the directories/files inside it (AmazonS3FullAccess policy is attached to the role and nothing overrides it).
I can use aws s3 cp xxx s3://flink-bc/checkpoints from slave and master nodes successfully to copy files to the bucket
Using hdfs for savepoints/checkpoints work
If I set the checkpoints to use hdfs and then try to savepoint to s3, the savepoint operation error looks like
org.apache.flink.util.FlinkException: Triggering a savepoint for the job 16c162c47f225cddad974056c9494b6d failed.
at org.apache.flink.client.cli.CliFrontend.triggerSavepoint(CliFrontend.java:723)
at org.apache.flink.client.cli.CliFrontend.lambda$savepoint$9(CliFrontend.java:701)
at org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:985)
at org.apache.flink.client.cli.CliFrontend.savepoint(CliFrontend.java:698)
at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1065)
at org.apache.flink.client.cli.CliFrontend.lambda$main$11(CliFrontend.java:1126)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1126)
Caused by: java.util.concurrent.CompletionException: java.util.concurrent.CompletionException: org.apache.flink.runtime.checkpoint.CheckpointTriggerException: Failed to trigger savepoint. Decline reason: An Exception occurred while triggering the checkpoint.........
Caused by: java.util.concurrent.CompletionException: org.apache.flink.runtime.checkpoint.CheckpointTriggerException: Failed to trigger savepoint. Decline reason: An Exception occurred while triggering the checkpoint.
at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
and the jobmanager logs:
java.io.IOException: Cannot instantiate file system for URI: s3://flink-bc/savepoints
at org.apache.flink.runtime.fs.hdfs.HadoopFsFactory.create(HadoopFsFactory.java:187)
at org.apache.flink.core.fs.FileSystem.getUnguardedFileSystem(FileSystem.java:399)
at org.apache.flink.core.fs.FileSystem.get(FileSystem.java:318)
at org.apache.flink.core.fs.Path.getFileSystem(Path.java:298)
at org.apache.flink.runtime.state.filesystem.AbstractFsCheckpointStorage.initializeLocationForSavepoint(AbstractFsCheckpointStorage.java:147)
at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.triggerCheckpoint(CheckpointCoordinator.java:511)
at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.triggerSavepoint(CheckpointCoordinator.java:370)
at org.apache.flink.runtime.jobmaster.JobMaster.triggerSavepoint(JobMaster.java:951)
I faced a similar kind of issue while using the latest Flink version(1.10.0) with s3 to store the checkpoints in the s3 bucket.
So please find a detailed working answer which I have provided here.
When I ssh onto an EMR cluster and do the following command:
hadoop fs -get s3://path/to/my/files
I am getting the following error, and the file transfer fails partway through. I have used this command in the past, so I'm not sure what's up. Could it be related to the files' encryption? What would cause the stream to consistently close?
WARN internal.S3AbortableInputStream: Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.
Exception in thread "main" org.apache.hadoop.fs.FSError: java.io.IOException: Stream Closed
at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:253)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
at java.io.FilterOutputStream.close(FilterOutputStream.java:158)
at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:74)
at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:108)
at org.apache.hadoop.io.IOUtils.cleanup(IOUtils.java:244)
at org.apache.hadoop.io.IOUtils.closeStream(IOUtils.java:261)
at org.apache.hadoop.fs.shell.CommandWithDestination$TargetFileSystem.writeStreamToFile(CommandWithDestination.java:478)
at org.apache.hadoop.fs.shell.CommandWithDestination.copyStreamToTarget(CommandWithDestination.java:395)
at org.apache.hadoop.fs.shell.CommandWithDestination.copyFileToTarget(CommandWithDestination.java:328)
at org.apache.hadoop.fs.shell.CommandWithDestination.processPath(CommandWithDestination.java:263)
at org.apache.hadoop.fs.shell.CommandWithDestination.processPath(CommandWithDestination.java:248)
at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:317)
at org.apache.hadoop.fs.shell.Command.recursePath(Command.java:373)
at org.apache.hadoop.fs.shell.CommandWithDestination.recursePath(CommandWithDestination.java:291)
at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:319)
at org.apache.hadoop.fs.shell.Command.recursePath(Command.java:373)
at org.apache.hadoop.fs.shell.CommandWithDestination.recursePath(CommandWithDestination.java:291)
at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:319)
at org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:289)
at org.apache.hadoop.fs.shell.CommandWithDestination.processPathArgument(CommandWithDestination.java:243)
at org.apache.hadoop.fs.shell.Command.processArgument(Command.java:271)
at org.apache.hadoop.fs.shell.Command.processArguments(Command.java:255)
at org.apache.hadoop.fs.shell.CommandWithDestination.processArguments(CommandWithDestination.java:220)
at org.apache.hadoop.fs.shell.Command.processRawArguments(Command.java:201)
at org.apache.hadoop.fs.shell.Command.run(Command.java:165)
at org.apache.hadoop.fs.FsShell.run(FsShell.java:287)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at org.apache.hadoop.fs.FsShell.main(FsShell.java:340)
Caused by: java.io.IOException: Stream Closed
at java.io.FileOutputStream.writeBytes(Native Method)
at java.io.FileOutputStream.write(FileOutputStream.java:326)
at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:251)
... 30 more
My best guess: Not enough space on the cluster for the files.
I am trying to execute a program on Spark. I have a cluster with a master and two slave nodes. I am receiving the following error during execution.
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 4.0 failed 4 times, most recent failure: Lost task 3.3 in stage 4.0 (TID 44, hadoopslave3): java.lang.RuntimeException: java.io.FileNotFoundException: File /home/ubuntu/hadoop/hadoop-te/dl4j/1485860107978_-4ccc8c8/0/data/dataset_4-4ccc8c8_68.bin does not exist
Driver stacktrace is as follows:
Driver stacktrace:
at og.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
17/01/31 10:56:08 INFO scheduler.TaskSetManager: Lost task 1.3 in stage 4.0 (TID 45) on executor hadoopslave3: java.lang.RuntimeException (java.io.FileNotFoundException: File /home/ubuntu/hadoop/hadoop-te/dl4j/1485860107978_-4ccc8c8/0/data/dataset_2-4ccc8c8_77.bin does not exist) [duplicate 3]
However, I can see all the dataset objects (.bin files) created on the HDFS. Any suggesstions ?
Since you have a cluster with "two slave nodes" set up: do you also have a Hadoop filesystem set up? If not, then that's your issue.
The example you're linking to, when using a non-local cluster, transfers references to datasets using Hadoop. The behavior of this example has been made more predictable (with a big error message) in release 0.8.0.
I am facing tasks attempts failing with below error, related to Teradata export (batch insert) jobs.Other jobs exporting data to Oracle etc. are running fine.
Task attempt_1234_m_000000_0 failed to report status for 600 seconds. Killing!,
java.lang.Throwable: Child Error at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:250)
Caused by: ExitCodeException exitCode=255: at org.apache.hadoop.util.Shell.runCommand(Shell.java:543)
at org.apache.hadoop.util.Shell.run(Shell.java:460) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:707)
at org.apache.hadoop.mapred.LinuxTaskController.createLogDir(LinuxTaskController.java:313) at org.apache.hadoop.mapred.TaskRunner.prepareLogFiles(TaskRunner.java:295)
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:215)
I can also see below error message from task stdout logs :
"main" prio=10 tid=0x00007f8824018800 nid=0x3395 runnable [0x00007f882bffb000]
java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:152)
at java.net.SocketInputStream.read(SocketInputStream.java:122)
at com.teradata.jdbc.jdbc_4.io.TDNetworkIOIF.read(TDNetworkIOIF.java:693)
at com.teradata.jdbc.jdbc_4.io.TDPacketStream.readStream(TDPacketStream.java:774)
Hadoop version : Hadoop 2.5.0-cdh5.3.8
Specifically it will be really helpful if you could tell me why this issue is happening?
Is this an issue related to limit in number of connections to Teradata ?
Found the root cause of the issue:
-Sqoop task is inserting around 4 million records into teradata and hence the task is bit long running
-The insert query ,since long running is going into Teradata delay qeueue (workload management at Teradata end - set by DBAs), and hence sqoop mapreduce task is not getting a response fro 600 sec.s from teradata
-Since default task time-out is 600sec., the transaction was aborted by map task resulting in task failure
Ref: http://apps.teradata.com/TDMO/v08n04/Tech2Tech/TechSupport/RoadRules.aspx
Solution:
1-Increase Taks time-out at mapreduce end.
2-Change configuartions related to delay queue at teradata end for the sepecific user
I came to work today and i found my hudson with this problem! I've tried to research, but i didn't found anything that help me.
Follow the full stack:
hudson.util.IOException2: Failed to create a temp file on /home/cpcaserver5/.hudson/jobs/SVN/workspace
at hudson.FilePath.createTextTempFile(FilePath.java:966)
at hudson.tasks.CommandInterpreter.createScriptFile(CommandInterpreter.java:124)
at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:68)
at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:60)
at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:19)
at hudson.model.AbstractBuild$AbstractRunner.perform(AbstractBuild.java:630)
at hudson.model.Build$RunnerImpl.build(Build.java:175)
at hudson.model.Build$RunnerImpl.doRun(Build.java:137)
at hudson.model.AbstractBuild$AbstractRunner.run(AbstractBuild.java:429)
at hudson.model.Run.run(Run.java:1366)
at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46)
at hudson.model.ResourceController.execute(ResourceController.java:88)
at hudson.model.Executor.run(Executor.java:145)
Caused by: hudson.util.IOException2: Failed to create a temporary directory in /etc/tomcat6/apache-tomcat-6.0.35/temp
at hudson.FilePath$12.invoke(FilePath.java:955)
at hudson.FilePath$12.invoke(FilePath.java:944)
at hudson.FilePath.act(FilePath.java:758)
at hudson.FilePath.act(FilePath.java:740)
at hudson.FilePath.createTextTempFile(FilePath.java:944)
... 12 more
Caused by: java.io.IOException: Permission denied
at java.io.UnixFileSystem.createFileExclusively(Native Method)
at java.io.File.checkAndCreate(File.java:1716)
at java.io.File.createTempFile(File.java:1804)
at hudson.FilePath$12.invoke(FilePath.java:953)
... 16 more
Email was triggered for: Failure
Sending email for trigger: Failure
It looks like you have a permissions problem. Make sure you run Jenkins/Tomcat with appropriate user permissions. Ditto if this happens on a slave - check that slave process runs as a user that has appropriate permissions.