Spark Streaming: How to restart spark streaming job running on hdfs cleanly - hadoop

We have a spark streaming job which reads data from kafka running on a 4 node cluster that uses a checkpoint directory on HDFS ....we had an I/O error where we ran out of the space and we had to go in and delete a few hdfs folders to free up some space and now we have bigger disks mounted ....and want to restart cleanly no need to preserve checkpoint data or kafka offset.....getting the error ..
Application application_1482342493553_0077 failed 2 times due to AM Container for appattempt_1482342493553_0077_000002 exited with exitCode: -1000
For more detailed output, check application tracking page:http://hdfs-name-node:8088/cluster/app/application_1482342493553_0077Then, click on links to logs of each attempt.
Diagnostics: org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-1266542908-96.118.179.119-1479844615420:blk_1073795938_55173 file=/user/hadoopuser/streaming_2.10-1.0.0-SNAPSHOT.jar
Failing this attempt. Failing the application.
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1484420770001
final status: FAILED
tracking URL: http://hdfs-name-node:8088/cluster/app/application_1482342493553_0077
user: hadoopuser
From the error what i can make out is it's still looking for old hdfs blocks which we deleted ...
From research found that ..changing check point directory will help tried changing it and pointing to a new directory ...but still it's not helping to restart spark on clean slate ..it's still giving the same block exception ...Are we missing anything while doing the configuration changes? And how can we make sure that spark is started on a clean slate ?
Also this is how we are setting the checkpoint directory
val ssc = new StreamingContext(sparkConf, Seconds(props.getProperty("spark.streaming.window.seconds").toInt))
ssc.checkpoint(props.getProperty("spark.checkpointdir"))
val sc = ssc.sparkContext
current checkpoint directory in property file is like this
spark.checkpointdir:hdfs://hadoopuser#hdfs-name-node:8020/user/hadoopuser/.checkpointDir1
previously it used to be like this
spark.checkpointdir:hdfs://hadoopuser#hdfs-name-node:8020/user/hadoopuser/.checkpointDir

Related

Unable to Start the Name Node in hadoop

I am running the hadoop in my local system but while running ./start-all.sh command its running all functionality except Name Node while running it's getting connection refused and in log file prints below exception
java.io.ioexception : there appears to be a gap in the edit log, we expected txid 1, but got txid 291.
Can You please help me.
Start namenode with recover flag enabled. Use the following command
./bin/hadoop namenode -recover
The metadata in Hadoop NN consists of:
fsimage: contains the complete state of the file system at a point in time
edit logs: contains each file system change (file creation/deletion/modification) that was made after the most recent fsimage.
If you list all files inside your NN workspace directory, you'll see files include:
fsimage_0000000000000000000 (fsimage)
fsimage_0000000000000000000.md5
edits_0000000000000003414-0000000000000003451 (edit logs, there're many ones with different name)
seen_txid (a separated file contains last seen transaction id)
When NN starts, Hadoop will load fsimage and apply all edit logs, and meanwhile do a lot of consistency checks, it'll abort if the check failed. Let's make it happen, I'll rm edits_0000000000000000001-0000000000000000002 from many of my edit logs in my NN workspace, and then try to sbin/start-dfs.sh, I'll get error message in log like:
java.io.IOException: There appears to be a gap in the edit log. We expected txid 1, but got txid 3.
So your error message indicates that your edit logs is inconsitent(may be corrupted or maybe some of them are missing). If you just want to play hadoop on your local and don't care its data, you could simply hadoop namenode -format to re-format it and start from beginning, otherwise you need to recovery your edit logs, from SNN or somewhere you backed up before.

Operation not permitted while launch mr job

I have change my kerberos cluster to unkerberized
after that below Exception occurred while launching MR jobs.
Application application_1458558692044_0001 failed 1 times due to AM Container for appattempt_1458558692044_0001_000001 exited with exitCode: -1000
For more detailed output, check application tracking page:http://hdp1.impetus.co.in:8088/proxy/application_1458558692044_0001/Then, click on links to logs of each attempt.
Diagnostics: Operation not permitted
Failing this attempt. Failing the application.
I am able to continue my work
delete yarn folder from all the nodes which is define on yarn.nodemanager.local-dirs this property
then restart yarn process

Spark streaming jobs fail when chained

I'm running a few a few Spark Streaming jobs in a chain (one looking for input in the output folder of the previous one) on a Hadoop cluster, using HDFS, running in Yarn-cluster mode.
job 1 --> reads from folder A outputs to folder A'
job 2 --> reads from folder A'outputs to folder B
job 3 --> reads from folder B outputs to folder C
...
When running the jobs independently they work just fine.
But when they are all waiting for input and I place a file in folder A, job1 will change its status from running to accepting to failed.
I can not reproduce this error when using the local FS, only when running it on a cluster (using HDFS)
Client: Application report for application_1422006251277_0123 (state: FAILED)
INFO Client:
client token: N/A
diagnostics: Application application_1422006251277_0123 failed 2 times due to AM Container for appattempt_1422006251277_0123_000002 exited with exitCode: 15 due to: Exception from container-launch.
Container id: container_1422006251277_0123_02_000001
Exit code: 15
Even though Mapreduce ignores files that start with . or _, Spark Streaming does not.
The problem is, when a file is being copied or processes or whatever and there is a trace of a file found on HDFS(i.e. "somefilethatsuploading.txt.tmp") Spark will try to process it.
By the time the process starts to read the file, it's either gone or not complete yet.
That's why the processes kept blowing up.
Ignoring files that start with . or _ or end with .tmp fixes this issue.
Addition:
We kept having issues with the chained jobs. It appears that as soon as Spark notices a file (even if it's not completely written) it will try to process it, ignoring all additional data. The file rename operation is typically atomic and should prevent issues.

Error with flume and remote hdfs sink

I'm trying to run flume with an hdfs sink. The hdfs is running in a different machine properly and I can even interact with the hdfs from the flume machine, but when I run flume and send events to it I get the following error:
2013-05-26 14:22:11,399 (SinkRunner-PollingRunner-DefaultSinkProcessor) [WARN - org.apache.flume.sink.hdfs.HDFSEventSink.process(HDFSEventSink.java:456)] HDFS IO error
java.io.IOException: Callable timed out after 25000 ms
at org.apache.flume.sink.hdfs.HDFSEventSink.callWithTimeout(HDFSEventSink.java:352)
at org.apache.flume.sink.hdfs.HDFSEventSink.append(HDFSEventSink.java:727)
at org.apache.flume.sink.hdfs.HDFSEventSink.process(HDFSEventSink.java:430)
at org.apache.flume.sink.DefaultSinkProcessor.process(DefaultSinkProcessor.java:68)
at org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:147)
at java.lang.Thread.run(Thread.java:679)
Caused by: java.util.concurrent.TimeoutException
at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:258)
at java.util.concurrent.FutureTask.get(FutureTask.java:119)
at org.apache.flume.sink.hdfs.HDFSEventSink.callWithTimeout(HDFSEventSink.java:345)
... 5 more
Again, conectivity is not an issue since I can interact with hdfs using the hadoop command line (the flume machine is NOT a datanode).
The weirdest part is that after killing flume I can see that the tmp file is created in hdfs but it's empty (and the .tmp extension remains).
Any ideas as to why could this be happening? Thanks a lot!
Check 3 things, if your firewall is off i.e. iptables should be stopped. Secondly, value of the property agent.sinks.hdfs-sink.hdfs.path = hdfs://PUBLIC_IP:8020/user/hdfs/flume and not Private IP.
And change
agent.sinks.hdfs-sink.hdfs.callTimeout = 180000 because the default is 10000 ms which is very less time for HDFS to react.
Thanks,
Shilpa

How do you establish single node Hadoop instance on AWS using Apache Whirr?

I am attempting to run a single-node instance of Hadoop on Amazon Web Services using Apache Whirr. I set whirr.instance-templates equal to 1 jt+nn+dn+tt. The instance starts up fine. I am able to create directories, but when I try to put files, I get a File could only be replicated to 0 nodes, instead of 1 error. When I do a hadoop fsck / I get a Exception in thread "main" java.net.ConnectException: Connection refused error. Does anyone know what is wrong with my configuration?
I made the experience that whirr does not always start all services reliable. It sounds like the namenode started (the namenode is responsible for storing directory information) but the datanode did not start (the datanode stores the data).
Try running
hadoop dfsadmin -report
to see if a datanode is available.
If not: often it helps to restart the cluster.

Resources