Can we specify spark streaming to start from a specific batch of the checkpoint folder? - spark-streaming

I had been running my streaming job for a while and it had processed thousands of batches.
There is retention on the checkpoint file system and the older directories are removed. Now when I restarted my streaming job it failed with the following error
terminated with error",throwable.class="java.lang.IllegalStateException",throwable.msg="failed to read log file for batch 0"
this is because the corresponding batch directory is no longer available. Is there a way to make the streaming job start from a specific batchId?

Related

How can I get distcp failed files and replay the task?

I have distcp a file between two hdfs cluster with same version,when I execute failed ,I want to find the failed mapreduce task and related file path,then replay.
Copying 'retrying' actually already happens exactly (mapred.map.max.attempts times).
If you rerun distcp again, it will only try to copy files that haven't already been copied. (files successfully copied by a previous distcp on a re-execution will be marked as "skipped".)
If you would like a log of the files that couldn't be copied you can specify '-i' and -log <logdir>. This will ignore failures but write out a more complete log of what failed and why they failed.

executing 2 file watchers on the same job on talend

I have 2 file watchers on the same job on talend. I want both of them to run on the same time. Right now what is happening is that only one file checks its corresponding directory. Its all on the same job design, can I run them on parallel ?

Unable to Start the Name Node in hadoop

I am running the hadoop in my local system but while running ./start-all.sh command its running all functionality except Name Node while running it's getting connection refused and in log file prints below exception
java.io.ioexception : there appears to be a gap in the edit log, we expected txid 1, but got txid 291.
Can You please help me.
Start namenode with recover flag enabled. Use the following command
./bin/hadoop namenode -recover
The metadata in Hadoop NN consists of:
fsimage: contains the complete state of the file system at a point in time
edit logs: contains each file system change (file creation/deletion/modification) that was made after the most recent fsimage.
If you list all files inside your NN workspace directory, you'll see files include:
fsimage_0000000000000000000 (fsimage)
fsimage_0000000000000000000.md5
edits_0000000000000003414-0000000000000003451 (edit logs, there're many ones with different name)
seen_txid (a separated file contains last seen transaction id)
When NN starts, Hadoop will load fsimage and apply all edit logs, and meanwhile do a lot of consistency checks, it'll abort if the check failed. Let's make it happen, I'll rm edits_0000000000000000001-0000000000000000002 from many of my edit logs in my NN workspace, and then try to sbin/start-dfs.sh, I'll get error message in log like:
java.io.IOException: There appears to be a gap in the edit log. We expected txid 1, but got txid 3.
So your error message indicates that your edit logs is inconsitent(may be corrupted or maybe some of them are missing). If you just want to play hadoop on your local and don't care its data, you could simply hadoop namenode -format to re-format it and start from beginning, otherwise you need to recovery your edit logs, from SNN or somewhere you backed up before.

Spark streaming jobs fail when chained

I'm running a few a few Spark Streaming jobs in a chain (one looking for input in the output folder of the previous one) on a Hadoop cluster, using HDFS, running in Yarn-cluster mode.
job 1 --> reads from folder A outputs to folder A'
job 2 --> reads from folder A'outputs to folder B
job 3 --> reads from folder B outputs to folder C
...
When running the jobs independently they work just fine.
But when they are all waiting for input and I place a file in folder A, job1 will change its status from running to accepting to failed.
I can not reproduce this error when using the local FS, only when running it on a cluster (using HDFS)
Client: Application report for application_1422006251277_0123 (state: FAILED)
INFO Client:
client token: N/A
diagnostics: Application application_1422006251277_0123 failed 2 times due to AM Container for appattempt_1422006251277_0123_000002 exited with exitCode: 15 due to: Exception from container-launch.
Container id: container_1422006251277_0123_02_000001
Exit code: 15
Even though Mapreduce ignores files that start with . or _, Spark Streaming does not.
The problem is, when a file is being copied or processes or whatever and there is a trace of a file found on HDFS(i.e. "somefilethatsuploading.txt.tmp") Spark will try to process it.
By the time the process starts to read the file, it's either gone or not complete yet.
That's why the processes kept blowing up.
Ignoring files that start with . or _ or end with .tmp fixes this issue.
Addition:
We kept having issues with the chained jobs. It appears that as soon as Spark notices a file (even if it's not completely written) it will try to process it, ignoring all additional data. The file rename operation is typically atomic and should prevent issues.

How do I run map/reduce on files in local file system?

How do I run a Java map/reduce job on files available in local file system? For instance, I have a 3 node cluster, and all the nodes have a log file in their local file system, say /home/log/log.txt.
How do I run a job on these files? Do I need to combine them and transfer it to HDFS before running the job?
Thanks.
You can upload all the individual files under one folder and provide that folder path as input path to your map reduce program. Your Map Reduce runs on all the files in that folder.

Resources