What is Apache Spark doing before a job start - hadoop

I have an Apache Spark batch job running continuously on AWS EMR. It pulls from AWS S3, runs a couple of jobs with that data, and then stores the data in an RDS instance.
However, there seems to be a long period of inactivity between jobs.
This is the CPU use:
And this is the network:
Notice the gap between each column, it is almost the same size as the activity column!
At first I thought these two columns were shifted (when it was pulling from S3, it wasn't using a lot of CPU and vice-versa) but then I noticed that these two graphs actually follow each other. This makes sense since the RDDs are lazy and will thus pull as the job is running.
Which leads to my question, what is Spark doing during that time? All of the Ganglia graphs seem zeroed during that time. It is as if the cluster decided to take a break before each job.
EDIT: Looking at the logs, this is the part where it seems to take an hour of...doing nothing?
15/04/27 01:13:13 INFO storage.DiskBlockManager: Created local directory at /mnt1/var/lib/hadoop/tmp/nm-local-dir/usercache/hadoop/appcache/application_1429892010439_0020/spark-c570e510-934c-4510-a1e5-aa85d407b748
15/04/27 01:13:13 INFO storage.MemoryStore: MemoryStore started with capacity 4.9 GB
15/04/27 01:13:13 INFO netty.NettyBlockTransferService: Server created on 37151
15/04/27 01:13:13 INFO storage.BlockManagerMaster: Trying to register BlockManager
15/04/27 01:13:13 INFO storage.BlockManagerMaster: Registered BlockManager
15/04/27 01:13:13 INFO util.AkkaUtils: Connecting to HeartbeatReceiver: akka.tcp://sparkDriver#ip-10-0-3-12.ec2.internal:41461/user/HeartbeatReceiver
15/04/27 02:30:45 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 0
15/04/27 02:30:45 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 7
15/04/27 02:30:45 INFO executor.Executor: Running task 77251.0 in stage 0.0 (TID 0)
15/04/27 02:30:45 INFO executor.Executor: Running task 77258.0 in stage 0.0 (TID 7)
15/04/27 02:30:45 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 8
15/04/27 02:30:45 INFO executor.Executor: Running task 0.0 in stage 0.0 (TID 8)
15/04/27 02:30:45 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 15
15/04/27 02:30:45 INFO executor.Executor: Running task 7.0 in stage 0.0 (TID 15)
15/04/27 02:30:45 INFO broadcast.TorrentBroadcast: Started reading broadcast variable
Notice at 01:13:13, it just hangs there until 20:30:45.

I found the issue. The problem was in the way I was calling pulling from S3.
We have our data in S3 separated by a date pattern as in s3n://bucket/2015/01/03/10/40/actualData.txt for the data from 2015-01-03 at 10:40
So when we want to run the batch process on the whole set, we call sc.textFiles("s3n://bucket/*/*/*/*/*/*").
BUT that is bad. In retrospect, this makes sense; for each star (*), Spark needs to get all of the files in that "directory", and then get all of the files in the directory under that. A single month has about 30 files and each day has 24 files, and each of those has 60. So the above pattern would call a "list files" on each star AND the call list files on the files returned, all the way down to the minutes! This is so that is can eventually get all of the **/acutalData.txt files and then union all of their RDDs.
This, of course, is really slow. So the answer was to build these paths in code (a list of strings for all the dates. In our case, all possible dates can be determined) and reducing them into a comma-separated string that can be passed into textFiles.
If in your case you can't determine all of the possible paths, consider either restructuring your data or build as much as possible of the paths and only call * towards the end of the path, or use the AmazonS3Client to get all the keys using the list-objects api (which allows you to get ALL keys in a bucket with a prefix very quickly) and then pass them as comma-separated string into textFiles. It will still make a list Status call for each file and it will still be serial, but there will be a lot less calls.
However, all of these solutions just slow down the inevitable; as more and more data gets built, more and more list status calls will be made serially. The root of the issue seems to the that sc.textFiles(s3n://) pretends that s3 is a file system, which is not. It is a key-value store. Spark (and Hadoop) need a different way of dealing with S3 (and possibly other key-value stores) that don't assume a file system.


Invalid state: The Flow Controller is initializing the Data Flow

I'm trying out a test scenario to add a new node to the already existing cluster (for now 1-node) using external zookeeper.
I'm constantly getting the below repeated lines, and on UI "Invalid state: The Flow Controller is initializing the Data Flow."
2022-02-28 17:51:29,668 INFO [main] o.a.n.c.c.n.LeaderElectionNodeProtocolSender Determined that Cluster Coordinator is located at nifi-02:9489; will use this address for sending heartbeat messages
2022-02-28 17:51:29,668 INFO [main] o.a.n.c.p.AbstractNodeProtocolSender Cluster Coordinator is located at nifi-02:9489. Will send Cluster Connection Request to this address
2022-02-28 17:51:37,572 INFO [Cleanup Archive for default] o.a.n.c.repository.FileSystemRepository Successfully deleted 0 files (0 bytes) from archive
2022-02-28 17:52:36,914 INFO [Write-Ahead Local State Provider Maintenance] org.wali.MinimalLockingWriteAheadLog org.wali.MinimalLockingWriteAheadLog#13c90c06 checkpointed with 1 Records and 0 Swap Files in 4 milliseconds (Stop-the-world time = 1 milliseconds, Clear Edit Logs time = 1 millis), max Transaction ID 1
2022-02-28 17:52:37,581 INFO [Cleanup Archive for default] o.a.n.c.repository.FileSystemRepository Successfully deleted 0 files (0 bytes) from archive
NiFi-1.15.3 is being used (unsecure setup)
It seems that cluster coordinator is not running on mentioned port for node already in cluster. This I thought from timeout prospective, but new node is able to detect that a cluster coordinator is present at the mentioned node. (How to solve this?)
nc (netcat) is also timing out for the same port

why RandomForestClassificationModel broadcasted to executores for every mini batch in spark streaming

Trained Random forest model in offline and stored in file system.
This model is loaded once at the start of spark-streaming application using Pipeline.load .
Predict function is called for every batch (model.transform(input_data_frame))
Observation: From the Spark-UI we can see that every task of this stage is spending most of the time(more than 95%) for task deserialization. Our assumption is every task is deserializing the models that loaded initially so we have tried broadcasting the models (broadcast variables is useful when caching the data in deserialized form is important but still it is showing high task deserialization time.
Spark standalone cluster details : spark version : 2.2.1 Executor core = 4 Executor Memory = 4 GB Total Executors = 24
model size 45MB
spark kafka streaming job jar size 8 MB
1) why there is delay between this two steps ? what is happening between that steps?
attached is the spark kafka streaming log
18/04/15 03:21:23 INFO KafkaSource: GetBatch generating RDD of offset range: KafkaSourceRDDOffsetRange(Kafka_input_topic-0,242,242,Some(executor_xx.xxx.xx.110_2)), KafkaSourceRDDOffsetRange(Kafka_input_topic-1,239,239,Some(executor_xx.xxx.xx.107_0)), KafkaSourceRDDOffsetRange(Kafka_input_topic-2,241,241,Some(executor_xx.xxx.xx.102_3)), KafkaSourceRDDOffsetRange(Kafka_input_topic-3,238,239,Some(executor_xx.xxx.xx.138_4)), KafkaSourceRDDOffsetRange(Kafka_input_topic-4,240,240,Some(executor_xx.xxx.xx.137_1)), KafkaSourceRDDOffsetRange(Kafka_input_topic-5,242,242,Some(executor_xx.xxx.xx.111_5)) 18/04/15 03:21:24 INFO SparkContext: Starting job: start at App.java:106
2) why spark broad casting model to executor for every mini batch ?
18/04/15 03:21:31 INFO BlockManagerInfo: Added broadcast_92_piece0 in memory on xx.xxx.xx.137:44682 (size: 62.6 MB, free: 1942.0 MB) ##

Hadoop Streaming MapReduce slow finding files

I have a Hadoop job that is taking a very long time to initialize when fed a large number of input files, and I'm not sure why. The job will find all of the nodes and files within a few seconds, regardless of how many files are used, but takes significant time (minutes) to determine the number of splits if given 10,000 files. When I run the job as a different user, the job will determine the number of splits almost immediately.
$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -D 'mapreduce.job.name=...'
packageJobJar: [] [/usr/lib/hadoop-mapreduce/hadoop-streaming-2.6.0-cdh5.11.0.jar] /tmp/streamjob4556879591438635545.jar tmpDir=null
17/08/07 22:01:40 INFO client.RMProxy: Connecting to ResourceManager at jobtracker-dev.la.prod.factual.com/
17/08/07 22:01:41 INFO security.TokenCache: Got dt for hdfs://dev; Kind: HDFS_DELEGATION_TOKEN....
17/08/07 22:01:41 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library
17/08/07 22:01:41 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev 674c65bbf0f779edc3e00a00c953b121f1988fe1]
17/08/07 22:01:41 INFO mapred.FileInputFormat: Total input paths to process : 10000
17/08/07 22:01:41 INFO net.NetworkTopology: Adding a new node: /cs3/211/...
17/08/07 22:01:41 INFO net.NetworkTopology: Adding a new node: /cs3/210/...
17/08/07 22:31:39 INFO mapreduce.JobSubmitter: number of splits:10000
This is not a lot of information, obviously, but does anyone have an idea what might be going on?
Time taken depends of so many parameter, for starting you can check your cluster capacity, and yarn configuration specifically.
IF you have 10k splits that mean AM coordination with tasks will take significant amount of time , remember hadoop is built for big files processing not small {large number}files.
Do check your hdfs block size as well , and how much you are putting.
Make sure if you are distributed mode establish password less connection with your data nodes.
For specifically"When I run the job as a different user, the job will determine the number of splits almost immediately." This is not HADOOP capacity issue, check your configuration properly. If possible use ambari to manage your cluster if you have enough budget to do so.

kafka spark streaming job with many active jobs

I meet with a “many Active jobs” issue when using direct kafka streaming on YARN. (spark 1.5, hadoop 2.6, CDH5.5.1)
The problem happens when kafka has almost NO traffic.
From application UI, I see many ‘active’ jobs are keep running for hours. And finally the driver “Requesting 4 new executors because tasks are backlogged”.
But, when looking at the driver log of a ‘activity’ job, the log says the job is finished. So, why the application UI shows this job is activity like forever?
Here are related log info about one of the ‘activity’ jobs.
There are two stages: a reduceByKey follows a flatmap. The log says both stages are finished in ~20ms and the job also finishes in 64 ms.
Got job 6567
Final stage: ResultStage 9851(foreachRDD at
Parents of final stage: List(ShuffleMapStage 9850)
Missing parents: List(ShuffleMapStage 9850)
Finished task 0.0 in stage 9850.0 (TID 29551) in 20 ms
Removed TaskSet 9850.0, whose tasks have all completed, from pool
ShuffleMapStage 9850 (flatMap at OpaTransLogAnalyzeWithShuffle.scala:83) finished in 0.022 s
Submitting ResultStage 9851 (ShuffledRDD[16419] at reduceByKey at OpaTransLogAnalyzeWithShuffle.scala:83), which is now runnable
ResultStage 9851 (foreachRDD at OpaTransLogAnalyzeWithShuffle.scala:84) finished in 0.023 s
Job 6567 finished: foreachRDD at OpaTransLogAnalyzeWithShuffle.scala:84, took 0.064372 s
Finished job streaming job 1468592373000 ms.1 from job set of time 1468592373000 ms
I am facing similar issue. Myn is spark streaming applicaiton where in my only action is to write to cassandra table. And, this write fails due to certain ssl authenticaion. Ideally it should show such batches as failed in Streaming, but it remains in active state forever; inside the batch the jobs are completed successfully, ideally it should have been marked failed.

hadoop streaming jobs fails to report?

All jobs were running successfully using hadoop-streaming, but all of a sudden I started to see errors due to one of worker machines
Hadoop job_201110302152_0002 failures on master
Attempt Task Machine State Error Logs
attempt_201110302152_0002_m_000037_0 task_201110302152_0002_m_000037 worker2 FAILED
Task attempt_201110302152_0002_m_000037_0 failed to report status for 622 seconds. Killing!
Task attempt_201110302152_0002_m_000037_0 failed to report status for 601 seconds. Killing!
Last 4KB
Last 8KB
Questions :
- Why does this happening ?
- How can I handle such issues?
Thank you
The description for mapred.task.timeout which defaults to 600s says "The number of milliseconds before a task will be terminated if it neither reads an input, writes an output, nor updates its status string. "
Increasing the value of mapred.task.timeout might solve the problem, but you need to figure out if more than 600s is actually required for the map task to complete processing the input data or if there is a bug in the code which needs to be debugged.
According to the Hadoop best practices, on average a map task should take a minute or so to process an InputSplit.
