Spark Streaming: Issues when processing time > batch time - hadoop

I am running a spark streaming (1.6.1) on yarn using DirectAPI to read events from Kafka topic having 50 partitions and writing on HDFS. I have a batch interval of 60 seconds. I was receiving around 500K messages which was getting processed under 60 Sec.
Suddenly spark started receiving 15-20 million messages which took around 5-6 minutes to process with a batch interval of 60 seconds. I have configured "spark.streaming.concurrentJobs=4".
So when batch takes a long time for processing spark initiate concurrent 4 active tasks to handle the backlog batches but still over a period of time batch backlog increases as batch interval is too less for such volume of data.
I have few doubts around this.
When I start receiving 15-20 million messages & time to process those messages is around 5-6 minutes with batch interval of 60 Sec. When I check my HDFS directory I see the files created for each 60 Sec with 50 part files, I am little confused here my batch is getting processed in 5-6 minutes, then how it is writing files on HDFS every 1 min & 'saveAsTextFile' action is called only once per batch. Total records from all the files 50 part files comes around 3.3 million.
In order to handle the processing of 15-20 million messages, I configured my batch interval to 8-10 minutes now spark started consuming around 35-40 million messages from Kafka & again its processing time started exceeding batch interval.
I have configured 'spark.streaming.kafka.maxRatePerPartition=50' & 'spark.streaming.backpressure.enabled=true'.

I think one thing that may have confused you is the relationship between the length of a job, and the frequency.
From what you describe, with the resources available it seems that in the end the job took about 5 minutes to complete. However your batch frequency is 1 minute.
So as a result, every 1 minute you kick off some batch that takes 5 minutes to complete.
As a result, in the end you will expect to see HDFS receive nothing for the first few minutes, and then you keep receiving something every 1 minute (but with a 5 minute 'delay' from when the data went in).

Related

How to configure PutFile such that it runs on a 24 hr schedule however it runs as many time as number of incoming flowfiles

How to configure PutFile such that it runs on a 24 hr schedule however it should execute as many time as the number of incoming flowfiles. Currently if PutFile is set on 24 hr schedule then it only produces one file (for the first incoming flowfile), my use case is requirement is to have it run once a day for all incoming flowfiles.
One possible way of solving this is to use the Run Duration and Concurrent Tasks configuration under the process scheduling tab.
When you set the Run Duration for 2s, the PutFile process will continue to run for two seconds once it is started (After 24 hours in your case). And you can increase the number of Concurrent Tasks to a higher level (Ex: 100) so that within two seconds the processor will be able to process more files concurrently.
NOTE: This solution cannot guarantee that all the flow files in the queue are going to be processed but it will process large number of files within two seconds.

Hbase write performance degrades afer 4-5 days of restart

We are facing this issue in our cluster where we use phoenix to write the data. We have observed our jobs works fine initially. But after few days (4-5 days) we see drastic increase in our job time (4 mins to 30 mins). Input data size is almost same. And restarting hbase solves the issue for the next 4-5 days.
We have 70 region servers of size 128G each. 50k(per region server)*70(no of region servers) puts per job.
From the RS logs I can see increase in responseTooSlow warning logs frequency from 40k/day to 280k/day but response time is less than 1000ms in those logs.
2018-04-18 00:00:07,831 WARN [RW.default.writeRpcServer.handler=10,queue=4,port=16020] ipc.RpcServer: (responseTooSlow): {"call":"Multi(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$MultiRequest)","starttimems":1524009607697,"responsesize":106,"method":"Multi","processingtimems":134,"client":"192.168.25.70:54718","queuetimems":0,"class":"HRegionServer"}

Job unexpectedly cancelled due to time limit

There are several partitions on the cluster I work on. With sinfo I can see the time limit for each partition. I put my code to work on mid1 partition which has time limit of 8-00:00:00 from which I understand that time limit is 8 days. I had to wait for 1-15:23:41 which means nearly 1 day and 15 hours. However, my code ran for only 00:02:24 which means nearly 2.5 minutes (and the solution was converging). Also, I did not set a time limit in the file submitted with sbatch The reason of my code stopped was given as:
JOB 3216125 CANCELLED AT 2015-12-19T04:22:04 DUE TO TIME LIMIT
So, why my code was stopped if I did not exceed the time limit? I was asking this to the guys who were responsible for the cluster but they did not return.
Look at the value of DefaultTime in the output of scontrol show partitions. This is the maximum time that is allocated to your job in the case you do not specify it by yourself with --time.
Most probably this value is set to 2 minutes to force you to specify a sensible time limit (within the limits of the partition.)

Reducer takes more time after 92%

I have a bulk load job using HFileOutputFormat to load a HBase table. My mapper completes within 2-3 mins and then the reducer(PutSortReducer invoked by HFileOutputFomat) completes till 92% in the next 2 mins, but takes around 9 mins to complete the remaining 8%
Totally 10 reduce tasks are spawned in my job, of which always 8 or 9 tasks completes within 2-3 mins and the remaining one or two tasks take then 9 mins. And these one or two last tasks are usually the ones restarted in place of the failed tasks. The logs don't show any evident errors as the reason of the failed tasks

Why does Map Jobs slow down after first set of Mappers are completed?

Say I have 100 mappers running in parallel and there are total 500 mappers running.
Input size received by each mapper is almost same and the processing time each mapper should take should be more or less identical.
But say first 100 mappers finishes in 20 minutes, the next 100 mappers take like 25-30 minutes and the next batch of 100 mappers take around 40-50 minutes each. And then later we get GC overhead error.
Why is this happening?
I have following configurations already set:
<property><name>mapred.child.java.opts</name><value>-Xmx4096m</value></property>
<property><name>mapred.job.reuse.jvm.num.tasks</name><value>1</value></property>
What else can be done here?

Resources