How does back pressure property work in Spark Streaming? - hadoop

I have a CustomReceiver which receives a single event(String).The received single event is used during spark application's run time to read data from nosql and to apply transformations.When the processing time for each batch was observed to be greater than batch interval I set this property.
spark.streaming.backpressure.enabled=true
After which I expected the CustomReceiver to not trigger and receive the event when a batch is processing longer than batch window, which didn't happen and still a backlog of batches were being added. Am I missing something here?

Try to check this and this articles.

Related

kg.apc.jmeter.functions.FifoTimeout does not work if added to user.properties in jmeter

I am using inter thread communicator plugin to share data between two thread groups.
TG-1: generates an ID -> stores it in the queue name Q1 TG-2: picks an ID from queue -> does the processing
After some time when run duration of TG-1 is completed, it stops processing or storing ID in to Q1. TG-2 processed all the data in the queue and keep on waiting for new data in the Q1. However Q1 will not have any data. My expectation was when the run duration of TG-2 also completed. TG-2 should finish its job and exit. Why does TG-2 keep on waiting for data in Q1. This is causing the exhaustion of the heap space and test never stops. This is causing a serious issue.
To prevent this, I tried adding kg.apc.jmeter.functions.FifoTimeout=120 in user.properties file as suggested by Dmitri T in one of my previous question for the same thing. However this property is not taking effect. Has anybody else also experience the same thing with this plugin? What is the alternative?
We are not telepathic enough to guess what is your setup, what exact components of the Inter-Thread Communication Plugin you're using and how they're configured.
If you're using Functions - the timeout is working fine for the __fifoPop() function, just make sure to restart JMeter after amending the property. __fifoGet() one will just return an empty value if the queue is empty
If you're using jp#gc - Inter-Thread Communication PreProcessor - there is a possibility to specify the timeout directly in GUI
Also it is always possible to stop the test via Flow Control Action Sampler

Spring batch file resume after Server failure

I have configured Spring Boot Batch to process Fixed length flat file. I read and split columns by using FlatFileItemReader, FixedLengthTokenizer and Writing data into Database by using ItemWriter, JPA Repository.
I have a scenario like, My Server was crashed or it was stopped at the time of file processing. At this point half of the file was processed(means half of the data wrote into DB). When it comes to next Job(when server was running up) the file has to start from where it stops.
For Example, A file having 1000 lines, Server was shutdown after processing 500 rows. In the next Job, The file has to start from 501 row.
I googled for solution but nothing relevant. Any help appreciated.
As far as I know, what you are asking ( restart at chunk level ) doesn't automatically exist in Spring Batch API & is something that programmer has to implement on his/her own.
Spring Batch provides Job restart feature via JobOperator.restart . This is a job level restart and a new execution id will be created for next run & whole of the job will rerun as there are other concerns like somebody put in a new file or renamed existing file to process in place of old file , how batch will know that its same input file content wise or db not changed since last run?
Due to these concerns , its imperative that programmer handles these situations via custom code.
Second concern is that when there is a server failure, job status would still be STARTED & not FAILED since it happens all of a sudden and framework couldn't update status correctly.
Following steps you need to implement ,
1.Implement a custom logic to decide if last job execution was successful or restart is needed.
2.If restart is needed, mark previous job execution as FAILED & then use JobOperator.restart(long executionId) - For a non - partitioned job , only useful impact would be the marking of job status to be correct as FAILED but whole job will restart from beginning.
There are many scenarios like,
a)job status is STARTED but all steps are marked COMPELTEDetc
b)For a partitioned job, few steps are completed , few failed & few in started etc
3.If restart is not needed, launch a new job using - JobLauncher.run.
So with above steps, you see that a real chunk level job restart is not achieved but above steps are primary things that you first need to understand & implement.
Next would be to changing your input at job restart i.e. you devise a mechanism to mark input records as processed for processed chunks ( i.e. read , processed & written ) & have a way to know what input records are not processed - then in next job run you feed modified input that is still unprocessed. So its all going to be your use case specific custom logic.
I am not aware of any inbuilt mechanism in the framework itself to achieve this. To me a Job Restart is a brand new job execution with modified / reduced input.

Batch transfer rate upper bound in a channel - batch creation start trigger

From this stackoverflow question i understand a batch is sent out one at a time (without bother pipeline in this discussion), meaning, a second batch won't be sent until the first one is delivered.
My follow up question is, what condition starts a batch creation process. If i understand correctly (i could obviously be wrong....), a batch is created/cut, or let's call it a batch creation process is completed, if BATCHSZ reached, or BATCHLIM reached, or BATCHINT (=/=0) reached, or XMIT-Q is empty, but what starts a batch creation process. Is the batch creation process synchronous or asynchronous to batch transfer? Does batch creation process start only after the previous batch is delivered (synchronous), or it's totally decoupled from the previous batch (eg. while the previous batch is still in transfer)?
This is a sibling/follow up question to 1. The intention is to estimate our QRepl-MQ-transfer upper limit. As documented in entry "[added on Dec.20]" in the first (self-)answer in 1, our observation seems support the batch creation process starts synchronously AFTER the previous batch transfer is complete, but i couldn't find ibm references documenting the details......
Thanks for your help.
our observation seems support the batch creation process starts
synchronously AFTER the previous batch transfer is complete, but i
couldn't find ibm references documenting the details.
Yes that is how it works. If a 2nd batch started before the 1st batch finished then you would have newer messages jumping in front of older messages, which could cause all kinds of issues.
Yes, I know, applications are not suppose to rely on messages coming in a logical order (i.e. 1,2,3,etc.) but they do.
Think of MCA (Message Channel Agent) which is the process getting messages from the XMIT the same as a security guard at a store on Black Friday. He lets in 50 people form the line (batch). After many people leave the store, he lets in another 50 people into the store. Would you want ASYNC batching of the line at the store - absolutely not. The security guard wants order not chaos.
The same is true for MQ's MCA. It creates a batch of "n" messages, sends them, acknowledges them, then goes onto the next batch.

Using Quartz for long running job

I'm planning to use Quartz scheduler to process a one-time job.
My use case is, I need to migrate BLOB from one storage to another and blob's can be as big as 100GB, so a particular job can run really long enough to get the work done.
The reason I'm using Quartz because of its clustering support, fault tolerance and retry capabilities in case job fails etc. Only thing I'm concerned about is, I might have a lot of miss fire trigger scenario and a lot of database lock which can hamper live production traffic on those database hosts. I will probably be scheduling 10s of thousands of job in one shot.
Few of the things that I figured out is
I can set a high value for org.quartz.jobStore.misfireThreshold so that miss fire does not happen. I don't really care about the time when the job get's picked up as it's background job and no SLA as such. Only thing I care about is that eventually job getting picked up and getting work done.
I can also set batch mode properties org.quartz.scheduler.batchTriggerAcquisitionMaxCount and org.quartz.scheduler.batchTriggerAcquisitionFireAheadTimeWindow. I understand the batch max count property should be like equal to the thread pool size which can give the biggest bang on performance but what should be the value of fire ahead of time window be?
I'm using Quartz with Spring boot and will be leveraging org.quartz.impl.jdbcjobstore.JobStoreCMT. What I understand is execute method of the job get wrapped in the transaction, will this cause any problem since transaction will be open for a long time as the job might take hours to complete? Is this something ok? I will be using Oracle database.
Am I missing something here? Can someone share their experience with a similar use case?
Thanks!

How to architecture file processing in laravel

I have task observe folder where files are coming from SFTP. File are big and processing one file is relatively time consuming. I am looking for best approach to do it. Here are some ideas how to do it, but I am not sure what is the best way.
Run scheduller each 5 min to check for new files
For each new file trigger event that there is new file.
Create listener which will listen for this event and which will using queues. In the listener for new files copy new file in the processing folder and process it. When processing of new files start insert record in the DB with status processing. When processing is done change record status and copy file to processed folder.
I this solution I have 2 copy operations for each file. This is because it is possible if second scheduler executes before all files are processed than some files could overlap in 2 processing jobs.
What is the best way to do it? Should I use another approach to avoid 2 copy operations? Something like to put database check during scheduler execution to see if the file is already in the processing state?
You should use the ->withoutOverlapping(); as stated in the manual of task Scheduler here.
Using this you will make sure that only one instance of the task run at any given time.

Resources