Will journal be lost if there is no time to flush to the journal writer before jvm hangs? - alluxio

Many journals are written asynchronously to the related journal writer through AsyncJournalWriter. If the journal is in AsyncJournalWriter.mQueue but there is no time to flush to the journal writer before jvm hangs, will the journal be lost?

Operation is finished and journal is written to local journal before returning success to client. so if jvm hangs, client will take a longer time to receive the operation succeed results because it needs to wait for the journal written.

Related

How to get Job version from allocation JSON that has no job information?

I have a persistent Nomad database of jobs, allocations and evaluations (with my own cleanup settings, not in the scope of the question). I take Nomad event stream https://developer.hashicorp.com/nomad/api-docs/events and listen to allocations, evaluations and jobs and save all JSONs to a database.
Allocations from Nomad event stream contain no Job information. I can get evaluation from allocation using "EvalID" field. I do not know hot to get Job version from evaluation. Evaluation JSON has only "JobID", it has no "JobModifyIndex", nor "JobVersion" field that I could connect to Job history.
How can I get which Job version is associated with allocation? Nomad UI shows that - how can I get that information using only Nomad event stream? Evaluation has "ModifyIndex" - can I use it?

NiFi - Stop upon failure

I've been trying to google and search stack for the answer but have beeen unable to find.
Using NiFi, is it possible to stop a process upon previous job failure?
We have user data we need to process but the data is sequentially constructed so that if a job fails, we need to stop further jobs from running.
I understand we can create scripts to fail a process upon previous process failure, but what if I need entire group to halt upon failure, is this possible? We don't want each job in queue to follow failure path, we want it to halt until we can look at the data and analyze the failure.
TL;DR - can we STOP a process upon a failure, not just funnel all remaining jobs into the failure flow. We want data in queues to wait until we fix, thus stop process, not just fail again and again.
Thanks for any feedback, cheers!
Edit: typos
You can configure backpressure on the queues to stop upstream processes. If you set the backpressure threshold to 1 on a failure queue, it would effectively stop the processor until you had a chance to address the failure.
The screenshot shows failure routing back to the processor, but this is not required. It is important that the next processor should not remove it from the queue to maintain the backpressure until you take action.

Is it possible to restart a "killed" Hadoop job from where it left off?

I have a Hadoop job that processes log files and reports some statistics. This job died about halfway through the job because it ran out of file handles. I have fixed the issue with the file handles and am wondering if it is possible to restart a "killed" job.
As it turns out, there is not a good way to do this; once a job has been killed there is no way to re-instantiate that job and re-start processing immediately prior to the first failure. There are likely some really good reasons for this but I'm not qualified to speak to this issue.
In my own case, I was processing a large set of log files and loading these files into an index. Additionally I was creating a report on the contents of these files at the same time. In order to make the job more tolerant of failures on the indexing side (a side-effect, this isn't related to Hadoop at all) I altered my job to instead create many smaller jobs, each one of these jobs processing a chunk of these log files. When one of these jobs finishes, it renames the processed log files so that they are not processed again. Each job waits for the previous job to complete before running.
Chaining multiple MapReduce jobs in Hadoop
When one job fails, all of the subsequent jobs quickly fail afterward. Simply fixing whatever the issue was and the re-submitting my job will, roughly, pick up processing where it left off. In the worst-case scenario where a job was 99% complete at the time of it's failure, that one job will be erroneously and wastefully re-processed.

Win32 API TerminateProcess() clarification with pending I/O operations which are cancelled or completed

According to http://msdn.microsoft.com/en-us/library/ms686714(VS.85).aspx:
TerminateProcess initiates termination and returns immediately. This
stops execution of all threads within the process and requests
cancellation of all pending I/O. The terminated process cannot exit
until all pending I/O has been completed or canceled.
In my application, sometimes I need to forcibly kill a process that enters a bad state. I am using Lucene for indexing, and the statement above worries me that although Lucene is designed to be tolerant to crashes, if I/O operations can be "canceled" rather than "completed", this indicates to me an index could still be corrupted.
Can anyone shed any more light on when/if an I/O operation can be cancelled?
I am reading
This [...] requests cancellation of all pending I/O.
The terminated process cannot exit until all pending I/O has been completed or canceled.
as
This [...] requests cancellation of all pending I/O. The terminated process cannot exit until all pending I/O has been canceled. Some pending I/O may complete slightly before it would have been canceled.
I would therefor expect the complete range of any to all pending I/O to complete.
If you want to "forcibly kill a process that enters a bad state" you cannot expect that the application state/data will be left in a good state.

Lot of time spent with following waits 'SQL*Net message from client' and 'wait for unread message on broadcast channel'

My application that wraps around Oracle Data pump's executables IMPDP and EXPDP takes random amounts of time for the same work. On further investigation, I see it waiting for again random amounts of time with the event 'wait for unread message on broadcast channel'. This makes the application take anytime b/w 10 minutes to over an hour for the same work.
I fail to understand if this has something to do with the way my application uses these executables, or it has got something to do with Load on my server or something totally alien to me.
There's a bunch of processes and sessions involved in a data pump operation.
I suspect you are looking at the master processes, not at the worker processes. So all that event is saying is that the Master process spends more time waiting for the worker process when the job takes longer. Which is fairly useless information.
You need to monitor the worker processes and see why they are taking longer.
Those wait events are usually considered to be "idle" waits - i.e. Oracle has nothing to do, it is waiting for further data/instructions.

Resources