Can anyone explain me the definitions and differences between sequential consistency and quiescent consistency? In the most dumb form possible :|
I did read this: Example of execution which is sequentially consistent but not quiescently consistent
But I am not able to understand Sequential and quiescent consistency itself :(
Sequential consistency requires that the operations should appear to take effect in the order they are specified in each program. Basically it enforces program order within each individual process, and allows all processes to assume they are observing the same order of operations. Let's say we have 2 processes enqueuing and dequeuing items on a queue q:
P1 -- q.enq(x) -----------------------------
P2 -------------- q.enq(y) ---- q.deq():y --
This is not the expected behaviour from a FIFO queue. We'd expect to dequeue x because P1 enqueues x before P2 enqueues y. However this scenario is allowed in the sequential consistency model because sequential consistency doesn't require that the order seen by all processes is correct (real-time order). There's at least one sequential execution that can explain these results and one is:
P2:q.enq(y) P1:q.enq(x) P2:q.deq():y
In this execution each process executes operations in program order meaning each process executes its operations in the order they're specified in each process.
Quiescent consistency requires non-overlapping operations to appear to take effect in their real-time order, but overlapping operations might be reordered. Therefore, the same scenario is not allowed in the quiescent consistency model because we expect q.enq(x) to appear to take effect before q.enq(y), and q.deq() to return x instead of y. Also quiescent consistency doesn't necessarily preserve program order. If q.enq(x) and q.enq(y) would be concurrent (overlapping) operations, they could be reordered and q.deq():y would be quiescently consistent.
Basically some executions are sequentially consistent but not quiescently consistent, and vice versa.
First you should understand what is program order, it is literally how you expect your program runs in the order of the appearance of instructions.
But a program order is only for a single thread program, if you have multithreads, then problem comes as the program order may not hold or even not exist as sometimes you cannot tell which thread's method call happens first.
A quiescent consistency describes a clear program order of all threads' behaviors. that is no overlaps are allowed since it is required a quiescent period between two method calls.
A sequential consistency allows overlaps, but requires one can find a program order in which all the method calls can be put in a place and still returns correct value and behaves correctly.
Related
Is MPI_Gather compulsory after MPI_Scatter or can we just scatter and leave data on the nodes.
I scattered a 2D array and counted the evens and odds. the program is working fine without gather.
I think since gather only returns the scattered items, it would be fine if I do not gather it in my case.
No, it is not compulsory (or MPI doesn't even care). In theory, there is no relation between MPI_Scatter and MPI_Gather. Both are separate, independent collective operations. Because of its opposite behaviour scatter scatters and gather gathers data, it can be used after one another to send and collect data.
Only point to point communication needs a corresponding receive for a send because point-to-point communication sends messages between two different MPI processes.
Collectives on the other hand involves all the processes and hence it doesn't need a pair.
(note this isnt about parallel execution of a query inside the RDB, but peformance characteristics of submitting queries in parallel).
I have a process that executes 1000's (if not 10,000s+) of queries, in a single threaded manner (i.e. send, wait for response, process, send....), loosely of the form
select a,b from table where id = 123
i.e. query a single record on an already indexed field
on an oracle database.
This process takes longer than desired, and doing some metrics on it, I'm sure that 90% of the time is spent server side execution (and transport) rather than client side.
This process naturally can be split into N 'jobs', and its been suggested that this could/should speed up the process.
Naively you would expect it to run N times quicker (with a small overhead to merge the answer).
Given that (loosely) SQL is 'serialised' is this the case though? That would imply that actually it would probably not run quicker at all.
I assume that for an update on a single record (for example) N updates would have to be effectively serialised, but for N reads, this may not be the case.
Which theory is most accurate (or neither)
I'm not a dba, it looks like for reads, reads never block reads, so assuming infinite resources the theory would be that N reads could be run completely in parallel with no blocking. For writes and reads it gets more complex depending on how you set up your transactions/locks but thats out of scope for me.
I would like to ask if someone would have an idea on the best(fastest) algorithm for the following scenario:
X processes generate a list of very large files. Each process generates one file at a time
Y processes are being notified that a file is ready. Each Y process has its own queue to collect the notifications
At a given time 1 X process will notify 1 Y process through a Load Balancer that has the Round Rubin algorithm
Each file has a size and naturally, bigger files will keep both X and Y more busy
Limitations
Once a file gets on a Y process it would be impractical to remove it and move it to another Y process.
I can't think of other limitations at the moment.
Disadvantages to this approach
sometimes X falls behind(files are no longer pushed). It's not really impacted by the queueing system and no matter if I change it it will still have slow/good times.
sometimes Y falls behind(a lot of files gather in the queues). Again, the same thing like before.
1 Y process is busy with a very large file. It also has several small files in its queue that could be taken on by other Y processes.
The notification itself is through HTTP and seems somehow unreliable sometimes. Notifications fail and debugging has not revealed anything.
There are some more details that would help to see the picture more clearly.
Y processes are DB threads/jobs
X processes are web apps
Once files reach the X processes, these would also burn resources from the DB side by querying it. It has an impact on the producing part
Now I considered the following approach:
X will produce files like it has before but will not notify Y. It will hold a buffer (table) to populate the file list
Y will constantly search for files in the buffer and retrieve them itself and store them in its own queue.
Now would this change be practical? Like I said, each Y process has its own queue, it doesn't seem to be efficient to keep it anymore. If so, then I'm still undecided on the next bit:
How to decide which files to fetch
I've read through the knapsack problem and I think that has application if I would have the entire list of files from the beginning which I don't. Actually, I do have the list and the size of each file but I wouldn't know when each file would be ready to be taken.
I've gone through the producer-consumer problem but that centers around a fixed buffer and optimising that but in this scenario the buffer is unlimited and I don't really care if it is large or small.
The next best option would be a greedy approach where each Y process locks on the smallest file and takes it. At first it does appear to be the fastest approach and I'm currently building a simulation to verify that but a second opinion would be fantastic.
Update
Just to be sure that everyone gets the big picture, I'm linking here a fast-done diagram.
Jobs are independent from Processes. They will run at a speed and process how many files are possible.
When a Job finishes with a file it will send a HTTP request to the LB
Each process queues requests (files) coming from the LB
The LB works on a round robin rule
Diagram
The current LB idea is not good
The load balancer as you've described it is a bad idea because it's needlessly required to predict the future, which you are saying is impossible. Moreover, round-robin is a terrible scheduling strategy when jobs have varying lengths.
Just have consumers notify the LB when they're idle. When a new job arrives from a producer, it selects the first idle consumer and sends the job there. When there are no idle consumers, the producer's job is queued in the LB waiting for a free consumer to appear.
This way consumers will always be optimally busy.
You say "Having one queue to serve 100 apps (for example) would be inefficient." This is a huge leap of intuition that's probably wrong. A work queue that's only handling file names can be fast. You need it only to be 100 times faster (because you infer there are 100 consumers) than the average "very large file" handling operation. File handling is normally 10th of seconds or seconds. A queue handler based, say, on an Apache mod or Redis for two random choices, could pretty easily serve 10,000 requests per second. This is a factor of 10 away from being a bottleneck.
If you select from idle consumers on a FIFO basis, the behavior will be round-robin when all jobs are equal length.
If the LB absolutelly cannot queue work
Then let Ty(t) be the total future time needed to complete the work in the queue of consumer y at the current epoch t. The LB's goal is to make Ty(t) values equal for all y and t. This is the ideal.
To get as close as possible to the ideal, it needs an internal model to compute these Ty(t) values. When a new job arrives from a producer at epoch t, it finds consumer y with the the minimum Ty(t) value, assigns the job to this y, and adjusts the model accordingly. This is a variation of the "least time remaining" scheduling strategy, which is optimal for this situation.
The model must inevitably be an approximation. The quality of the approximation will determine its usefulness.
A standard approach (e.g. from OS scheduling), will be to maintain a pair [t, T]_y for each consumer y. T is the estimate of Ty(t) that was computed at the past epoch t. Thus at a later epoch t+d, we can estimate Ty(t+d) as max(T-t,0). The max is because for d>t, the estimated job time has expired, so the consumer should be complete.
The LB uses whatever information it can get to update the model. Examples are estimates of time a job will require (from your description probably based on file size and other characteristics), notification that the consumer has actually finished a job (LB decreases T by the esimated duration of the completed job and updates t), assignment of a new job (LB increases T by the estimated duration of the new job and updates t), and intermediate progress updates of estimated time remaining from consumers during long jobs.
If the information available to the LB is detailed, you will want to replace the total time T in the [t, T]_y pair with a more complete model of the work queued at y: for example a list of estimated job durations, where the head of the list is the one currently being executed.
The more accurate the LB model, the less likely a consumer will starve when work is available, which is what you are trying to avoid.
We have a list of tasks with different length, a number of cpu cores and a Context Switch time.
We want to find the best scheduling of tasks among the cores to maximize processor utilization.
How could we find this?
Isn't it like if we choose the biggest available tasks from the list and give them one by one to the current ready cores, it's going to be best or you think we must try all orders to find out which is the best?
I must add that all cores are ready at the time unit 0 and the tasks are supposed to work concurrently.
The idea here is that there's no silver bullet, for what you must consider what are the types of tasks being executed, and try to schedule them as nicely as possible.
CPU-bound tasks don't use much communication (I/O), and thus, need to be continuously executed, and interrupted only when necessary -- according to the policy being used;
I/O-bound tasks may be continuously put aside in the execution, allowing other processes to work, since it will be sleeping for many periods, waiting for data to be retrieved to primary memory;
interative tasks must be continuously executed, but needs not to be executed without interruptions, as it will generate interruptions, waiting for user inputs, but it needs to have a high priority, in order not to let the user notice delays in the execution.
Considering this, and the context switch costs, you must evaluate what types of tasks you have, choosing, thus, one or more policies for your scheduler.
Edit:
I thought this was a simply conceptual question. Considering you have to implement a solution, you must analyze the requirements.
Since you have the length of the tasks, and the context switch times, and you have to maintain the cores busy, this becomes an optimization problem, where you must keep the minimal number of cores idle when it reaches the end of the processes, but you need to maintain the minimum number of context switches, so that your overall execution time does not grow too much.
As pointed by svick, this sounds like a partition problem, which is NP-complete, and in which you need to divide a sequence of numbers into a given number of lists, so that the sum of each list is equal to each other.
In your problem you'd have a relaxation on the objective, so that you no longer need all the cores to execute the same amount of time, but you want the difference between any two cores execution time to be as small as possible.
In the reference given by svick, you can see a dynamic programming approach that you may be able to map onto your problem.
I very often encounter situations where I have a large number of small operations that I want to carry out independently. In these cases, the number of operations is so large compared to the actual time each operation takes so simply creating a task for each operation is inappropriate due to overhead, even though GCD overhead is typically low.
So what you'd want to do is split up the number of operations into nice chunks where each task operates on a chunk. But how can I determine the appropriate number of tasks/chunks?
Testing, and profiling. What makes sense, and what works well is application specific.
Basically you need to decide on two things:
The number of worker processes/threads to generate
The size of the chunks they will work on
Play with the two numbers, and calculate their throughput (tasks completed per second * number of workers). Somewhere you'll find a good equilibrium between speed, number of workers, and number of tasks in a chunk.
You can make finding the right balance even simpler by feeding your workers a bunch of test data, essentially a benchmark, and measuring their throughput automatically while adjusting these two variables. Record the throughput for each combination of worker size/task chunk size, and output it at the end. The highest throughput is your best combination.
Finally, if how long a particular task takes really depends on the task itself (e.g. some tasks take X time, and while some take X*3 time, then you can can take a couple of approaches. Depending on the nature of your incoming work, you can try one of the following:
Feed your benchmark historical data - a bunch of real-world data to be processed that represents the actual kind of work that will come into your worker grid, and measure throughput using that example data.
Generate random-sized tasks that cross the spectrum of what you think you'll see, and pick the combination that seems to work best on average, across multiple sizes of tasks
If you can read the data in a task, and the data will give you an idea of whether or not that task will take X time, or X*3 (or something in between) you can use that information before processing the tasks themselves to dynamically adjust the worker/task size to achieve the best throughput depending on current workload. This approach is taken with Amazon EC2 where customers will spin-up extra VMs when needed to handle higher load, and spin them back down when load drops, for example.
Whatever you choose, any unknown speed issue should almost always involve some kind of demo benchmarking, if the speed at which it runs is critical to the success of your application (sometimes the time to process is so small, that it's negligible).
Good luck!