Dijkstra's Bankers Algorithm - algorithm

Could somebody please provide a step-through approach to solving the following problem using the Banker's Algorithm? How do I determine whether a "safe-state" exists? What is meant when a process can "run to completion"?
In this example, I have four processes and 10 instances of the same resource.
Resources Allocated | Resources Needed
Process A 1 6
Process B 1 5
Process C 2 4
Process D 4 7

Per Wikipedia,
A state (as in the above example) is considered safe if it is possible for all processes to finish executing (terminate). Since the system cannot know when a process will terminate, or how many resources it will have requested by then, the system assumes that all processes will eventually attempt to acquire their stated maximum resources and terminate soon afterward. This is a reasonable assumption in most cases since the system is not particularly concerned with how long each process runs (at least not from a deadlock avoidance perspective). Also, if a process terminates without acquiring its maximum resources, it only makes it easier on the system.
A process can run to completion when the number of each type of resource that it needs is available, between itself and the system. If a process needs 8 units of a given resource, and has allocated 5 units, then it can run to completion if there are at least 3 more units available that it can allocate.
Given your example, the system is managing a single resource, with 10 units available. The running processes have already allocated 8 (1+1+2+4) units, so there are 2 units left. The amount that any process needs to complete is its maximum less whatever it has already allocated, so at the start, A needs 5 more (6-1), B needs 4 more (5-1), C needs 2 more (4-2), and D needs 3 more (7-4). There are 2 available, so Process C is allowed to run to completion, thus freeing up 2 units (leaving 4 available). At this point, either B or D can be run (we'll assume D). Once D has completed, there will be 8 units available, after which either A or B can be run (we'll assume A). Once A has completed, there will be 9 units available, and then B can be run, which will leave all 10 units left for further work. Since we can select an ordering of processes that will allow all processes to be run, the state is considered 'safe'.

Resources Allocated | Resources Needed claim
Process A 1 6 5
Process B 1 5 4
Process C 2 4 2
Process D 4 7 3
Total resources allocated is 8
Hence 2 resources are yet to be allocated hence that is allocated to process C. and process c after finishing relieves 4 resources that can be given to process B ,Process B after finishing relives 5 resources which is allocated to PROCESS A the n process A after finishing allocates 2 resources to process D

Related

Are all processor cores on a cache-coherent system required to see the same value of a shared data at any point in time

From what I've learnt, cache coherence is defined by the following 3 requirements:
Read R from an address X on a core C returns the value written by the most recent write W to X on C if no other core has written to X between W and R.
If a core C1 writes to X and a core C2 reads after a sufficient time, and there are no other writes in between, C2's read returns the value from C1's write.
Writes to the same location are serialized: any two writes to X must be seen to occur in the same order on all cores.
As far as I understand these rules, they basically require all threads to see updates made by other threads within some reasonable time and in the same order, but there seems to be no requirement about seeing the same data at any point in time. For example, say thread A wrote a value to a shared memory location X, then thread B wrote another value to X. Threads C and D reading from X must see the same order of updates: A, B. Imagine that thread C has already seen both updates A and B, while thread D has only observed A (the event B is yet to be seen). Provided that the time interval between writes to X and reads from X is small enough (less than what we consider a sufficient time), this situation doesn't violate any rules of coherence, does it?
On the other hand, coherence protocols, e.g. MSI use write-invalidation to guarantee that all cores have an up-to-date value of a shared variable. Wiki says: "The intention is that two clients must never see different values for the same shared data". If what I wrote about the coherence rules is true, I don't understand where this point comes from. I mean, I realize it's useful, but don't see where it is defined.

Optimum memory consumption program

Write a program to find the processes which utilize the memory optimally, given the list of the processes with their memory usage and the total memory available.
Example:-
Total memory :- 10
First column denotes process id's and 2nd column is the memory consumption of respective process.
1 2
2 3
3 4
4 5
Answer should be processes {1,2,4} with memory consumption {2,3,5} as 2+3+5=10
This question is Knapsack problem
I believe you can find many sample code on Google

When/how to benefit from parallel processing of scripts?

I have a sequence of scripts to run on a computer with 1 physical & logical core.
I have tried running them in sequence, and also forking them with something like the bash script below. The background processes, running in parallel, actually took longer than running them in sequence.
My question is: Under what circumstances should a workload like this be run in parallel? That is, must I have particular hardware or can I do this more efficiently with the single-processor computer that I have?
#!/bin/bash
# Run them in sequence...
T1=$(date +%s)
python proc_test1.py
python proc_test2.py
python proc_test3.py
T2=$(date +%s)
T=$((T2-T1))
echo "Scripts took ${T} seconds."
# Now fork them...
T1=$(date +%s)
python proc_test1.py &
python proc_test2.py &
python proc_test3.py &
wait
T2=$(date +%s)
T=$((T2-T1))
echo "Scripts took ${T} seconds."
exit 0
Parallel processing usually becomes beneficial when hardware is available. For example, if you had three logical CPUs, then three computations could occur simultaneously. Thus, ideally, if you ran proc_test1.py three times with forks with three available processors, all three would finish in the same time it would take to run just one instance of proc_test1.py.
In other words, given sufficient hardware, running three proc_test1.py's in serial will take three times as long as running them with forks.
Now, given that you just have one hardware cpu, it makes sense that the parallel jobs would run more slowly than the serial ones, as each python program will be competing with each other for cpu time. The cpu stopping one job and resuming another costs cpu time itself.
For example, say you had 6 oranges and two hands, and you had to hold all 6 oranges for 5 seconds total. Say it takes you one second to pick up or swap out oranges. You could do this task in serial, and pick up two oranges at a time for five seconds before swapping for a new pair. This would take you
1 + 5 + 1 + 5 + 1 + 5
= 3 * 5 + 3 = 18
seconds to complete.
Now suppose the parallel analogy. Then all 6 oranges are begging to be picked up and you holding one does not mean that you wont immediately drop it for an alternative. There isn't necessarily an upper bound on how long it will take you to complete the task as we have defined it, so suppose that you have to hold the oranges for at least 2.5 seconds before you swap them out, and you only swap in pairs. Then, it will take you
1 + 2.5 + 1 + 2.5 + 1 + 2.5 + 1 + 2.5 + 1 + 2.5 + 1 + 2.5 + 1
= 3 * 5 + 7 = 22
seconds to complete. Note that by "forking", it takes you 22% longer to hold 6 oranges for 5 seconds each. Since you have two hands, it still takes you 15 seconds to complete the task, but there is a variable overhead in switching time based on your strategy. Note that if you had 6 hands, it would take only 7 seconds to complete the task.
Thus when you have more processors, fork processes, otherwise you're just juggling jobs on limited hardware.
Your simple question is in practice extremely hard to answer in general. In real life I would always measure to see if reality agrees with my theory.
An example where reality did not agree with my theory was on my Intel Core i7. It has 4 cores and has hyperthreading. This would suggest that running 8 threads in parallel will be optimal: You will be using the 4 processing units and use 4 additional threads for keeping the pipelines filled.
However, Core i7 has 6MB of shared cache. It just so happened, that the working set of my data fit inside the 6MB. So I saw an extreme speedup by running 1 thread instead of 2 or even 8: Running more than 1 would simply flush the cache all the time. This would not have been true if the cache had not been shared, but it just shows that it is not simple to say when parallelizing will be faster.

Peterson-2 mutual exclusion algorithm

The contention-free complexity for classic Peterson-2 algorithm is 4 (because it performs 4 read/write operations to shared-registers memory)Is there some verion of Peterson-2 algorithm, which requires less accesses to shared-registers memory ?
It is obvious that 1 access is impossible.But what about 2 or 3 accesses?
Thank you
At least three operations per critical section are needed: a write and a read on entry (to declare acquisition of the mutex and verify that the other process has not acquired), a write on exit (to release the mutex). On entry, process id in Peterson's algorithm writes the single-writer register interested[id] and the multi-writer register turn. At the cost of turning a bounded register into one that also holds an unbounded version number, for two processes, there's a simulation of a multi-writer register by two single-writer registers that makes 1 write per write and 1 read per read, allowing the merger of the two writes in Peterson's algorithm.

FPGA timing question

I am new to FPGA programming and I have a question regarding the performance in terms of overall execution time.
I have read that latency is calculated in terms of cycle-time. Hence, overall execution time = latency * cycle time.
I want to optimize the time needed in processing the data, I would be measuring the overall execution time.
Let's say I have a calculation a = b * c * d.
If I make it to calculate in two cycles (result1 = b * c) & (a = result1 * d), the overall execution time would be latency of 2 * cycle time(which is determined by the delay of the multiplication operation say value X) = 2X
If I make the calculation in one cycle ( a = b * c * d). the overall execution time would be latency of 1 * cycle time (say value 2X since it has twice of the delay because of two multiplication instead of one) = 2X
So, it seems that for optimizing the performance in terms of execution time, if I focus only on decreasing the latency, the cycle time would increase and vice versa. Is there a case where both latency and the cycle time could be decreased, causing the execution time to decrease? When should I focus on optimizing the latency and when should I focus on cycle-time?
Also, when I am programming in C++, it seems that when I want to optimize the code, I would like to optimize the latency( the cycles needed for the execution). However, it seems that for FPGA programming, optimizing the latency is not adequate as the cycle time would increase. Hence, I should focus on optimizing the execution time ( latency * cycle time). Am I correct in this if I could like to increase the speed of the program?
Hope that someone would help me with this. Thanks in advance.
I tend to think of latency as the time from the first input to the first output. As there is usually a series of data, it is useful to look at the time taken to process multiple inputs, one after another.
With your example, to process 10 items doing a = b x c x d in one cycle (one cycle = 2t) would take 20t. However doing it in two 1t cycles, to process 10 items would take 11t.
Hope that helps.
Edit Add timing.
Calculation in one 2t cycle. 10 calculations.
Time 0 2 2 2 2 2 2 2 2 2 2 = 20t
Input 1 2 3 4 5 6 7 8 9 10
Output 1 2 3 4 5 6 7 8 9 10
Calculation in two 1t cycles, pipelined, 10 calculations
Time 0 1 1 1 1 1 1 1 1 1 1 1 = 11t
Input 1 2 3 4 5 6 7 8 9 10
Stage1 1 2 3 4 5 6 7 8 9 10
Output 1 2 3 4 5 6 7 8 9 10
Latency for both solutions is 2t, one 2t cycle for the first one, and two 1t cycles for the second one. However the through put of the second solution is twice as fast. Once the latency is accounted for, you get a new answer every 1t cycle.
So if you had a complex calculation that required say 5 1t cycles, then the latency would be 5t, but the through put would still be 1t.
You need another word in addition to latency and cycle-time, which is throughput. Even if it takes 2 cycles to get an answer, if you can put new data in every cycle and get it out every cycle, your throughput can be increased by 2x over the "do it all in one cycle".
Say your calculation takes 40 ns in one cycle, so a throughput of 25 million data items/sec.
If you pipeline it (which is the technical term for splitting up the calculation into multiple cycles) you can do it in 2 lots of 20ns + a bit (you lose a bit in the extra registers that have to go in). Let's say that bit is 10 ns (which is a lot, butmakes the sums easy). So now it takes 2x25+10=50 ns => 20M items/sec. Worse!
But, if you can make the 2 stages independent of each other (in your case, not sharing the multiplier) you can push new data into the pipeline every 25+a bit ns. This "a bit" will be smaller than the previous one, but even if it's the whole 10 ns, you can push data in at 35ns times or nearly 30M items/sec, which is better than your started with.
In real life the 10ns will bemuch less, often 100s of ps, so the gains are much larger.
George described accurately the meaning latency (which does not necessary relate to computation time). Its seems you want to optimize your design for speed. This is very complex and requires much experience. The total runtime is
execution_time = (latency + (N * computation_cycles) ) * cycle_time
Where N is the number of calculations you want to perform. If you develop for acceleration you should only compute on large data sets, i.e. N is big. Usually you then dont have requirements for latency (which could be in real time applications different). The determining factors are then the cycle_time and the computation_cycles. And here it is really hard to optimize, because there is a relation. The cycle_time is determined by the critical path of your design, and that gets longer the fewer registers you have on it. The longer it gets, the bigger is the cycle_time. But the more registers you have the higher is your computation_cycles (each register increases the number of required cycles by one).
Maybe I should add, that the latency is usually the number of computation_cycles (its the first computation that makes the latency) but in theory this can be different.

Resources