Optimum memory consumption program - algorithm

Write a program to find the processes which utilize the memory optimally, given the list of the processes with their memory usage and the total memory available.
Example:-
Total memory :- 10
First column denotes process id's and 2nd column is the memory consumption of respective process.
1 2
2 3
3 4
4 5
Answer should be processes {1,2,4} with memory consumption {2,3,5} as 2+3+5=10

This question is Knapsack problem
I believe you can find many sample code on Google

Related

Processor Cycles spent on useful work is less when L1 access latency is high

I run a benchmark program on a processor simulator with two different configurations.
Config 1 has L1 access latency (hitDelay and missDelay as 1 cycle)
Config 2 has L1 access latency as 7 cycles.
The total number of dynamic instructions completed in both the runs of the same benchmark is 13743658, but the cycles attributed to completing and committing useful instructions is 68,782.17 in config 2 and 158,498.33 in Config 1.
So, what's bizarre about this is, the processor spends less cycles (68,782.17) when L1 access latency is 7 cycles as compared to 158,498.33 when L1 access latency is 1 cycle.
Can somebody explain why this would be the case. It seems counterintuitive.

Algo- Given a sequence of memory access, give the minimum number of cache misses

Given inputs are: size of cache s, number of memory entries n, and a series of memory accesses.
Give the minumum number of cache misses possible.
Example:
s = 3, n = 4
1 2 3 1 4 1 2 3
min_miss = 4
I've been stuck the entire day. Thanks in advance!
You get to decide whatever behaviour the cache takes. You don't have to take in an entry even if it's accessed, for example. And it need not be regular. You don't need to follow a fixed "rule" to cache.
Try following http://en.wikipedia.org/wiki/Page_replacement_algorithm#The_theoretically_optimal_page_replacement_algorithm - when you need to swap something out swap out the item that will not be used again for the longest possible time. Since you get the entire sequence of memory accesses ahead of time this is feasible for you. This is obviously locally optimum at least up to the first cache miss after the cache becomes full, because every other strategy has had at least one cache miss by then. It is not obvious to me that this is globally optimal - searching I find a proof at http://www.stanford.edu/~bvr/psfiles/paging.pdf with a claim that other proofs of its optimality do exist but are even longer.

Memory allocation algorithm

How can we implement a data structure that allocate & tracks memory with the following constraints
Allocate and Free Memory in O(1)
Min fragmentation.
Lets say you have 1 KB unit of memory.
You need to allocate between 2kB - 64 KB memory
For eg
A- 1
B-1
C-4
D-2
0 1 2 3 4 5 6 7 8 9 10
x A x B C C C C D D x
If we allocate memory at the min address available when we free (shown as x above) memory we will have fragmentation. So in above example even if 3units are free we cant allocate memory of 3 continuous units.
Search for "buddy system" or "buddy memory allocation". That's probably the best solution you will find. Though, it is not purely O(1), it suffers from some internal fragmentation, and external fragmentation can also happen...
You can avoid internal fragmentation completely by using an augmented tree, but then operations take O(log N) time. And you still have external fragmentation.

FPGA timing question

I am new to FPGA programming and I have a question regarding the performance in terms of overall execution time.
I have read that latency is calculated in terms of cycle-time. Hence, overall execution time = latency * cycle time.
I want to optimize the time needed in processing the data, I would be measuring the overall execution time.
Let's say I have a calculation a = b * c * d.
If I make it to calculate in two cycles (result1 = b * c) & (a = result1 * d), the overall execution time would be latency of 2 * cycle time(which is determined by the delay of the multiplication operation say value X) = 2X
If I make the calculation in one cycle ( a = b * c * d). the overall execution time would be latency of 1 * cycle time (say value 2X since it has twice of the delay because of two multiplication instead of one) = 2X
So, it seems that for optimizing the performance in terms of execution time, if I focus only on decreasing the latency, the cycle time would increase and vice versa. Is there a case where both latency and the cycle time could be decreased, causing the execution time to decrease? When should I focus on optimizing the latency and when should I focus on cycle-time?
Also, when I am programming in C++, it seems that when I want to optimize the code, I would like to optimize the latency( the cycles needed for the execution). However, it seems that for FPGA programming, optimizing the latency is not adequate as the cycle time would increase. Hence, I should focus on optimizing the execution time ( latency * cycle time). Am I correct in this if I could like to increase the speed of the program?
Hope that someone would help me with this. Thanks in advance.
I tend to think of latency as the time from the first input to the first output. As there is usually a series of data, it is useful to look at the time taken to process multiple inputs, one after another.
With your example, to process 10 items doing a = b x c x d in one cycle (one cycle = 2t) would take 20t. However doing it in two 1t cycles, to process 10 items would take 11t.
Hope that helps.
Edit Add timing.
Calculation in one 2t cycle. 10 calculations.
Time 0 2 2 2 2 2 2 2 2 2 2 = 20t
Input 1 2 3 4 5 6 7 8 9 10
Output 1 2 3 4 5 6 7 8 9 10
Calculation in two 1t cycles, pipelined, 10 calculations
Time 0 1 1 1 1 1 1 1 1 1 1 1 = 11t
Input 1 2 3 4 5 6 7 8 9 10
Stage1 1 2 3 4 5 6 7 8 9 10
Output 1 2 3 4 5 6 7 8 9 10
Latency for both solutions is 2t, one 2t cycle for the first one, and two 1t cycles for the second one. However the through put of the second solution is twice as fast. Once the latency is accounted for, you get a new answer every 1t cycle.
So if you had a complex calculation that required say 5 1t cycles, then the latency would be 5t, but the through put would still be 1t.
You need another word in addition to latency and cycle-time, which is throughput. Even if it takes 2 cycles to get an answer, if you can put new data in every cycle and get it out every cycle, your throughput can be increased by 2x over the "do it all in one cycle".
Say your calculation takes 40 ns in one cycle, so a throughput of 25 million data items/sec.
If you pipeline it (which is the technical term for splitting up the calculation into multiple cycles) you can do it in 2 lots of 20ns + a bit (you lose a bit in the extra registers that have to go in). Let's say that bit is 10 ns (which is a lot, butmakes the sums easy). So now it takes 2x25+10=50 ns => 20M items/sec. Worse!
But, if you can make the 2 stages independent of each other (in your case, not sharing the multiplier) you can push new data into the pipeline every 25+a bit ns. This "a bit" will be smaller than the previous one, but even if it's the whole 10 ns, you can push data in at 35ns times or nearly 30M items/sec, which is better than your started with.
In real life the 10ns will bemuch less, often 100s of ps, so the gains are much larger.
George described accurately the meaning latency (which does not necessary relate to computation time). Its seems you want to optimize your design for speed. This is very complex and requires much experience. The total runtime is
execution_time = (latency + (N * computation_cycles) ) * cycle_time
Where N is the number of calculations you want to perform. If you develop for acceleration you should only compute on large data sets, i.e. N is big. Usually you then dont have requirements for latency (which could be in real time applications different). The determining factors are then the cycle_time and the computation_cycles. And here it is really hard to optimize, because there is a relation. The cycle_time is determined by the critical path of your design, and that gets longer the fewer registers you have on it. The longer it gets, the bigger is the cycle_time. But the more registers you have the higher is your computation_cycles (each register increases the number of required cycles by one).
Maybe I should add, that the latency is usually the number of computation_cycles (its the first computation that makes the latency) but in theory this can be different.

Dijkstra's Bankers Algorithm

Could somebody please provide a step-through approach to solving the following problem using the Banker's Algorithm? How do I determine whether a "safe-state" exists? What is meant when a process can "run to completion"?
In this example, I have four processes and 10 instances of the same resource.
Resources Allocated | Resources Needed
Process A 1 6
Process B 1 5
Process C 2 4
Process D 4 7
Per Wikipedia,
A state (as in the above example) is considered safe if it is possible for all processes to finish executing (terminate). Since the system cannot know when a process will terminate, or how many resources it will have requested by then, the system assumes that all processes will eventually attempt to acquire their stated maximum resources and terminate soon afterward. This is a reasonable assumption in most cases since the system is not particularly concerned with how long each process runs (at least not from a deadlock avoidance perspective). Also, if a process terminates without acquiring its maximum resources, it only makes it easier on the system.
A process can run to completion when the number of each type of resource that it needs is available, between itself and the system. If a process needs 8 units of a given resource, and has allocated 5 units, then it can run to completion if there are at least 3 more units available that it can allocate.
Given your example, the system is managing a single resource, with 10 units available. The running processes have already allocated 8 (1+1+2+4) units, so there are 2 units left. The amount that any process needs to complete is its maximum less whatever it has already allocated, so at the start, A needs 5 more (6-1), B needs 4 more (5-1), C needs 2 more (4-2), and D needs 3 more (7-4). There are 2 available, so Process C is allowed to run to completion, thus freeing up 2 units (leaving 4 available). At this point, either B or D can be run (we'll assume D). Once D has completed, there will be 8 units available, after which either A or B can be run (we'll assume A). Once A has completed, there will be 9 units available, and then B can be run, which will leave all 10 units left for further work. Since we can select an ordering of processes that will allow all processes to be run, the state is considered 'safe'.
Resources Allocated | Resources Needed claim
Process A 1 6 5
Process B 1 5 4
Process C 2 4 2
Process D 4 7 3
Total resources allocated is 8
Hence 2 resources are yet to be allocated hence that is allocated to process C. and process c after finishing relieves 4 resources that can be given to process B ,Process B after finishing relives 5 resources which is allocated to PROCESS A the n process A after finishing allocates 2 resources to process D

Resources