I would like to implement queued locking in C++ for one of my applications.
I was going through the algorithm from the following paper :
http://www.google.co.in/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&ved=0CCUQFjAA&url=http%3A%2F%2Fwww.cs.rice.edu%2F~johnmc%2Fpapers%2Ftocs91.pdf&ei=HpRfUKCZFsfWrQfpgIGACQ&usg=AFQjCNF_QamPWhJrq5dSjJjFjO7W3WzJ5Q&sig2=3TU1vo_aAYbM2fmLxeiZ0A
type qnode = record
next : ^qnode
locked : Boolean
type lock = ^qnode
// parameter I, below, points to a qnode record allocated
// (in an enclosing scope) in shared memory locally-accessible
// to the invoking processor
procedure acquire_lock (L : ^lock, I : ^qnode)
I->next := nil
predecessor : ^qnode := fetch_and_store (L, I)
if predecessor != nil // queue was non-empty
I->locked := true
predecessor->next := I ---A
repeat while I->locked // spin ---C
procedure release_lock (L : ^lock, I: ^qnode)
if I->next = nil // no known successor
if compare_and_swap (L, I, nil) // compare_and_swap returns true iff it swapped
return
repeat while I->next = nil // spin --B
I->next->locked := false ---D
A & B are accessing the same variable( predecessor->next & I->next ) and also C & D( locked variable ) but they are not being locked before accessing. Am I missing something here ?
It's true that these concurrent accesses can race, but the algorithm is designed to be tolerant of that.
The spinning at B is actually to prevent a race with A. At D, we need I->next to be non-nil. I->next (known as predecessor->next here) is set to non-nil by A. As you noticed, this could race, so there is a spinning loop at B to wait for the other thread to set I->next to something valid.
Let's look at C & D. The repeat while I->locked line is the actual "spinning" part of the lock; if a thread trying to acquire the lock has to wait for another thread to release it, it spins in this loop. If the thread releasing the lock sets I->next->locked to false before the acquiring thread reaches repeat while I->locked, the loop will simply never start.
Related
only use atomic implement the follow code:
const Max = 8
var index int
func add() int {
index++
if index >= Max {
index = 0
}
return index
}
such as:
func add() int {
atomic.AddUint32(&index, 1)
// error: race condition
atomic.CompareAndSwapUint32(&index, Max, 0)
return index
}
but it is wrong. there is a race condition.
can be implemented that don't use lock ?
Solving it without loops and locks
A simple implementation may look like this:
const Max = 8
var index int64
func Inc() int64 {
value := atomic.AddInt64(&index, 1)
if value < Max {
return value // We're done
}
// Must normalize, optionally reset:
value %= Max
if value == 0 {
atomic.AddInt64(&index, -Max) // Reset
}
return value
}
How does it work?
It simply adds 1 to the counter; atomic.AddInt64() returns the new value. If it's less than Max, "we're done", we can return it.
If it's greater than or equal to Max, then we have to normalize the value (make sure it's in the range [0..Max)) and reset the counter.
Reset may only be done by a single caller (a single goroutine), which will be selected by the counter's value. The winner will be the one that caused the counter to reach Max.
And the trick to avoid the need of locks is to reset it by adding -Max, not by setting it to 0. Since the counter's value is normalized, it won't cause any problems if other goroutines are calling it and incrementing it concurrently.
Of course with many goroutines calling this Inc() concurrently it may be that the counter will be incremented more that Max times before a goroutine that ought to reset it can actually carry out the reset, which would cause the counter to reach or exceed 2 * Max or even 3 * Max (in general: n * Max). So we handle this by using a value % Max == 0 condition to decide if a reset should happen, which again will only happen at a single goroutine for each possible values of n.
Simplification
Note that the normalization does not change values already in the range [0..Max), so you may opt to always perform the normalization. If you want to, you may simplify it to this:
func Inc() int64 {
value := atomic.AddInt64(&index, 1) % Max
if value == 0 {
atomic.AddInt64(&index, -Max) // Reset
}
return value
}
Reading the counter without incrementing it
The index variable should not be accessed directly. If there's a need to read the counter's current value without incrementing it, the following function may be used:
func Get() int64 {
return atomic.LoadInt64(&index) % Max
}
Extreme scenario
Let's analyze an "extreme" scenario. In this, Inc() is called 7 times, returning the numbers 1..7. Now the next call to Inc() after the increment will see that the counter is at 8 = Max. It will then normalize the value to 0 and wants to reset the counter. Now let's say before the reset (which is to add -8) is actually executed, 8 other calls happen. They will increment the counter 8 times, and the last one will again see that the counter's value is 16 = 2 * Max. All the calls will normalize the values into the range 0..7, and the last call will again go on to perform a reset. Let's say this reset is again delayed (e.g. for scheduling reasons), and yet another 8 calls come in. For the last, the counter's value will be 24 = 3 * Max, the last call again will go on to perform a reset.
Note that all calls will only return values in the range [0..Max). Once all reset operations are executed, the counter's value will be 0, properly, because it had a value of 24 and there were 3 "pending" reset operations. In practice there's only a slight chance for this to happen, but this solution handles it nicely and efficiently.
I assume your goal is to never let index has value equal or greater than Max. This can be solved using CAS (Compare-And-Swap) loop:
const Max = 8
var index int32
func add() int32 {
var next int32;
for {
prev := atomic.LoadInt32(&index)
next = prev + 1;
if next >= Max {
next = 0
}
if (atomic.CompareAndSwapInt32(&index, prev, next)) {
break;
}
}
return next
}
CAS can be used to implement almost any operation atomically like this. The algorithm is:
Load the value
Perform the desired operation
Use CAS, goto 1 on failure.
So, I have a piece of code that is concurrent and it's meant to be run onto each CPU/core.
There are two large vectors with input/output values
var (
input = make([]float64, rowCount)
output = make([]float64, rowCount)
)
these are filled and I want to compute the distance (error) between each input-output pair. Being the pairs independent, a possible concurrent version is the following:
var d float64 // Error to be computed
// Setup a worker "for each CPU"
ch := make(chan float64)
nw := runtime.NumCPU()
for w := 0; w < nw; w++ {
go func(id int) {
var wd float64
// eg nw = 4
// worker0, i = 0, 4, 8, 12...
// worker1, i = 1, 5, 9, 13...
// worker2, i = 2, 6, 10, 14...
// worker3, i = 3, 7, 11, 15...
for i := id; i < rowCount; i += nw {
res := compute(input[i])
wd += distance(res, output[i])
}
ch <- wd
}(w)
}
// Compute total distance
for w := 0; w < nw; w++ {
d += <-ch
}
The idea is to have a single worker for each CPU/core, and each worker processes a subset of the rows.
The problem I'm having is that this code is no faster than the serial code.
Now, I'm using Go 1.7 so runtime.GOMAXPROCS should be already set to runtime.NumCPU(), but even setting it explicitly does not improves performances.
distance is just (a-b)*(a-b);
compute is a bit more complex, but should be reentrant and use global data only for reading (and uses math.Pow and math.Sqrt functions);
no other goroutine is running.
So, besides accessing the global data (input/output) for reading, there are no locks/mutexes that I am aware of (not using math/rand, for example).
I also compiled with -race and nothing emerged.
My host has 4 virtual cores, but when I run this code I get (using htop) CPU usage to 102%, but I expected something around 380%, as it happened in the past with other go code that used all the cores.
I would like to investigate, but I don't know how the runtime allocates threads and schedule goroutines.
How can I debug this kind of issues? Can pprof help me in this case? What about the runtime package?
Thanks in advance
Sorry, but in the end I got the measurement wrong. #JimB was right, and I had a minor leak, but not so much to justify a slowdown of this magnitude.
My expectations were too high: the function I was making concurrent was called only at the beginning of the program, therefore the performance improvement was just minor.
After applying the pattern to other sections of the program, I got the expected results. My mistake in evaluation which section was the most important.
Anyway, I learned a lot of interesting things meanwhile, so thanks a lot to all the people trying to help!
On the internet there are quite a number of tutorials of how to control a shift register with a microcontroller, but is it actually possible to implement the shift register function with only the microcontroller?
If you have enough pins, I don't see why the naive way wouldn't work...
For an n-bit shift in register, you need n+2 pins:
One clock-in
One data-in
n data-out
The pseudocode of the implementation is:
var byte r := 0 // Assuming n=8, so 8 bits fit into a single byte
var byte i := 0
forever:
wait for clock-in = low
wait for clock-in = high
r := r << 0 | data-in
i := i + 1
if i = n:
data-out<1..n> := r
i = 0
If you want to make sure that data-out is updated synchronously, make sure you use pins of a single port: then the data-out<1..n> := r statement can literally be a single port register assignment.
If you want to run this concurrently with other code, you should be able to use a pin for clock-in that can trigger an interrupt.
I have written the MAX-HEAPIFY(A,i) method from the introduction to algorithms book. Now I want to write it without recursion using while loop. Can you help me please?
You can use while loop with condition your i <= HEAPSIZE and using all other same conditions , except when you find the right position just break the loop.
Code:-
while ( i < = heapsize) {
le <- left(i)
ri <- right(i)
if (le<=heapsize) and (A[le]>A[i])
largest <- le
else
largest <- i
if (ri<=heapsize) and (A[ri]>A[largest])
largest <- ri
if (largest != i)
{
exchange A[i] <-> A[largest]
i <- largest
}
else
break
}
The solution above works but I think that following code is closer to the recursive version
(* Code TP compatible *)
const maxDim = 1000;
type TElem = integer;
TArray = array[1..maxDim]of TElem
procedure heapify(var A:TArray;i,heapsize:integer);
var l,r,largest,save:integer;
temp:TElem;
(*i - index of node that violates heap property
l - index of left child of node with index i
r - index of right child of node with index i
largest - index of largest element of the triplet (i,l,r)
save - auxiliary variable to save the value of i
temp - auxiliary variable used for swapping *)
begin
repeat
l:=2*i;
r:=2*i + 1;
if(l <= heapsize) and (A[l] > A[i]) then
largest:=l
else
largest:=i;
if(r <= heapsize) and (A[r] > A[largest]) then
largest:=r;
(*Now we save the value i to check properly the termination
condition of repeat until loop
The value of i will be modified soon in the if statement *)
save:=i;
if largest <> i then
begin
temp:=A[i];
A[i]:=A[largest];
A[largest]:=temp;
i:=largest;
end;
until largest = save;
(*Why i used repeat until istead of while ?
because body of the called procedure will be executed
at least once *)
end;
One more thing, in Wirth's Algorithms + Data Structures = Programs
can be found sift procedure without recursion but we should introduce boolean variable or break to eliminate goto statement
I'm having a really hard time understanding the Second Algorithm to the Readers-Writers problem. I understand the general concept, that the writers will get priority over the readers (readers can starve). I even understand the conditional variable implementation of this algorithm Reader/Writer Locks in C++. However, the semaphore & mutex implementation makes no sense to me. This is an example from Wikipedia:
int readcount, writecount; (initial value = 0)
semaphore mutex 1, mutex 2, mutex 3, w, r ; (initial value = 1)
READER
P(mutex 3);
P(r);
P(mutex 1);
readcount := readcount + 1;
if readcount = 1 then P(w);
V(mutex 1);
V(r);
V(mutex 3);
reading is done
P(mutex 1);
readcount := readcount - 1;
if readcount = 0 then V(w);
V(mutex 1);
WRITER
P(mutex 2);
writecount := writecount + 1;
if writecount = 1 then P(r);
V(mutex 2);
P(w);
writing is performed
V(w);
P(mutex 2);
writecount := writecount - 1;
if writecount = 0 then V(r);
V(mutex 2);
[http://en.wikipedia.org/wiki/Readers-writers_problem][2]
I don't understand what the three semaphores (mutex 3, r, and mutex 1) are for in the reader lock. Isn't one semaphore enough for the readcount?
mutex 1 protects the readcount variable; mutext 2 protects writecount variable; mutex r protects the reading operations and mutext w protects the writing operations.
1) Let's suppose a writer comes in:
Signals mutex 2 and increments writercount to account for the extra writer (itself)
Since it is the only process that can change writercount (as it is holding mutex 2), it can safely test whether it is the only writer (writercount==1), if true, it signals mutex r to protect readers from coming in -- other writers (writercount > 1) can enjoy the mutex rbeing signaled already.
The writer then signals mutex w to protect its changes from other (concurrent) writers.
Last writer (writecount==1) releases mutex r to let readers perform their tasks.
2) Let's suppose a reader comes in:
Signals mutex 3 to protect the readers' setup logic from other readers; then signals mutex r to protect from other writers (remember, r is signaled while writers are operating); then signals mutex 1 to protect readcount (from other readers that might be exiting) and if it is the first reader (readercount == 1), signals mutex w to protect from writers (now excludes writers from performing their operations).
Reading can be done parallel, so no protection is needed from other readers while reading (remember, mutex w is being held at this point, so no intereference from writers)
Then the last reader resets the write mutex (w) to allow writers.
The trick that prevents writer starvation is that writers pose as readers (when signaling mutex p), so have a good chance of getting scheduled even when there are many readers. Also, mutex 3 prevents too many readers from waiting on mutex r, so writers have a good chance to signal r when they come.
Have a look at Concurrent Control with "Readers" and "Writers" by P.J. Courtois, F. Heymans, and D.L. Parnas, which is the reference for the code at Wikipedia. It explains why are all the mutexes needed.