What breaks a tie in Optimal Page Replacement? - memory-management

Assume we have 3 frames for a process with values (1,7,0). Now assume that the remaining references in the string reference for that process are : 4,6,7. For the reference 4 , a page fault will occur and only 7 is present in the future , so which frame should be replaced here , 1 or 0 ? Do I apply FIFO to break this tie or I can select any of them to replace ?

Optimal means you have all future information, so if two pages were truly never to be used again you could pick either and be fine. However if you wanted to do an LRU or FIFO to break the tie, it would be fine (i.e. wouldn't matter).
What would be interesting is thinking about optimal applied globally and not just to a single process. If we assume other processes might touch 1 or 0 later, that could be part of the future information used to make the optimal choice. Optimal decision-making would also become more complicated if multiple cpus are touching memory concurrently as well.

Related

Are add/remove set CRDT's monotonic?

The internals of an add/remove set CRDT is monotonic, because we only ever add to the internal sets, so the internal state of the CRDT cannot ever go backwards in logical time.
However, the observed state of the CRDT is that we're adding and removing elements, so the observed state doesn't have to be monotonic.
If we chain these systems together and take actions based on the presense or non-presense of an element, it doesn't look very monotonic anymore. The final state will still converge eventually, but we may or may not see some elements for a while before it stabilizes. It's not unlikely that some side-effect happens because of that intermediate state, such as a user reading the state of the system and reacting before it converges.
What does it mean for a CRDT to be monotonic?
Just to add a TL;DR on top of alekibango's awesome answer:
Monotonicity refers to the fact that once operations are observed and applied by a replica, the object's state will always take into consideration that operation.
Once an operation is applied, it will never be un-applied.
The observed non-monotonicity of (most) CRDT Sets does not invalidate the CRDT monotonic property.
CRDT Sets that support the remove operation are at its core two G-Sets:
One of the G-Sets is the set of elements that were added.
The other G-set is the set of elements that were removed.
The observed state is the set of added elements minus the set of removed elements. Although each of the internal sets is clearly monotonic, their difference can appear to be non-monotonic.
CRDT means Conflict-Free Replicated Data Type
This means (diverging) instances of CRDT might merge together (in whatever order and repetitions) to finally get into correct, consistent state.
Monotonicity might help with implementing this (see CALM -- Consistency as Logical Monotonicity). But it is not a requirement for your set instance.
Read those notes on crdt: https://github.com/pfrazee/crdt_notes
Some Examples of CRDT sets are:
G-set (grow only set, only adding items)
2P-set (keeps tombstones, element can be inserted once only)
LWW-set uses timestamps for marking 'time' of adding/removing item, allows adding/removing an item multiple times. Concurrent add
and remove is decided using bias.
OR-set - similar to lww set, but uses unique tags to be sure which element we are removing.
Optimized OR-Sets - can be able to be useable without having many tombstones around, see those if your sets are big (and have many
changes).
some links to read more:
https://hal.inria.fr/inria-00609399v1/document
https://github.com/CBaquero/delta-enabled-crdts
https://doc.akka.io/docs/akka/2.5/distributed-data.html
https://www.youtube.com/watch?v=ebWVLVhiaiY
https://www.youtube.com/watch?v=veeWamWy8dk
https://www.youtube.com/watch?v=xxjHC3yLDqw
https://www.youtube.com/watch?v=PQzNW8uQ_Y4

brainf_ck not operation in a list

I'm having a problem with the not operation (and nearly all operations) in a list. What I mean with a list is 0 i1 i2 i3 ... in-1 in 0 with a unknown n
In my program I'm at an unknown index in that list and I need to check if it is 0
For the not algorithm you need a temporary value but you can only get to that value with a [<] or a [>] but then you will lose the value in the list.
reminder: the a = 0 algorithm goes like this:
t0[-]+
a[t0-]
t0[
<code>
]
The only thing I could come up with is leaving a 1 between each index but that seems extremely un-elegant.
so my questions is : is there a better way to do this?
Actually the 1 between each element thing is really one of the more efficient ways to do it. Then you simply walk back and forth until you meet a zero and you know at which end of the sequence you are, and also how many there are. And they're really easy to clear up after each operation as well.
There are ways to use only one cell per element, but it would require moving all elements to the left of the one you want one position to the left, and then moving them all back, for each operation. In some cases this might be faster if you only store small values in each element and you have a lot of elements.
Depends what you want to achieve. Personally I think the first option of leaving a trail of 1s and clearing them afterwards is the better option, even though it requires twice the space, as it is usually significantly faster in the general case.

Algorithm for concurrent queue (single consumer, multiple producers) based on shared dictionary data structure

I am looking for a queue algorithm that fulfills the following properties:
Processes communicate using only a shared dictionary (key-value-store)
Does not use any atomic operations other than load and store (no CAS, for example)
Supports multiple producers
Supports a single consumer
Producers can die at any time and queue must remain operational
The consumer can also die at any time and be restarted later, but there will never be more than one consumer-process running at a time
This is meant as a general question about a suitable algorithm, since I'd like to use it in a couple of different scenarios. But to help visualize the requirements, here is an example use-case:
I have a website with two pages: producer.html and consumer.html
producer.html can be opened in multiple tabs simultaneously
Each producer.html adds events to the queue
One copy of consumer.html is open and consumes these events (to aggregate and stream them to a webserver, for example)
If the multiple producer-tabs are opened by the user rather than the page, these tabs do not have references to each other available, so the usual communication methods (postMessage or calling directly into the other tab's JS code) are out. One of the ways they can still communicate with each other is via LocalStorage as suggested here: Javascript; communication between tabs/windows with same origin. But LocalStorage is not "thread-safe" as detailed here.
Note: There may be other ways to implement cross-tab communication in the browser (Flash, ...), but these are NOT the aim of this question as they won't translate to my other use-cases. This is really just an example use-case for the general queue algorithm that I am trying to find.
A couple more parameters:
The number of producers will never be very large (10s or 100s maybe), so the scaling of the number of reads and writes needed with respect to the number of producers is not really a concern.
I don't know before hand how many producers I might have and there is no immediately obvious way to assign a number or index to them. (Many mutex algorithms (Lamport's Bakery, Eisenberg&McGuire, Szymański's, ...) maintain an array of state for each process, which wouldn't necessarily be a natural approach here, although I do not want to exclude these approaches ex ante, if they can be implemented using the shared dictionary in some way...)
The algorithm should be 100% reliable. So, I'd like to avoid things like the delay in Lamport's first Fast Mutex algorithm (page 2 in the PDF) since I don't have any kind of real-time guarantees.
It would be very helpful if the queue was FIFO, but it's not strictly required.
The algorithm should not be encumbered by any patents, etc.
Update:
The Two-Lock Concurrent Queue Algorithm by Michael and Scott looks like it could work, but I would need two things to implement it:
A locking mechanism using the shared dictionary that can survive the crash of a lock-holder
A reliable way to allocate a new node (if I move the allocation into the locked section, I could just generate new random keys until I find one that's not in use yet, but there might be a better way?)
Update 2:
It seems, I wasn't being specific enough about the dictionary:
It's really nothing more than a trivial key-value-store. It provides the functions get(key) to read the value of a key, put(key, value) to change the value of a key, and delete(key) to remove a key. In some of my use-cases, I can also iterate over keys, but if possible, I'd like to avoid it for generality. Keys are arbitrary and the producers and consumers can create or calculate them as needed. The dictionary does not provide any facilities for automatically generating unique keys.
Examples are HTML LocalStorage, Google AppEngine's Datastore, a Java Map, a Python dictionary, or even a file-system with only a single directory (where the keys would be the file-names and the values the content of the files).
After quite a bit of further reading and sleeping on things for a night, I came up with one way that should be able to accomplish what I need, but it might not be the most elegant:
The paper Wait-Free Algorithms for Fast, Long-Lived Renaming by Moir and Anderson generalizes Lamport's Fast Mutex Algorithm #2 (page 6 here) into the following building block (Figure 2):
When n processes enter this section of code, at most one of them will stop, at most n-1 will move right and at most n-1 will move down.
In Lamport's algorithm, stopping means the process acquired the lock, whereas moving right or left will simply send the process back to the beginning of this section of code. To release the lock, a process simply sets Y back to false. (Not quite correct, actually... See "Update" below...)
The big problem with this is that if any of the processes ever die while holding the lock (i.e. before releasing it), the block will simply stay locked forever.
Another problem is that every process needs to be assigned a unique process ID p.
The locked-forever problem can be fixed by borrowing an idea from Moir and Anderson, namely to send processes that end up moving right or down into a different building block rather than back to this one, leading to a structure like this (Figure 3 in the paper):
Except that in this case, I won't be using this grid to assign process IDs as M&A did (although I could probably solve the problem of the unique values for p with this). Instead, every box in the grid will correspond to a very simple queue. If a process stops on a box, it acquired the tail-lock for the corresponding queue (e.g. as per the algorithm by Michael and Scott) and proceeds to enqueue a new element to that queue. Upon completion, it sets the Y value of the box back to false to allow other processes to use this queue. This way, if there is high contention or if processes die before releasing locks, new queues will be created dynamically as needed.
The consumer-process doesn't need to worry about locking the heads of the queues when dequeuing elements, since it's the only process to ever do so. So, it simply traverses the tree of boxes to find all queues and trivially helps itself to their contained elements. One thing to note is that while each individual queue will be FIFO, there is no synchronization between the queues, so the combined queue will not necessarily be FIFO.
If we now change the boolean Y to a time-stamp (or null/0 to indicate false), the consumer can also expire locks after some safe timeout to re-activate dead queues.
A note about implementation using the dictionary:
The shared variables X and Y can be entries in the dictionaries with key-names X_123 and Y_123, where 123 is the number of the box.
p can simply be any unique random string and will be stored as the value of key X_123.
The boolean or time-stamp is also simply stored as the value of key Y_123. The producer-processes interpret a missing entry for Y_123 as false or null/0.
The box-numbers 123 need to be calculated from the move-pattern. One way to do this would be to start with 1 in the top-left corner. If the process stops in that box, we're done. If not, the current number (starting with 1) is shifted left by 1 (i.e. multiplied by 2) and, if the process moved down, also incremented by 1. Smaller (and fewer) numbers can be calculated with a different numbering scheme (I still need to work it out), but this one should work.
The queues then consist of one entry with key H_123 that holds the index of the current head of the queue in its value and one entry with key T_123 that holds the index of the tail. Both default to 0 if they don't exist.
To enqueue an item into queue 123, the tail index is read from T_123 (let's say it yields 48) and an entry with key Q_123_48 is put into the dictionary with its value containing the enqueued item. After, T_123 is incremented by 1.
After the item is enqueued, the Y_123 entry is set back to false or null/0 (not deleted!)
To dequeue an item, the head index is read from H_123 (let's say it yields 39) and compared to the tail index T_123. If it is smaller, an item is available at Q_123_39, which is then read and deleted from the dictionary. After, H_123 is incremented by 1.
To traverse the box-tree, the consumer starts with the box in the top left corner. For each box (e.g. 123), if a key Y_123 exists in the dictionary (even if it contains values null/0 or false), the consumer dequeues items from the corresponding queue, and then recursively moves right and down to the adjacent boxes. If no key Y_123 exists, this box hasn't been used by any processes yet and doesn't need to be considered (and neither do the boxes below or to its right).
I haven't actually implemented this yet, but I'll do that next. I just wanted to post this already to see if it could inspire other approaches or if anyone can see anything wrong with this idea.
Update:
I just noticed one complication: It is possible that if two processes are trying to acquire the lock for a queue simultaneously, both will fail and move on to the next block. This will leave that queue locked forever as no-one will be left to set Y back to false or null/0.
This is the reason why the "Long-Lived Renaming" algorithm by M&A as well as Lamport's algorithm #2 use an array of Y-values in which every process has its own entry that it resets also if it moves on to another block. Y is then only considered false if all entries are false.
Since I don't know before-hand how many processes I will have, I could implement this only if the dictionary had some way of enumerating keys (the keys would then be Y_123_456 where 456 is the value of p for each process).
But, with rare contention and the above described timeout-mechanism for reactivating dead queues, the issue might lead to only a little bit of memory inefficiency, rather than a major problem.
Update 2:
A better way to label the boxes would be this pattern:
If we call the total number of moves n (counting the move into the top left box also, i.e. n ≥ 1) and the number of moves to the right r, then the box-number can be calculated using
box = (n × (n - 1))/2 + r
Just use a RDBMS. It's pretty simple in MS SQL, for PostgresSQL you'd have to use the RETURNING keyword and for MySQL you'd probably have to use triggers.
CREATE TABLE Q ([Key] BIGINT IDENTITY(1,1) PRIMARY KEY, [Message] NVARCHAR(4000))
INSERT INTO Q OUTPUT inserted.* VALUE(#message)
DELETE TOP(1) Q WITH (READPAST) OUTPUT deleted.*
If you were really hoping for an algorithmic solution, just use a ring buffer.
const int MAX_Q_SIZE = 20000000;
static string[] Q = new string[MAX_Q_SIZE];
static long ProducerID = 0;
static long ConsumerID = 0;
public static long Produce(string message) {
long key = Interlocked.Increment(ref ProducerID);
int idx = (int)(key % MAX_Q_SIZE);
Q[idx] = message;
return key;
}
public static string Consume() {
long key = Interlocked.Increment(ref ConsumerID);
int idx = (int)(key % MAX_Q_SIZE);
string message = Q[idx];
return message;
}

Is it possible to come up with a distributed / multi core implementation of a prime sieve.

I have been working on prime sieve algorithm, and the basic implementation is working fine for me. What I am currently struggling with is a way to divide and distribute the calculation on to multiple processors.
I know it would require storage of the actual sieve in a shared memory area or a text file, but how would one go about dividing the calculation related steps.
Any lead would help. Thanks!
Split the numbers into sections of equal size, each processor will be responsible for one of these sections.
Another processor (or one of the processors) will generate the numbers of which multiple needs to be crossed-off. And pass this number to each other processors.
Each of the processors will then use the remainder of the section size divided by the given number and its own section index to determine the offset into its own section, and then loop through and cross off the applicable numbers.
Alternatively, one could get a much simpler approach by just using shared memory.
Let the first processor start crossing off multiple of 2, the second multiples of 3, the third multiples of 5, etc.
Essentially just let each processor grab the next number from the array and run with it.
If you don't do this well, you may end up with the third crossing off multiples of 4, since the first didn't get to 4 yet when the third started, so it's not crossed off, but it shouldn't result in too much more work - it will take increasingly longer for a multiple of some prime to be grabbed by a processor, while it will always be the first value crossed off by a processor handling that prime, so the likelihood of this redundancy happening decreases very quickly.
Using shared memory like this tends to be risky - if you plan on using one bit per index, most languages don't allow you to work on that level, and you'll end up needing to do some bitwise operations (probably bitwise-AND) on a few bytes to make your desired changes (although this complexity might be hidden in some API), and many languages will also not have this operation be a so-called atomic operation, meaning one thread can get a value, AND it, and write it back, and another can come in and get the value before the first thread wrote it, AND it, and write it back after the first thread's write, essentially causing the first thread's changes to be lost. There's no simple, efficient fix for this - what exactly you need to do will depend on the language.

brute force search optimisation

I have an function that is engineered as follows:
int brutesearch(startNumber,endNumber);
this function returns the correct number if one matches my criteria by performing a linear search, or null if it's not found in the searched numbers.
Say that:
I want to search all 6 digits numbers to find one that does something I want
I can run the brutesearch() function multithreaded
I have a laptop with 4 cores
My question is the following:
What is my best bet for optimising this search? Dividing the number space in 4 segments and running 4 instances of the function one on each core? Or dividing for example in 10 segments and running them all together, or dividing in 12 segments and running them in batches of 4 using a queue?
Any ideas?
Knowing nothing about your search criteria (there may be other considerations created by the memory subsystem), the tradeoff here is between the cost of having some processors do more work than others (e.g., because the search predicate is faster on some values than others, or because other threads were scheduled) and the cost of coordinating the work. A strategy that's worked well for me is to have a work queue from which threads grab a constant/#threads fraction of the remaining tasks each time, but with only four processors, it's pretty hard to go wrong, though the really big running-time wins are in algorithms.
There is no general answer. You need to give more information.
If your each comparison is completely independent of the others, and there are no opportunities for saving computation in a global resource, there is say no global hash tables involved, and your operations are all done in a single stage,
then your best bet is to just divide your problem space into the number of cores you have available, in this case 4 and send 1/4 of the data to each core.
For example if you had 10 million unique numbers that you wanted to test for primality. Or if you had 10 million passwords your were trying to hash to find a match, then just divide by 4.
If you have a real world problem, then you need to know a lot more about the underlying operations to get a good solution. For example if a global resource is involved, then you won't get any improvement from parallelism unless you isolate the operations on the global resource somehow.

Resources