Assume we have a distributed system and there are K machines in the cluster. Each machine stores several integers. I would like to remove all the duplicate values from the system. So if integer 123 appears in machine1 and machine2, we should only keep one 123 in the system. How should I handle this?
My idea is to first let each machine do a removeDuplicate operation using something like bucket-sorting (all nubmers are integer), and then let one machine be the master-node to do a reduce. Is there any better idea?
The easy answer would be to not end up with unmanaged duplicate values on different machines in the first place, by using a distributed hash ring or similar technology to make sure a certain value ends up on a certain node.
If that's not good enough, I'd look into heuristic optimizations. Since you already have multiple copies on different machines, I'm assuming that you want to deduplicate these values for a little bit of extra performance, rather than application correctness.
If this is the case, let each node slowly pass through its keyspace (foreach integer on node) and ask all other nodes if they have a copy of that same value. If they do, we deduplicate it. If someone doesn't respond (fast enough); ignore them and continue. This allows for a decentralized deduplication algorithm that handles node failures and that can be run in any speed, allowing more important traffic to be prioritized when needed.
I'm guessing that the keys are accessed according to a power law distribution, so sweaping through the most commonly updated keys more often could be more efficient, but there's no guarantee for it.
Not sure what type of system you are interested in but if a shared memory is an option you can keep a counter array. Since all your numbers are integers you can flag each integer that appears in this shared array. Also, if this integer is already flagged then drop it. This will result in o(k) operations for each integer received and no duplicates.
Related
I have a set of nodes, each identified by a unique integer (UID), which are arranged into one or more networks. Each network has one or more root nodes. I need to know which nodes are not connected to any root nodes.
At present, from a previous product iteration, my connectivity check starts at each root node and follows all connections. For every found node, a bit in a bitmap is set so that you can quickly check if a node has already been found / processed. Once all paths for all root nodes have been followed, the complete set of nodes is compared against the 'found' bitmap to show all the unconnected nodes.
Previously, UIDs were sequential and I could consolidate them to remove gaps. So using the first ID as an offset, I just made my found array quite large and indexed found nodes into the bitmap directly using the UID (i.e., if I found node 1000, I'd set the 1000th bit in the bitmap). However, with the current version of the product, I have less control over the node UIDs. I can't consolidate them, and third party interaction is unpredictably making very large gaps (e.g., UIDs might jump from being in the thousands to being in the tens of millions). I have come across instances where my bitmap array is too small to accommodate the gaps and the connectivity check fails.
Now, I could just go on increasing the size of my bitmap array, but that always runs the risk of still being too small and is not very resource efficient. Thus, I'm looking to replace it with something else. Obviously, I'd like it to be as fast and as resource efficient as possible - I think some sort of hashed map is what I need. Unfortunately, I have to make this work in Fortran, so I don't have access to <map> etc.
What is the best way to hash my UIDs into a lookup structure, such that I can easily check if I already found that node?
If you can modify the node types themselves, you could add a found field, and use that?
If that is not possible, then yes, a hash table sounds like the obvious solution. Depending on how the UID's are distributed, you might get away with something as simple as mod(UID, bitmap_size). But, you still need to handle collisions somehow. There's lot of literature on that topic, so I won't go into that here, except to note that robinhood hashing is pretty cool (except maybe a bit complicated for a one-off use).
I had some interviews recently and it's quite normal to be asked some scale problems.
For example, you have a long list of words(dict) and list of characters as the inputs, design an algorithm to find out a shortest word which in dict contains all the chars in the char list. Then the interviewer asked how to scale your algorithm into multiple machines.
Another example is you have been designed a traffic light control system for an intersection in a city. How do you scale this control system to the whole city which has many intersections.
I always have no idea about this kind of "scale" problems, welcome any suggestions and comments.
Your first question is completely different from your second question. In fact the control of traffic lights in cities is a local operation. There are boxes nearby that you can tune and optical sensor on top of the light that detects waiting cars. I guess if you need to optimize for some objective function of flow, you can route information to a server process, then it can become how to scale this server process over multiple machines.
I am no expert in design of distributed algorithm, which spans a whole field of research. But the questions in undergrad interviews usually are not that specialized. After all they are not interviewing a graduate student specializing in those fields. Take your first question as an example, it is quite generic indeed.
Normally these questions involve multiple data structures (several lists and hashtables) interacting (joining, iterating, etc) to solve a problem. Once you have worked out a basic solution, scaling is basically copying that solution on many machines and running them with partitions of the input at the same time. (Of course, in many cases this is difficult if not impossible, but interview questions won't be that hard)
That is, you have many identical workers splitting the input workload and work at the same time, but those workers are processes in different machines. That brings the problem of communication protocol and network latency etc, but we will ignore these to get to the basics.
The most common way to scale is let the workers hold copies of smaller data structures and have them split the larger data structures as workload. In your example (first question), the list of characters is small in size, so you would give each worker a copy of the list, and a portion of the dictionary to work on with the list. Notice that the other way around won't work, because each worker holding a dictionary will consume a large amount of memory in total, and it won't save you anything scaling up.
If your problem gets larger, then you may need more layer of splitting, which also implies you need a way of combining the outputs from the workers taking in the split input. This is the general concept and motivation for the MapReduce framework and its derivatives.
Hope it helps...
For the first question, how to search words that contain all the char in the char list that can run on the same time on the different machine. (Not yet the shortest). I will do it with map-reduce as the base.
First, this problem is actually can run on different machine at the same time. This is because for each word in the database, you can check it on another machine (so to check another word, you didn't have to wait for the previous word or the next word, you can literally send each word to different computer to be checked).
Using map-reduce, you can map each word as a value and then check it if it contain every char in the char list.
Map(Word, keyout, valueout){
//Word comes from dbase, keyout & valueout is input for Reduce
if(check if word contain all char){
sharedOutput(Key, Word)//Basically, you send the word to a shared file.
//The output shared file, should be managed by the 'said like' hadoop
}
}
After this Map running, you get all the Word that you want from the database locate in shared file. As for the reduce step, you can actually used some simple step to reduce it based on it length. And tada, you get the shortest one.
As for the second question, multi threading come to my mind. It's actually a problem that not relate to each other. I mean each intersection has its own timer right? So to be able handle tons of intersection, you should use multi threading.
The simple term will be using each core in the processor to control each intersection. Rather then go loop through all intersection on by one. You can alocate them in each core so that the process will be faster.
I understand with the classical consistent caching algorithm, when adding/removing a node, some keys have to be remapped to different nodes. Is there an algorithm that supports no remapping at all, if I loosen some requirements?
In my application, I want to incrementally assign keys to nodes:
Once a key has been assigned to a node, it stays there forever.
Nodes are added but not removed. a node is never down after being added - assume a replication/backup mechanism at work.
Keys don't need to be distributed uniformly among the nodes. Best-effort is OK: when a new node is added, more new keys are assigned to it than the old nodes.
Is there an algorithm for this scenario?
I can image two similar workarounds that could give you what you’re asking for, but both come with conditions that probably are not acceptable:
If cache clients know in what sequence keys were first requested, i.e. if cache keys include a monotonically increasing id or version number of some kind, then you could keep track of the sequence numbers at which the cluster size increased, and compute the hash according to the number of nodes that existed at that time.
If you don’t mind a two-stage lookup, you could keep a key → numnodes lookup table that records how many nodes there were at the time a key was cached, then use that to compute the hash code. Or just keep a key → cachenode lookup table.
(A variation on #2 if the two-stage lookup is OK, but size of the lookup table is a concern: keep a hash(key) → cachenode lookup table, and make that hash as small as you need it to be to keep the lookup table small. If two keys happen to have the same hash, they end up on the same node — but that’s not a concern if the balancing isn’t strict.)
Neither of these techniques even relies on consistent hashing — just naive hash codes — but both are quite limiting.
In the general case, without something that ties a key to information about the state of the cache at the time that key was first cached, then no, I don’t think what you’re asking for is possible.
I'm looking for something like Consistent Hashing, but with a guarantee that a distribution ends up as fair as possible (not just on average for random keys) - is there such a thing and where can I find it if so?
Edit: In my specific case, the set of keys is known up front (and "small"). Exactly these keys will always be present and must be allocated to exactly one node each at any given point in time.
Sounds to me like you're looking for a minimal perfect hash.
not just on average for random keys
This is not an accurate description of the guarantees provided by consistent hashing. First, "on average" does not capture the fact that, with random placement of a large number of virtual nodes on the circle and a good family of hash functions (e.g., one that is log-wise independent), the load imbalance is seriously unlikely to be large (I believe the usual imbalance should be on the order of square root of the number of keys assigned to a particular machine). Second, the keys don't have to be random as long as they don't depend on the randomly chosen hash function (oblivious adversary).
Since you want hashing that is fair always, randomization won't help, since the RNG might have outcomes indistinguishable from this one. No deterministic algorithm can assign node preferences to keys statically without the possibility of imbalance, unless the keys are known offline.
If you have sufficiently few items that you care about square root imbalances, you can do old-fashioned stateful load balancing.
A few years back, researchers announced that they had completed a brute-force comprehensive solution to checkers.
I have been interested in another similar game that should have fewer states, but is still quite impractical to run a complete solver on in any reasonable time frame. I would still like to make an attempt, as even a partial solution could give valuable information.
Conceptually I would like to have a database of game states that has every known position, as well as its succeeding positions. One or more clients can grab unexplored states from the database, calculate possible moves, and insert the new states into the database. Once an endgame state is found, all states leading up to it can be updated with the minimax information to build a decision trees. If intelligent decisions are made to pick probable branches to explore, I can build information for the most important branches, and then gradually build up to completion over time.
Ignoring the merits of this idea, or the feasability of it, what is the best way to implement such a database? I made a quick prototype in sql server that stored a string representation of each state. It worked, but my solver client ran very very slow, as it puled out one state at a time and calculated all moves. I feel like I need to do larger chunks in memory, but the search space is definitely too large to store it all in memory at once.
Is there a database system better suited to this kind of job? I will be doing many many inserts, a lot of reads (to check if states (or equivalent states) already exist), and very few updates.
Also, how can I parallelize it so that many clients can work on solving different branches without duplicating too much work. I'm thinking something along the lines of a program that checks out an assignment, generates a few million states, and submits it back to be integrated into the main database. I'm just not sure if something like that will work well, or if there is prior work on methods to do that kind of thing as well.
In order to solve a game, what you really need to know per a state in your database is what is its game-theoretic value, i.e. if it's win for the player whose turn it is to move, or loss, or forced draw. You need two bits to encode this information per a state.
You then find as compact encoding as possible for that set of game states for which you want to build your end-game database; let's say your encoding takes 20 bits. It's then enough to have an array of 221 bits on your hard disk, i.e. 213 bytes. When you analyze an end-game position, you first check if the corresponding value is already set in the database; if not, calculate all its successors, calculate their game-theoretic values recursively, and then calculate using min/max the game-theoretic value of the original node and store in database. (Note: if you store win/loss/draw data in two bits, you have one bit pattern left to denote 'not known'; e.g. 00=not known, 11 = draw, 10 = player to move wins, 01 = player to move loses).
For example, consider tic-tac-toe. There are nine squares; every one can be empty, "X" or "O". This naive analysis gives you 39 = 214.26 = 15 bits per state, so you would have an array of 216 bits.
You undoubtedly want a task queue service of some sort, such as RabbitMQ - probably in conjunction with a database which can store the data once you've calculated it. Alternately, you could use a hosted service like Amazon's SQS. The client would consume an item from the queue, generate the successors, and enqueue those, as well as adding the outcome of the item it just consumed to the queue. If the state is an end-state, it can propagate scoring information up to parent elements by consulting the database.
Two caveats to bear in mind:
The number of items in the queue will likely grow exponentially as you explore the tree, with each work item causing several more to be enqueued. Be prepared for a very long queue.
Depending on your game, it may be possible for there to be multiple paths to the same game state. You'll need to check for and eliminate duplicates, and your database will need to be structured so that it's a graph (possibly with cycles!), not a tree.
The first thing that popped into my mind is the Linda-style of a shared 'whiteboard', where different processes can consume 'problems' off the whiteboard, add new problems to the whiteboard, and add 'solutions' to the whiteboard.
Perhaps the Cassandra project is the more modern version of Linda.
There have been many attempts to parallelize problems across distributed computer systems; Folding#Home provides a framework that executes binary blob 'cores' to solve protein folding problems. Distributed.net might have started the modern incarnation of distributed problem solving, and might have clients that you can start from.