What alternative to bitmap for network connectivity check? - algorithm

I have a set of nodes, each identified by a unique integer (UID), which are arranged into one or more networks. Each network has one or more root nodes. I need to know which nodes are not connected to any root nodes.
At present, from a previous product iteration, my connectivity check starts at each root node and follows all connections. For every found node, a bit in a bitmap is set so that you can quickly check if a node has already been found / processed. Once all paths for all root nodes have been followed, the complete set of nodes is compared against the 'found' bitmap to show all the unconnected nodes.
Previously, UIDs were sequential and I could consolidate them to remove gaps. So using the first ID as an offset, I just made my found array quite large and indexed found nodes into the bitmap directly using the UID (i.e., if I found node 1000, I'd set the 1000th bit in the bitmap). However, with the current version of the product, I have less control over the node UIDs. I can't consolidate them, and third party interaction is unpredictably making very large gaps (e.g., UIDs might jump from being in the thousands to being in the tens of millions). I have come across instances where my bitmap array is too small to accommodate the gaps and the connectivity check fails.
Now, I could just go on increasing the size of my bitmap array, but that always runs the risk of still being too small and is not very resource efficient. Thus, I'm looking to replace it with something else. Obviously, I'd like it to be as fast and as resource efficient as possible - I think some sort of hashed map is what I need. Unfortunately, I have to make this work in Fortran, so I don't have access to <map> etc.
What is the best way to hash my UIDs into a lookup structure, such that I can easily check if I already found that node?

If you can modify the node types themselves, you could add a found field, and use that?
If that is not possible, then yes, a hash table sounds like the obvious solution. Depending on how the UID's are distributed, you might get away with something as simple as mod(UID, bitmap_size). But, you still need to handle collisions somehow. There's lot of literature on that topic, so I won't go into that here, except to note that robinhood hashing is pretty cool (except maybe a bit complicated for a one-off use).

Related

Algorithm for predictably splitting data structures no matter the order it's built

The requirements:
I'm adding a feature to a program that builds Solr index's. The system is multi-threaded, so the search entries will be created in a random order every time. The Solr index's also need to be split into multiple files because if a user tries to upload one big file, the server can run out of memory.
The problem:
In order to keep the system reliable and make things easier overall, the resulting Solr index files need to be the same no matter what order they're processed in. The indices need to be balanced across the files (or close enough to balanced), and have a maximum amount of entries. If the files go beyond the maximum amount of entries, they need to be split. These files will also be updated across runs, so entries will be added, removed, and changed.
What's needed:
I'm looking for an algorithm that can be adopted for these requirements. I think I need some kind of B-tree, but I don't know about any B-tree variants that fits around this particular set of requirements.
Is there an algorithm or data structure out there that can help with these requirements?
Use a UUID based on contents. For splitting the file, send each item to a bucket based on the range that the UUID falls in. No matter what order you get items, this will reliably send it to buckets with relatively even sizes, and the unique index will guarantee that the result comes out the same.
See https://wiki.apache.org/solr/UniqueKey for more detailed advice, and https://wiki.apache.org/solr/LargeIndexes for other useful tips.

How to remove duplicated values in distributed system?

Assume we have a distributed system and there are K machines in the cluster. Each machine stores several integers. I would like to remove all the duplicate values from the system. So if integer 123 appears in machine1 and machine2, we should only keep one 123 in the system. How should I handle this?
My idea is to first let each machine do a removeDuplicate operation using something like bucket-sorting (all nubmers are integer), and then let one machine be the master-node to do a reduce. Is there any better idea?
The easy answer would be to not end up with unmanaged duplicate values on different machines in the first place, by using a distributed hash ring or similar technology to make sure a certain value ends up on a certain node.
If that's not good enough, I'd look into heuristic optimizations. Since you already have multiple copies on different machines, I'm assuming that you want to deduplicate these values for a little bit of extra performance, rather than application correctness.
If this is the case, let each node slowly pass through its keyspace (foreach integer on node) and ask all other nodes if they have a copy of that same value. If they do, we deduplicate it. If someone doesn't respond (fast enough); ignore them and continue. This allows for a decentralized deduplication algorithm that handles node failures and that can be run in any speed, allowing more important traffic to be prioritized when needed.
I'm guessing that the keys are accessed according to a power law distribution, so sweaping through the most commonly updated keys more often could be more efficient, but there's no guarantee for it.
Not sure what type of system you are interested in but if a shared memory is an option you can keep a counter array. Since all your numbers are integers you can flag each integer that appears in this shared array. Also, if this integer is already flagged then drop it. This will result in o(k) operations for each integer received and no duplicates.

Is there a consistent hashing algorithm that support zero remapping of keys?

I understand with the classical consistent caching algorithm, when adding/removing a node, some keys have to be remapped to different nodes. Is there an algorithm that supports no remapping at all, if I loosen some requirements?
In my application, I want to incrementally assign keys to nodes:
Once a key has been assigned to a node, it stays there forever.
Nodes are added but not removed. a node is never down after being added - assume a replication/backup mechanism at work.
Keys don't need to be distributed uniformly among the nodes. Best-effort is OK: when a new node is added, more new keys are assigned to it than the old nodes.
Is there an algorithm for this scenario?
I can image two similar workarounds that could give you what you’re asking for, but both come with conditions that probably are not acceptable:
If cache clients know in what sequence keys were first requested, i.e. if cache keys include a monotonically increasing id or version number of some kind, then you could keep track of the sequence numbers at which the cluster size increased, and compute the hash according to the number of nodes that existed at that time.
If you don’t mind a two-stage lookup, you could keep a key → numnodes lookup table that records how many nodes there were at the time a key was cached, then use that to compute the hash code. Or just keep a key → cachenode lookup table.
(A variation on #2 if the two-stage lookup is OK, but size of the lookup table is a concern: keep a hash(key) → cachenode lookup table, and make that hash as small as you need it to be to keep the lookup table small. If two keys happen to have the same hash, they end up on the same node — but that’s not a concern if the balancing isn’t strict.)
Neither of these techniques even relies on consistent hashing — just naive hash codes — but both are quite limiting.
In the general case, without something that ties a key to information about the state of the cache at the time that key was first cached, then no, I don’t think what you’re asking for is possible.

Database for brute force solving board games

A few years back, researchers announced that they had completed a brute-force comprehensive solution to checkers.
I have been interested in another similar game that should have fewer states, but is still quite impractical to run a complete solver on in any reasonable time frame. I would still like to make an attempt, as even a partial solution could give valuable information.
Conceptually I would like to have a database of game states that has every known position, as well as its succeeding positions. One or more clients can grab unexplored states from the database, calculate possible moves, and insert the new states into the database. Once an endgame state is found, all states leading up to it can be updated with the minimax information to build a decision trees. If intelligent decisions are made to pick probable branches to explore, I can build information for the most important branches, and then gradually build up to completion over time.
Ignoring the merits of this idea, or the feasability of it, what is the best way to implement such a database? I made a quick prototype in sql server that stored a string representation of each state. It worked, but my solver client ran very very slow, as it puled out one state at a time and calculated all moves. I feel like I need to do larger chunks in memory, but the search space is definitely too large to store it all in memory at once.
Is there a database system better suited to this kind of job? I will be doing many many inserts, a lot of reads (to check if states (or equivalent states) already exist), and very few updates.
Also, how can I parallelize it so that many clients can work on solving different branches without duplicating too much work. I'm thinking something along the lines of a program that checks out an assignment, generates a few million states, and submits it back to be integrated into the main database. I'm just not sure if something like that will work well, or if there is prior work on methods to do that kind of thing as well.
In order to solve a game, what you really need to know per a state in your database is what is its game-theoretic value, i.e. if it's win for the player whose turn it is to move, or loss, or forced draw. You need two bits to encode this information per a state.
You then find as compact encoding as possible for that set of game states for which you want to build your end-game database; let's say your encoding takes 20 bits. It's then enough to have an array of 221 bits on your hard disk, i.e. 213 bytes. When you analyze an end-game position, you first check if the corresponding value is already set in the database; if not, calculate all its successors, calculate their game-theoretic values recursively, and then calculate using min/max the game-theoretic value of the original node and store in database. (Note: if you store win/loss/draw data in two bits, you have one bit pattern left to denote 'not known'; e.g. 00=not known, 11 = draw, 10 = player to move wins, 01 = player to move loses).
For example, consider tic-tac-toe. There are nine squares; every one can be empty, "X" or "O". This naive analysis gives you 39 = 214.26 = 15 bits per state, so you would have an array of 216 bits.
You undoubtedly want a task queue service of some sort, such as RabbitMQ - probably in conjunction with a database which can store the data once you've calculated it. Alternately, you could use a hosted service like Amazon's SQS. The client would consume an item from the queue, generate the successors, and enqueue those, as well as adding the outcome of the item it just consumed to the queue. If the state is an end-state, it can propagate scoring information up to parent elements by consulting the database.
Two caveats to bear in mind:
The number of items in the queue will likely grow exponentially as you explore the tree, with each work item causing several more to be enqueued. Be prepared for a very long queue.
Depending on your game, it may be possible for there to be multiple paths to the same game state. You'll need to check for and eliminate duplicates, and your database will need to be structured so that it's a graph (possibly with cycles!), not a tree.
The first thing that popped into my mind is the Linda-style of a shared 'whiteboard', where different processes can consume 'problems' off the whiteboard, add new problems to the whiteboard, and add 'solutions' to the whiteboard.
Perhaps the Cassandra project is the more modern version of Linda.
There have been many attempts to parallelize problems across distributed computer systems; Folding#Home provides a framework that executes binary blob 'cores' to solve protein folding problems. Distributed.net might have started the modern incarnation of distributed problem solving, and might have clients that you can start from.

How to subdivide a 2d game world for better collision detection

I'm developing a game which features a sizeable square 2d playing area. The gaming area is tileless with bounded sides (no wrapping around). I am trying to figure out how I can best divide up this world to increase the performance of collision detection. Rather than checking each entity for collision with all other entities I want to only check nearby entities for collision and obstacle avoidance.
I have a few special concerns for this game world...
I want to be able to be able to use a large number of entities in the game world at once. However, a % of entities won't collide with entities of the same type. For example projectiles won't collide with other projectiles.
I want to be able to use a large range of entity sizes. I want there to be a very large size difference between the smallest entities and the largest.
There are very few static or non-moving entities in the game world.
I'm interested in using something similar to what's described in the answer here: Quadtree vs Red-Black tree for a game in C++?
My concern is how well will a tree subdivision of the world be able to handle large size differences in entities? To divide the world up enough for the smaller entities the larger ones will need to occupy a large number of regions and I'm concerned about how that will affect the performance of the system.
My other major concern is how to properly keep the list of occupied areas up to date. Since there's a lot of moving entities, and some very large ones, it seems like dividing the world up will create a significant amount of overhead for keeping track of which entities occupy which regions.
I'm mostly looking for any good algorithms or ideas that will help reduce the number collision detection and obstacle avoidance calculations.
If I were you I'd start off by implementing a simple BSP (binary space partition) tree. Since you are working in 2D, bound box checks are really fast. You basically need three classes: CBspTree, CBspNode and CBspCut (not really needed)
CBspTree has one root node instance of class CBspNode
CBspNode has an instance of CBspCut
CBspCut symbolize how you cut a set in two disjoint sets. This can neatly be solved by introducing polymorphism (e.g. CBspCutX or CBspCutY or some other cutting line). CBspCut also has two CBspNode
The interface towards the divided world will be through the tree class and it can be a really good idea to create one more layer on top of that, in case you would like to replace the BSP solution with e.g. a quad tree. Once you're getting the hang of it. But in my experience, a BSP will do just fine.
There are different strategies of how to store your items in the tree. What I mean by that is that you can choose to have e.g. some kind of container in each node that contains references to the objects occuping that area. This means though (as you are asking yourself) that large items will occupy many leaves, i.e. there will be many references to large objects and very small items will show up at single leaves.
In my experience this doesn't have that large impact. Of course it matters, but you'd have to do some testing to check if it's really an issue or not. You would be able to get around this by simply leaving those items at branched nodes in the tree, i.e. you will not store them on "leaf level". This means you will find those objects quick while traversing down the tree.
When it comes to your first question. If you only are going to use this subdivision for collision testing and nothing else, I suggest that things that can never collide never are inserted into the tree. A missile for example as you say, can't collide with another missile. Which would mean that you dont even have to store the missile in the tree.
However, you might want to use the bsp for other things as well, you didn't specify that but keep that in mind (for picking objects with e.g. the mouse). Otherwise I propose that you store everything in the bsp, and resolve the collision later on. Just ask the bsp of a list of objects in a certain area to get a limited set of possible collision candidates and perform the check after that (assuming objects know what they can collide with, or some other external mechanism).
If you want to speed up things, you also need to take care of merge and split, i.e. when things are removed from the tree, a lot of nodes will become empty or the number of items below some node level will decrease below some merge threshold. Then you want to merge two subtrees into one node containing all items. Splitting happens when you insert items into the world. So when the number of items exceed some splitting threshold you introduce a new cut, which splits the world in two. These merge and split thresholds should be two constants that you can use to tune the efficiency of the tree.
Merge and split are mainly used to keep the tree balanced and to make sure that it works as efficient as it can according to its specifications. This is really what you need to worry about. Moving things from one location and thus updating the tree is imo fast. But when it comes to merging and splitting it might become expensive if you do it too often.
This can be avoided by introducing some kind of lazy merge and split system, i.e. you have some kind of dirty flagging or modify count. Batch up all operations that can be batched, i.e. moving 10 objects and inserting 5 might be one batch. Once that batch of operations is finished, you check if the tree is dirty and then you do the needed merge and/or split operations.
Post some comments if you want me to explain further.
Cheers !
Edit
There are many things that can be optimized in the tree. But as you know, premature optimization is the root to all evil. So start off simple. For example, you might create some generic callback system that you can use while traversing the tree. This way you dont have to query the tree to get a list of objects that matched the bound box "question", instead you can just traverse down the tree and execute that call back each time you hit something. "If this bound box I'm providing intersects you, then execute this callback with these parameters"
You most definitely want to check this list of collision detection resources from gamedev.net out. It's full of resources with game development conventions.
For other than collision detection only, check their entire list of articles and resources.
My concern is how well will a tree
subdivision of the world be able to
handle large size differences in
entities? To divide the world up
enough for the smaller entities the
larger ones will need to occupy a
large number of regions and I'm
concerned about how that will affect
the performance of the system.
Use a quad tree. For objects that exist in multiple areas you have a few options:
Store the object in both branches, all the way down. Everything ends up in leaf nodes but you may end up with a significant number of extra pointers. May be appropriate for static things.
Split the object on the zone border and insert each part in their respective locations. Creates a lot of pain and isn't well defined for a lot of objects.
Store the object at the lowest point in the tree you can. Sets of objects now exist in leaf and non-leaf nodes, but each object has one pointer to it in the tree. Probably best for objects that are going to move.
By the way, the reason you're using a quad tree is because it's really really easy to work with. You don't have any heuristic based creation like you might with some BSP implementations. It's simple and it gets the job done.
My other major concern is how to
properly keep the list of occupied
areas up to date. Since there's a lot
of moving entities, and some very
large ones, it seems like dividing the
world up will create a significant
amount of overhead for keeping track
of which entities occupy which
regions.
There will be overhead to keeping your entities in the correct spots in the tree every time they move, yes, and it can be significant. But the whole point is that you're doing much much less work in your collision code. Even though you're adding some overhead with the tree traversal and update it should be much smaller than the overhead you just removed by using the tree at all.
Obviously depending on the number of objects, size of game world, etc etc the trade off might not be worth it. Usually it turns out to be a win, but it's hard to know without doing it.
There are lots of approaches. I'd recommend settings some specific goals (e.g., x collision tests per second with a ratio of y between smallest to largest entities), and do some prototyping to find the simplest approach that achieves those goals. You might be surprised how little work you have to do to get what you need. (Or it might be a ton of work, depending on your particulars.)
Many acceleration structures (e.g., a good BSP) can take a while to set up and thus are generally inappropriate for rapid animation.
There's a lot of literature out there on this topic, so spend some time searching and researching to come up with a list candidate approaches. Mock them up and profile.
I'd be tempted just to overlay a coarse grid over the play area to form a 2D hash. If the grid is at least the size of the largest entity then you only ever have 9 grid squares to check for collisions and it's a lot simpler than managing quad-trees or arbitrary BSP trees. The overhead of determining which coarse grid square you're in is typically just 2 arithmetic operations and when a change is detected the grid just has to remove one reference/ID/pointer from one square's list and add the same to another square.
Further gains can be had from keeping the projectiles out of the grid/tree/etc lookup system - since you can quickly determine where the projectile would be in the grid, you know which grid squares to query for potential collidees. If you check collisions against the environment for each projectile in turn, there's no need for the other entities to then check for collisions against the projectiles in reverse.

Resources