JUNG - How to get the giant connected component of a graph? - gcc

Currently, what I'm doing is:
WeakComponentClusterer<Integer, String> wcc = new WeakComponentClusterer<Integer, String>();
Collection<Graph<Integer,String>> ccs = FilterUtils.createAllInducedSubgraphs(wcc.transform(graph),graph);
The problem is that in ccs is stored all the connected components but I just want the giant one (GCC). Since the order of the clusters in the collection css is not determinated by their size, I have to iterate over the whole collection in order to find the giant cluster. The bad thing is that the graph I'm using is huge and has many clusters; so, that iteration costs a lot.
Since I'm new at JUNG I was just wondering if there's a fast way of retrieving the GCC of a graph. Any help is valid.

Probably the easiest way to solve your problem would be to hack WeakComponentClusterer so that it kept track of the component sizes (or of which one was the largest, since that's what you're interested in) as it was constructing them, and then exposed that information to the user.
This is a modification we might make at some point but it's easy enough for you to to in your local copy of the code.

Related

Understanding thread-safety

Let use as example standard Binary Search Tree, its nodes having an encapsulated value and pointers to the left and right children and its parent. Its primary methods are addNode(), removeNode() and searchNode().
I want to understand how to make my data structure safe. I however do not know much about thread-safety and want to ask if what I am thinking is correct or not.
My first problem is one of decision: what I want to happen when I have concurrent methods? Let us say I am using my BST to manage a map in a game: the position of the character of a player is represented by an array of doubles, I store the player's positions in my BST in lexicographic order. One player destroyed a target (and his thread want to remove it from my map), but another player is initiating an attack and needs a list of all possible targets inside a certain distance from him, including the target just removed (and his thread want to execute a search that will include the target). How I manage the situation? What about other situations?
My second problem, (I think) my biggest, is one of actuation: how I make happen what I want to happen. I read about things like atomic-operations, linearity, obstruction-free, lock-free, wait-free. I never found out in a single place a simple definition for each, the advantages of that strategy, the disadvantages. What are those things? What am I doing if I am locking the nodes, or if i am locking the pointers to other nodes, or the values encapsulates in the node?
Am I understanding correctly the problematics of thread-safety? Do you have suggestions on problems I should think about, or on written papers and pdfs I should read in order to better understand the problems and their many solutions?

Barnes-Hut tree creating

I am currently trying to create a Barnes-Hut octree, however, I still not fully understand how to do this properly. I have read threads here, this article and some others. I believe I do understand how to make a tree if every node contains the information about the indices of particles inside, and if you keep storing the empty nodes. But if you do not want to? How to make a tree such that at the end you will only have necessary information: say, monopoles and quadrupoles for all non-empty nodes. I made so many different attempts that now I am completely confused, to be honest. What should I contain in each node? What would be the pseudocode for such thing?
P.S. By the way, is it different for monopoles and quadrupoles? I mean I can imagine that you do not need the exact information about the particles inside the node to calculate a monopole (it is just a full mass of node), but for quadruple?
Thank you in advance!
P.S. By the way, I use julia language if it is somehow relevant.

How can I store a graph and run page rank like analytics on it hbase?

Sorry if this question seems a bit complex but I think its all related so I wanted try to get the answer in one shot. Basically I have a layered graph*, that has various sets of data that are connected to only the next set of data(so set1 has vertexes that have edges to set2, and so on but set1 has nothing connecting to set3 or anything other than set2. It might be relevant not sure). Generally, you can think of my data as one massive family tree(every set I add about a billion nodes) that I keep loading new generations with every new set(families create new families and no edges go backwards).
I have an Hbase/hadoop system running and I know how to use java to add columns and values, but what I don't know how to do is:
add data to hbase in a graph type format(since its hbase, I want to load it in a way that I can add a ton of data and it'll scale..unlike other databases that limit graphs to the size of the system). I know how to add data but don't understand how to do it in a scalable graph way.
Once the graph is loaded I want to know how to apply some kind of analytics to it. Pagerank is popular so I thought I would say it, but pretty much anything that is based on processing a graph.
I guess the simplified way of asking the question is how to do I specifically get a graph into hbase and once its there how do I analyze it? Is there a tutorial? There's a lot of hbase information on the internet(I read the hbase book) but I could not find anything specific to graphs. I found, giraph, but I don't think it can connect to hbase(yet). Seeing how hadoop/hbase are versions of mapreduce/bigtables I suspect there is a way to process graphs I'm just not having luck finding anything.
*A layered graph is a directed graph with a level for different set of vertex's, like so: http://en.wikipedia.org/wiki/Layered_graph_drawing
I think this question on SO could help:
https://stackoverflow.com/questions/9865738/is-it-possible-to-store-graphs-hbase-if-so-how-do-you-model-the-database-to-sup/9867563#9867563
This part of my answer to this question might be of use.
Using HBase/Accumulo as input to giraph has been submitted recently (7
Mar 2012) as a new feature request to Giraph: HBase/Accumulo Input
and Output formats (GIRAPH-153)
We use giraph in this way, it only store minimum data in each vertex, and then run the graph algorithm with giraph, then we assemble the result with rich data using pig, for page rank algo, each vertex only needs to store vertex id, rank, thus it could scale to almost billion level.

Database for brute force solving board games

A few years back, researchers announced that they had completed a brute-force comprehensive solution to checkers.
I have been interested in another similar game that should have fewer states, but is still quite impractical to run a complete solver on in any reasonable time frame. I would still like to make an attempt, as even a partial solution could give valuable information.
Conceptually I would like to have a database of game states that has every known position, as well as its succeeding positions. One or more clients can grab unexplored states from the database, calculate possible moves, and insert the new states into the database. Once an endgame state is found, all states leading up to it can be updated with the minimax information to build a decision trees. If intelligent decisions are made to pick probable branches to explore, I can build information for the most important branches, and then gradually build up to completion over time.
Ignoring the merits of this idea, or the feasability of it, what is the best way to implement such a database? I made a quick prototype in sql server that stored a string representation of each state. It worked, but my solver client ran very very slow, as it puled out one state at a time and calculated all moves. I feel like I need to do larger chunks in memory, but the search space is definitely too large to store it all in memory at once.
Is there a database system better suited to this kind of job? I will be doing many many inserts, a lot of reads (to check if states (or equivalent states) already exist), and very few updates.
Also, how can I parallelize it so that many clients can work on solving different branches without duplicating too much work. I'm thinking something along the lines of a program that checks out an assignment, generates a few million states, and submits it back to be integrated into the main database. I'm just not sure if something like that will work well, or if there is prior work on methods to do that kind of thing as well.
In order to solve a game, what you really need to know per a state in your database is what is its game-theoretic value, i.e. if it's win for the player whose turn it is to move, or loss, or forced draw. You need two bits to encode this information per a state.
You then find as compact encoding as possible for that set of game states for which you want to build your end-game database; let's say your encoding takes 20 bits. It's then enough to have an array of 221 bits on your hard disk, i.e. 213 bytes. When you analyze an end-game position, you first check if the corresponding value is already set in the database; if not, calculate all its successors, calculate their game-theoretic values recursively, and then calculate using min/max the game-theoretic value of the original node and store in database. (Note: if you store win/loss/draw data in two bits, you have one bit pattern left to denote 'not known'; e.g. 00=not known, 11 = draw, 10 = player to move wins, 01 = player to move loses).
For example, consider tic-tac-toe. There are nine squares; every one can be empty, "X" or "O". This naive analysis gives you 39 = 214.26 = 15 bits per state, so you would have an array of 216 bits.
You undoubtedly want a task queue service of some sort, such as RabbitMQ - probably in conjunction with a database which can store the data once you've calculated it. Alternately, you could use a hosted service like Amazon's SQS. The client would consume an item from the queue, generate the successors, and enqueue those, as well as adding the outcome of the item it just consumed to the queue. If the state is an end-state, it can propagate scoring information up to parent elements by consulting the database.
Two caveats to bear in mind:
The number of items in the queue will likely grow exponentially as you explore the tree, with each work item causing several more to be enqueued. Be prepared for a very long queue.
Depending on your game, it may be possible for there to be multiple paths to the same game state. You'll need to check for and eliminate duplicates, and your database will need to be structured so that it's a graph (possibly with cycles!), not a tree.
The first thing that popped into my mind is the Linda-style of a shared 'whiteboard', where different processes can consume 'problems' off the whiteboard, add new problems to the whiteboard, and add 'solutions' to the whiteboard.
Perhaps the Cassandra project is the more modern version of Linda.
There have been many attempts to parallelize problems across distributed computer systems; Folding#Home provides a framework that executes binary blob 'cores' to solve protein folding problems. Distributed.net might have started the modern incarnation of distributed problem solving, and might have clients that you can start from.

How to subdivide a 2d game world for better collision detection

I'm developing a game which features a sizeable square 2d playing area. The gaming area is tileless with bounded sides (no wrapping around). I am trying to figure out how I can best divide up this world to increase the performance of collision detection. Rather than checking each entity for collision with all other entities I want to only check nearby entities for collision and obstacle avoidance.
I have a few special concerns for this game world...
I want to be able to be able to use a large number of entities in the game world at once. However, a % of entities won't collide with entities of the same type. For example projectiles won't collide with other projectiles.
I want to be able to use a large range of entity sizes. I want there to be a very large size difference between the smallest entities and the largest.
There are very few static or non-moving entities in the game world.
I'm interested in using something similar to what's described in the answer here: Quadtree vs Red-Black tree for a game in C++?
My concern is how well will a tree subdivision of the world be able to handle large size differences in entities? To divide the world up enough for the smaller entities the larger ones will need to occupy a large number of regions and I'm concerned about how that will affect the performance of the system.
My other major concern is how to properly keep the list of occupied areas up to date. Since there's a lot of moving entities, and some very large ones, it seems like dividing the world up will create a significant amount of overhead for keeping track of which entities occupy which regions.
I'm mostly looking for any good algorithms or ideas that will help reduce the number collision detection and obstacle avoidance calculations.
If I were you I'd start off by implementing a simple BSP (binary space partition) tree. Since you are working in 2D, bound box checks are really fast. You basically need three classes: CBspTree, CBspNode and CBspCut (not really needed)
CBspTree has one root node instance of class CBspNode
CBspNode has an instance of CBspCut
CBspCut symbolize how you cut a set in two disjoint sets. This can neatly be solved by introducing polymorphism (e.g. CBspCutX or CBspCutY or some other cutting line). CBspCut also has two CBspNode
The interface towards the divided world will be through the tree class and it can be a really good idea to create one more layer on top of that, in case you would like to replace the BSP solution with e.g. a quad tree. Once you're getting the hang of it. But in my experience, a BSP will do just fine.
There are different strategies of how to store your items in the tree. What I mean by that is that you can choose to have e.g. some kind of container in each node that contains references to the objects occuping that area. This means though (as you are asking yourself) that large items will occupy many leaves, i.e. there will be many references to large objects and very small items will show up at single leaves.
In my experience this doesn't have that large impact. Of course it matters, but you'd have to do some testing to check if it's really an issue or not. You would be able to get around this by simply leaving those items at branched nodes in the tree, i.e. you will not store them on "leaf level". This means you will find those objects quick while traversing down the tree.
When it comes to your first question. If you only are going to use this subdivision for collision testing and nothing else, I suggest that things that can never collide never are inserted into the tree. A missile for example as you say, can't collide with another missile. Which would mean that you dont even have to store the missile in the tree.
However, you might want to use the bsp for other things as well, you didn't specify that but keep that in mind (for picking objects with e.g. the mouse). Otherwise I propose that you store everything in the bsp, and resolve the collision later on. Just ask the bsp of a list of objects in a certain area to get a limited set of possible collision candidates and perform the check after that (assuming objects know what they can collide with, or some other external mechanism).
If you want to speed up things, you also need to take care of merge and split, i.e. when things are removed from the tree, a lot of nodes will become empty or the number of items below some node level will decrease below some merge threshold. Then you want to merge two subtrees into one node containing all items. Splitting happens when you insert items into the world. So when the number of items exceed some splitting threshold you introduce a new cut, which splits the world in two. These merge and split thresholds should be two constants that you can use to tune the efficiency of the tree.
Merge and split are mainly used to keep the tree balanced and to make sure that it works as efficient as it can according to its specifications. This is really what you need to worry about. Moving things from one location and thus updating the tree is imo fast. But when it comes to merging and splitting it might become expensive if you do it too often.
This can be avoided by introducing some kind of lazy merge and split system, i.e. you have some kind of dirty flagging or modify count. Batch up all operations that can be batched, i.e. moving 10 objects and inserting 5 might be one batch. Once that batch of operations is finished, you check if the tree is dirty and then you do the needed merge and/or split operations.
Post some comments if you want me to explain further.
Cheers !
Edit
There are many things that can be optimized in the tree. But as you know, premature optimization is the root to all evil. So start off simple. For example, you might create some generic callback system that you can use while traversing the tree. This way you dont have to query the tree to get a list of objects that matched the bound box "question", instead you can just traverse down the tree and execute that call back each time you hit something. "If this bound box I'm providing intersects you, then execute this callback with these parameters"
You most definitely want to check this list of collision detection resources from gamedev.net out. It's full of resources with game development conventions.
For other than collision detection only, check their entire list of articles and resources.
My concern is how well will a tree
subdivision of the world be able to
handle large size differences in
entities? To divide the world up
enough for the smaller entities the
larger ones will need to occupy a
large number of regions and I'm
concerned about how that will affect
the performance of the system.
Use a quad tree. For objects that exist in multiple areas you have a few options:
Store the object in both branches, all the way down. Everything ends up in leaf nodes but you may end up with a significant number of extra pointers. May be appropriate for static things.
Split the object on the zone border and insert each part in their respective locations. Creates a lot of pain and isn't well defined for a lot of objects.
Store the object at the lowest point in the tree you can. Sets of objects now exist in leaf and non-leaf nodes, but each object has one pointer to it in the tree. Probably best for objects that are going to move.
By the way, the reason you're using a quad tree is because it's really really easy to work with. You don't have any heuristic based creation like you might with some BSP implementations. It's simple and it gets the job done.
My other major concern is how to
properly keep the list of occupied
areas up to date. Since there's a lot
of moving entities, and some very
large ones, it seems like dividing the
world up will create a significant
amount of overhead for keeping track
of which entities occupy which
regions.
There will be overhead to keeping your entities in the correct spots in the tree every time they move, yes, and it can be significant. But the whole point is that you're doing much much less work in your collision code. Even though you're adding some overhead with the tree traversal and update it should be much smaller than the overhead you just removed by using the tree at all.
Obviously depending on the number of objects, size of game world, etc etc the trade off might not be worth it. Usually it turns out to be a win, but it's hard to know without doing it.
There are lots of approaches. I'd recommend settings some specific goals (e.g., x collision tests per second with a ratio of y between smallest to largest entities), and do some prototyping to find the simplest approach that achieves those goals. You might be surprised how little work you have to do to get what you need. (Or it might be a ton of work, depending on your particulars.)
Many acceleration structures (e.g., a good BSP) can take a while to set up and thus are generally inappropriate for rapid animation.
There's a lot of literature out there on this topic, so spend some time searching and researching to come up with a list candidate approaches. Mock them up and profile.
I'd be tempted just to overlay a coarse grid over the play area to form a 2D hash. If the grid is at least the size of the largest entity then you only ever have 9 grid squares to check for collisions and it's a lot simpler than managing quad-trees or arbitrary BSP trees. The overhead of determining which coarse grid square you're in is typically just 2 arithmetic operations and when a change is detected the grid just has to remove one reference/ID/pointer from one square's list and add the same to another square.
Further gains can be had from keeping the projectiles out of the grid/tree/etc lookup system - since you can quickly determine where the projectile would be in the grid, you know which grid squares to query for potential collidees. If you check collisions against the environment for each projectile in turn, there's no need for the other entities to then check for collisions against the projectiles in reverse.

Resources