Ways to determine a group of units in RTS - algorithm

Looking for an algorithm that can be used to determine groups of units that move together as a squad in a real time strategy game like StarCraft. The direction that I am currently look at is a clustering algorithm but having a hard time finding which one would work best since units are moving as a group not just standing still. Any help would be great.

K-means is not the best choice, as it requires you to specify the number of clusters you expect to find. Some might contain single objects then.
I recommend adapting DBSCAN. In particular, the generalized version GDBSCAN.
For this, you need to define what constitutes the neighborhood of a unit - say, any other unit within a range of 2 that is belonging to the same player and moving approximately in the same direction (up to a certain delta threshold in x and y velocity).
Next, you need to specify when you consider units to start forming an initial cluster, called "core point". Say that is a minimum of 3 units.
Then using DBSCAN is quite basic, and should give you good results. You need to fine-tune the parameters a bit. Things like this minimum size are clearly an input parameter, and depend on your use case. So is the neighborhood definition: you are looking for groups that move into the same direction, this information needs to be put into the algorithm somehow. With GDBSCAN this is trivial, by adjusting the neighborhood definition.

You may want to look at a number of classification algorithms, like k-Nearest Neighbor or Support Vector Machines

Kmeans algorithm is quite simple and standard approach. You can check if it works:

Related

Select relevant features with PCA and K-MEANS

I am trying to understand PCA and K-Means algorithms in order to extract some relevant features from a set of features.
I don't know what branch of computer science study these topics, seems on internet there aren't good resources, just some paper that I don't understand well. An example of paper http://www.ifp.illinois.edu/~qitian/e_paper/icip02/icip02.pdf
I have csv files of pepole walks composed as follow:
TIME, X, Y, Z, these values are registred by the accelerometer
What I did
I transformed the dataset as a table in Python
I used tsfresh, a Python library, to extract from each walk a vector of features, these features are a lot, 2k+ features from each walk.
I have to use PFA, Principal Feature Analysis, to select the relevant features from the set of
vectors features
In order to do the last point, I have to reduce the dimension of the set of features walks with PCA (PCA will make the data different from the original one cause it modifies the data with the eigenvectors and eigenvalues of the covariance matrix of the original data). Here I have the first question:
How the input of PCA should look? The rows are the number of walks and the columns are the features or viceversa, so the rows are the number of the features and the columns are the number of walks of pepole?
After I reduced this data, I should use the K-Means algorithm on the reduced 'features' data. How the input should look in the K-Means? And what's the propouse on using this algorithm? All I know this algorithm it's used to 'cluster' some data, so in each cluster there are some 'points' based on some rule. What I did and think is:
If I use in PCA an input that looks like: the rows are the number of walks and the columns are the number of features, then for K-Means I should change the columns with rows cause in this way each point it's a feature (but this is not the original data with the features, it's just the reduced one, so I don't know). So then for each cluster I see with euclidean distance who has the lower distance from the centroid and select that feature. So how many clusters I should declare? If I declare that the clusters are the same as the number of features, I will extract always the same number of features. How can I say that a point in the reduced data correspond to this feature in the original set of features?
I know it's not correct what I am saying maybe, but I am trying to understand it, can some of you help me? If am I in the right way? Thanks!
For the PCA, make sure you separate the understanding of the method the algorithm uses (eigenvectors and such) and the result. The result, is a linear mapping, mapping the original space A, to A', where possibly, the dimension (number of features in your case) is less than the original space A.
So the first feature/element in space A', is a linear combination of features of A.
The row/column depends on implementation, but if you use scikit PCA the columns are the features.
You can feed the PCA output, the A' space, to K-means, and it will cluster them, based on a space of usually reduced dimension.
Each point will be part of a cluster, and the idea is that if you would calculate K-Means on A, you would probably end up with the same/similar clusters like with A'. Computationally A' is a lot cheaper. You now have a clustering, on A' and A. As we agree that points similar in A' are also similar in A.
The number of clusters is difficult to answer, if you don't know anything search the elbow method. But say you want to get a sense of different type of things you have, I argue go for 3~8 and not too much, compare 2-3 points closest to
each center, and you have something consumable. The number of features can be larger than the number of clusters. e.g. If we want to know the most dense area in some area (2D) you can easily have 50 clusters, to get a sense where 50 cities could be. Here we have number of cluster way higher than space dimension, and it makes sense.

Bark at a Sphere Tree or Look Somewhere Else?

The scenario: A large number of players, playing a real-time game in 3d space, must be organized in a way where a server can efficiently update other players and any other observer of a players move and actions. What objects 'talk' to one another needs to be culled based on their range from one another in simulation; this is to preserve network sanity, programmer sanity, and also to allow for server-lets to handle smaller chunks of the overall world play-space.
However, if you have 3000 players, this runs into the issue that one must run 3000! Calculations to find out the ranges between everything. (Google tells me that ends up as a number with over 9000 digits; that’s insane and not worth considering for a near-real-time environment.)
Daybreak Games seems to have solved this problem with their massive online first person shooting game Planetside 2; it had allowed 3000 players to play on a shared space and have real-time responsiveness. They’ve apparently done it through a “Sphere Tree” data structure.
However, I’m not positive this is the solution they use, and I’m still a questioning how to apply the concept of "Sphere Trees" to reduce the range calculations for culling to a reasonable amount.
If Sphere Trees are not the right tree to bark up, what else should I be directing my attention at to tackle this problem?
(I'm a c# programmer (mainly), but I'm looking for a logical answer, not a code one)
References I’ve found about sphere trees;
http://isg.cs.tcd.ie/spheretree/#algorithms
https://books.google.com/books?id=1-NfBElV97IC&pg=PA385&lpg=PA385#v=onepage&q&f=false
Here are a few of my thoughts:
Let n denote the total number of players.
I think your estimate of 3000! is wrong. If you want to calculate all pairs distances given a fixed distance matrix, you run 3000 choose 2 operations, in the order of O(n^2*t), where t is the number of operations you spend calculating the distance between two players. If you build the graph underlying the players with edge weights being the Euclidean distance, you can reduce this to the all-pairs shortest paths problem, which is doable via the Floyd-Warshall algorithm in O(n^3).
What you're describing sounds pretty similar to doing a range query: https://en.wikipedia.org/wiki/Range_searching. There are a lot of data structures that can help you, such as range trees and k-d trees.
If objects only need to interact with objects that are, e.g., <= 100m away, then you can divide the world up into 100m x 100m tiles (or voxels), and keep track of which objects are in each non-empty tile.
Then, when one object needs to 'talk', you only need to check the objects in at most 9 tiles to see if they are close enough to hear it.

clustering algorithm for objects which have multiple feature time series information

I am looking for clustering algorithm which can handle with multiple time series information for each objects.
For example, for company "A" we have time series of 3 features(ex. income, sales, inventory)
At the same way, company "B" also has same time series of same features. and so on..
Then, how we can make cluster between set of company?
Is there some wise way to handle this?
A lot of clustering algorithms ask you to provide some measure of the similarity or distance between two points. It is really up to you to decide what features are important and what the distance really is. One way forwards would be to use the correlation between two time series. This gives you a similarity. If you have to convert this to a distance I would use sqrt(1-r), where r is the correlation, because if you look e.g. at the equation at the bottom of http://www.analytictech.com/mb876/handouts/distance_and_correlation.htm you can see that this is proportional to a distance if you have points in n-dimensional space. If you have three different time series (income, sales, inventory) I would use the sum of the three distances worked out from the correlations between the two time series of the same type.
Another option, especially if the time series are not very long, would be to regard a time series of length n as a point in n-dimensional space and feed this into the clustering algorithm, or use http://en.wikipedia.org/wiki/Principal_component_analysis to reduce the n dimensions down to 1 by looking at the most significant components (while you are doing this, it never hurts to plot the points using the least significant components and investigate points that stand out from the others. Points where the data is in error sometimes stand out here).

How do I extend a support vector machine algorithm to a high dimensional data set?

I'm trying to implement an SVM algorithm, but I'm having a hard time understanding how d-dimensional data sets are actually handled. In my particular case, each 'point' has nearly 400 identifying features.
In the two dimensional space, it basically tries to find a line between the two groups that maximizes the margin from any point on either side. I can sort of imagine what such a 'line' would look like in a d-dimensional space, but I'm completely lost on how the classification would actually work.
There is a similar question here, but I'm not getting it. I sort of get how the separation would occur after you have the classifier, but I'm lost on how to actually get the classifier.
If you can imagine how the line of the 2D case would become a d-dimensional hyperplane for higher dimensions, then you are pretty much done. The actual classification occurs when you test a point over the hyperplane, which will give you a positive number if the point belongs to class 1 or negative if it belongs to class 2.
Notice that in the formula there is no restriction for the dimension of each point:
[Image courtesy of wikipedia]
And in case you are curious about what happens with the non-linear case when you use the kernel trick, I would like to share with you a video that illustrates very well the idea.
http://www.youtube.com/watch?v=3liCbRZPrZA

Graph Simplification Algorithm Advice Needed

I have a need to take a 2D graph of n points and reduce it the r points (where r is a specific number less than n). For example, I may have two datasets with slightly different number of total points, say 1021 and 1001 and I'd like to force both datasets to have 1000 points. I am aware of a couple of simplification algorithms: Lang Simplification and Douglas-Peucker. I have used Lang in a previous project with slightly different requirements.
The specific properties of the algorithm I am looking for is:
1) must preserve the shape of the line
2) must allow me reduce dataset to a specific number of points
3) is relatively fast
This post is a discussion of the merits of the different algorithms. I will post a second message for advice on implementations in Java or Groovy (why reinvent the wheel).
I am concerned about requirement 2 above. I am not an expert enough in these algorithms to know whether I can dictate the exact number of output points. The implementation of Lang that I've used took lookAhead, tolerance and the array of Points as input, so I don't see how to dictate the number of points in the output. This is a critical requirement of my current needs. Perhaps this is due to the specific implementation of Lang we had used, but I have not seen a lot of information on Lang on the web. Alternatively we could use Douglas-Peucker but again I am not sure if the number of points in the output can be specified.
I should add I am not an expert on these types of algorithms or any kind of math wiz, so I am looking for mere mortal type advice :) How do I satisfy requirements 1 and 2 above? I would sacrifice performance for the right solution.
I think you can adapt Douglas-Pücker quite straightforwardly. Adapt the recursive algorithm so that rather than producing a list it produces a tree mirroring the structure of the recursive calls. The root of the tree will be the single-line approximation P0-Pn; the next level will represent the two-line approximation P0-Pm-Pn where Pm is the point between P0 and Pn which is furthest from P0-Pn; the next level (if full) will represent a four-line approximation, etc. You can then trim the tree either on the basis of depth or on the basis of distance of the inserted point from the parent line.
Edit: in fact, if you take the latter approach you don't need to build a tree. Instead you populate a priority queue where the priority is given by the distance of the inserted point from the parent line. Then when you've finished the queue tells you which points to remove (or keep, according to the order of the priorities).
You can find my C++ implementation and article on Douglas-Peucker simplification here and here. I also provide a modified version of the Douglas-Peucker simplification that allows you to specify the number of points of the resulting simplified line. It uses a priority queue as mentioned by 'Peter Taylor'. Its a lot slower though, so I don't know if it would satisfy the 'is relatively fast' requirement.
I'm planning on providing an implementation for Lang simplification (and several others). Currently I don't see any easy way how to adjust Lang to reduce to a fixed point count. If you
could live with a less strict requirement: 'must allow me reduce dataset to an approximate number of points', then you could use an iterative approach. Guess an initial value for lookahead: point count / desired point count. Then slowly increase the lookahead until you approximately hit the desired point count.
I hope this helps.
p.s.: I just remembered something, you could also try the Visvalingam-Whyatt algorithm. In short:
-compute the triangle area for each point with its direct neighbors
-sort these areas
-remove the point with the smallest area
-update the area of its neighbors
-resort
-continue until n points remain

Resources