Exploration Algorithm - algorithm

Massively edited this question to make it easier to understand.
Given an environment with arbitrary dimensions and arbitrary positioning of an arbitrary number of obstacles, I have an agent exploring the environment with a limited range of sight (obstacles don't block sight). It can move in the four cardinal directions of NSEW, one cell at a time, and the graph is unweighted (each step has a cost of 1). Linked below is a map representing the agent's (yellow guy) current belief of the environment at the instant of planning. Time does not pass in the simulation while the agent is planning.
http://imagizer.imageshack.us/a/img913/9274/qRsazT.jpg
What exploration algorithm can I use to maximise the cost-efficiency of utility, given that revisiting cells are allowed? Each cell holds a utility value. Ideally, I would seek to maximise the sum of utility of all cells SEEN (not visited) divided by the path length, although if that is too complex for any suitable algorithm then the number of cells seen will suffice. There is a maximum path length but it is generally in the hundreds or higher. (The actual test environments used on my agent are at least 4x bigger, although theoretically there is no upper bound on the dimensions that can be set, and the maximum path length would thus increase accordingly)
I consider BFS and DFS to be intractable, A* to be non-optimal given a lack of suitable heuristics, and Dijkstra's inappropriate in generating a single unbroken path. Is there any algorithm you can think of? Also, I need help with loop detection, as I've never done that before since allowing revisitations is my first time.
One approach I have considered is to reduce the map into a spanning tree, except that instead of defining it as a tree that connects all cells, it is defined as a tree that can see all cells. My approach would result in the following:
http://imagizer.imageshack.us/a/img910/3050/HGu40d.jpg
In the resultant tree, the agent can go from a node to any adjacent nodes that are 0-1 turn away at intersections. This is as far as my thinking has gotten right now. A solution generated using this tree may not be optimal, but it should at least be near-optimal with much fewer cells being processed by the algorithm, so if that would make the algorithm more likely to be tractable, then I guess that is an acceptable trade-off. I'm still stuck with thinking how exactly to generate a path for this however.

Your problem is very similar to a canonical Reinforcement Learning (RL) problem, the Grid World. I would formalize it as a standard Markov Decision Process (MDP) and use any RL algorithm to solve it.
The formalization would be:
States s: your NxM discrete grid.
Actions a: UP, DOWN, LEFT, RIGHT.
Reward r: the value of the cells that the agent can see from the destination cell s', i.e. r(s,a,s') = sum(value(seen(s')).
Transition function: P(s' | s, a) = 1 if s' is not out of the boundaries or a black cell, 0 otherwise.
Since you are interested in the average reward, the discount factor is 1 and you have to normalize the cumulative reward by the number of steps. You also said that each step has cost one, so you could subtract 1 to the immediate reward rat each time step, but this would not add anything since you will already average by the number of steps.
Since the problem is discrete the policy could be a simple softmax (or Gibbs) distribution.
As solving algorithm you can use Q-learning, which guarantees the optimality of the solution provided a sufficient number of samples. However, if your grid is too big (and you said that there is no limit) I would suggest policy search algorithms, like policy gradient or relative entropy (although they guarantee convergence only to local optima). You can find something about Q-learning basically everywhere on the Internet. For a recent survey on policy search I suggest this.
The cool thing about these approaches is that they encode the exploration in the policy (e.g., the temperature in a softmax policy, the variance in a Gaussian distribution) and will try to maximize the cumulative long term reward as described by your MDP. So usually you initialize your policy with a high exploration (e.g., a complete random policy) and by trial and error the algorithm will make it deterministic and converge to the optimal one (however, sometimes also a stochastic policy is optimal).
The main difference between all the RL algorithms is how they perform the update of the policy at each iteration and manage the tradeoff exploration-exploitation (how much should I explore VS how much should I exploit the information I already have).
As suggested by Demplo, you could also use Genetic Algorithms (GA), but they are usually slower and require more tuning (elitism, crossover, mutation...).
I have also tried some policy search algorithms on your problem and they seems to work well, although I initialized the grid randomly and do not know the exact optimal solution. If you provide some additional details (a test grid, the max number of steps and if the initial position is fixed or random) I can test them more precisely.

Related

Traveling Salesman Problem for a large number of vertices

I need to solve TSP for a large number of vertices(30-100) with good accuracy and adequate time(like 1-2 days). My graph can contain asymmetrical edges(g[i][j] not equal g[j][i]).
I tried greedy, little(maybe my bad, but that shows worse results than greedy), simple genetic algo(barely better than greedy) , dynamic for O(2^n*n) (fast out of memory).
Well, 30-100 is not really large number of vertices. Did you miss some zeroes? Or you are facing some special hard to solve cases like p43 from TSPLIB?
In any case, if you are looking for a good heuristic I used to use Ant Colony Optimization for Asymmetric TSP. It is easy to implement and providers quite good performance.
You might take a look at my old implementation: https://github.com/aligusnet/optimer/tree/master/src/heuristics/aco
If you can accept not an optimal, but "close to optimal" solution, I can suggest you to use "random traveling" algorithm. Idea of this algorithm - do not BFS/DFS search through entire combination tree, but search just random DFS-subtrees.
For example, you have vertices [A-Z], and you start point within [A]. Try 10000 attempts for each path (total 32 prefix), started from [A-B-...], [A-C-...] and so on, where [...] is randomly selected full-depth path through your graph, according your rules. Keep cost of appropriate paths within array, where cost is sum of costs from each prefix. Because of you use equal attempts to all "start prefixes", sum of minimal prefix will show you best step from [A]. Of course, this is not guarantee for optimal, but this is high probability to be so.
For example, sum of 10,000 attempts withing path [A-K] is lowest. Next step - accept first step [A-K], and again repeat algorithm, until you found the solution.
Here's the TSP source code of the OptaPlanner implementation, fwiw. It deals with datasets up to 10k visits pretty good when NearbySelection is activated (or up to 500 visits or so if it's not activated) - to go above 10k you'll need to activate Partitioned Search which comes at a trade-off.
It has asymmetric datasets (using OpenStreetMap data) in the import/belgium/road-time directory. It can't prove if it reaches the optimal solution or not. Usually termination is set on either a few minutes or on a few unimproved minutes.
Benchmarks showed that Late Acceptance has slightly better results than Simulated Annealing and Tabu Search, given a specific set of MoveSelectors configured, but your mileage may vary...

A* Algorithm for very large graphs, any thoughts on caching shortcuts?

I'm writing a courier/logistics simulation on OpenStreetMap maps and have realised that the basic A* algorithm as pictured below is not going to be fast enough for large maps (like Greater London).
The green nodes correspond to ones that were put in the open set/priority queue and due to the huge number (the whole map is something like 1-2 million), it takes 5 seconds or so to find the route pictured. Unfortunately 100ms per route is about my absolute limit.
Currently, the nodes are stored in both an adjacency list and also a spatial 100x100 2D array.
I'm looking for methods where I can trade off preprocessing time, space and if needed optimality of the route, for faster queries. The straight-line Haversine formula for the heuristic cost is the most expensive function according to the profiler - I have optimised my basic A* as much as I can.
For example, I was thinking if I chose an arbitrary node X from each quadrant of the 2D array and run A* between each, I can store the routes to disk for subsequent simulations. When querying, I can run A* search only in the quadrants, to get between the precomputed route and the X.
Is there a more refined version of what I've described above or perhaps a different method I should pursue. Many thanks!
For the record, here are some benchmark results for arbitrarily weighting the heuristic cost and computing the path between 10 pairs of randomly picked nodes:
Weight // AvgDist% // Time (ms)
1 1 1461.2
1.05 1 1327.2
1.1 1 900.7
1.2 1.019658848 196.4
1.3 1.027619169 53.6
1.4 1.044714394 33.6
1.5 1.063963413 25.5
1.6 1.071694171 24.1
1.7 1.084093229 24.3
1.8 1.092208509 22
1.9 1.109188175 22.5
2 1.122856792 18.2
2.2 1.131574742 16.9
2.4 1.139104895 15.4
2.6 1.140021962 16
2.8 1.14088128 15.5
3 1.156303676 16
4 1.20256964 13
5 1.19610861 12.9
Surprisingly increasing the coefficient to 1.1 almost halved the execution time whilst keeping the same route.
You should be able to make it much faster by trading off optimality. See Admissibility and optimality on wikipedia.
The idea is to use an epsilon value which will lead to a solution no worse than 1 + epsilon times the optimal path, but which will cause fewer nodes to be considered by the algorithm. Note that this does not mean that the returned solution will always be 1 + epsilon times the optimal path. This is just the worst case. I don't know exactly how it would behave in practice for your problem, but I think it is worth exploring.
You are given a number of algorithms that rely on this idea on wikipedia. I believe this is your best bet to improve the algorithm and that it has the potential to run in your time limit while still returning good paths.
Since your algorithm does deal with millions of nodes in 5 seconds, I assume you also use binary heaps for the implementation, correct? If you implemented them manually, make sure they are implemented as simple arrays and that they are binary heaps.
There are specialist algorithms for this problem that do a lot of pre-computation. From memory, the pre-computation adds information to the graph that A* uses to produce a much more accurate heuristic than straight line distance. Wikipedia gives the names of a number of methods at http://en.wikipedia.org/wiki/Shortest_path_problem#Road_networks and says that Hub Labelling is the leader. A quick search on this turns up http://research.microsoft.com/pubs/142356/HL-TR.pdf. An older one, using A*, is at http://research.microsoft.com/pubs/64505/goldberg-sp-wea07.pdf.
Do you really need to use Haversine? To cover London, I would have thought you could have assumed a flat earth and used Pythagoras, or stored the length of each link in the graph.
There's a really great article that Microsoft Research wrote on the subject:
http://research.microsoft.com/en-us/news/features/shortestpath-070709.aspx
The original paper is hosted here (PDF):
http://www.cc.gatech.edu/~thad/6601-gradAI-fall2012/02-search-Gutman04siam.pdf
Essentially there's a few things you can try:
Start from the both the source as well as the destination. This helps to minimize the amount of wasted work that you'd perform when traversing from the source outwards towards the destination.
Use landmarks and highways. Essentially, find some positions in each map that are commonly taken paths and perform some pre-calculation to determine how to navigate efficiently between those points. If you can find a path from your source to a landmark, then to other landmarks, then to your destination, you can quickly find a viable route and optimize from there.
Explore algorithms like the "reach" algorithm. This helps to minimize the amount of work that you'll do when traversing the graph by minimizing the number of vertices that need to be considered in order to find a valid route.
GraphHopper does two things more to get fast, none-heuristic and flexible routing (note: I'm the author and you can try it online here)
A not so obvious optimization is to avoid 1:1 mapping of OSM nodes to internal nodes. Instead GraphHopper uses only junctions as nodes and saves roughly 1/8th of traversed nodes.
It has efficient implements for A*, Dijkstra or e.g. one-to-many Dijkstra. Which makes a route in under 1s possible through entire Germany. The (none-heuristical) bidirectional version of A* makes this even faster.
So it should be possible to get you fast routes for greater London.
Additionally the default mode is the speed mode which makes everything an order of magnitudes faster (e.g. 30ms for European wide routes) but less flexible, as it requires preprocessing (Contraction Hierarchies). If you don't like this, just disable it and also further fine-tune the included streets for car or probably better create a new profile for trucks - e.g. exclude service streets and tracks which should give you a further 30% boost. And as with any bidirectional algorithm you could easily implement a parallel search.
I think it's worth to work-out your idea with "quadrants". More strictly, I'd call it a low-resolution route search.
You may pick X connected nodes that are close enough, and treat them as a single low-resolution node. Divide your whole graph into such groups, and you get a low-resolution graph. This is a preparation stage.
In order to compute a route from source to target, first identify the low-res nodes they belong to, and find the low-resolution route. Then improve your result by finding the route on high-resolution graph, however restricting the algorithm only to nodes that belong to hte low-resolution nodes of the low-resolution route (optionally you may also consider neighbor low-resolution nodes up to some depth).
This may also be generalized to multiple resolutions, not just high/low.
At the end you should get a route that is close enough to optimal. It's locally optimal, but may be somewhat worse than optimal globally by some extent, which depends on the resolution jump (i.e. the approximation you make when a group of nodes is defined as a single node).
There are dozens of A* variations that may fit the bill here. You have to think about your use cases, though.
Are you memory- (and also cache-) constrained?
Can you parallelize the search?
Will your algorithm implementation be used in one location only (e.g. Greater London and not NYC or Mumbai or wherever)?
There's no way for us to know all the details that you and your employer are privy to. Your first stop thus should be CiteSeer or Google Scholar: look for papers that treat pathfinding with the same general set of constraints as you.
Then downselect to three or four algorithms, do the prototyping, test how they scale up and finetune them. You should bear in mind you can combine various algorithms in the same grand pathfinding routine based on distance between the points, time remaining, or any other factors.
As has already been said, based on the small scale of your target area dropping Haversine is probably your first step saving precious time on expensive trig evaluations. NOTE: I do not recommend using Euclidean distance in lat, lon coordinates - reproject your map into a e.g. transverse Mercator near the center and use Cartesian coordinates in yards or meters!
Precomputing is the second one, and changing compilers may be an obvious third idea (switch to C or C++ - see https://benchmarksgame.alioth.debian.org/ for details).
Extra optimization steps may include getting rid of dynamic memory allocation, and using efficient indexing for search among the nodes (think R-tree and its derivatives/alternatives).
I worked at a major Navigation company, so I can say with confidence that 100 ms should get you a route from London to Athens even on an embedded device. Greater London would be a test map for us, as it's conveniently small (easily fits in RAM - this isn't actually necessary)
First off, A* is entirely outdated. Its main benefit is that it "technically" doesn't require preprocessing. In practice, you need to pre-process an OSM map anyway so that's a pointless benefit.
The main technique to give you a huge speed boost is arc flags. If you divide the map in say 5x6 sections, you can allocate 1 bit position in a 32 bits integer for each section. You can now determine for each edge whether it's ever useful when traveling to section {X,Y} from another section. Quite often, roads are bidirectional and this means only one of the two directions is useful. So one of the two directions has that bit set, and the other has it cleared. This may not appear to be a real benefit, but it means that on many intersections you reduce the number of choices to consider from 2 to just 1, and this takes just a single bit operation.
Usually A* comes along with too much memory consumption rather than time stuggles.
However I think it could be useful to first only compute with nodes that are part of "big streets" you would choose a highway over a tiny alley usually.
I guess you may already use this for your weight function but you can be faster if you use some priority Queue to decide which node to test next for further travelling.
Also you could try reducing the graph to only nodes that are part of low cost edges and then find a way from to start/end to the closest of these nodes.
So you have 2 paths from start to the "big street" and the "big street" to end.
You can now compute the best path between the two nodes that are part of the "big streets" in a reduced graph.
Old question, but yet:
Try to use different heaps that "binary heap". 'Best asymptotic complexity heap' is definetly Fibonacci Heap and it's wiki page got a nice overview:
https://en.wikipedia.org/wiki/Fibonacci_heap#Summary_of_running_times
Note that binary heap has simpler code and it's implemented over array and traversal of array is predictable, so modern CPU executes binary heap operations much faster.
However, given dataset big enough, other heaps will win over binary heap, because of their complexities...
This question seems like dataset big enough.

Approximated closest pair algorithm

I have been thinking about a variation of the closest pair problem in which the only available information is the set of distances already calculated (we are not allowed to sort points according to their x-coordinates).
Consider 4 points (A, B, C, D), and the following distances:
dist(A,B) = 0.5
dist(A,C) = 5
dist(C,D) = 2
In this example, I don't need to evaluate dist(B,C) or dist(A,D), because it is guaranteed that these distances are greater than the current known minimum distance.
Is it possible to use this kind of information to reduce the O(n²) to something like O(nlogn)?
Is it possible to reduce the cost to something close to O(nlogn) if I accept a kind of approximated solution? In this case, I am thinking about some technique based on reinforcement learning that only converges to the real solution when the number of reinforcements go to infinite, but provides a great approximation for small n.
Processing time (measured by the big O notation) is not the only issue. To keep a very large amount of previous calculated distances can also be an issue.
Imagine this problem for a set with 10⁸ points.
What kind of solution should I look for? Was this kind of problem solved before?
This is not a classroom problem or something related. I have been just thinking about this problem.
I suggest using ideas that are derived from quickly solving k-nearest-neighbor searches.
The M-Tree data structure: (see http://en.wikipedia.org/wiki/M-tree and http://www.vldb.org/conf/1997/P426.PDF ) is designed to reduce the number distance comparisons that need to be performed to find "nearest neighbors".
Personally, I could not find an implementation of an M-Tree online that I was satisfied with (see my closed thread Looking for a mature M-Tree implementation) so I rolled my own.
My implementation is here: https://github.com/jon1van/MTreeMapRepo
Basically, this is binary tree in which each leaf node contains a HashMap of Keys that are "close" in some metric space you define.
I suggest using my code (or the idea behind it) to implement a solution in which you:
Search each leaf node's HashMap and find the closest pair of Keys within that small subset.
Return the closest pair of Keys when considering only the "winner" of each HashMap.
This style of solution would be a "divide and conquer" approach the returns an approximate solution.
You should know this code has an adjustable parameter the governs the maximum number of Keys that can be placed in an individual HashMap. Reducing this parameter will increase the speed of your search, but it will increase the probability that the correct solution won't be found because one Key is in HashMap A while the second Key is in HashMap B.
Also, each HashMap is associated a "radius". Depending on how accurate you want your result you maybe able to just search the HashMap with the largest hashMap.size()/radius (because this HashMap contains the highest density of points, thus it is a good search candidate)
Good Luck
If you only have sample distances, not original point locations in a plane you can operate on, then I suspect you are bounded at O(E).
Specifically, it would seem from your description that any valid solution would need to inspect every edge in order to rule out it having something interesting to say, meanwhile, inspecting every edge and taking the smallest solves the problem.
Planar versions bypass O(V^2), by using planar distances to deduce limitations on sets of edges, allowing us to avoid needing to look at most of the edge weights.
Use same idea as in space partitioning. Recursively split given set of points by choosing two points and dividing set in two parts, points that are closer to first point and points that are closer to second point. That is same as splitting points by a line passing between two chosen points.
That produces (binary) space partitioning, on which standard nearest neighbour search algorithms can be used.

Algorithm to optimize parameters based on imprecise fitness function

I am looking for a general algorithm to help in situations with similar constraints as this example :
I am thinking of a system where images are constructed based on a set of operations. Each operation has a set of parameters. The total "gene" of the image is then the sequential application of the operations with the corresponding parameters. The finished image is then given a vote by one or more real humans according to how "beautiful" it is.
The question is what kind of algorithm would be able to do better than simply random search if you want to find the most beautiful image? (and hopefully improve the confidence over time as votes tick in and improve the fitness function)
Given that the operations will probably be correlated, it should be possible to do better than random search. So for example operation A with parameters a1 and a2 followed by B with parameters b1 could generally be vastly superior to B followed by A. The order of operations will matter.
I have tried googling for research papers on random walk and markov chains as that is my best guesses about where to look, but so far have found no scenarios similar enough. I would really appreciate even just a hint of where to look for such an algorithm.
I think what you are looking for fall in a broad research area called metaheuristics (which include many non-linear optimization algorithms such as genetic algorithms, simulated annealing or tabu search).
Then if your raw fitness function is just giving a statistical value somehow approximating a real (but unknown) fitness function, you can probably still use most metaheuristics by (somehow) smoothing your fitness function (averaging results would do that).
Do you mean the Metropolis algorithm?
This approach uses a random walk, weighted by the fitness function. It is useful for locating local extrema in complicated fitness landscapes, but is generally slower than deterministic approaches where those will work.
You're pretty much describing a genetic algorithm in which the sequence of operations represents the "gene" ("chromosome" would be a better term for this, where the parameter[s] passed to each operation represents a single "gene", and multiple genes make up a chromosome), the image produced represents the phenotypic expression of the gene, and the votes from the real humans represent the fitness function.
If I understand your question, you're looking for an alternative algorithm of some sort that will evaluate the operations and produce a "beauty" score similar to what the real humans produce. Good luck with that - I don't think there really is any such thing, and I'm not surprised that you didn't find anything. Human brains, and correspondingly human evaluations of aesthetics, are much too staggeringly complex to be reducible to a simplistic algorithm.
Interestingly, your question seems to encapsulate the bias against using real human responses as the fitness function in genetic-algorithm-based software. This is a subject of relevance to me, since my namesake software is specifically designed to use human responses (or "votes") to evaluate music produced via a genetic process.
Simple Markov Chain
Markov chains, which you mention, aren't a bad way to go. A Markov chain is just a state machine, represented as a graph with edge weights which are transition probabilities. In your case, each of your operations is a node in the graph, and the edges between the nodes represent allowable sequences of operations. Since order matters, your edges are directed. You then need three components:
A generator function to construct the graph of allowed transitions (which operations are allowed to follow one another). If any operation is allowed to follow any other, then this is easy to write: all nodes are connected, and your graph is said to be complete. You can initially set all the edge weights to 1.
A function to traverse the graph, crossing N nodes, where N is your 'gene-length'. At each node, your choice is made randomly, but proportionally weighted by the values of the edges (so better edges have a higher chance of being selected).
A weighting update function which can be used to adjust the weightings of the edges when you get feedback about an image. For example, a simple update function might be to give each edge involved in a 'pleasing' image a positive vote each time that image is nominated by a human. The weighting of each edge is then normalised, with the currently highest voted edge set to 1, and all the others correspondingly reduced.
This graph is then a simple learning network which will be refined by subsequent voting. Over time as votes accumulate, successive traversals will tend to favour the more highly rated sequences of operations, but will still occasionally explore other possibilities.
Advantages
The main advantage of this approach is that it's easy to understand and code, and makes very few assumptions about the problem space. This is good news if you don't know much about the search space (e.g. which sequences of operations are likely to be favourable).
It's also easy to analyse and debug - you can inspect the weightings at any time and very easily calculate things like the top 10 best sequences known so far, etc. This is a big advantage - other approaches are typically much harder to investigate ("why did it do that?") because of their increased abstraction. Although very efficient, you can easily melt your brain trying to follow and debug the convergence steps of a simplex crawler!
Even if you implement a more sophisticated production algorithm, having a simple baseline algorithm is crucial for sanity checking and efficiency comparisons. It's also easy to tinker with, by messing with the update function. For example, an even more baseline approach is pure random walk, which is just a null weighting function (no weighting updates) - whatever algorithm you produce should perform significantly better than this if its existence is to be justified.
This idea of baselining is very important if you want to evaluate the quality of your algorithm's output empirically. In climate modelling, for example, a simple test is "does my fancy simulation do any better at predicting the weather than one where I simply predict today's weather will be the same as yesterday's?" Since weather is often correlated on a timescale of several days, this baseline can give surprisingly good predictions!
Limitations
One disadvantage of the approach is that it is slow to converge. A more agressive choice of update function will push promising results faster (for example, weighting new results according to a power law, rather than the simple linear normalisation), at the cost of giving alternatives less credence.
This is equivalent to fiddling with the mutation rate and gene pool size in a genetic algorithm, or the cooling rate of a simulated annealing approach. The tradeoff between 'climbing hills or exploring the landscape' is an inescapable "twiddly knob" (free parameter) which all search algorithms must deal with, either directly or indirectly. You are trying to find the highest point in some fitness search space. Your algorithm is trying to do that in less tries than random inspection, by looking at the shape of the space and trying to infer something about it. If you think you're going up a hill, you can take a guess and jump further. But if it turns out to be a small hill in a bumpy landscape, then you've just missed the peak entirely.
Also note that since your fitness function is based on human responses, you are limited to a relatively small number of iterations regardless of your choice of algorithmic approach. For example, you would see the same issue with a genetic algorithm approach (fitness function limits the number of individuals and generations) or a neural network (limited training set).
A final potential limitation is that if your "gene-lengths" are long, there are many nodes, and many transitions are allowed, then the size of the graph will become prohibitive, and the algorithm impractical.

Determining the best k for a k nearest neighbour

I have need to do some cluster analysis on a set of 2 dimensional data (I may add extra dimensions along the way).
The analysis itself will form part of the data being fed into a visualisation, rather than the inputs into another process (e.g. Radial Basis Function Networks).
To this end, I'd like to find a set of clusters which primarily "looks right", rather than elucidating some hidden patterns.
My intuition is that k-means would be a good starting place for this, but that finding the right number of clusters to run the algorithm with would be problematic.
The problem I'm coming to is this:
How to determine the 'best' value for k such that the clusters formed are stable and visually verifiable?
Questions:
Assuming that this isn't NP-complete, what is the time complexity for finding a good k. (probably reported in number of times to run the k-means algorithm).
is k-means a good starting point for this type of problem? If so, what other approaches would you recommend. A specific example, backed by an anecdote/experience would be maxi-bon.
what short cuts/approximations would you recommend to increase the performance.
For problems with an unknown number of clusters, agglomerative hierarchical clustering is often a better route than k-means.
Agglomerative clustering produces a tree structure, where the closer you are to the trunk, the fewer the number of clusters, so it's easy to scan through all numbers of clusters. The algorithm starts by assigning each point to its own cluster, and then repeatedly groups the two closest centroids. Keeping track of the grouping sequence allows an instant snapshot for any number of possible clusters. Therefore, it's often preferable to use this technique over k-means when you don't know how many groups you'll want.
There are other hierarchical clustering methods (see the paper suggested in Imran's comments). The primary advantage of an agglomerative approach is that there are many implementations out there, ready-made for your use.
In order to use k-means, you should know how many cluster there is. You can't try a naive meta-optimisation, since the more cluster you'll add (up to 1 cluster for each data point), the more it will brought you to over-fitting. You may look for some cluster validation methods and optimize the k hyperparameter with it but from my experience, it rarely work well. It's very costly too.
If I were you, I would do a PCA, eventually on polynomial space (take care of your available time) depending on what you know of your input, and cluster along the most representatives components.
More infos on your data set would be very helpful for a more precise answer.
Here's my approximate solution:
Start with k=2.
For a number of tries:
Run the k-means algorithm to find k clusters.
Find the mean square distance from the origin to the cluster centroids.
Repeat the 2-3, to find a standard deviation of the distances. This is a proxy for the stability of the clusters.
If stability of clusters for k < stability of clusters for k - 1 then return k - 1
Increment k by 1.
The thesis behind this algorithm is that the number of sets of k clusters is small for "good" values of k.
If we can find a local optimum for this stability, or an optimal delta for the stability, then we can find a good set of clusters which cannot be improved by adding more clusters.
In a previous answer, I explained how Self-Organizing Maps (SOM) can be used in visual clustering.
Otherwise, there exist a variation of the K-Means algorithm called X-Means which is able to find the number of clusters by optimizing the Bayesian Information Criterion (BIC), in addition to solving the problem of scalability by using KD-trees.
Weka includes an implementation of X-Means along with many other clustering algorithm, all in an easy to use GUI tool.
Finally you might to refer to this page which discusses the Elbow Method among other techniques for determining the number of clusters in a dataset.
You might look at papers on cluster validation. Here's one that is cited in papers that involve microarray analysis, which involves clustering genes with related expression levels.
One such technique is the Silhouette measure that evaluates how closely a labeled point is to its centroid. The general idea is that, if a point is assigned to one centroid but is still close to others, perhaps it was assigned to the wrong centroid. By counting these events across training sets and looking across various k-means clusterings, one looks for the k such that the labeled points overall fall into the "best" or minimally ambiguous arrangement.
It should be said that clustering is more of a data visualization and exploration technique. It can be difficult to elucidate with certainty that one clustering explains the data correctly, above all others. It's best to merge your clusterings with other relevant information. Is there something functional or otherwise informative about your data, such that you know some clusterings are impossible? This can reduce your solution space considerably.
From your wikipedia link:
Regarding computational complexity,
the k-means clustering problem is:
NP-hard in general Euclidean
space d even for 2 clusters
NP-hard for a general number of
clusters k even in the plane
If k and d are fixed, the problem can be
exactly solved in time O(ndk+1 log n),
where n is the number of entities to
be clustered
Thus, a variety of heuristic
algorithms are generally used.
That said, finding a good value of k is usually a heuristic process (i.e. you try a few and select the best).
I think k-means is a good starting point, it is simple and easy to implement (or copy). Only look further if you have serious performance problems.
If the set of points you want to cluster is exceptionally large a first order optimisation would be to randomly select a small subset, use that set to find your k-means.
Choosing the best K can be seen as a Model Selection problem. One possible approach is Minimum Description Length, which in this context means: You could store a table with all the points (in which case K=N). At the other extreme, you have K=1, and all the points are stored as their distances from a single centroid. This Section from Introduction to Information Retrieval by Manning and Schutze suggest minimising the Akaike Information Criterion as a heuristic for an optimal K.
This problematic belongs to the "internal evaluation" class of "clustering optimisation problems" which curent state of the art solution seems to use the **Silhouette* coeficient* as stated here
https://en.wikipedia.org/wiki/Cluster_analysis#Applications
and here:
https://en.wikipedia.org/wiki/Silhouette_(clustering) :
"silhouette plots and averages may be used to determine the natural number of clusters within a dataset"
scikit-learn provides a sample usage implementation of the methodology here
http://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html

Resources