URL path similarity/string similarity algorithm - algorithm

My problem is that I need to compare URL paths and deduce if they are similar. Below I provide example data to process:
# GROUP 1
/robots.txt
# GROUP 2
/bot.html
# GROUP 3
/phpMyAdmin-2.5.6-rc1/scripts/setup.php
/phpMyAdmin-2.5.6-rc2/scripts/setup.php
/phpMyAdmin-2.5.6/scripts/setup.php
/phpMyAdmin-2.5.7-pl1/scripts/setup.php
/phpMyAdmin-2.5.7/scripts/setup.php
/phpMyAdmin-2.6.0-alpha/scripts/setup.php
/phpMyAdmin-2.6.0-alpha2/scripts/setup.php
# GROUP 4
//phpMyAdmin/
I tried Levenshtein distance to compare, but for me is not enough accurate. I do not need 100% accurate algorithm, but I think 90% and above is a must.
I think that I need some sort of classifier, but the problem is that each portion of new data can containt path that should be classified to the new unknown class.
Could you please direct me to the right thoutht?
Thanks

Levenshtein distance is best option, but tuned distance. You have to use weighted Edit distance and possibly split path on tokens - words and numbers. So for example version like "2.5.6-rc2 and 2.5.6" can be treated as 0 weight difference, but name token like phpMyAdmin and javaMyAdmin give 1 weight difference.

When checking #jakub.gieryluk suggestion I accidentally have found solution that satisfy me - "Hobohm clustering algorithm, originally devised to reduce redundancy of biological sequence data sets."
Tests of PERL library implemented by Bruno Vecchi gave me really good results. The only problem is that I need Python implementation, but I belive that I can either find one on the Internet or reimplement code by myself.
Next thing is that I have not checked active learning ability of this algorithm yet ;)

I know it's not the exact answer to your question, but are you familiar with k-means algorithm?
I guess even the Levenshtein can work here, the difficulty however is how to compute centroids with that approach.
Perhaps you can divide input set into disjoint subsets, then for each URL in each subset compute the distance to all the other URLs in the same subset, and the URL that has lowest sum of distances, should be the centroid (of course, it depends on how big is the input set; for huge sets it might be not a good idea to do so).
The good thing about k-means is that you can start with absolutely random division, and then iteratively make it better.
The bad thing about k-means is that you have to precise k before start. However, during the run (perhaps where the situation stabilized after first couple of iterations), you can measure intra-similarity of each set, and if it is low, you can divide the set into two subsets and go on with the same algorithm.

Related

Algorithm for highest value inside budget

I wasn't entirely sure the best way to ask this question (or do the research to see if it has been previously answered).
Given a data set where each entry has a Point value and a Dollar value, I'm looking to generate a list of length N entries that yields the highest aggregate Point value whilst staying within budget B.
Example data set:
Item Points Dollars
Apple 3.0 $1.00
Pear 2.5 $0.75
Peach 2.8 $0.88
And with this (small) data set, say my budget (B) is $2.25, and list length (N) must be 2. You MUST use the fixed list length, but are not required to use ALL of the budget.
Obviously the example provided is easy to do in one's head, but given a much larger data set, and both higher N and B values, I'm looking for an algorithm that can generate the list. Having a hard time wrapping my head around this one.
Just looking for a pseudo-algorithm, but if you prefer any given language feel free to respond with that!
I am quite positive that this can be reduced to an NP-complete problem and hence it's not really worth trying to develop a process that will always give you the 'correct' answer as many people have tried and failed to do this efficiently over a large data set. However, you can use a much more efficient approximation technique that whilst it will not guarantee to give you the correct answer, many popular approximation algorithms are capable of achieving a high degree of accuracy.
Hope this helps you out :)
This problem is NP-Complete (NP and NP-Hard), meaning, that until now there is no algorithm found, that solves this problem in a polynomial amount time (polynomial to the input size) and if you find an algorithm that does, you would have solved one of the greatest problems in computer science (P=NP), which would you at least bring a million dollar reward.
If you are satisfied with an approximation, I would recommend the Greedy-Algorithm:
https://en.wikipedia.org/wiki/Greedy_algorithm

Algorithm to find smallest number of points to cover area (war game)

I'm dealing with a war game. I have a list of my bases B(x,y) from which I can send attacks on the enemy (they have bases between my own bases). Each base B can attack at a range R (the same radius for all bases). How can I find my bases to be able to attack as many enemy bases as possible, but use a minimum number of my bases?
I've reduced the problem to finding the minimum number of bases (and their coordinates) required to cover the largest area possible. I wonder if there is a better way than looking at all the possible combinations and because the number of bases could reach thousands.
Example: If the attack radius is 10 and I have five bases in a square and its center: (0,0), (10,0), (10,10), (0,10), (5,5) then the answer is that only the first four would be needed because all the area covered by the one in the center is already covered by the others.
Note 1 The solution must be single-threaded.
Note 2 The solution doesn't have to be perfect if that means a big gain in speed. The number of bases reaches thousands and this needs to use as little time as possible. I would consider running time greater than 100 ms for 10,000 bases in Python on a modern computer unacceptable, so I was thinking maybe I could start by eliminating the obvious, like if there are multiple bases within R/10 distance of each other, simply eliminate all except for one (whichever).
If I understand you correctly, the enemy bases and your bases are given as well as the (constant) attack radius. I.e. if you select one of your bases, you know exactly which of the enemy bases get attacked due to the selection.
The first step would be to eliminate those enemy cities from the problem which can not be attacked by any of your bases. Then, selecting all of your bases guarantees attacking all attackable enemy bases, so there is solution that attacks as many enemy bases as possible.
Under all those solutions you are looking for the one that uses the minimum number of your bases. This problem is equivalent to the https://en.wikipedia.org/wiki/Set_cover_problem, which is unfortunately NP-hard. You can apply all known solution methods such as Integer Linear Programming or the already mentioned greedy algorithm / metaheuristics.
If your problem instance is large and runtime is the primary concern, greedy is probably the way to go. For example you could always add that particular base of yours to the selection which adds the highest number of enemy bases that can be attacked which were previously not under attack by your already selected bases.
Hum the solution depends on your needs. If you need real time answer, maybe a greedy algorithm could provide good solution.
Other solution could be using meta-heuristic with constraint time(http://en.wikipedia.org/wiki/Metaheuristic). I probably would use genetic algorithm to search a solution for this problem under a limited time.
If interested I can provide a toy example of implementation in Python.
EDIT :
When you have to provide solution quickly a greedy algorithm is often better. But in your case I doubt. Particularity of many greedy algorithm is that you need to start from scratch each time you try to compute a new result.
Speaking again of genetic algorithm, you could for example each time you have to take a decision restart the search process from its last result. In fact you could probably let him turning has a subprocess and each 100ms take the better solution computed during the last loop.
If not too greedy in computing resource, this solution would provide better results than greedy one on the long run as the solution will probably need to be adapted to the changes of the situation but many element will stay unchanged. Just be aware that initializing a meta-search with the solution of a greedy algorithm is anyway a good idea!

How to mix genetic algorithm with some heuristic

I'm working on university scheduling problem and using simple genetic algorithm for this. Actually it works great and optimizes the objective function value for 1 hour from 0% to 90% (approx). But then the process getting slow down drammatically and it takes days to get the best solution. I saw a lot of papers that it is reasonable to mix other algos with genetiс one. Could you, please, give me some piece of advise of what algorithm can be mixed with genetic one and of how this algorithm can be applied to speed up the solving process. The main question is how can any heuristic can be applied to such complex-structured problem? I have no idea of how can be applied there, for instance, greedy heuristics.
Thanks to everyone in advance! Really appreciate your help!
Problem description:
I have:
array filled by ScheduleSlot objects
array filled by Lesson objects
I do:
Standart two-point crossover
Mutation (Move random lesson to random position)
Rough selection (select only n best individuals to next population)
Additional information for #Dougal and #izomorphius:
I'm triyng to construct a university schedule, which will have no breaks between lessons, overlaps and geographically distributed lessons for groups and professors.
The fitness function is really simple: fitness = -1000*numberOfOverlaps - 1000*numberOfDistrebutedLessons - 20*numberOfBreaks. (or something like that, we can simply change coefficients in fron of the variables)
At the very beggining I generate my individuals just placing lessons in random room, time and day.
Mutation and crossover, as described above, a really trivial:
Crossover - take to parent schedules, randomly choose the point and the range of crossover and just exchange the parts of parent schedules, generating two child schedules.
Mutation - take a child schedule and move n random lessons to random position.
My initial observation: you have chosen the coefficients in front of the numberOfOverlaps, numberOfDistrebutedLessons and numberOfBreaks somewhat randomly. My experience shows that usually these choices are not the best one and you should better let the computer choose them. I propose writing a second algorithm to choose them - could be neural network, second genetic algorithm or a hill climbing. The idea is - compute how good a result you get after a certain amount of time and try to optimize the choice of these 3 values.
Another idea: after getting the result you may try to brute-force optimize it. What I mean is the following - if you had the initial problem the "silly" solution would be back track that checks all the possibilities and this is usually done using dfs. Now this would be very slow, but you may try using depth first search with iterative deepening or simply a depth restricted DFS.
For many problems, I find that a Lamarckian-style of GA works well, combining a local search into the GA algorithm.
For your case, I would try to introduce a partial systematic search as the local search. There are two obvious ways to do this, and you should probably try both.
Alternate GA iterations with local search iterations. For your local search you could, for example, brute force all the lessons assigned in a single day while leaving everything else unchanged. Another possibility is to move a randomly selected lesson to all free slots to find the best choice for that. The key is to minimise the cost of the brute-search while still having the chance to find local improvements.
Add a new operator alongside mutation and crossover that performs your local search. (You might find that the mutation operator is less useful in the hybrid scheme, so just replacing that could be viable.)
In essence, you will be combining the global exploration of the GA with an efficient local search. Several GA frameworks include features to assist in this combination. For example, GAUL implements the alternate scheme 1 above, with either the full population or just the new offspring at each iteration.

Graph Simplification Algorithm Advice Needed

I have a need to take a 2D graph of n points and reduce it the r points (where r is a specific number less than n). For example, I may have two datasets with slightly different number of total points, say 1021 and 1001 and I'd like to force both datasets to have 1000 points. I am aware of a couple of simplification algorithms: Lang Simplification and Douglas-Peucker. I have used Lang in a previous project with slightly different requirements.
The specific properties of the algorithm I am looking for is:
1) must preserve the shape of the line
2) must allow me reduce dataset to a specific number of points
3) is relatively fast
This post is a discussion of the merits of the different algorithms. I will post a second message for advice on implementations in Java or Groovy (why reinvent the wheel).
I am concerned about requirement 2 above. I am not an expert enough in these algorithms to know whether I can dictate the exact number of output points. The implementation of Lang that I've used took lookAhead, tolerance and the array of Points as input, so I don't see how to dictate the number of points in the output. This is a critical requirement of my current needs. Perhaps this is due to the specific implementation of Lang we had used, but I have not seen a lot of information on Lang on the web. Alternatively we could use Douglas-Peucker but again I am not sure if the number of points in the output can be specified.
I should add I am not an expert on these types of algorithms or any kind of math wiz, so I am looking for mere mortal type advice :) How do I satisfy requirements 1 and 2 above? I would sacrifice performance for the right solution.
I think you can adapt Douglas-Pücker quite straightforwardly. Adapt the recursive algorithm so that rather than producing a list it produces a tree mirroring the structure of the recursive calls. The root of the tree will be the single-line approximation P0-Pn; the next level will represent the two-line approximation P0-Pm-Pn where Pm is the point between P0 and Pn which is furthest from P0-Pn; the next level (if full) will represent a four-line approximation, etc. You can then trim the tree either on the basis of depth or on the basis of distance of the inserted point from the parent line.
Edit: in fact, if you take the latter approach you don't need to build a tree. Instead you populate a priority queue where the priority is given by the distance of the inserted point from the parent line. Then when you've finished the queue tells you which points to remove (or keep, according to the order of the priorities).
You can find my C++ implementation and article on Douglas-Peucker simplification here and here. I also provide a modified version of the Douglas-Peucker simplification that allows you to specify the number of points of the resulting simplified line. It uses a priority queue as mentioned by 'Peter Taylor'. Its a lot slower though, so I don't know if it would satisfy the 'is relatively fast' requirement.
I'm planning on providing an implementation for Lang simplification (and several others). Currently I don't see any easy way how to adjust Lang to reduce to a fixed point count. If you
could live with a less strict requirement: 'must allow me reduce dataset to an approximate number of points', then you could use an iterative approach. Guess an initial value for lookahead: point count / desired point count. Then slowly increase the lookahead until you approximately hit the desired point count.
I hope this helps.
p.s.: I just remembered something, you could also try the Visvalingam-Whyatt algorithm. In short:
-compute the triangle area for each point with its direct neighbors
-sort these areas
-remove the point with the smallest area
-update the area of its neighbors
-resort
-continue until n points remain

Determining the best k for a k nearest neighbour

I have need to do some cluster analysis on a set of 2 dimensional data (I may add extra dimensions along the way).
The analysis itself will form part of the data being fed into a visualisation, rather than the inputs into another process (e.g. Radial Basis Function Networks).
To this end, I'd like to find a set of clusters which primarily "looks right", rather than elucidating some hidden patterns.
My intuition is that k-means would be a good starting place for this, but that finding the right number of clusters to run the algorithm with would be problematic.
The problem I'm coming to is this:
How to determine the 'best' value for k such that the clusters formed are stable and visually verifiable?
Questions:
Assuming that this isn't NP-complete, what is the time complexity for finding a good k. (probably reported in number of times to run the k-means algorithm).
is k-means a good starting point for this type of problem? If so, what other approaches would you recommend. A specific example, backed by an anecdote/experience would be maxi-bon.
what short cuts/approximations would you recommend to increase the performance.
For problems with an unknown number of clusters, agglomerative hierarchical clustering is often a better route than k-means.
Agglomerative clustering produces a tree structure, where the closer you are to the trunk, the fewer the number of clusters, so it's easy to scan through all numbers of clusters. The algorithm starts by assigning each point to its own cluster, and then repeatedly groups the two closest centroids. Keeping track of the grouping sequence allows an instant snapshot for any number of possible clusters. Therefore, it's often preferable to use this technique over k-means when you don't know how many groups you'll want.
There are other hierarchical clustering methods (see the paper suggested in Imran's comments). The primary advantage of an agglomerative approach is that there are many implementations out there, ready-made for your use.
In order to use k-means, you should know how many cluster there is. You can't try a naive meta-optimisation, since the more cluster you'll add (up to 1 cluster for each data point), the more it will brought you to over-fitting. You may look for some cluster validation methods and optimize the k hyperparameter with it but from my experience, it rarely work well. It's very costly too.
If I were you, I would do a PCA, eventually on polynomial space (take care of your available time) depending on what you know of your input, and cluster along the most representatives components.
More infos on your data set would be very helpful for a more precise answer.
Here's my approximate solution:
Start with k=2.
For a number of tries:
Run the k-means algorithm to find k clusters.
Find the mean square distance from the origin to the cluster centroids.
Repeat the 2-3, to find a standard deviation of the distances. This is a proxy for the stability of the clusters.
If stability of clusters for k < stability of clusters for k - 1 then return k - 1
Increment k by 1.
The thesis behind this algorithm is that the number of sets of k clusters is small for "good" values of k.
If we can find a local optimum for this stability, or an optimal delta for the stability, then we can find a good set of clusters which cannot be improved by adding more clusters.
In a previous answer, I explained how Self-Organizing Maps (SOM) can be used in visual clustering.
Otherwise, there exist a variation of the K-Means algorithm called X-Means which is able to find the number of clusters by optimizing the Bayesian Information Criterion (BIC), in addition to solving the problem of scalability by using KD-trees.
Weka includes an implementation of X-Means along with many other clustering algorithm, all in an easy to use GUI tool.
Finally you might to refer to this page which discusses the Elbow Method among other techniques for determining the number of clusters in a dataset.
You might look at papers on cluster validation. Here's one that is cited in papers that involve microarray analysis, which involves clustering genes with related expression levels.
One such technique is the Silhouette measure that evaluates how closely a labeled point is to its centroid. The general idea is that, if a point is assigned to one centroid but is still close to others, perhaps it was assigned to the wrong centroid. By counting these events across training sets and looking across various k-means clusterings, one looks for the k such that the labeled points overall fall into the "best" or minimally ambiguous arrangement.
It should be said that clustering is more of a data visualization and exploration technique. It can be difficult to elucidate with certainty that one clustering explains the data correctly, above all others. It's best to merge your clusterings with other relevant information. Is there something functional or otherwise informative about your data, such that you know some clusterings are impossible? This can reduce your solution space considerably.
From your wikipedia link:
Regarding computational complexity,
the k-means clustering problem is:
NP-hard in general Euclidean
space d even for 2 clusters
NP-hard for a general number of
clusters k even in the plane
If k and d are fixed, the problem can be
exactly solved in time O(ndk+1 log n),
where n is the number of entities to
be clustered
Thus, a variety of heuristic
algorithms are generally used.
That said, finding a good value of k is usually a heuristic process (i.e. you try a few and select the best).
I think k-means is a good starting point, it is simple and easy to implement (or copy). Only look further if you have serious performance problems.
If the set of points you want to cluster is exceptionally large a first order optimisation would be to randomly select a small subset, use that set to find your k-means.
Choosing the best K can be seen as a Model Selection problem. One possible approach is Minimum Description Length, which in this context means: You could store a table with all the points (in which case K=N). At the other extreme, you have K=1, and all the points are stored as their distances from a single centroid. This Section from Introduction to Information Retrieval by Manning and Schutze suggest minimising the Akaike Information Criterion as a heuristic for an optimal K.
This problematic belongs to the "internal evaluation" class of "clustering optimisation problems" which curent state of the art solution seems to use the **Silhouette* coeficient* as stated here
https://en.wikipedia.org/wiki/Cluster_analysis#Applications
and here:
https://en.wikipedia.org/wiki/Silhouette_(clustering) :
"silhouette plots and averages may be used to determine the natural number of clusters within a dataset"
scikit-learn provides a sample usage implementation of the methodology here
http://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html

Resources