I wasn't entirely sure the best way to ask this question (or do the research to see if it has been previously answered).
Given a data set where each entry has a Point value and a Dollar value, I'm looking to generate a list of length N entries that yields the highest aggregate Point value whilst staying within budget B.
Example data set:
Item Points Dollars
Apple 3.0 $1.00
Pear 2.5 $0.75
Peach 2.8 $0.88
And with this (small) data set, say my budget (B) is $2.25, and list length (N) must be 2. You MUST use the fixed list length, but are not required to use ALL of the budget.
Obviously the example provided is easy to do in one's head, but given a much larger data set, and both higher N and B values, I'm looking for an algorithm that can generate the list. Having a hard time wrapping my head around this one.
Just looking for a pseudo-algorithm, but if you prefer any given language feel free to respond with that!
I am quite positive that this can be reduced to an NP-complete problem and hence it's not really worth trying to develop a process that will always give you the 'correct' answer as many people have tried and failed to do this efficiently over a large data set. However, you can use a much more efficient approximation technique that whilst it will not guarantee to give you the correct answer, many popular approximation algorithms are capable of achieving a high degree of accuracy.
Hope this helps you out :)
This problem is NP-Complete (NP and NP-Hard), meaning, that until now there is no algorithm found, that solves this problem in a polynomial amount time (polynomial to the input size) and if you find an algorithm that does, you would have solved one of the greatest problems in computer science (P=NP), which would you at least bring a million dollar reward.
If you are satisfied with an approximation, I would recommend the Greedy-Algorithm:
https://en.wikipedia.org/wiki/Greedy_algorithm
Related
My problem is that I need to compare URL paths and deduce if they are similar. Below I provide example data to process:
# GROUP 1
/robots.txt
# GROUP 2
/bot.html
# GROUP 3
/phpMyAdmin-2.5.6-rc1/scripts/setup.php
/phpMyAdmin-2.5.6-rc2/scripts/setup.php
/phpMyAdmin-2.5.6/scripts/setup.php
/phpMyAdmin-2.5.7-pl1/scripts/setup.php
/phpMyAdmin-2.5.7/scripts/setup.php
/phpMyAdmin-2.6.0-alpha/scripts/setup.php
/phpMyAdmin-2.6.0-alpha2/scripts/setup.php
# GROUP 4
//phpMyAdmin/
I tried Levenshtein distance to compare, but for me is not enough accurate. I do not need 100% accurate algorithm, but I think 90% and above is a must.
I think that I need some sort of classifier, but the problem is that each portion of new data can containt path that should be classified to the new unknown class.
Could you please direct me to the right thoutht?
Thanks
Levenshtein distance is best option, but tuned distance. You have to use weighted Edit distance and possibly split path on tokens - words and numbers. So for example version like "2.5.6-rc2 and 2.5.6" can be treated as 0 weight difference, but name token like phpMyAdmin and javaMyAdmin give 1 weight difference.
When checking #jakub.gieryluk suggestion I accidentally have found solution that satisfy me - "Hobohm clustering algorithm, originally devised to reduce redundancy of biological sequence data sets."
Tests of PERL library implemented by Bruno Vecchi gave me really good results. The only problem is that I need Python implementation, but I belive that I can either find one on the Internet or reimplement code by myself.
Next thing is that I have not checked active learning ability of this algorithm yet ;)
I know it's not the exact answer to your question, but are you familiar with k-means algorithm?
I guess even the Levenshtein can work here, the difficulty however is how to compute centroids with that approach.
Perhaps you can divide input set into disjoint subsets, then for each URL in each subset compute the distance to all the other URLs in the same subset, and the URL that has lowest sum of distances, should be the centroid (of course, it depends on how big is the input set; for huge sets it might be not a good idea to do so).
The good thing about k-means is that you can start with absolutely random division, and then iteratively make it better.
The bad thing about k-means is that you have to precise k before start. However, during the run (perhaps where the situation stabilized after first couple of iterations), you can measure intra-similarity of each set, and if it is low, you can divide the set into two subsets and go on with the same algorithm.
I have 'n' number of amounts (non-negative integers). My requirement is to determine an optimal set of amounts so that the sum of the combination is less than or equal to a given fixed limit and the total is as large as possible. There is no limit to the number of amounts that can be included in the optimal set.
for sake of example: amounts are 143,2054,546,3564,1402 and the given limit is 5000.
As per my understanding the knapsack problem has 2 attributes for each item (weight and value). But the problem stated above has only one attribute (amount). I hope that would make things simpler? :)
Can someone please help me with the algorithm or source code for solving this?
this is still an NP-hard problem, but if you want to (or have to) to do something like that, maybe this topic helps you out a bit:
find two or more numbers from a list of numbers that add up towards a given amount
where i solved it like this and NikiC modified it to be faster. only difference: that one was about getting the exact amount, not "as close as possible", but that would be only some small changes in code (and you'll have to translate it into the language you're using).
take a look at the comments in my code to understand what i'm trying to do, wich is, in short form:
calculating all possible combinations of the given parts and sum them up
if the result is the amount i'm looking for, save the solution to an array
at least, sort all possible solutions to get the one using the least parts
so you'll have to change:
save a solution if it's lower than the amount you're looking for
sort solutions by total amount instead of number of used parts
The book "Knapsack Problems" By Hans Kellerer, Ulrich Pferschy and David Pisinger calls this The Subset Sum Problem and dedicates an entire chapter (Ch 4) to it. The chapter is very comprehensive and covers algorithms as well as computational results.
Even though this problem is a special case of the knapsack problem, it is still NP-hard.
I do not know a whole lot about math, so I don't know how to begin to google what I am looking for, so I rely on the intelligence of experts to help me understand what I am after...
I am trying to find the smallest string of equations for a particular large number. For example given the number
"39402006196394479212279040100143613805079739270465446667948293404245721771497210611414266254884915640806627990306816"
The smallest equation is 64^64 (that I know of) . It contains only 5 bytes.
Basically the program would reverse the math, instead of taking an expression and finding an answer, it takes an answer and finds the most simplistic expression. Simplistic is this case means smallest string, not really simple math.
Has this already been created? If so where can I find it? I am looking to take extremely HUGE numbers (10^10000000) and break them down to hopefully expressions that will be like 100 characters in length. Is this even possible? are modern CPUs/GPUs not capable of doing such big calculations?
Edit:
Ok. So finding the smallest equation takes WAY too much time, judging on answers. Is there anyway to bruteforce this and get the smallest found thus far?
For example given a number super super large. Sometimes taking the sqaureroot of number will result in an expression smaller than the number itself.
As far as what expressions it would start off it, well it would naturally try expressions that would the expression the smallest. I am sure there is tons of math things I dont know, but one of the ways to make a number a lot smaller is powers.
Just to throw another keyword in your Google hopper, see Kolmogorov Complexity. The Kolmogorov complexity of a string is the size of the smallest Turing machine that outputs the string, given an empty input. This is one way to formalize what you seem to be after. However, calculating the Kolmogorov complexity of a given string is known to be an undecidable problem :)
Hope this helps,
TJ
There's a good program to do that here:
http://mrob.com/pub/ries/index.html
I asked the question "what's the point of doing this", as I don't know if you're looking at this question from a mathemetics point of view, or a large number factoring point of view.
As other answers have considered the factoring point of view, I'll look at the maths angle. In particular, the problem you are describing is a compressibility problem. This is where you have a number, and want to describe it in the smallest algorithm. Highly random numbers have very poor compressibility, as to describe them you either have to write out all of the digits, or describe a deterministic algorithm which is only slightly smaller than the number itself.
There is currently no general mathemetical theorem which can determine if a representation of a number is the smallest possible for that number (although a lower bound can be discovered by understanding shannon's information theory). (I said general theorem, as special cases do exist).
As you said you don't know a whole lot of math, this is perhaps not a useful answer for you...
You're doing a form of lossless compression, and lossless compression doesn't work on random data. Suppose, to the contrary, that you had a way of compressing N-bit numbers into N-1-bit numbers. In that case, you'd have 2^N values to compress into 2^N-1 designations, which is an average of 2 values per designation, so your average designation couldn't be uncompressed. Lossless compression works well on relatively structured data, where data we're likely to get is compressed small, and data we aren't going to get actually grows some.
It's a little more complicated than that, since you're compressing partly by allowing more information per character. (There are a greater number of N-character sequences involving digits and operators than digits alone.) Still, you're not going to get lossless compression that, on the average, is better than just writing the whole numbers in binary.
It looks like you're basically wanting to do factoring on an arbitrarily large number. That is such a difficult problem that it actually serves as the cornerstone of modern-day cryptography.
This really appears to be a mathematics problem, and not programming or computer science problem. You should ask this on https://math.stackexchange.com/
While your question remains unclear, perhaps integer relation finding is what you are after.
EDIT:
There is some speculation that finding a "short" form is somehow related to the factoring problem. I don't believe that is true unless your definition requires a product as the answer. Consider the following pseudo-algorithm which is just sketch and for which no optimization is attempted.
If "shortest" is a well-defined concept, then in general you get "short" expressions by using small integers to large powers. If N is my integer, then I can find an integer nearby that is 0 mod 4. How close? Within +/- 2. I can find an integer within +/- 4 that is 0 mod 8. And so on. Now that's just the powers of 2. I can perform the same exercise with 3, 5, 7, etc. We can, for example, easily find the nearest integer that is simultaneously the product of powers of 2, 3, 5, 7, 11, 13, and 17, call it N_1. Now compute N-N_1, call it d_1. Maybe d_1 is "short". If so, then N_1 (expressed as power of the prime) + d_1 is the answer. If not, recurse to find a "short" expression for d_1.
We can also pick integers that are maybe farther away than our first choice; even though the difference d_1 is larger, it might have a shorter form.
The existence of an infinite number of primes means that there will always be numbers that cannot be simplified by factoring. What you're asking for is not possible, sorry.
There was a post on here recently which posed the following question:
You have a two-dimensional plane of (X, Y) coordinates. A bunch of random points are chosen. You need to select the largest possible set of chosen points, such that no two points share an X coordinate and no two points share a Y coordinate.
This is all the information that was provided.
There were two possible solutions presented.
One suggested using a maximum flow algorithm, such that each selected point maps to a path linking (source → X → Y → sink). This runs in O(V3) time, where V is the number of vertices selected.
Another (mine) suggested using the Hungarian algorithm. Create an n×n matrix of 1s, then set every chosen (x, y) coordinate to 0. The Hungarian algorithm will give you the lowest cost for this matrix, and the answer is the number of coordinates selected which equal 0. This runs in O(n3) time, where n is the greater of the number of rows or the number of columns.
My reasoning is that, for the vast majority of cases, the Hungarian algorithm is going to be faster; V is equal to n in the case where there's one chosen point for each row or column, and substantially greater for any case where there's more than that: given a 50×50 matrix with half the coordinates chosen, V is 1,250 and n is 50.
The counterargument is that there are some cases, like a 109×109 matrix with only two points selected, where V is 2 and n is 1,000,000,000. For this case, it takes the Hungarian algorithm a ridiculously long time to run, while the maximum flow algorithm is blinding fast.
Here is the question: Given that the problem doesn't provide any information regarding the size of the matrix or the probability that a given point is chosen (so you can't know for sure) how do you decide which algorithm, in general, is a better choice for the problem?
You can't, it's an imponderable.
You can only define which is better "in general" by defining what inputs you will see "in general". So for example you could whip up a probability model of the inputs, so that the expected value of V is a function of n, and choose the one with the best expected runtime under that model. But there may be arbitrary choices made in the construction of your model, so that different models give different answers. One model might choose co-ordinates at random, another model might look at the actual use-case for some program you're thinking of writing, and look at the distribution of inputs it will encounter.
You can alternatively talk about which has the best worst case (across all possible inputs with given constraints), which has the virtue of being easy to define, and the flaw that it's not guaranteed to tell you anything about the performance of your actual program. So for instance HeapSort is faster than QuickSort in the worst case, but slower in the average case. Which is faster? Depends whether you care about average case or worst case. If you don't care which case, you're not allowed to care which "is faster".
This is analogous to trying to answer the question "what is the probability that the next person you see will have an above (mean) average number of legs?".
We might implicitly assume that the next person you meet will be selected at random with uniform distribution from the human population (and hence the answer is "slightly less than one", since the mean is less than the mode average, and the vast majority of people are at the mode).
Or we might assume that your next meeting with another person is randomly selected with uniform distribution from the set of all meetings between two people, in which case the answer is still "slightly less than one", but I reckon not the exact same value as the first - one-and-zero-legged people quite possibly congregate with "their own kind" very slightly more than their frequency within the population would suggest. Or possibly they congregate less, I really don't know, I just don't see why it should be exactly the same once you take into account Veterans' Associations and so on.
Or we might use knowledge about you - if you live with a one-legged person then the answer might be "very slightly above 0".
Which of the three answers is "correct" depends precisely on the context which you are forbidding us from talking about. So we can't talk about which is correct.
Given that you don't know what each pill does, do you take the red pill or the blue pill?
If there really is not enough information to decide, there is not enough information to decide. Any guess is as good as any other.
Maybe, in some cases, it is possible to divine extra information to base the decision on. I haven't studied your example in detail, but it seems like the Hungarian algorithm might have higher memory requirements. This might be a reason to go with the maximum flow algorithm.
You don't. I think you illustrated that clearly enough. I think the proper practical solution is to spawn off both implementations in different threads, and then take the response that comes back first. If you're more clever, you can heuristically route requests to implementations.
Many algorithms require huge amounts of memory beyond the physical maximum of a machine, and in these cases, the algorithmically more ineffecient in time but efficient in space algorithm is chosen.
Given that we have distributed parallel computing, I say you just let both horses run and let the results speak for themselves.
This is a valid question, but there's no "right" answer — they are incomparable, so there's no notion of "better".
If your interest is practical, then you need to analyze the kinds of inputs that are likely to arise in practice, as well as the practical running times (constants included) of the two algorithms.
If your interest is theoretical, where worst-case analysis is often the norm, then, in terms of the input size, the O(V3) algorithm is better: you know that V ≤ n2, but you cannot polynomially bound n in terms of V, as you showed yourself. Of course the theoretical best algorithm is a hybrid algorithm that runs both and stops when whichever one of them finishes first, thus its running time would be O(min(V3,n3)).
Theoretically, they are both the same, because you actually compare how the number of operations grows when the size of the problem is increased to infinity.
The way your problem is defined, it has 2 sizes - n and number of points, so this question has no answer.
This problem came up in the real world, but I've translated it into a more generic "textbook-like" formulation. I suspect it is NP, but I'm particularly interested in knowing if it has a name or is well known since I think I can't be the first one to encounter it. ;-)
Imagine there is a potluck party with N guests. Each guest may bring his/her "signature dish" to the party, or bring nothing. Each guest either likes or hates each of the dishes that the other guests may bring (and this is known in advance since they are all old friends!), but they all like their own dishes.
Is there a deterministic algorithm that does not take exponential time to find the smallest set of dishes that satisfies the constraint that all guests will find at least one dish to their liking? I say "the" smallest, but actually there may be multiple solutions, and I'd like to know them all if possible.
Or, in a more abstract way, imagine a square matrix where all elements are either 0 or 1, and all diagonal elements are 1. What are the smallest sets of rows such that the sum (or the binary OR) of the rows in each set have no zeroes? (The rows would be the dishes, the columns would be the guests, 1 would mean that a guest likes a dish, and diagonal elements are 1 since everyone likes their own dish.)
This could be generalized to non-square matrices, or by removing diagonal=1 rule (although the latter guarantees that there will always be at least one solution). But I don't care about those cases for now...
I already have a program that solves it through an exhaustive search and is fast enough for N around 20, but it takes exponential time. I'm thinking I may need to recur to stochastic algorithms to find good-enough solutions for larger values of N.
Added
Wow, thanks for the quick answer! "Set cover", that's the name I was looking for. :)
This is called the SET COVER problem and it is NP-complete.
The set cover problem, as described in the Wikipedia article which Antti Huima linked to, lacks the feature of each guest liking his own dish. Offhand, I don't know whether this makes any difference.