What does "algorithm problem size" actually mean? - algorithm

I'm currently in a Data Structures course at my university and did do some algorithm analysis in a prior class, but it was the section I had the most difficult time with in the previous course. We are now going over algorithm analysis in my data structures course and so I'm going back through my textbook from the previous course to see what it says on the matter.
In the textbook, it says "For every algorithm we want to analyze, we need to define the size of the prob-
lem." Doing some Google searching, it's not entirely clear what "problem size" actually means. I'm trying to get a more concrete definition of what a problem size is so I can identify it in an algorithm.
I know that, if I have an algorithm that is sorting a list of numbers, the problem size is n, the size of the list. With that said, saying that doesn't clarify what "problem size" actually is, except for in that context. An algorithm is not just a process to sort numbers, so I can't always say that the problem size is the number of elements in a list.
Hoping someone out there can clarify things for me, and that you all are doing well.
Thank you

The answer is right there in the part you quoted (emphasis mine):
For every algorithm we want to analyze, we need to define the size of the problem
The "problem size" is only defined numerically relative to the algorithm. For an algorithm where the input is an array or a list, the problem size is typically measured by its length; for a graph algorithm, the problem size is typically measured by the number of vertices and the number of edges (with two variables); for an algorithm where the input is a single number, the problem size may be measured by the number itself, or the amount of bits required to represent the number in binary, depending on context.
So the meaning of "problem size" is specific to the problem that the algorithm solves. If you want a more universal definition which could apply to all problems, then the problem size can be defined as the number of bits required to represent the input; but this definition is not practical, and is only used in theory to talk about classes of problems (such as those which are solvable in polynomial time).

The problem size is the number of bits needed to store an instance of the problem, when it is specified in a reasonable encoding.

To clarify the concept, let me define this in the layman's terms:
Given:
You have a big phone book.
Problem:
You are told to find the number of person John Mcallister.
Approach:
You can either search for this entry through each page (in the linear manner);
or, if the phone-book is sorted, you can utilize Binary Search;
Answer to your question:
Algorithm problem here is Finding the entry in the Phone Book;
Algorithm problem's size is the size of data, your algorithm should apply to (in your case, it's the size of your phone-book. If it has 10 entries per each page, and the book has 50 pages, the size is 50x10=500, to wit, 500 entries.)
As your algorithm should solve your task of examining entire phone book, the size of your task/problem, which you implement the algorithm for, is 500.
Problem Size is generally denoted with n and it literally means the size of input data.

Related

Problem size, input size and asymptotic behavior for an algorithm (re post)

I am re-posting my question because accidentally I said that another Thread (with a similar topic) did answer my question, which it wasn't the case. I am sorry for any inconvenience
I am trying to understand how the input size, problem size and the asymptotic behavior of an arbitrary algorithm given in pseudo code format differ from each other. While I fully understand the input and the asymptotic behavior, I have problems understanding the problem size. To me it looks as if problem size= space complexity for a given problem. But I am not sure. I'd like to illustrate my confusion with the following example:
We have the following pseudo code:
ALGONE(x,y)
if x=0 or x=y then
return 1
end
return ALGONE(x-1,y-1) + ALGONE(x,y-1)
So let's say we give two inputs in $x$ and $y$ and $n$ represents the number of digits.
Since we are having addition as our main operation, and addition is an elementary operation, and for two numbers of n digits, we need n operations, then the asymptotic behavior is of the form O(n).
But what about the problem size in this case. I don't understand what am I supposed to say. The term problem-size is so vague. It depends on the algorithm but even then, even if one is able to understand the algorithm what do you give as an answer?
I'd assume that in this particular case the problem size, might be the number of bits we need to represent the input. But this is a guess of mine, grounded in nothing

Algorithm for highest value inside budget

I wasn't entirely sure the best way to ask this question (or do the research to see if it has been previously answered).
Given a data set where each entry has a Point value and a Dollar value, I'm looking to generate a list of length N entries that yields the highest aggregate Point value whilst staying within budget B.
Example data set:
Item Points Dollars
Apple 3.0 $1.00
Pear 2.5 $0.75
Peach 2.8 $0.88
And with this (small) data set, say my budget (B) is $2.25, and list length (N) must be 2. You MUST use the fixed list length, but are not required to use ALL of the budget.
Obviously the example provided is easy to do in one's head, but given a much larger data set, and both higher N and B values, I'm looking for an algorithm that can generate the list. Having a hard time wrapping my head around this one.
Just looking for a pseudo-algorithm, but if you prefer any given language feel free to respond with that!
I am quite positive that this can be reduced to an NP-complete problem and hence it's not really worth trying to develop a process that will always give you the 'correct' answer as many people have tried and failed to do this efficiently over a large data set. However, you can use a much more efficient approximation technique that whilst it will not guarantee to give you the correct answer, many popular approximation algorithms are capable of achieving a high degree of accuracy.
Hope this helps you out :)
This problem is NP-Complete (NP and NP-Hard), meaning, that until now there is no algorithm found, that solves this problem in a polynomial amount time (polynomial to the input size) and if you find an algorithm that does, you would have solved one of the greatest problems in computer science (P=NP), which would you at least bring a million dollar reward.
If you are satisfied with an approximation, I would recommend the Greedy-Algorithm:
https://en.wikipedia.org/wiki/Greedy_algorithm

variant of knapsack problem

I have 'n' number of amounts (non-negative integers). My requirement is to determine an optimal set of amounts so that the sum of the combination is less than or equal to a given fixed limit and the total is as large as possible. There is no limit to the number of amounts that can be included in the optimal set.
for sake of example: amounts are 143,2054,546,3564,1402 and the given limit is 5000.
As per my understanding the knapsack problem has 2 attributes for each item (weight and value). But the problem stated above has only one attribute (amount). I hope that would make things simpler? :)
Can someone please help me with the algorithm or source code for solving this?
this is still an NP-hard problem, but if you want to (or have to) to do something like that, maybe this topic helps you out a bit:
find two or more numbers from a list of numbers that add up towards a given amount
where i solved it like this and NikiC modified it to be faster. only difference: that one was about getting the exact amount, not "as close as possible", but that would be only some small changes in code (and you'll have to translate it into the language you're using).
take a look at the comments in my code to understand what i'm trying to do, wich is, in short form:
calculating all possible combinations of the given parts and sum them up
if the result is the amount i'm looking for, save the solution to an array
at least, sort all possible solutions to get the one using the least parts
so you'll have to change:
save a solution if it's lower than the amount you're looking for
sort solutions by total amount instead of number of used parts
The book "Knapsack Problems" By Hans Kellerer, Ulrich Pferschy and David Pisinger calls this The Subset Sum Problem and dedicates an entire chapter (Ch 4) to it. The chapter is very comprehensive and covers algorithms as well as computational results.
Even though this problem is a special case of the knapsack problem, it is still NP-hard.

How do you evaluate the efficiency of an algorithm, if the problem space is underspecified?

There was a post on here recently which posed the following question:
You have a two-dimensional plane of (X, Y) coordinates. A bunch of random points are chosen. You need to select the largest possible set of chosen points, such that no two points share an X coordinate and no two points share a Y coordinate.
This is all the information that was provided.
There were two possible solutions presented.
One suggested using a maximum flow algorithm, such that each selected point maps to a path linking (source → X → Y → sink). This runs in O(V3) time, where V is the number of vertices selected.
Another (mine) suggested using the Hungarian algorithm. Create an n×n matrix of 1s, then set every chosen (x, y) coordinate to 0. The Hungarian algorithm will give you the lowest cost for this matrix, and the answer is the number of coordinates selected which equal 0. This runs in O(n3) time, where n is the greater of the number of rows or the number of columns.
My reasoning is that, for the vast majority of cases, the Hungarian algorithm is going to be faster; V is equal to n in the case where there's one chosen point for each row or column, and substantially greater for any case where there's more than that: given a 50×50 matrix with half the coordinates chosen, V is 1,250 and n is 50.
The counterargument is that there are some cases, like a 109×109 matrix with only two points selected, where V is 2 and n is 1,000,000,000. For this case, it takes the Hungarian algorithm a ridiculously long time to run, while the maximum flow algorithm is blinding fast.
Here is the question: Given that the problem doesn't provide any information regarding the size of the matrix or the probability that a given point is chosen (so you can't know for sure) how do you decide which algorithm, in general, is a better choice for the problem?
You can't, it's an imponderable.
You can only define which is better "in general" by defining what inputs you will see "in general". So for example you could whip up a probability model of the inputs, so that the expected value of V is a function of n, and choose the one with the best expected runtime under that model. But there may be arbitrary choices made in the construction of your model, so that different models give different answers. One model might choose co-ordinates at random, another model might look at the actual use-case for some program you're thinking of writing, and look at the distribution of inputs it will encounter.
You can alternatively talk about which has the best worst case (across all possible inputs with given constraints), which has the virtue of being easy to define, and the flaw that it's not guaranteed to tell you anything about the performance of your actual program. So for instance HeapSort is faster than QuickSort in the worst case, but slower in the average case. Which is faster? Depends whether you care about average case or worst case. If you don't care which case, you're not allowed to care which "is faster".
This is analogous to trying to answer the question "what is the probability that the next person you see will have an above (mean) average number of legs?".
We might implicitly assume that the next person you meet will be selected at random with uniform distribution from the human population (and hence the answer is "slightly less than one", since the mean is less than the mode average, and the vast majority of people are at the mode).
Or we might assume that your next meeting with another person is randomly selected with uniform distribution from the set of all meetings between two people, in which case the answer is still "slightly less than one", but I reckon not the exact same value as the first - one-and-zero-legged people quite possibly congregate with "their own kind" very slightly more than their frequency within the population would suggest. Or possibly they congregate less, I really don't know, I just don't see why it should be exactly the same once you take into account Veterans' Associations and so on.
Or we might use knowledge about you - if you live with a one-legged person then the answer might be "very slightly above 0".
Which of the three answers is "correct" depends precisely on the context which you are forbidding us from talking about. So we can't talk about which is correct.
Given that you don't know what each pill does, do you take the red pill or the blue pill?
If there really is not enough information to decide, there is not enough information to decide. Any guess is as good as any other.
Maybe, in some cases, it is possible to divine extra information to base the decision on. I haven't studied your example in detail, but it seems like the Hungarian algorithm might have higher memory requirements. This might be a reason to go with the maximum flow algorithm.
You don't. I think you illustrated that clearly enough. I think the proper practical solution is to spawn off both implementations in different threads, and then take the response that comes back first. If you're more clever, you can heuristically route requests to implementations.
Many algorithms require huge amounts of memory beyond the physical maximum of a machine, and in these cases, the algorithmically more ineffecient in time but efficient in space algorithm is chosen.
Given that we have distributed parallel computing, I say you just let both horses run and let the results speak for themselves.
This is a valid question, but there's no "right" answer — they are incomparable, so there's no notion of "better".
If your interest is practical, then you need to analyze the kinds of inputs that are likely to arise in practice, as well as the practical running times (constants included) of the two algorithms.
If your interest is theoretical, where worst-case analysis is often the norm, then, in terms of the input size, the O(V3) algorithm is better: you know that V ≤ n2, but you cannot polynomially bound n in terms of V, as you showed yourself. Of course the theoretical best algorithm is a hybrid algorithm that runs both and stops when whichever one of them finishes first, thus its running time would be O(min(V3,n3)).
Theoretically, they are both the same, because you actually compare how the number of operations grows when the size of the problem is increased to infinity.
The way your problem is defined, it has 2 sizes - n and number of points, so this question has no answer.

Given a number series, finding the Check Digit Algorithm...?

Suppose I have a series of index numbers that consists of a check digit. If I have a fair enough sample (Say 250 sample index numbers), do I have a way to extract the algorithm that has been used to generate the check digit?
I think there should be a programmatic approach atleast to find a set of possible algorithms.
UPDATE: The length of a index number is 8 Digits including the check digit.
No, not in the general case, since the number of possible algorithms is far more than what you may think. A sample space of 250 may not be enough to do proper numerical analysis.
For an extreme example, let's say your samples are all 15 digits long. You would not be able to reliably detect the algorithm if it changed the behaviour for those greater than 15 characters.
If you wanted to be sure, you should reverse engineer the code that checks the numbers for validity (if available).
If you know that the algorithm is drawn from a smaller subset than "every possible algorithm", then it might be possible. But algorithms may be only half the story - there's also the case where multipliers, exponentiation and wrap-around points change even using the same algorithm.
paxdiablo is correct, and you can't guess the algorithm without making any other assumption (or just having the whole sample space - then you can define the algorithm by a look up table).
However, if the check digit is calculated using some linear formula dependent on the "data digits" (which is a very common case, as you can see in the wikipedia article), given enough samples you can use Euler elimination.

Resources