I know that the algorithm for live variable analysis can finally terminate and give a solution. However, I'd like to know whether the iteration number of the algorithm is determined(i.e., can I calculate the iteration number of the algorithm with some parameters, I guess the parameters may be related to the program to be analyzed).
Although I still don't know how to calculate the accurate iteration number, it's easy to compute the maximum iteration number we need. It can be solved by applying the lattice theorems.
In terms of simplicity, assume the height of the lattice corresponding to the analysis is i and the number of the nodes in CFG is k, then the maximum iteration number is i*k.
Related
Something I'm struggling with is figuring out a precise way of knowing how our size n should be defined.
To demonstrate what I mean by that; take for example binary search. The input size n of T(n) is defined to be high - low + 1. I don't really understand how we can
Figure that out from just the algorithm without taking an "educated guess"
confirm that we are not wasting our time proving the recurrence equation with a falsely derived input size
I would really appreciate some advise, thanks in advance.
Indeed the input size is usually not arbitrary. It is necessary to correctly determine what that is.
So how do we do it?
First of all you have to understand the algorithm - what it does and why do you even use it. Usually you can brute force any problem quite easily, but you still decide to use something better. Why? Answering that question should make everything clear.
Let's take a look at your binary search example. You want to find a specific value by calling a monotonic function that will tell you if your choice is too low or too high. For example, you may want to find a largest value less than a specific number in an array.
What is a brute force approach? Well, you can ask for every value possible. What affects the complexity? The number of values you can choose. That's your input size. To decrease the complexity you may want to use the binary search, which let's you do much less queries. The input size remains the same.
Let's have a look at some other algorithms:
Euclidean algorithm for GCD - brute force? Search for every number less or equal to the minimum of your input. What affects the complexity? The number of values you want to check, that is the numbers for which you want to find the GCD.
BFS/DFS (graph traversal) - you want to visit each node once. For each node you have to additionally check every edge. In total: the input size is the number of nodes and edges.
KMP/Karp-Rabin/any other pattern finding algorithm - you want to find occurences of pattern in a text, the input size will obviously be a text size, but also a pattern size.
Once you understand the problem the algorithm solves, think of a brute force approach. Determine what affects the speed, what can be improved. That is most likely your input size. In a more complex cases, the input size is stated in an algorithm's description.
I've read that the naive approach to testing primality has exponential complexity because you judge the algorithm by the size of its input. Mysteriously, people insist that when discussing primality of an integer, the appropriate measure of the size of the input is the number of bits (not n, the integer itself).
However, when discussing an algorithm like Floyd's, the complexity is often stated in terms of the number of nodes without regard to the number of bits required to store those nodes.
I'm not trying to make an argument here. I honestly don't understand the reasoning. Please explain. Thanks.
Traditionally speaking, the complexity is measured against the size of input.
In case of numbers, the size of input is log of this number (because it is a binary representation of it), in case of graphs, all edges and vertices must be represented somehow in the input, so the size of the input is linear in |V| and |E|.
For example, naive primality test that runs in linear time of the number itself, is called pseudo-polynomial. It is polynomial in the number, but it is NOT polynomial in the size of the input, which is log(n), and it is in fact exponential in the size of the input.
As a side note, it does not matter if you use the size of the input in bits, bytes, or any other CONSTANT factor for this matter, because it will be discarded anyway later on when computing the asymptotical notation as constants.
The main difference is that when discussing algorithms we keep in the back of our mind a hardware that is able to perform operations on the data used in O(1) time. When being strict or when considering data which is not able to fit into the processors register then taking the number of bits in account becomes important.
Although the size of input is measured in the number of bits, in many cases we can use a shortcut that lets us divide out a constant number of bits. This constant factor is embedded in the representation that we choose for our data structure.
When discussing graph algorithms, we assume that each vertex and each edge has a fixed cost of representation in terms of the number of bits, which does not depend of the number of vertices and edges. This assumption requires that weights associated with vertices and edges have fixed size in terms of the number of bits (i.e. all integers, all floats, etc.)
With this assumption in place, adjacency list representation has fixed size per edge or vertex, because we need one pointer per edge and one pointer per vertex, in addition to the weights, which we presume to be of constant size as well.
Same goes for adjacency matrix representation, because we need W(E2 + V) bits for the matrix, where W is the number of bits required to store the weight.
In rare situations when weights themselves are dependent on the number of vertices or edges the assumption of fixed weight no longer holds, so we must go back to counting the number of bits.
Given a collection of points in the complex plane, I want to find a "typical value", something like mean or mode. However, I expect that there will be a lot of outliers, and that only a minority of the points will be close to the typical value. Here is the exact measure that I would like to use:
Find the mean of the largest set of points with variance less than some programmer-defined constant C
The closest thing I have found is the article Finding k points with minimum diameter and related problems, which gives an efficient algorithm for finding a set of k points with minimum variance, for some programmer-defined constant k. This is not useful to me because the number of points close to the typical value could vary a lot and there may be other small clusters. However, incorporating the article's result into a binary search algorithm shows that my problem can be solved in polynomial time. I'm asking here in the hope of finding a more efficient solution.
Here is way to do it (from what i have understood of problem) : -
select the point k from dataset and calculate sorted list of points in ascending order of their distance from k in O(NlogN).
Keeping k as mean add the points from sorted list into set till variance < C and then stop.
Do this for all points
Keep track of set which is largest.
Time Complexity:- O(N^2*logN) where N is size of dataset
Mode-seeking algorithms such as Mean-Shift clustering may still be a good choice.
You could then just keep the mode with the largest set of points that has variance below the threshold C.
Another approach would be to run k-means with a fairly large k. Then remove all points that contribute too much to variance, decrease k and repeat. Even though k-means does not handle noise very well, it can be used (in particular with a large k) to identify such objects.
Or you might first run some simple outlier detection methods to remove these outliers, then identify the mode within the reduced set only. A good candidate method is 1NN outlier detection, which should run in O(n log n) if you have an R-tree for acceleration.
The Hungarian algorithm solves the assignment problem in polynomial time. Given workers and tasks, and an n×n matrix containing the cost of assigning each worker to a task, it can find the cost minimizing assignment.
I want to find the choice for which cost is max? Can I do it using Hungarian or any similar method? Or this can only be done exponentially?
Wikipedia says:
If the goal is to find the assignment that yields the maximum cost,
the problem can be altered to fit the setting by replacing each cost
with the maximum cost subtracted by the cost.
So if I understand correctly: among all the costs you have as input, you find the maximum value. Then you replace each cost x by max - x. This way you still have positive costs and you can run the Hungarian algorithm.
Said differently: Hungarian tries to minimize the assignment cost. So, if you are looking for the maximum, you can reverse the costs: x -> -x. However, some implementations (don't know if all or any) require positive numbers. So the idea is to add a constant value to each cost in order to have positive numbers. This constant value does not change the resulting affectation.
As David said in the comment:
Multiply the cost matrix by -1 for maximization.
The question is like this:
Assume we have N machines, and each machine store and can manipulate its N elements, then, how can we find the median of all the N^2 elements in the lowest cost?
It really bothers me much, hope to get answer from you guys, thanks!
Sorry I just write it down too simple. The elements stored in each machine is random, and have no order. And the cost contains I/O cost, as well as communication between machines, RAM, time everything should be considered too. I just want to find the most efficient way to get the median.
These are some solutions I have come up with:
use external sort like merge sort or something else, and find the median.
use bucket sort, divide all the elements into X consecutive buckets according to its value, and so we can decide which bucket the median is in. Scan the bucket and we will get the median.
I think the finding kth number in O(N) algorithm in "Introduction to Algorithms" should work here?
But still, all these solutions need an extra machine to do the job. I'm wondering whether there is a way that we can only use these N machines to get the median?
Thanks!
You'll need to have a process that counts all the values (total across all the stores). Pick the middle index. Adjust the index to be an offset from the start of items on the appropriate machine. Ask that machine to sort the items and return the value for that index.
Step 1: Sort the numbers at each machine individually
Step 2: Send the median at each machine to a central place
Step 3: Sort the medians and send it to each machine
Step 4: For each element in the sorted medians calculate the rank at machine level
Step 5: Calculate the rank of each element over all machines (just sum the rank)
Step 6: Find two elements in the sorted medians between which the global median exists
Step 7: For the next iteration consider only elements between those two medians
and repeat the whole thing again
In the worst case all the remaining elements in the second iteration will be on a single machine.
Complexity: Pretty sure it is O(nlogn) (i.e. including palatalization it can be O(n^2logn)
Can you estimate it rather than get it exactly?
If so, pick a constant K and fit a K-coefficient polynomial to the data on each machine, send the coefficients to a central machine that adds them and then finds the median by
Integrating the curve over the range to find the area under the curve
Doing a root-finding algorithm to find the point that splits the area in half.
The bigger K is, the less error there will be. The smaller K is, the more efficient it will be.