Logarithmic growth [duplicate] - performance

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Plain English explanation of Big O
The wikipedia article on logarithmic growth is a stub. Many of the answers I read on stackoverflow give clarification of how efficient a process or function is based on a logarithmic function using 0 (I assume[see below] it is a 0[zero] and not an O[letter as M,N,O,P,Q], but please correct my assumption if it is wrong) and an n or N.
Can someone explain the logarithmic explanations pertaining to the common computational explanations better? maybe in terms of time in seconds(milliseconds are also welcome, just trying to conceptualize it in real life time differences...), terms of size, and/or in terms of weight?
I have seen the following frequently: (please feel free to include other ones as well)
O(1)
O(N)
My assumption is based that a 0 [zero] outside of a code block does not have a slash through it, while inside a code block a 0 does have a slash through it.

What this means is that the execution time (or other resource) is some function of the amount of data. Let's say it takes you 5 minutes to blow up 10 baloons. If the function is O(1), then blowing up 50 baloons also takes 5 minutes. If it is O(n), then blowing up 50 baloons will take 25 minutes.
O(log n) means that as things scale up, managing larger n becomes easier (it follows logorithmic growth). Let's say you want to find an address in a directory of f(n) items. Suppose it takes 6.64 seconds to search a directory of 100 entries. [6.64 = log_2(100)] Then it might only take 7.64 seconds to search 200 entries. That is logorithmic growth.

Related

Performance comparsion: Algorithm S and Algorithm Z

Recently I ran into two sampling algorithms: Algorithm S and Algorithm Z.
Suppose we want to sample n items from a data set. Let N be the size of the data set.
When N is known, we can use Algorithm S
When N is unknown, we can use Algorithm Z (optimized atop Algorithm R)
Performance of the two algorithms:
Algorithm S
Time complexity: average number of scanned items is n(N+1)/n+1 (I compute the result, Knuth's book left this as exercises), we can say it O(N)
Space complexity: O(1) or O(n)(if returning an array)
Algorithm Z (I search the web, find the paper https://www.cs.umd.edu/~samir/498/vitter.pdf)
Time complexity: O(n(1+log(N/n))
Space complexity: in TAOCP vol2 3.4.2, it mentions Algorithm R's space complexity is O(n(1+log(N/n))), so I suppose Algorithm Z might be the same
My question
The model for Algorithm Z is: keep calling next method on the data set until we reach the end. So for the problem that N is known, we can still use Algorithm Z.
Based on the above performance comparison, Algorithm Z has better time complexity than Algorithm S, and worse space complexity.
If space is not a problem, should we use Algorithm Z even when N is known?
Is my understanding correct? Thanks!
Is the Postgres code mentioned in your comment actually used in production? In my opinion, it really should be reviewed by someone who has at least some understanding of the problem domain. The problem with random sampling algorithms, and random algorithms in general, is that it is very hard to diagnose biased sampling bugs. Most samples "look random" if you don't look too hard, and biased sampling is only obvious when you do a biased sample of a biased dataset. Or when your biased sample results in a prediction which is catastrophically divergent from reality, which will eventually happen but maybe not when you're doing the code review.
Anyway, by way of trying to answer the questions, both the one actually in the text of this post and the ones added or implied in the comment stream:
Properly implemented, Vitter's algorithm Z is much faster than Knuth's algorithm S. If you have a use case in which reservoir sampling is indicated, then you should probably use Vitter, subject to the code testing advice above: Vitter's algorithm is more complicated and it might not be obvious how to validate the implementation.
I noticed in the Postgres code that it just uses the threshold value of 22 to decide whether to use the more complicated code, based on testing done almost 40 years ago on hardware which you'd be hard pressed to find today. It's possible that 22 is not a bad threshold, but it's just a number pulled out of thin air. At least some attempt should be made to verify or, more likely, correct it.
Forty years ago, when those algorithms were developed, large datasets were typically stored on magnetic tape. Magnetic tape is still used today, but applications have changed; I think that you're not likely to find a Postgres installation in which a live database is stored on tape. This matters because the way you get data off a tape drive is radically different from the way you get data from a file server. Or a sharded distributed collection of file servers, which also has its particular needs.
Data on a reel of tape can only be accessed linearly, although it is possible to skip tape somewhat faster than you can read it. On a file server, data is random access; there may be a slight penalty for jumping around in a file, but there might not. (On the sharded distributed model, it might well be faster then linear reads.) But trying to read out of order on a tape drive might turn an input operation which takes an hour into an operation which takes a week. So it's very important to access the sample in order. Moreover, you really don't want to have to read the tape twice, which would take twice as long.
One of the other assumptions that was made in those algorithms is that you might not have enough memory to store the entire sample; in 1985, main memory was horribly expensive and databases were already quite large. So a common way to collect a large sample from a huge database was to copy the sampled blocks onto secondary memory, such as another tape drive. But there's a bit of a catch with reservoir sampling: as the sampling algorithm proceeds, some items which were initially inserted in the sample are later replaced with other items. But you can't replace data written on tape, so you need to just keep on appending the newly selected samples. What you do hold in random access memory is a list of locations of the sample; once you've finished selecting the sample, you can sort this list of locations and then use it to read out the final selection in storage order, skipping over the rejected items. That means that the temporary sample storage ends up holding both the final sample, and some number of later rejected items. The O(n(1+log(N/n))) space complexity in Algorithm R refers to precisely this storage, and it's actually a reasonably small multiplier, considering.
All that is irrelevant if you can just allocate enough random access storage somewhere to hold the entire sample. Or, even better, if you can directly read a data from the database. There could well still be good reasons to read the sample into local storage, but nothing stops you from updating a block of local storage with a different block.
On the other hand, in many common cases, you don't need to read the data in order to sample it. You can just take a list of items numbers, select a sample from that list of the desired size, and then set about acquiring the sample from the list of selected item numbers. And that presents a rather different problem: how to choose an unbiased sample of size k from a set of K item indexes.
There's a fast and simple solution to that (also described by Knuth, unsurprisingly): make an array of all the item numbers (say, the integers from 0 to K, and then shuffle the array using the standard Knuth/Fisher-Yates shuffle, with a slight modification: you run the algorithm from front to back (instead of back to front, as it is often presented), and stop after k iterations. At that point the first k elements in the partially shuffled array are an unbiased sample. (In fact, you don't need the entire vector of K indices, as long as k is much smaller than K. You're only going to touch O(k) of the values, and you can keep the ones you touched in a hash table of size O(k).)
And there's an even simpler algorithm, again for the case where the sample is small relative to the dataset: just keep one bit for each item in the dataset, which indicates that the item has been selected. Now select k items at random, marking the bit vector as you go; if the relevant bit is already marked, then that item is already in the sample; you just ignore that selection and continue with the next random choice. The expected number of ignored sample is very small unless the sample size is a significant fraction of the dataset size.
There's one other criterion which weighed on the minds of Vitter and Knuth: you'll normally want to do something with the selected sample. And given the amount of time it takes to read through a tape, you want to be able to start processing each item immediately as it is accepted. That precludes algorithms which include, for example, "sort the selected indices and then read the indicated items. (See above.) For immediate processing to be possible, you must not depend on being able to "deselect" already selected items.
Fortunately, both the quick algorithms mentioned at the end of point 2 do satisfy this requirement. In both cases, an item once selected will never be later rejected.
There is at least one use case for reservoir sampling which is still very much relevant: sampling a datastream which is too voluminous or too high-bandwidth to store. That might be some kind of massive social media feed, or it might be telemetry data from a large sensor array, or whatever. In that case, you might want to reduce the size of the datastream by extracting only a small sample, and reservoir sampling is a good candidate. However, that has nothing to do with the Postgres example.
In summary:
Yes, you can (and probably should) use Vitter's Algorithm Z in preference to Knuth's Algorithm S, even if you know how big the data set it.
But there are certainly better algorithms, some of which are outlined above.

Getting the maximum data an algorithm can process in a certain time span

Well meanwhile it's the second time I got an exercise where I have to determine (in this case it's about sorting algorithms) how many numbers I can sort with a certain algorithm (on my own computer) so that the algorithm would run exactly one minute.
This is a practical exercise, means I must generate enough numbers so it would run that long. Now I ask myself, since I haven't had this problem in all ten years of programming: How can I possibly do this? My first attempt was a bit brute-forcy which resulted in an instant StackOverflow.
I could make an array (or multiple) and fill them up with random numbers, but to determine how many would end up in one minute runtime would be a terrible long task since you would always need to wait.
What can I do to efficiently find out about this? Measuring the difference between let's say 10 and 20 numbers and calculate how much it would take to fill a minute? Sounds easy, but algorithms (especially sorting algorithms) are rarely linear.
You know time complexity for each algorithm in question. For example, bubble sort takes O(n*n) time. Make relatively small sample run - D=1000 records, measure the time it takes (T milliseconds). For example, it takes 15 seconds = 15000 milliseconds.
Now with more or less accuracy you can expect that D*2 records will be processed 4 times slower. And vice versa - you need about D* sqrt(60000/T) records to process them in 1 minute. For example, you need D* sqrt(60000/15000)=D* sqrt(4)=D*2=2000 records.
This method is not accurate enough to get exact number, and in most cases exact number of records is not set, it fluctuates from run to run. Also for many algorithms time it takes depends on values in your record set. For example, worst case for quicksort is O(nn), while normal case is O(nlog(n))
You could use something like this:
long startTime = System.getCurrentTimeMillis();
int times = 0;
boolean done = false;
while(!done){
//run algorithm
times++;
if(System.getCurrentTimeMillis()-startTime >= 60000)
done = true;
}
Or if you don't want to wait that long you can can replace the 60000 by 1000 and then multiply the times by 60, it won't be very accurate though.
It would be time consuming to generate a new number every time, so you can use an array that you populate beforehand and then access with the times variable, or you can always use the same variable, which you know would be most time consuming to process so that you get the minimum amount of times that it would run in a minute.

brute force search optimisation

I have an function that is engineered as follows:
int brutesearch(startNumber,endNumber);
this function returns the correct number if one matches my criteria by performing a linear search, or null if it's not found in the searched numbers.
Say that:
I want to search all 6 digits numbers to find one that does something I want
I can run the brutesearch() function multithreaded
I have a laptop with 4 cores
My question is the following:
What is my best bet for optimising this search? Dividing the number space in 4 segments and running 4 instances of the function one on each core? Or dividing for example in 10 segments and running them all together, or dividing in 12 segments and running them in batches of 4 using a queue?
Any ideas?
Knowing nothing about your search criteria (there may be other considerations created by the memory subsystem), the tradeoff here is between the cost of having some processors do more work than others (e.g., because the search predicate is faster on some values than others, or because other threads were scheduled) and the cost of coordinating the work. A strategy that's worked well for me is to have a work queue from which threads grab a constant/#threads fraction of the remaining tasks each time, but with only four processors, it's pretty hard to go wrong, though the really big running-time wins are in algorithms.
There is no general answer. You need to give more information.
If your each comparison is completely independent of the others, and there are no opportunities for saving computation in a global resource, there is say no global hash tables involved, and your operations are all done in a single stage,
then your best bet is to just divide your problem space into the number of cores you have available, in this case 4 and send 1/4 of the data to each core.
For example if you had 10 million unique numbers that you wanted to test for primality. Or if you had 10 million passwords your were trying to hash to find a match, then just divide by 4.
If you have a real world problem, then you need to know a lot more about the underlying operations to get a good solution. For example if a global resource is involved, then you won't get any improvement from parallelism unless you isolate the operations on the global resource somehow.

Eliminating deviants while computing score

I have a relatively simple algorithmic problem where I recommend questions for users
I have a set of questions with answers (like, comments for each
answer)
I want to score how engaging each question is.
Current implementation:
(total comments + likes for all answers for a question) / sqrt (number of answers)
Problems:
Sometimes, one answer that has a tonne of activity skews the score for the question, even if the other 20 answers generate very little interest
Some reduction should be applied for questions with very few answers.
Would appreciate any suggestions on these 2 problems can be negated.
Usually when we want to avoid letting one sample from being too powerful, the standard way to do it is by one of these:
use log(N) instead of N, making the effect of each observation less powerful1
leave the "strange" observations out: Take only the middle X%, and use them, for example: take only observations that has 1/4 - 3/4 likes, from the max of this question, and leave the skewing examples out.
For the second issue - one thing I can think of is giving a varainting factor: instead using sqrt(number of answers) - you can try (number_of_answers)^(log(number_of_answers+1)/log(max_answers+1)) where max_answers is the maximal number of answer per question in your data set.
It will result in boosting up questions with few answers, which I think is what you are after.
(1): We usually take log(N+1) - so it will be defined for N==0 as well.

How do I find the running time given algorithm speed and computer speed?

I'm currently working through an assignment that deals with Big-O and running times. I have this one question presented to me that seems to be very easy, but I'm not sure if I'm doing it correctly. The rest of the problems have been quite difficult, and I feel like I'm overlooking something here.
First, you have these things:
Algorithm A, which has a running time of 50n^3.
Computer A, which has a speed of 1 millisecond per operation.
Computer B, which has a speed of 2 milliseconds per operation.
An instance of size 300.
I want to find how long it takes algorithm A to solve this instance on computer A, and how long it takes it on computer B.
What I want to do is sub 300 in for n, so you have 50*(300^2) = 4500000.
Then, multiply that by 1 for the first computer, and by 2 for the second computer.
This feels odd to me, though, because it says the "running time" is 50n^3, not, "the number of operations is 50n^3", so I get the feeling that I'm multiplying time by time, and would end up with units of milliseconds squared, which doesn't seem right at all.
I would like to know if I'm right, and if not, what the question actually means.
It wouldn't make sense if you had O(n^3) but you are not using big O notation in your example. I.e. if you had O(n^3) you would know the complexity of your algorithm but you would not know the exact number of operations as you said.
Instead it looks as though you are given the exact number of operations that are taken. (Even know it is not explicitly stated). So substituting for n would make sense.
Big O notation describes how the size of the input would effect your running time or memory usage. But with Big O you could not deduce an exact running time even given the speed of each operation.
Putting an explanation of why your answer looks so simple (as I described above) would also be a safe way. But I'm sure even without it you'll get the marks for the question.
Well, aside from the pointlessness of figuring out how long something will take this way on most modern computers, though it might make have some meaning in an embedded system, it does look right to me the way you did it.
If the algorithm needs 50n^3 operations to complete something, where n is the number of elements to process, then substituting 300 for n will give you the number of operations to perform, not a time-unit.
So multiply with time per operation and you would get the time needed.
Looks right to me.
Your 50*n^3 data is called "running time", but that's because the model used for speed evaluations assumes a machine with several basic operations, where each of these takes 1 time unit.
In you case, running the algorithm takes 50*500^3 time units. On computer A each time unit is 1ms, and on computer B 2ms.
Hope this puts some sense into the units,
Asaf.

Resources