Fastest way to find an unique elements out of given numbers

Fastest way to find an unique elements out of given numbers - algorithm

I have n numbers. n <= 1000000. Each number will be positive integer and less than 10^9.
It is sure that there will be only one number will occur once, rest will occur twice or even number of times.
The shortest solution I know is the result of XOR of all numbers. I want to know
What will be the complexity of the standard XOR solution.
How can we optimize the solution.

XORing all the numbers will be of O(n) complexity, since you'll need to visit each element only once.
I can't think of any way to optimize this further, given your requirements. XOR is a very cheap operation, and the nature of the problem requires you to visit each element at least once: otherwise, you cannot possibly know which value is unique.

The XOR algorithm is the right algorithm and the fastest one. The slow part is the way that you are reading the input.
For instance, scanf in C is slower than handrolling your own number algorithm with getchar (or even better getchar_unlocked). On the SPOJ problem that you mentioned, I got an improvement from 1.35s to 0.14s just by making this change. I'm sure that the remaining 0.04 to get the best time on the site is just due to better low-level IO than my code.

You can go for hashing. A raw solution would be to use the unique number as a key to your hash table. If that is possible, then you can probably use the hashing algorithm. A simple example is to use the numbers as an index in an array. Now, the space will be too much (I mean too too much), but can be optimized further.

Related

Is there a better way to check if a number is part of a certain sequence?

I was wondering, if there is a better way, to check, if a certain number n is part of a specific sequence of natural numbers:
The way i think of now would iteratively/recursively compute the next number of the sequence, until you find the n. And, if a number x, bigger than n, would be computed, n would not be in the sequence, and i would stop the loop.
But this seems to be computationally quite demanding, especially if the given number is huge.
So, i was wondering, if there are fundamentally better ways (maybe another algorithmic approach) to achieve the same thing?

Fast algorithm to determine a negative number in a given array

Please suggest if there is quicker way to find the negative number in a given array, provided that the array has only one negative number. I think sorting is an option, but it will be helpful if there is a quicker way.

Sorting won't be quicker than going through all the elements of the array (because to sort you also have to do that).
The fastest possible thing to do is to go through the all array and stop once you detect one negative number.

Just traverse the array. That is order n. Sorting is at best order n(log n); at worst n2.

Probably the fastest is to just scan the array until you find it.
If you're just doing this once, and don't need the array sorted for other purposes, it'll be faster to scan for the negative number than to do the sort. If, however, you need (or can use) the sorting for other purposes, or you may need to find the negative number several times, then sorting can end up saving time. Likewise, with some programs, spending extra time in preparation to get faster response when really crucial can be justified (but I've no idea whether that applies here or not).

What sorting techniques can I use when comparing elements is expensive?

Problem
I have an application where I want to sort an array a of elements a0, a1,...,an-1. I have a comparison function cmp(i,j) that compares elements ai and aj and a swap function swap(i,j), that swaps elements ai and aj of the array. In the application, execution of the cmp(i,j) function might be extremely expensive, to the point where one execution of cmp(i,j) takes longer than any other steps in the sort (except for other cmp(i,j) calls, of course) together. You may think of cmp(i,j) as a rather lengthy IO operation.
Please assume for the sake of this question that there is no way to make cmp(i,j) faster. Assume all optimizations that could possibly make cmp(i,j) faster have already been done.
Questions
Is there a sorting algorithm that minimizes the number of calls to cmp(i,j)?
It is possible in my application to write a predicate expensive(i,j) that is true iff a call to cmp(i,j) would take a long time. expensive(i,j) is cheap and expensive(i,j) ∧ expensive(j,k) → expensive(i,k) mostly holds in my current application. This is not guaranteed though.
Would the existance of expensive(i,j) allow for a better algorithm that tries to avoid expensive comparing operations? If yes, can you point me to such an algorithm?
I'd like pointers to further material on this topic.
Example
This is an example that is not entirely unlike the application I have.
Consider a set of possibly large files. In this application the goal is to find duplicate files among them. This essentially boils down to sorting the files by some arbitrary criterium and then traversing them in order, outputting sequences of equal files that were encountered.
Of course reader in large amounts of data is expensive, therefor one can, for instance, only read the first megabyte of each file and calculate a hash function on this data. If the files compare equal, so do the hashes, but the reverse may not hold. Two large file could only differ in one byte near the end.
The implementation of expensive(i,j) in this case is simply a check whether the hashes are equal. If they are, an expensive deep comparison is neccessary.

I'll try to answer each question as best as I can.
Is there a sorting algorithm that minimizes the number of calls to cmp(i,j)?
Traditional sorting methods may have some variation, but in general, there is a mathematical limit to the minimum number of comparisons necessary to sort a list, and most algorithms take advantage of that, since comparisons are often not inexpensive. You could try sorting by something else, or try using a shortcut that may be faster that may approximate the real solution.
Would the existance of expensive(i,j) allow for a better algorithm that tries to avoid expensive comparing operations? If yes, can you point me to such an algorithm?
I don't think you can get around the necessity of doing at least the minimum number of comparisons, but you may be able to change what you compare. If you can compare hashes or subsets of the data instead of the whole thing, that could certainly be helpful. Anything you can do to simplify the comparison operation will make a big difference, but without knowing specific details of the data, it's hard to suggest specific solutions.
I'd like pointers to further material on this topic.
Check these out:
Apparently Donald Knuth's The Art of Computer Programming, Volume 3 has a section on this topic, but I don't have a copy handy.
Wikipedia of course has some insight into the matter.
Sorting an array with minimal number of comparisons
How do I figure out the minimum number of swaps to sort a list in-place?
Limitations of comparison based sorting techniques

The theoretical minimum number of comparisons needed to sort an array of n elements on average is lg (n!), which is about n lg n - n. There's no way to do better than this on average if you're using comparisons to order the elements.
Of the standard O(n log n) comparison-based sorting algorithms, mergesort makes the lowest number of comparisons (just about n lg n, compared with about 1.44 n lg n for quicksort and about n lg n + 2n for heapsort), so it might be a good algorithm to use as a starting point. Typically mergesort is slower than heapsort and quicksort, but that's usually under the assumption that comparisons are fast.
If you do use mergesort, I'd recommend using an adaptive variant of mergesort like natural mergesort so that if the data is mostly sorted, the number of comparisons is closer to linear.
There are a few other options available. If you know for a fact that the data is already mostly sorted, you could use insertion sort or a standard variation of heapsort to try to speed up the sorting. Alternatively, you could use mergesort but use an optimal sorting network as a base case when n is small. This might shave off enough comparisons to give you a noticeable performance boost.
Hope this helps!

A technique called the Schwartzian transform can be used to reduce any sorting problem to that of sorting integers. It requires you to apply a function f to each of your input items, where f(x) < f(y) if and only if x < y.
(Python-oriented answer, when I thought the question was tagged [python])
If you can define a function f such that f(x) < f(y) if and only if x < y, then you can sort using
sort(L, key=f)
Python guarantees that key is called at most once for each element of the iterable you are sorting. This provides support for the Schwartzian transform.
Python 3 does not support specifying a cmp function, only the key parameter. This page provides a way of easily converting any cmp function to a key function.

Is there a sorting algorithm that minimizes the number of calls to cmp(i,j)?
Edit: Ah, sorry. There are algorithms that minimize the number of comparisons (below), but not that I know of for specific elements.
Would the existence of expensive(i,j) allow for a better algorithm that tries to avoid expensive comparing operations? If yes, can you point me to such an algorithm?
Not that I know of, but perhaps you'll find it in these papers below.
I'd like pointers to further material on this topic.
On Optimal and Eﬃcient in Place Merging
Stable Minimum Storage Merging by Symmetric Comparisons
Optimal Stable Merging (this one seems to be O(n log2 n) though
Practical In-Place Mergesort
If you implement any of them, posting them here might be useful for others too! :)

Is there a sorting algorithm that minimizes the number of calls to cmp(i,j)?
Merge insertion algorithm, described in D. Knuth's "The art of computer programming", Vol 3, chapter 5.3.1, uses less comparisons than other comparison-based algorithms. But still it needs O(N log N) comparisons.
Would the existence of expensive(i,j) allow for a better algorithm that tries to avoid expensive comparing operations? If yes, can you point me to such an algorithm?
I think some of existing sorting algorithms may be modified to take into account expensive(i,j) predicate. Let's take the simplest of them - insertion sort. One of its variants, named in Wikipedia as binary insertion sort, uses only O(N log N) comparisons.
It employs a binary search to determine the correct location to insert new elements. We could apply expensive(i,j) predicate after each binary search step to determine if it is cheap to compare the inserted element with "middle" element found in binary search step. If it is expensive we could try the "middle" element's neighbors, then their neighbors, etc. If no cheap comparisons could be found we just return to the "middle" element and perform expensive comparison.
There are several possible optimizations. If predicate and/or cheap comparisons are not so cheap we could roll back to the "middle" element earlier than all other possibilities are tried. Also if move operations cannot be considered as very cheap, we could use some order statistics data structure (like Indexable skiplist) do reduce insertion cost to O(N log N).
This modified insertion sort needs O(N log N) time for data movement, O(N2) predicate computations and cheap comparisons and O(N log N) expensive comparisons in the worst case. But more likely there would be only O(N log N) predicates and cheap comparisons and O(1) expensive comparisons.
Consider a set of possibly large files. In this application the goal is to find duplicate files among them.
If the only goal is to find duplicates, I think sorting (at least comparison sorting) is not necessary. You could just distribute the files between buckets depending on hash value computed for first megabyte of data from each file. If there are more than one file in some bucket, take other 10, 100, 1000, ... megabytes. If still more than one file in some bucket, compare them byte-by-byte. Actually this procedure is similar to radix sort.

Most sorting algorithm out there try minimize the amount of comparisons during sorting.
My advice:
Pick quick-sort as a base algorithm and memorize results of comparisons just in case you happen to compare the same problems again. This should help you in the O(N^2) worst case of quick-sort. Bear in mind that this will make you use O(N^2) memory.
Now if you are really adventurous you could try the Dual-Pivot quick-sort.

Something to keep in mind is that if you are continuously sorting the list with new additions, and the comparison between two elements is guaranteed to never change, you can memoize the comparison operation which will lead to a performance increase. In most cases this won't be applicable, unfortunately.

We can look at your problem in the another direction, Seems your problem is IO related, then you can use advantage of parallel sorting algorithms, In fact you can run many many threads to run comparison on files, then sort them by one of a best known parallel algorithms like Sample sort algorithm.

Quicksort and mergesort are the fastest possible sorting algorithm, unless you have some additional information about the elements you want to sort. They will need O(n log(n)) comparisons, where n is the size of your array.
It is mathematically proved that any generic sorting algorithm cannot be more efficient than that.
If you want to make the procedure faster, you might consider adding some metadata to accelerate the computation (can't be more precise unless you are, too).
If you know something stronger, such as the existence of a maximum and a minimum, you can use faster sorting algorithms, such as radix sort or bucket sort.
You can look for all the mentioned algorithms on wikipedia.
As far as I know, you can't benefit from the expensive relationship. Even if you know that, you still need to perform such comparisons. As I said, you'd better try and cache some results.
EDIT I took some time to think about it, and I came up with a slightly customized solution, that I think will make the minimum possible amount of expensive comparisons, but totally disregards the overall number of comparisons. It will make at most (n-m)*log(k) expensive comparisons, where
n is the size of the input vector
m is the number of distinct component which are easy to compare between each other
k is the maximum number of elements which are hard to compare and have consecutive ranks.
Here is the description of the algorithm. It's worth nothing saying that it will perform much worse than a simple merge sort, unless m is big and k is little. The total running time is O[n^4 + E(n-m)log(k)], where E is the cost of an expensive comparison (I assumed E >> n, to prevent it from being wiped out from the asymptotic notation. That n^4 can probably be further reduced, at least in the mean case.
EDIT The file I posted contained some errors. While trying it, I also fixed them (I overlooked the pseudocode for insert_sorted function, but the idea was correct. I made a Java program that sorts a vector of integers, with delays added as you described. Even if I was skeptical, it actually does better than mergesort, if the delay is significant (I used 1s delay agains integer comparison, which usually takes nanoseconds to execute)

why is integer factorization a non-polynomial time?

I am just a beginner of computer science. I learned something about running time but I can't be sure what I understood is right. So please help me.
So integer factorization is currently not a polynomial time problem but primality test is. Assume the number to be checked is n. If we run a program just to decide whether every number from 1 to sqrt(n) can divide n, and if the answer is yes, then store the number. I think this program is polynomial time, isn't it?
One possible way that I am wrong would be a factorization program should find all primes, instead of the first prime discovered. So maybe this is the reason why.
However, in public key cryptography, finding a prime factor of a large number is essential to attack the cryptography. Since usually a large number (public key) is only the product of two primes, finding one prime means finding the other. This should be polynomial time. So why is it difficult or impossible to attack?

Casual descriptions of complexity like "polynomial factoring algorithm" generally refer to the complexity with respect to the size of the input, not the interpretation of the input. So when people say "no known polynomial factoring algorithm", they mean there is no known algorithm for factoring N-bit natural numbers that runs in time polynomial with respect to N. Not polynomial with respect to the number itself, which can be up to 2^N.

The difficulty of factorization is one of those beautiful mathematical problems that's simple to understand and takes you immediately to the edge of human knowledge. To summarize (today's) knowledge on the subject: we don't know why it's hard, not with any degree of proof, and the best methods we have run in more than polynomial time (but also significantly less that exponential time). The result that primality testing is even in P is pretty recent; see the linked Wikipedia page.
The best heuristic explanation I know for the difficulty is that primes are randomly distributed. One of the easier-to-understand results is Dirichlet's theorem. This theorem say that every arithmetic progression contains infinitely many primes, in other words, you can think of primes as being dense with respect to progressions, meaning you can't avoid running into them. This is the simplest of a rather large collection of such results; in all of them, primes appear in ways very much analogous to random numbers.
The difficult of factoring is thus analogous to the impossibility of reversing a one-time pad. In a one-time pad, there's a bit we don't know XOR with another one we don't. We get zero information about an individual bit knowing the result of the XOR. Replace "bit" with "prime" and multiplication with XOR, and you have the factoring problem. It's as if you've multiplied two random numbers together, and you get very little information from product (instead of zero information).

If we run a program just to decide whether every number from 1 to sqrt(n) can divide n, and if the answer is yes, then store the number.
Even ignoring that the divisibility test will take longer for bigger numbers, this approach takes almost twice as long if you just add a single (binary) digit to n. (Actually it will take twice as long if you add two digits)
I think that is the definition of exponential runtime: Make n one bit longer, the algorithm takes twice as long.
But note that this observation applies only to the algorithm you proposed. It is still unknown if integer factorization is polynomial or not. The cryptographers sure hope that it is not, but there are also alternative algorithms that do not depend on prime factorization being hard (such as elliptic curve cryptography), just in case...

Is there such a thing as "negative" big-O complexity? [duplicate]

This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
Are there any O(1/n) algorithms?
This just popped in my head for no particular reason, and I suppose it's a strange question. Are there any known algorithms or problems which actually get easier or faster to solve with larger input? I'm guessing that if there are, it wouldn't be for things like mutations or sorting, it would be for decision problems. Perhaps there's some problem where having a ton of input makes it easy to decide something, but I can't imagine what.
If there is no such thing as negative complexity, is there a proof that there cannot be? Or is it just that no one has found it yet?

No that is not possible. Since Big-Oh is suppose to be an approximation of the number of operations an algorithm performs related to its domain size then it would not make sense to describe an algorithm as using a negative number of operations.
The formal definition section of the wikipedia article actually defines the Big-Oh notation in terms of using positive real numbers. So there actually is not even a proof because the whole concept of Big-Oh has no meaning on the negative real numbers per the formal definition.
Short answer: Its not possible because the definition says so.

update
Just to make it clear, I'm answering this part of the question: Are there any known algorithms or problems which actually get easier or faster to solve with larger input?
As noted in accepted answer here, there are no algorithms working faster with bigger input.
Are there any O(1/n) algorithms?
Even an algorithm like sleep(1/n) has to spend time reading its input, so its running time has a lower bound.
In particular, author referes relatively simple substring search algorithm:
http://en.wikipedia.org/wiki/Horspool
PS But using term 'negative complexity' for such algorithms doesn't seem to be reasonable to me.

To think in an algorithm that executes in negative time, is the same as thinking about time going backwards.
If the program starts executing at 10:30 AM and stops at 10:00 AM without passing through 11:00 AM, it has just executed with time = O(-1).
=]
Now, for the mathematical part:
If you can't come up with a sequence of actions that execute backwards in time (you never know...lol), the proof is quite simple:
positiveTime = O(-1) means:
positiveTime <= c * -1, for any C > 0 and n > n0 > 0
Consider the "C > 0" restriction.
We can't find a positive number that multiplied by -1 will result in another positive number.
By taking that in account, this is the result:
positiveTime <= negativeNumber, for any n > n0 > 0
Wich just proves that you can't have an algorithm with O(-1).

Not really. O(1) is the best you can hope for.
The closest I can think of is language translation, which uses large datasets of phrases in the target language to match up smaller snippets from the source language. The larger the dataset, the better (and to a certain extent faster) the translation. But that's still not even O(1).

Well, for many calculations like "given input A return f(A)" you can "cache" calculation results (store them in array or map), which will make calculation faster with larger number of values, IF some of those values repeat.
But I don't think it qualifies as "negative complexity". In this case fastest performance will probably count as O(1), worst case performance will be O(N), and average performance will be somewhere inbetween.
This is somewhat applicable for sorting algorithms - some of them have O(N) best-case scenario complexity and O(N^2) worst case complexity, depending on the state of data to be sorted.
I think that to have negative complexity, algorithm should return result before it has been asked to calculate result. I.e. it should be connected to a time machine and should be able to deal with corresponding "grandfather paradox".

As with the other question about the empty algorithm, this question is a matter of definition rather than a matter of what is possible or impossible. It is certainly possible to think of a cost model for which an algorithm takes O(1/n) time. (That is not negative of course, but rather decreasing with larger input.) The algorithm can do something like sleep(1/n) as one of the other answers suggested. It is true that the cost model breaks down as n is sent to infinity, but n never is sent to infinity; every cost model breaks down eventually anyway. Saying that sleep(1/n) takes O(1/n) time could be very reasonable for an input size ranging from 1 byte to 1 gigabyte. That's a very wide range for any time complexity formula to be applicable.
On the other hand, the simplest, most standard definition of time complexity uses unit time steps. It is impossible for a positive, integer-valued function to have decreasing asymptotics; the smallest it can be is O(1).

I don't know if this quite fits but it reminds me of bittorrent. The more people downloading a file, the faster it goes for all of them

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio