What data structure to use with arbitrarily large integer numbers? [duplicate] - algorithm

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
What data-structure should I use to create my own “BigInteger” class?
Out of pure interest, I am trying to design a type that can hold an arbitrarily large integer. I want to support four basic operations [+, -, *, /] and optimise for speed of those operations.
I was thinking about some sort of a doubly-linked list and a bit flag to indicate positive or negative value. But I am not really sure how to add, for example, to large numbers of different sizes. Shall I walk to the last element of both numbers and then return back (using the second reverse pointer to the previous element).
123456789 //one large number
+ 123 //another large number with different size
Providing I can have an arbitrarily large memory, what is the best data structure for this task?
I would appreciate a small hint and any comments on worst-case complexity of the arithmetic operations. Thanks!

Usually one would go for an array/vector in this case, perhaps little-endian (lowest-significant word first). If you implement in-place operations, use a constant factor when growing the array, then amortized complexity for the reallocation remains O(1).
All operations should be doable in O(n) run time where n is the size of the input. EDIT: No, of course, multiplication and division will need more, this answer says it's at least O(N log N).
Just out of curiosity: Why are you reimplementing the wheel? Find here an implementation in Java. C# has one with .NET 4.0, too. While it might be a good exercise to implement this yourself (I remember myself doing it once), if you just need the functionality then it's there in many computing environments already.

Related

Is there an efficient way to iterate over an unsorted container in a specific order without sorting/copying/referencing the original container?

What I have in mind is a SortedIterator, which accepts a Less function that would be used to sort the container with all the known algorithms.
The brute force implementation would of course either keep a copy of the original elements, or keep references/pointers to the elements in the original list.
Is there an efficient way to iterate in a well-defined order without actually sorting the list? I'm asking this out of algorithmic curiosity, and I expect the answer to be no (or yes with a big but). It is asked out of a C++ style mindset, but is in fact a quite general language-agnostic premise.
If you want O(1) memory the O(n^2) complexity is the only way to do it that we know of. Otherwise we could improve selection-sort algorithm the same way. Any other sorting mechanism relies on being able to restructure part of array(merge sort relies on sorting parts of the array, qsort relies on splitting the array based on the pivot and so on).
Now if you relax the memory constrain you can do something a bit more efficient. For example you could store a heap to contain the lowest values x elements. So after one pass O(Nlog x) you get x elements for your iterator. For the next pass restrict only to elements greater than the last element you've emitted so far. You'll need to do N/x passes to get all. If x ==1 than the solution is O(N^2). If x == N the solution is O(Nlog N) (but with larger constant than the typical qsort). If the data is on disk then I would set x to about as much ram as you can, minus a few MB to be able to read large chunks for drive.

Preferred Sorting For People Based On Their Age

Suppose we have 1 million entries of an object 'Person' with two fields 'Name', 'Age'. The problem was to sort the entries based on the 'Age' of the person.
I was asked this question in an interview. I answered that we could use an array to store the objects and use quick sort as that would save us from using additional space but interviewer told that memory was not a factor.
My question is what would be the factor that would decide which sort to use?
Also what would be the preferred way to store this?
In this scenario does any sorting algorithm have an advantage over another sorting algorithm and would result in a better complexity?
This Stackoverflow link may be useful to you.
The answers above are sufficient but i would like to add some more information from the link above.
I am copying some information from the answers in, the link above, over here.
We should note that even if the fields in the Object are very big (i.e. long names) you do not need to use a file system sort, you can use an in-memory sort, because
# elements * 8 ~= 762 MB (most modern systems have enough memory for that)
^
key(age) + pointer to struct requires 8 bytes in 32 bits system
It is important to minimize the disk accesses - because disks are not random access, and disk accesses are MUCH slower then RAM accesses.
Now, use a sort of your choice on that - and avoid using disk for the sorting process.
Some possibilities of sorts (on RAM) for this case are:
Standard quicksort or merge-sort (Which you had already thought of)
Bucket sort can also be applied here, since the rage is limited to [0,150] (Which others have specified here under the name Count Sort)
Radix sort (For the same reason, radix sort will need ceil(log_2(150)) ~= 8 iterations
I wanted to point out the memory aspect in case you may encounter the same question but may need to answer it taking the memory constraints into consideration. In fact your constraints are even less(10^6 compared to the 10^8 in the other question).
As for the matter of storing it -
The quickest way to sort it would be to allocate 151 linked lists/vector (let's call them buckets or whatever you may depending on the language you prefer) and put each person's data structure in the bucket according to his/her age(all people's ages are between 0 and 150):
bucket[person->age].add(person)
As others have pointed out Bucket Sort is going to be the better option for you.
In fact the beauty of bucket sort is that if you have to perform any operation on ranges of ages(like from 10-50 years of age) you can partition your bucket sizes according to your requirements(like have varied bucket range for each bucket).
I repeat again i have copied the information from the answers in the link given above, but i believe they might be useful to you.
If the array has n elements, then quicksort (or, actually, any comparison-based sort) is Ω(n log(n)).
Here, though, it looks like you have here an alternative to comparison-based sorting, since you need to sort only on age. Suppose there are m distinct ages. In this case, Counting Sort, will be Θ(m + n). For the specifics of your question, assuming that age is in years, m is much smaller than n, and you can do this in linear time.
The implementation is trivial. Simply create an array of, say, 200 entries (200 being an upper bound on the age). The array is of linked lists. Scan over the people, and place each person in the linked list in the appropriate entry. Now, just concatenate the lists according to the positions in the array.
Different sorting algorithms perform at different complexities, yes. Some use different amounts of space. And in practice, real performance with the same complexity varies too.
http://www.cprogramming.com/tutorial/computersciencetheory/sortcomp.html
There're different ways to set up a quicksort's partition method that could have an effect for ages. Shell sorts can have different gap settings that perform better for certain types of input. But maybe your interviewer was more interested in you thinking about 1 million people having a lot of duplicate ages; which might mean you want a 3-way quicksort, or as suggested in comments a counting sort.
This is an interview question, so I guess interviewee's answer is more important than correct sorting algorithm. Your problem is sorting array of Object with field age is integer. Age has some special properties:
integer: there are some sorting algorithms specially design for integer.
finite: you know maximum age of people, right? For example that will be 200.
I will list some sorting algorithm for this problem with advantages and disadvantages that suitable enough in one interview session:
Quick sort: complexity is O(NLogN) and can apply to any data set. Quicksort is the fastest sort that using compare operator between two elements. Biggest disadvantage of quicksort is quicksort isn't stable. That means two objects equal in age doesn't maintain order after sorting.
Merge sort: complexity is O(NLogN). Little bit slower than quicksort but this is a stable sort. Also this algorithm can apply to any data set.
radix sort: complexity is O(w*n), with n is size of your list and w is maximum length of number of digits in your dataset. For example: length of 12 is 3, length of 154 is 3. So if people's age maximum is 99, complexity should be O(2*n). This algorithm just can apply to integer or string.
Counting sort complexity is O(m+n). With n is size of your list and m is number of distinct ages. This algorithm just can apply to integer.
Because we are sorting milion of entries and all values are integer stand in range 0 .. 200 so ton of duplicate values. So counting sort is the best fit with complexity O(200 + N), with N ~= 1,000,000. 200 is not much.
If you assume that you have finite number of different values of age (usually people are not older then 100) then you could use
counting sort (https://en.wikipedia.org/wiki/Counting_sort). You would be able to sort in linear time.

Performance(memory and speed wise) for a STL vector+sort+equality vs. unordered_set vs. using pure set

I have the following scenario:
I have bunch of elements that does not need to be in consecutive order.
I will be able insert the elements very first time during initialization
I need to perform containerA == containerB operation.
The number of elements N can be 100 at most, but for going through avg. case analysis purpose, I will say N can be either 100, 10k or 100k
Given , my requirements std::set is not a good option. I can do all the element insertion in a vector using push_back N*O(1) and std::sort O(NlogN) them and do a equality comparison (N); total of 2N+NlogN that would beat the std::set memory/speed easily.
The topic is already well reviewed here:
http://lafstern.org/matt/col1.pdf
and here:
What is the difference between std::set and std::vector?
Lets, move onto what if I use the new unordered_set. The insertion(N*O(1)) + equality lookup(N avg. case) for N elements totals as 2N.
Now, for unordered_set I need to create a hasher, which is not easy for my case. And I am guessing just the hashing part will cause this to go more than 2N for my complicated data structure.
However, why for a simple unique_ptr value insertion, would someone get the following performance results:
http://kohei.us/2010/03/31/stl-container-performance-on-data-insertion/
It seems vector sort + equality would still work better than unordered_set, upto a large number of elements (100k). unordered_set does not use a red-black tree right? so where is this performance hit coming from?
A slightly relevant post is here:
Performance of vector sort/unique/erase vs. copy to unordered_set
If your elements have a simple ordering function, and you know that they are distinct, then you will always be better off putting them in a vector and sorting them. In theory, a hash-table based solution with a good hash function could make the comparison O(n) rather than O(n log n), but there are a number of mitigating facts:
log n is a small number. If n is two thousand million, for example, log n is 31 (using binary logs, which is usually what are implied).
A standard library unordered collection requires an allocation for each element. This is effectively required by the specification because adding elements to an unordered collection does not invalidate references to existing elements, unlike the case with an standard library vector.
Iteration over an unordered collection is done per bucket (again, this is in the specification), with the result that the iteration involves random memory access. Iterating over a vector is sequential, which is much more cache-friendly.
In short, even though the sort is O(n log n), it is highly likely that the O(n) hash-based solution has a large per-element constant, and since log n is a small number, the vector-based solution will be faster. Often much faster.
How much slower a hash-based solution will be depends on the speed of the allocator, and there is considerable variation between different standard library implementations. But even a super-rapid allocator is unlikely to give you competitive performance, and the cache-unfriendliness of the hash table will become important when your tables grow sufficiently large.
Even if you have some duplicate elements, you might be better off with the vector, but that will depend on how many duplicates you have. Since the hash table is likely to occupy at least twice as much memory as the vector with the same number of elements, a simple rule of thumb might be to use vectors as long as you don't expect the number of elements to be more than twice the number of unique elements. (It is easy to eliminate duplicates after sorting. There is a standard library function which will do that.)

What sorting techniques can I use when comparing elements is expensive?

Problem
I have an application where I want to sort an array a of elements a0, a1,...,an-1. I have a comparison function cmp(i,j) that compares elements ai and aj and a swap function swap(i,j), that swaps elements ai and aj of the array. In the application, execution of the cmp(i,j) function might be extremely expensive, to the point where one execution of cmp(i,j) takes longer than any other steps in the sort (except for other cmp(i,j) calls, of course) together. You may think of cmp(i,j) as a rather lengthy IO operation.
Please assume for the sake of this question that there is no way to make cmp(i,j) faster. Assume all optimizations that could possibly make cmp(i,j) faster have already been done.
Questions
Is there a sorting algorithm that minimizes the number of calls to cmp(i,j)?
It is possible in my application to write a predicate expensive(i,j) that is true iff a call to cmp(i,j) would take a long time. expensive(i,j) is cheap and expensive(i,j) ∧ expensive(j,k) → expensive(i,k) mostly holds in my current application. This is not guaranteed though.
Would the existance of expensive(i,j) allow for a better algorithm that tries to avoid expensive comparing operations? If yes, can you point me to such an algorithm?
I'd like pointers to further material on this topic.
Example
This is an example that is not entirely unlike the application I have.
Consider a set of possibly large files. In this application the goal is to find duplicate files among them. This essentially boils down to sorting the files by some arbitrary criterium and then traversing them in order, outputting sequences of equal files that were encountered.
Of course reader in large amounts of data is expensive, therefor one can, for instance, only read the first megabyte of each file and calculate a hash function on this data. If the files compare equal, so do the hashes, but the reverse may not hold. Two large file could only differ in one byte near the end.
The implementation of expensive(i,j) in this case is simply a check whether the hashes are equal. If they are, an expensive deep comparison is neccessary.
I'll try to answer each question as best as I can.
Is there a sorting algorithm that minimizes the number of calls to cmp(i,j)?
Traditional sorting methods may have some variation, but in general, there is a mathematical limit to the minimum number of comparisons necessary to sort a list, and most algorithms take advantage of that, since comparisons are often not inexpensive. You could try sorting by something else, or try using a shortcut that may be faster that may approximate the real solution.
Would the existance of expensive(i,j) allow for a better algorithm that tries to avoid expensive comparing operations? If yes, can you point me to such an algorithm?
I don't think you can get around the necessity of doing at least the minimum number of comparisons, but you may be able to change what you compare. If you can compare hashes or subsets of the data instead of the whole thing, that could certainly be helpful. Anything you can do to simplify the comparison operation will make a big difference, but without knowing specific details of the data, it's hard to suggest specific solutions.
I'd like pointers to further material on this topic.
Check these out:
Apparently Donald Knuth's The Art of Computer Programming, Volume 3 has a section on this topic, but I don't have a copy handy.
Wikipedia of course has some insight into the matter.
Sorting an array with minimal number of comparisons
How do I figure out the minimum number of swaps to sort a list in-place?
Limitations of comparison based sorting techniques
The theoretical minimum number of comparisons needed to sort an array of n elements on average is lg (n!), which is about n lg n - n. There's no way to do better than this on average if you're using comparisons to order the elements.
Of the standard O(n log n) comparison-based sorting algorithms, mergesort makes the lowest number of comparisons (just about n lg n, compared with about 1.44 n lg n for quicksort and about n lg n + 2n for heapsort), so it might be a good algorithm to use as a starting point. Typically mergesort is slower than heapsort and quicksort, but that's usually under the assumption that comparisons are fast.
If you do use mergesort, I'd recommend using an adaptive variant of mergesort like natural mergesort so that if the data is mostly sorted, the number of comparisons is closer to linear.
There are a few other options available. If you know for a fact that the data is already mostly sorted, you could use insertion sort or a standard variation of heapsort to try to speed up the sorting. Alternatively, you could use mergesort but use an optimal sorting network as a base case when n is small. This might shave off enough comparisons to give you a noticeable performance boost.
Hope this helps!
A technique called the Schwartzian transform can be used to reduce any sorting problem to that of sorting integers. It requires you to apply a function f to each of your input items, where f(x) < f(y) if and only if x < y.
(Python-oriented answer, when I thought the question was tagged [python])
If you can define a function f such that f(x) < f(y) if and only if x < y, then you can sort using
sort(L, key=f)
Python guarantees that key is called at most once for each element of the iterable you are sorting. This provides support for the Schwartzian transform.
Python 3 does not support specifying a cmp function, only the key parameter. This page provides a way of easily converting any cmp function to a key function.
Is there a sorting algorithm that minimizes the number of calls to cmp(i,j)?
Edit: Ah, sorry. There are algorithms that minimize the number of comparisons (below), but not that I know of for specific elements.
Would the existence of expensive(i,j) allow for a better algorithm that tries to avoid expensive comparing operations? If yes, can you point me to such an algorithm?
Not that I know of, but perhaps you'll find it in these papers below.
I'd like pointers to further material on this topic.
On Optimal and Efficient in Place Merging
Stable Minimum Storage Merging by Symmetric Comparisons
Optimal Stable Merging (this one seems to be O(n log2 n) though
Practical In-Place Mergesort
If you implement any of them, posting them here might be useful for others too! :)
Is there a sorting algorithm that minimizes the number of calls to cmp(i,j)?
Merge insertion algorithm, described in D. Knuth's "The art of computer programming", Vol 3, chapter 5.3.1, uses less comparisons than other comparison-based algorithms. But still it needs O(N log N) comparisons.
Would the existence of expensive(i,j) allow for a better algorithm that tries to avoid expensive comparing operations? If yes, can you point me to such an algorithm?
I think some of existing sorting algorithms may be modified to take into account expensive(i,j) predicate. Let's take the simplest of them - insertion sort. One of its variants, named in Wikipedia as binary insertion sort, uses only O(N log N) comparisons.
It employs a binary search to determine the correct location to insert new elements. We could apply expensive(i,j) predicate after each binary search step to determine if it is cheap to compare the inserted element with "middle" element found in binary search step. If it is expensive we could try the "middle" element's neighbors, then their neighbors, etc. If no cheap comparisons could be found we just return to the "middle" element and perform expensive comparison.
There are several possible optimizations. If predicate and/or cheap comparisons are not so cheap we could roll back to the "middle" element earlier than all other possibilities are tried. Also if move operations cannot be considered as very cheap, we could use some order statistics data structure (like Indexable skiplist) do reduce insertion cost to O(N log N).
This modified insertion sort needs O(N log N) time for data movement, O(N2) predicate computations and cheap comparisons and O(N log N) expensive comparisons in the worst case. But more likely there would be only O(N log N) predicates and cheap comparisons and O(1) expensive comparisons.
Consider a set of possibly large files. In this application the goal is to find duplicate files among them.
If the only goal is to find duplicates, I think sorting (at least comparison sorting) is not necessary. You could just distribute the files between buckets depending on hash value computed for first megabyte of data from each file. If there are more than one file in some bucket, take other 10, 100, 1000, ... megabytes. If still more than one file in some bucket, compare them byte-by-byte. Actually this procedure is similar to radix sort.
Most sorting algorithm out there try minimize the amount of comparisons during sorting.
My advice:
Pick quick-sort as a base algorithm and memorize results of comparisons just in case you happen to compare the same problems again. This should help you in the O(N^2) worst case of quick-sort. Bear in mind that this will make you use O(N^2) memory.
Now if you are really adventurous you could try the Dual-Pivot quick-sort.
Something to keep in mind is that if you are continuously sorting the list with new additions, and the comparison between two elements is guaranteed to never change, you can memoize the comparison operation which will lead to a performance increase. In most cases this won't be applicable, unfortunately.
We can look at your problem in the another direction, Seems your problem is IO related, then you can use advantage of parallel sorting algorithms, In fact you can run many many threads to run comparison on files, then sort them by one of a best known parallel algorithms like Sample sort algorithm.
Quicksort and mergesort are the fastest possible sorting algorithm, unless you have some additional information about the elements you want to sort. They will need O(n log(n)) comparisons, where n is the size of your array.
It is mathematically proved that any generic sorting algorithm cannot be more efficient than that.
If you want to make the procedure faster, you might consider adding some metadata to accelerate the computation (can't be more precise unless you are, too).
If you know something stronger, such as the existence of a maximum and a minimum, you can use faster sorting algorithms, such as radix sort or bucket sort.
You can look for all the mentioned algorithms on wikipedia.
As far as I know, you can't benefit from the expensive relationship. Even if you know that, you still need to perform such comparisons. As I said, you'd better try and cache some results.
EDIT I took some time to think about it, and I came up with a slightly customized solution, that I think will make the minimum possible amount of expensive comparisons, but totally disregards the overall number of comparisons. It will make at most (n-m)*log(k) expensive comparisons, where
n is the size of the input vector
m is the number of distinct component which are easy to compare between each other
k is the maximum number of elements which are hard to compare and have consecutive ranks.
Here is the description of the algorithm. It's worth nothing saying that it will perform much worse than a simple merge sort, unless m is big and k is little. The total running time is O[n^4 + E(n-m)log(k)], where E is the cost of an expensive comparison (I assumed E >> n, to prevent it from being wiped out from the asymptotic notation. That n^4 can probably be further reduced, at least in the mean case.
EDIT The file I posted contained some errors. While trying it, I also fixed them (I overlooked the pseudocode for insert_sorted function, but the idea was correct. I made a Java program that sorts a vector of integers, with delays added as you described. Even if I was skeptical, it actually does better than mergesort, if the delay is significant (I used 1s delay agains integer comparison, which usually takes nanoseconds to execute)

is selection sort faster than insertion for big arrays? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Efficency of Insertion Sort vs Bubble sort vs Selection sort?
is selection sort faster than insertion for big arrays? When its in the worstest case?
I know insertion be faster than selection, but for large arrays and in the worstest case?
The size of the array involved is rarely of much consequence.
The real question is the speed of comparison vs. copying. The time a selection sort will win is when a comparison is a lot faster than copying. Just for example, let's assume two fields: a single int as a key, and another megabyte of data attached to it.
In such a case, comparisons involve only that single int, so it's really fast, but copying involves the entire megabyte, so it's almost certainly quite a bit slower.
Since the selection sort does a lot of comparisons, but relatively few copies, this sort of situation will favor it. The insertion sort does a lot more copies, so in a situation like this, the slower copies will slow it down quite a bit.
As far as worst case for a insertion sort, it'll be pretty much the opposite -- anything where copying is fast, but comparison is slow. There are a few more cases that favor insertion as well, such as when elements might be slightly scrambled, but each is still within a short distance of its final location when sorted.
If the data doesn't provide a solid indication in either direction, chances are pretty decent that insertion sort will work out better.
According to the Wikipedia article,
In general, insertion sort will write to the array O(n2) times,
whereas selection sort will write only O(n) times. For this reason
selection sort may be preferable in cases where writing to memory is
significantly more expensive than reading, such as with EEPROM or
flash memory.
That's going to be true regardless of the array size. In fact, the difference will be more pronounced as the arrays get larger.
Insertion sort, if well implemented use memcpy() to move the other values. So it depends on the processor, the cache speed (first level 2nd level chache), the cache size, when one algo becomes faster then the other.
I remember an implementations (was it java?) where one algo was used when the number of elements did not exceed a specific hard coded threshold, otherwise another algo was used.
So, you simply have to measure it.
The big O notation is for small and medium array a bit misleading, then O(N) means c * O(N).
And the factor c influences the total execution time.

Resources