Time complexity of built-in 'sort' in Clojure - sorting

I want to know complexity (Big-O notation) of the built-in 'sort' function that the Clojure programming language provides, I did my search about it on the clojuredocs page but I did not find anything about it.
Thanks in advance.

The sort built-in actually calls java.util.Arrays.sort:
(defn sort
"Returns a sorted sequence of the items in coll. If no comparator is
supplied, uses compare. comparator must implement
java.util.Comparator. Guaranteed to be stable: equal elements will
not be reordered. If coll is a Java array, it will be modified. To
avoid this, sort a copy of the array."
{:added "1.0"
:static true}
([coll]
(sort compare coll))
([^java.util.Comparator comp coll]
(if (seq coll)
(let [a (to-array coll)]
(. java.util.Arrays (sort a comp))
(seq a))
())))
The Java sort on generic Object values has the following comment (emphasis added):
Implementation note: This implementation is a stable, adaptive, iterative mergesort that requires far fewer than n lg(n) comparisons when the input array is partially sorted, while offering the performance of a traditional mergesort when the input array is randomly ordered. If the input array is nearly sorted, the implementation requires approximately n comparisons. Temporary storage requirements vary from a small constant for nearly sorted input arrays to n/2 object references for randomly ordered input arrays.
The implementation takes equal advantage of ascending and descending order in its input array, and can take advantage of ascending and descending order in different parts of the the same input array. It is well-suited to merging two or more sorted arrays: simply concatenate the arrays and sort the resulting array.
The implementation was adapted from Tim Peters's list sort for Python (TimSort). It uses techniques from Peter McIlroy's "Optimistic Sorting and Information Theoretic Complexity", in Proceedings of the Fourth Annual ACM-SIAM Symposium on Discrete Algorithms, pp 467-474, January 1993.
However, there are other overloaded methods, for primitive types, like for int[], for which the following holds:
The sorting algorithm is a Dual-Pivot Quicksort by Vladimir Yaroslavskiy, Jon Bentley, and Joshua Bloch. This algorithm offers O(n log(n)) performance on many data sets that cause other quicksorts to degrade to quadratic performance, and is typically faster than traditional (one-pivot) Quicksort implementations.
Complexity also depends on the complexity of the custom comparison algorithm.

Looking at the source, you can see that it just delegates to java.util.Arrays.sort:
(defn sort
([coll]
(sort compare coll))
([^java.util.Comparator comp coll]
(if (seq coll)
(let [a (to-array coll)]
(. java.util.Arrays (sort a comp)) ; Here
(seq a))
())))
Apparently java.util.Arrays.sort uses a Timsort, which according to Wikipedia has a worst case runtime of O(n log n).
To verify, following the ctrl+B rabbit hole in IntelliJ, I eventually end up at the sort with the signature:
public static <T> void sort(T[] a, Comparator<? super T> c)
And in the body of that function is this bit:
...
if (LegacyMergeSort.userRequested)
legacyMergeSort(a, c);
else
TimSort.sort(a, 0, a.length, c, null, 0, 0);
...
So it does appear to use a Timsort unless the user has requested it to use a legacy Merge Sort implementation instead.

Related

is there a data structure with O(1) insertion time which also maintains sorted order?

hash tables can insert in O(1), but they aren't sorted.
BSTs maintain an ordering (can be traversed using pre-order traversal) but insertion is O(logN).
is there any data structure that:
guarantees O(1) insertion
maintains an ordering over the elements?
if not, is there a proof that such a data structure cannot exist?
thanks
In the general case, such a data structure is impossible because you could use it to sort a sequence with O(n) comparison operations, simply by inserting the sequence elements into it one-by-one. It is straightforward to show that Ω(n log n) is a lower bound for the number of comparisons required to sort a list, so insertion into a data structure which "maintains sorted order" must take at least Ω(log n) time per element.
Here I am assuming that it is possible to iterate over the data structure in order to output a sorted list, without doing any additional comparisons. If additional comparisons are required simply to iterate over the data structure's contents, then it is fair to say the data structure doesn't "maintain" sorted order internally. If you don't mind it taking O(n log n) time to iterate over a collection of size n, then you could just use an unsorted list as your data structure, with O(1) insertion time.
You can trade more space for this O(1). If the value domain is integral, say from 0 to N-1,
then you could have a
// (In Java)
int[] valueToFrequency = new int[N];
void insert(int value) {
++valueToFrequency[value];
}
For N values: O(N).
The difficulty (besides the value mapping, and 0 based indexing) is that the array is sparse: many zeroes.
Hence the output is slow.
For unique values (as "hashtable" suggests), one can use BitSet in Java. This is also a bit faster on output (nextSetBit).
// (In Java)
BitSet valueToFrequency = new Bitset(N);
void insert(int value) {
valueToFrequency.set(value);
}

Efficient algorithm to determine if two sets of numbers are disjoint

Practicing for software developer interviews and got stuck on an algorithm question.
Given two sets of unsorted integers with array of length m and other of
length n and where m < n find an efficient algorithm to determine if
the sets are disjoint. I've found solutions in O(nm) time, but haven't
found any that are more efficient than this, such as in O(n log m) time.
Using a datastructure that has O(1) lookup/insertion you can easily insert all elements of first set.
Then foreach element in second set, if it exists not disjoint, otherwise it is disjoint
Pseudocode
function isDisjoint(list1, list2)
HashMap = new HashMap();
foreach( x in list1)
HashMap.put(x, true);
foreach(y in list2)
if(HashMap.hasKey(y))
return false;
return true;
This will give you an O(n + m) solution
Fairly obvious approach - sort the array of length m - O(m log m).
For every element in the array of length n, use binary search to check if it exists in the array of length m - O(log m) per element = O(n log m). Since m<n, this adds up to O(n log m).
Here's a link to a post that I think answers your question.
3) Sort smaller O((m + n)logm)
Say, m < n, sort A
Binary search for each element of B into A
Disadvantage: Modifies the input
Looks like Cheruvian beat me to it, but you can use a hash table to get O(n+m) in average case:
*Insert all elements of m into the table, taking (probably) constant time for each, assuming there aren't a lot with the same hash. This step is O(m)
*For each element of n, check to see if it is in the table. If it is, return false. Otherwise, move on to the next. This takes O(n).
*If none are in the table, return true.
As I said before, this works because a hash table gives constant lookup time in average case. In the rare event that many unique elements in m have the same hash, it will take slightly longer. However, most people don't need to care about hypothetical worst cases. For example, quick sort is used more than merge sort because it gives better average performance, despite the O(n^2) upper bound.

Sorting algorithm for sorted vectors of "moving" values

This question is related to Which sort algorithm works best on mostly sorted data?
The difference is that I have other very important restriction: the values are changed with small amounts after every sort.
This means that the vector stays almost sorted and the displaced values are nearly in their position. After making some tests it seems same answer apply for my case.
Do you know other algorithms that may be better in this case?
Consider timsort or smoothsort. These are designed with mostly-sorted data in mind.
If frequent update is the case, maintain an index structure (i.e. a binary search tree) is perhaps a better choice than sorting the vector over and over again.
Insertion-Sort and Bubble-Sort both have linear best case complexity for an input, which is already sorted (which is optimal since the values are continuously changing, i.e. you have to have a look at each element of the input vector) and they are stable (which seems to be a useful property given your problem description).
Compare all the pairs a[i] <= a[i+1]. If this is false, move the second element to a new array.
Sort the new array (mergesort, heapsort or any other O(n*log n) algorithm), and merge the new and old array again.
How about this:
Def CheckedMergeSort(L)
Count = 0
S(1) = 0
For I in 2 to |L|
If (L(I-1) < L(I))
Count = Count + 1
S(I) = Count
Def MergeSort(A, B)
If (A != B and S(B)-S(A) != B-A)
C = (B + A) / 2
MergeSort(A,C)
MergeSort(C+1,B)
InplaceMerge(L(A..C), L(C+1..B))
MergeSort(1, |L|)
A linear time prepass of the input is made to fill in S(i) which keeps track of how many pairs previous to i have been in sorted order.
Then by subtracting two bounds S(j)-S(i) and comparing it to j-1 we can determine if any subsequence L(i..j) is in sorted order.
The merge sort then can skip any sorted sequences it finds in its recursion in constant time.
(For example if the array is sorted at entry then MergeSort(1, |L|) becomes a noop.)

How to find out the largest element number(array size), let insertion sort beat Merge sort?

from wiki page of insertion sort:
Some divide-and-conquer algorithms such as quicksort and mergesort sort by recursively dividing the list into smaller sublists which are
then sorted. A useful optimization in practice for these algorithms is
to use insertion sort for sorting small sublists, where insertion sort
outperforms these more complex algorithms. The size of list for which
insertion sort has the advantage varies by environment and
implementation, but is typically between eight and twenty elements.
the quote from wiki has one reason is that, the small lists from merge sort are not worse case for insertion sort.
I want to just ignore this reason.
I knew that if the array size is small, Insertion sort O(n^2) has chance to beat Merge Sort O(n log n).
I think(not sure) this is related to the constants in T(n)
Insertion sort: T(n) = c1n^2 +c2n+c3
Merge Sort: T(n) = n log n + cn
now my question is, on the same machine, same case (worse case), how to find out the largest element number, let insertion sort beat merge sort?
It's simple:
Take a set of sample arrays to sort, and iterate over a value k where k is the cutoff point for when you switch from merge to insertion.
then go
for(int k = 1; k < MAX_TEST_VALUE; k++) {
System.out.println("Results for k = " + k);
for(int[] array : arraysToTest) {
long then = System.currentTimeMillis();
mergeSort(array,k); // pass in k to your merge sort so it uses that
long now = System.currentTimeMillis();
System.out.println(now - then);
}
}
For what it's worth, the java.util.Arrays class has this to say on the matter in its internal documentation:
/**
* Tuning parameter: list size at or below which insertion sort will be
* used in preference to mergesort or quicksort.
*/
private static final int INSERTIONSORT_THRESHOLD = 7;
/**
* Src is the source array that starts at index 0
* Dest is the (possibly larger) array destination with a possible offset
* low is the index in dest to start sorting
* high is the end index in dest to end sorting
* off is the offset to generate corresponding low, high in src
*/
private static void mergeSort(Object[] src,
Object[] dest,
int low,
int high,
int off) {
int length = high - low;
// Insertion sort on smallest arrays
if (length < INSERTIONSORT_THRESHOLD) {
for (int i=low; i<high; i++)
for (int j=i; j>low &&
((Comparable) dest[j-1]).compareTo(dest[j])>0; j--)
swap(dest, j, j-1);
return;
}
In its primitive sequences, it also uses 7, although it doesn't use the constant value.
Insertion sort usually beats merge sort for sorted (or almost sorted) lists of any size.
So the question "How to find out the largest element number(array size), let insertion sort beat Merge sort? " is not really correct.
edit:
Just to get the downvoters of my back:
The question could rephrased as:
"how to determine largest array size for which, on average, insertion sort beats merge sort". This usually is measured empirically by generating sample of arrays of small size and running implementations of both algorithms on them. glowcoder does that in his answer.
"what is the largest array size for which insertion sort in worst case performs better than merge sort" This is something that can be approximately answered by a simple calculation as IS has to do n insertions and n*(n-1) element movements (which are insertions) in worst case , while mergesort does always n*logn cell copies from one array to another. Since it will be relatively small number it doesn't even make sense to consider it.
Typically, that's done by testing with arrays of varying size. When n == 10, insertion sort is almost certainly faster. When n == 100, probably not. Test, test, test, until your results converge.
I suppose it's possible to determine the number strictly through analysis, but to do so you'd have to know exactly what code is generated by the compiler, include instruction timings, and take into account things like the cost of cache misses, etc. All things considered, the easiest way is to derive it empirically.
Okay, so we are talking about largest array length where insertion sort beats merge sort. Yes, of course for small inputs insertion sort beats merge sort because of auxiliary space
complexity. Now talking about exact data is somewhat difficult because it requires doing experiment. And it also varies from language to language. In python once n crosses 4000, it beats insertion sort in C(For reference watch https://youtu.be/Kg4bqzAqRBM //forward to 43:00). We can calculate that length in asymptotics though but talking about exact data is somewhat difficult.
p.s.:
Watch the video and most of your doubts would get cleared for sure!! (https://youtu.be/Kg4bqzAqRBM)
Also read about using insertion sort in merge sort when sub-arrays
become sufficiently small.(Refer book: Introduction to Algorithms by
Cormen, Chapter 2, problem 2.1). You can easily get the pdf in google.

tricky linked list problem

Given three lists: A, B and C of length n each. if any 3 three numbers (1 from each list), sum up to zero return true.I want to solve this with o(n)complexity.I have sorted the lists and I can think of creating either a hash map with sum of 2 linked lists or comparing 3 lists together[o(n*n*n)].Suggest some ways to improvise the methods to reduce complexity..I can't think of any...Thanks in adv
The lists are sorted, right? Build a sorted array C' out of C in O(n) time.
For each of the n² pairs x, y in A × B, check if -(x + y) is in C' with binary search. Total time complexity is O(n² lg n), space complexity is O(n).
Building a hash table out of C brings the time complexity down further to O(n²), at the expense of belief in O(1) hash tables.
I do not think it is possible in o(n²) (i.e. really better than n²), but it can be done in O(n²) (i.e. sth. like n²) as follows:
First of all, reverse list B to obtain B' (takes O(n) time), a list whose items are sorted in descending order. First, we consider the problem of finding two elements in the lists A and B' that sum to any given number:
We can do this like the following (Python code):
def getIndicesFromTwoArrays(A,B,t):
a,b=0,0
while(A[a]+B[b]!=t):
b=b+1
if b==len(B):
return (-1,-1)
if A[a]+B[b]<t:
a=a+1
b=b-1
if a==len(A):
return (-1,-1)
return (a,b)
Run time of the above is O(n). Additional space required is O(1) since we only have to store two pointers. Note that the above can be easily transformed such that it works with doubly linked lists.
Then, overall we just have to do the following:
def test(A,B,C):
B.reverse()
for c in C:
if getIndicesFromTwoArrays(A, B, c) != (-1,-1):
return True
return False
That results in running time O(n²) and additional space O(1).
You can't do this with O(n) complexity since it's NP-complete problem (unless P=NP). Check out Wiki page about Subset Sum problem for possible solutions.

Resources