Tries for sorting strings in linear time? - algorithm

I am trying to sort strings alphabetically in linear time and thought about using tries for this, my question is What's the time complexity of running a Pre-Order transversal on tries? is it O(n) ?

You have to be a little careful with the way you measure complexity in this case. A lot of times, people pretend that sorting N strings with a comparison-based sort takes O(N log N) time, but that is not really true in the worst case unless the length of the strings is bounded. It is the expected time if the strings are randomized, however, so it's not a bad approximation for many use cases.
If you want to account for possible long strings with long common prefixes, then you change the meaning of N to refer to the total size of the input, including all the strings. With this new definition, you can sort a list of strings in O(N) time.
Inserting the strings into a trie, or better a radix tree (https://en.wikipedia.org/wiki/Radix_tree) and then doing a preorder traversal is one way, and yes that works in O(N) time, where N is the total size of the input.
But it's faster and easier to do a radix sort: https://en.wikipedia.org/wiki/Radix_sort The Most-Significant-Digit-First variant works best with variable-length inputs.

Radix Sort can be applied in this case to sort them in O(n) refer to the following code implemented in c++:
#include<iostream>
using namespace std;
class RadixSort {
public:
static char charAt(string s,int n){
return s[n];
}
static void countingSort(string arr[],int n,int index,char lower,char upper){
int countArray[(upper-lower)+2];
string tempArray[n];
for(int i =0; i < sizeof(countArray)/sizeof(countArray[0]); i++)
countArray[i]=0;
//increase count for char at index
for(int i=0;i<n;i++){
int charIndex = (arr[i].length()-1 < index) ? 0 : (charAt(arr[i],index) - lower+1);
countArray[charIndex]++;
}
//sum up countArray;countArray will hold last index for the char at each strings index
for(int i=1;i<sizeof(countArray)/sizeof(countArray[0]);i++){
countArray[i] += countArray[i-1];
}
for(int i=n-1;i>=0;i--){
int charIndex = (arr[i].length()-1 < index) ? 0 : (charAt(arr[i],index) - lower+1);
tempArray[countArray[charIndex]-1] = arr[i];
countArray[charIndex]--;
}
for(int i=0;i<sizeof(tempArray)/sizeof(tempArray[0]);i++){
arr[i] = tempArray[i];
}
}
static void radixSort(string arr[],int n,char lower,char upper){
int maxIndex = 0;
for(int i=0;i<n;i++){
if(arr[i].length()-1 > maxIndex){
maxIndex = arr[i].length()-1;
}
}
for(int i=maxIndex;i>=0;i--){
countingSort(arr,n,i,lower,upper);
}
}
};
int main(){
string arr[] = {"a", "aa", "aaa","kinga", "bishoy","computer","az"};
int n = sizeof(arr)/sizeof(arr[0]);
RadixSort::radixSort(arr,n,'a','z');
for(int i=0;i<n;i++){
cout<<arr[i]<<" ";
}
return 0;
}

No. it is not O(n). it is Omega(k(log(k))n).
without any other restriction,and this is the case as i understand from your question, it is just comparison based sorting algorithm.
Sorting an array of length k is in Omega(klog(k)),
and doing it n times, without any connections between the times, will lead to
Omega(klog(k)n).
You can read more here:
https://www.geeksforgeeks.org/lower-bound-on-comparison-based-sorting-algorithms/
If you look at k as bounded, because there is no ENGLISH word longer then 10^1000000 (Which probably larger than atoms on Earth), then sort an array of bounded length is in O(1), and doing it n time will lead to O(n).
You get a lot from dealing with infinity, but sometimes you have to pay back...

Related

How to find the complexity of the given algorithm?

Consider the above procedure, which finds the location LOC1 of the largest element and the location LOC2 of the second largest element in an array DATA with n>1 elements. Let C(n) denote the number of comparisons during the execution of the procedure.
So, I was unable to find the following points related to it:
Find C(n) for the best case.
Find C(n) for the worst case.
Find C(n) for the average case for n=4, assuming all arrangements of the given elements in DATA are equally likely.
#include<iostream>
using namespace std;
void findd(int arr[],int n,int loc1,int loc2)
{
int first=arr[0],second=arr[1];
if(first<second)
{
int temp=first;
first=second;
second=temp;
loc2=0,loc1=1;
}
for(int i=2;i<n;i++)
{
if(first<arr[i])
{
second=first;
first=arr[i];
loc2=loc1;
loc1=i;
}
else if(second<arr[i])
{
second=arr[i];
loc2=i;
}
}
cout<<"index of max element"<<loc1+1<<" index of min element "<<loc2+1<<"\n";
}
int main()
{
int n;
cin>>n;
int arr[n];
for(int i=0;i<n;i++)
{
cin>>arr[i];
}
findd(arr,n,0,1);
}
so accoriding to the solution and your algorithm your this code will have
for n elements :
O(n)= worst complexity....as it will iterate n times if the max no is at the end
O(n)= best case as if the max no is present at the beginning but still it have to compare it with all the elements in the array
O(n)= average complexity
O(1)=space complexity...
hope you will like the answer ..:)

Number of Binary Search Trees of a given Height

How can I find the number of BSTs upto a given height h and discard all the BSTs with height greater than h for a given set of unique numbers?
I have worked out the code using a recursive approach
static int bst(int h,int n){
if(h==0&&n==0)return 1;
else if(h==0&&n==1)return 1;
else if(h==0&&n>1)return 0;
else if(h>0&&n==0)return 1;
else{
int sum=0;
for(int i=1;i<=n;i++)
sum+=bst(h-1,i-1)*bst(h-1,n-i);
return sum;
}
}
You can speed it up by adding memoization as #DavidEisenstat suggested in the comments.
You create a memoization table to store the values of already computed results.
In the example, -1 indicates the value has not been computed yet.
Example in c++
long long memo[MAX_H][MAX_N];
long long bst(int h,int n){
if(memo[h][n] == -1){
memo[h][n] = //Compute the value here using recursion
}
return memo[h][n];
}
...
int main(){
memset(memo,-1,sizeof memo);
bst(102,89);
}
This will execute in O(h*n) as you will only compute bst once for each possible pair of n and h. Another advantage of this technique is that once the table is filled up, bst will respond in O(1) (for the values in the range of the table).
Be careful not to call the function with values above MAX_H and MAN_N. Also keep in mind memoization is a memory-time tradeoff, meaning your program will run faster, but it will use more memory too.
More info: https://en.wikipedia.org/wiki/Memoization

Is there a sorting algorithm with a worst case time complexity of n^3?

I'm familiar with other sorting algorithms and the worst I've heard of in polynomial time is insertion sort or bubble sort. Excluding the truly terrible bogosort and those like it, are there any sorting algorithms with a worse polynomial time complexity than n^2?
Here's one, implemented in C#:
public void BadSort<T>(T[] arr) where T : IComparable
{
for (int i = 0; i < arr.Length; i++)
{
var shortest = i;
for (int j = i; j < arr.Length; j++)
{
bool isShortest = true;
for (int k = j + 1; k < arr.Length; k++)
{
if (arr[j].CompareTo(arr[k]) > 0)
{
isShortest = false;
break;
}
}
if(isShortest)
{
shortest = j;
break;
}
}
var tmp = arr[i];
arr[i] = arr[shortest];
arr[shortest] = tmp;
}
}
It's basically a really naive sorting algorithm, coupled with a needlessly-complex method of calculating the index with the minimum value.
The gist is this:
For each index
Find the element from this point forward which
when compared with all other elements after it, ends up being <= all of them.
swap this shortest element with the element at this index
The innermost loop (with the comparison) will be executed O(n^3) times in the worst case (descending-sorted input), and every iteration of the outer loop will put one more element into the correct place, getting you just a bit closer to being fully sorted.
If you work hard enough, you could probably find a sorting algorithm with just about any complexity you want. But, as the commenters pointed out, there's really no reason to seek out an algorithm with a worst-case like this. You'll hopefully never run into one in the wild. You really have to try to come up with one this bad.
Here's an example of elegant algorithm called slowsort which runs in Ω(n^(log(n)/(2+ɛ))) for any positive ɛ:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.116.9158&rep=rep1&type=pdf (section 5).
Slow Sort
Returns the sorted vector after performing SlowSort.
It is a sorting algorithm that is of humorous nature and not useful.
It's based on the principle of multiply and surrender, a tongue-in-cheek joke of divide and conquer.
It was published in 1986 by Andrei Broder and Jorge Stolfi in their paper Pessimal Algorithms and Simplexity Analysis.
This algorithm multiplies a single problem into multiple subproblems
It is interesting because it is provably the least efficient sorting algorithm that can be built asymptotically, and with the restriction that such an algorithm, while being slow, must still all the time be working towards a result.
void SlowSort(vector<int> &a, int i, int j)
{
if(i>=j)
return;
int m=i+(j-i)/2;
int temp;
SlowSort(a, i, m);
SlowSort(a, m + 1, j);
if(a[j]<a[m])
{
temp=a[j];
a[j]=a[m];
a[m]=temp;
}
SlowSort(a, i, j - 1);
}

Why should Insertion Sort be used after threshold crossover in Merge Sort

I have read everywhere that for divide and conquer sorting algorithms like Merge-Sort and Quicksort, instead of recursing until only a single element is left, it is better to shift to Insertion-Sort when a certain threshold, say 30 elements, is reached. That is fine, but why only Insertion-Sort? Why not Bubble-Sort or Selection-Sort, both of which has similar O(N^2) performance? Insertion-Sort should come handy only when many elements are pre-sorted (although that advantage should also come with Bubble-Sort), but otherwise, why should it be more efficient than the other two?
And secondly, at this link, in the 2nd answer and its accompanying comments, it says that O(N log N) performs poorly compared to O(N^2) upto a certain N. How come? N^2 should always perform worse than N log N, since N > log N for all N >= 2, right?
If you bail out of each branch of your divide-and-conquer Quicksort when it hits the threshold, your data looks like this:
[the least 30-ish elements, not in order] [the next 30-ish ] ... [last 30-ish]
Insertion sort has the rather pleasing property that you can call it just once on that whole array, and it performs essentially the same as it does if you call it once for each block of 30. So instead of calling it in your loop, you have the option to call it last. This might not be faster, especially since it pulls the whole data through cache an extra time, but depending how the code is structured it might be convenient.
Neither bubble sort nor selection sort has this property, so I think the answer might quite simply be "convenience". If someone suspects selection sort might be better then the burden of proof lies on them to "prove" that it's faster.
Note that this use of insertion sort also has a drawback -- if you do it this way and there's a bug in your partition code then provided it doesn't lose any elements, just partition them incorrectly, you'll never notice.
Edit: apparently this modification is by Sedgewick, who wrote his PhD on QuickSort in 1975. It was analyzed more recently by Musser (the inventor of Introsort). Reference https://en.wikipedia.org/wiki/Introsort
Musser also considered the effect on caches of Sedgewick's delayed
small sorting, where small ranges are sorted at the end in a single
pass of insertion sort. He reported that it could double the number of
cache misses, but that its performance with double-ended queues was
significantly better and should be retained for template libraries, in
part because the gain in other cases from doing the sorts immediately
was not great.
In any case, I don't think the general advice is "whatever you do, don't use selection sort". The advice is, "insertion sort beats Quicksort for inputs up to a surprisingly non-tiny size", and this is pretty easy to prove to yourself when you're implementing a Quicksort. If you come up with another sort that demonstrably beats insertion sort on the same small arrays, none of those academic sources is telling you not to use it. I suppose the surprise is that the advice is consistently towards insertion sort, rather than each source choosing its own favorite (introductory teachers have a frankly astonishing fondness for bubble sort -- I wouldn't mind if I never hear of it again). Insertion sort is generally thought of as "the right answer" for small data. The issue isn't whether it "should be" fast, it's whether it actually is or not, and I've never particularly noticed any benchmarks dispelling this idea.
One place to look for such data would be in the development and adoption of Timsort. I'm pretty sure Tim Peters chose insertion for a reason: he wasn't offering general advice, he was optimizing a library for real use.
Insertion sort is faster in practice, than bubblesort at least. Their asympotic running time is the same, but insertion sort has better constants (fewer/cheaper operations per iteration). Most notably, it requires only a linear number of swaps of pairs of elements, and in each inner loop it performs comparisons between each of n/2 elements and a "fixed" element that can be stores in a register (while bubble sort has to read values from memory). I.e. insertion sort does less work in its inner loop than bubble sort.
The answer claims that 10000 n lg n > 10 n² for "reasonable" n. This is true up to about 14000 elements.
I am surprised no-one's mentioned the simple fact that insertion sort is simply much faster for "almost" sorted data. That's the reason it's used.
The easier one first: why insertion sort over selection sort? Because insertion sort is in O(n) for optimal input sequences, i.e. if the sequence is already sorted. Selection sort is always in O(n^2).
Why insertion sort over bubble sort? Both need only a single pass for already sorted input sequences, but insertion sort degrades better. To be more specific, insertion sort usually performs better with a small number of inversion than bubble sort does. Source This can be explained because bubble sort always iterates over N-i elements in pass i while insertion sort works more like "find" and only needs to iterate over (N-i)/2 elements in average (in pass N-i-1) to find the insertion position. So, insertion sort is expected to be about two times faster than insertion sort on average.
Here is an empirical proof the insertion sort is faster then bubble sort (for 30 elements, on my machine, the attached implementation, using java...).
I ran the attached code, and found out that the bubble sort ran on average of 6338.515 ns, while insertion took 3601.0
I used wilcoxon signed test to check the probability that this is a mistake and they should actually be the same - but the result is below the range of the numerical error (and effectively P_VALUE ~= 0)
private static void swap(int[] arr, int i, int j) {
int temp = arr[i];
arr[i] = arr[j];
arr[j] = temp;
}
public static void insertionSort(int[] arr) {
for (int i = 1; i < arr.length; i++) {
int j = i;
while (j > 0 && arr[j-1] > arr[j]) {
swap(arr, j, j-1);
j--;
}
}
}
public static void bubbleSort(int[] arr) {
for (int i = 0 ; i < arr.length; i++) {
boolean bool = false;
for (int j = 0; j < arr.length - i ; j++) {
if (j + 1 < arr.length && arr[j] > arr[j+1]) {
bool = true;
swap(arr,j,j+1);
}
}
if (!bool) break;
}
}
public static void main(String... args) throws Exception {
Random r = new Random(1);
int SIZE = 30;
int N = 1000;
int[] arr = new int[SIZE];
int[] millisBubble = new int[N];
int[] millisInsertion = new int[N];
System.out.println("start");
//warm up:
for (int t = 0; t < 100; t++) {
insertionSort(arr);
}
for (int t = 0; t < N; t++) {
arr = generateRandom(r, SIZE);
int[] tempArr = Arrays.copyOf(arr, arr.length);
long start = System.nanoTime();
insertionSort(tempArr);
millisInsertion[t] = (int)(System.nanoTime()-start);
tempArr = Arrays.copyOf(arr, arr.length);
start = System.nanoTime();
bubbleSort(tempArr);
millisBubble[t] = (int)(System.nanoTime()-start);
}
int sum1 = 0;
for (int x : millisBubble) {
System.out.println(x);
sum1 += x;
}
System.out.println("end of bubble. AVG = " + ((double)sum1)/millisBubble.length);
int sum2 = 0;
for (int x : millisInsertion) {
System.out.println(x);
sum2 += x;
}
System.out.println("end of insertion. AVG = " + ((double)sum2)/millisInsertion.length);
System.out.println("bubble took " + ((double)sum1)/millisBubble.length + " while insertion took " + ((double)sum2)/millisBubble.length);
}
private static int[] generateRandom(Random r, int size) {
int[] arr = new int[size];
for (int i = 0 ; i < size; i++)
arr[i] = r.nextInt(size);
return arr;
}
EDIT:
(1) optimizing the bubble sort (updated above) reduced the total time taking to bubble sort to: 6043.806 not enough to make a significant change. Wilcoxon test is still conclusive: Insertion sort is faster.
(2) I also added a selection sort test (code attached) and compared it against insertion. The results are: selection took 4748.35 while insertion took 3540.114.
P_VALUE for wilcoxon is still below the range of numerical error (effectively ~=0)
code for selection sort used:
public static void selectionSort(int[] arr) {
for (int i = 0; i < arr.length ; i++) {
int min = arr[i];
int minElm = i;
for (int j = i+1; j < arr.length ; j++) {
if (arr[j] < min) {
min = arr[j];
minElm = j;
}
}
swap(arr,i,minElm);
}
}
EDIT: As IVlad points out in a comment, selection sort does only n swaps (and therefore only 3n writes) for any dataset, so insertion sort is very unlikely to beat it on account of doing fewer swaps -- but it will likely do substantially fewer comparisons. The reasoning below better fits a comparison with bubble sort, which will do a similar number of comparisons but many more swaps (and thus many more writes) on average.
One reason why insertion sort tends to be faster than the other O(n^2) algorithms like bubble sort and selection sort is because in the latter algorithms, every single data movement requires a swap, which can be up to 3 times as many memory copies as are necessary if the other end of the swap needs to be swapped again later.
With insertion sort OTOH, if the next element to be inserted isn't already the largest element, it can be saved into a temporary location, and all lower elements shunted forward by starting from the right and using single data copies (i.e. without swaps). This opens up a gap to put the original element.
C code for insertion-sorting integers without using swaps:
void insertion_sort(int *v, int n) {
int i = 1;
while (i < n) {
int temp = v[i]; // Save the current element here
int j = i;
// Shunt everything forwards
while (j > 0 && v[j - 1] > temp) {
v[j] = v[j - 1]; // Look ma, no swaps! :)
--j;
}
v[j] = temp;
++i;
}
}

order of complexity of the algorithm in O notation

Can anyone tell me order of complexity of below algorithm? This algorithm is to do following:
Given an unsorted array of integers with duplicate numbers, write the most efficient code to print out unique values in the array.
I would also like to know
What are some pros and cons in the context of hardware usage of this implementation
private static void IsArrayDuplicated(int[] a)
{
int size = a.Length;
BitArray b = new BitArray(a.Max()+1);
for ( int i = 0; i < size; i++)
{
b.Set(a[i], true);
}
for (int i = 0; i < b.Count; i++)
{
if (b.Get(i))
{
System.Console.WriteLine(i.ToString());
}
}
Console.ReadLine();
}
You have two for loops, one of length a.Length and one of length (if I understand the code correctly) a.Max() + 1. So your algorithmic complexity is O(a.Length + a.Max())
The complexity of the algorithm linear.
Finding the maximum is linear.
Setting the bits is linear.
However the algorithm is also wrong,
unless your integers can be assumed to be positive.
It also has a problem with large integers - do you really
want to allocate MAX_INT/8 bytes of memory?
The name, btw, makes me cringe. IsXYZ() should always return a bool.
I'd say, try again.
Correction - pavpanchekha has the correct answer.
O(n) is probably only possible for a finite/small domain of integers. Everyone think about bucketsort. The Hashmap approach is basically not O(n) but O(n^2) since worst-case insertion into a hashmap is O(n) and NOT constant.
How about sorting the list in O(nlog(n)) and then going through it and print the duplicate values. This results in O(nlog(n)) which is probably the true complexity of the problem.
HashSet<int> mySet = new HashSet<int>( new int[] {-1, 0, -2, 2, 10, 2, 10});
foreach(var item in mySet)
{
console.WriteLine(item);
}
// HashSet guarantee unique values without exception
You have two loops, each based on the size of n. I agree with whaley, but that should give you a good start on it.
O(n) on a.length
Complexity of your algorithm is O(N), but algorithm is not correct.
If numbers are negative it will not work
In case of large numbers you will have problems with memory
I suggest you to use this approach:
private static void IsArrayDuplicated(int[] a) {
int size = a.length;
Set<Integer> b = new HashSet<Integer>();
for (int i = 0; i < size; i++) {
b.add(a[i]);
}
Integer[] T = b.toArray(new Integer[0]);
for (int i = 0; i < T.length; i++) {
System.out.println(T[i]);
}
}

Resources