Sorting a list of names lexographically - performance

Say you are given a list of names, S = {s1, s2 ... sn} and you want to sort them lexicographically.
How would you guarantee that the running time of the sort is O(the total sum of the lengths of all the words)?
Any useful techniques?

One simple solution would be to use MSD radix sort, assuming a constant-size alphabet. Replace "digit" by "character" while reading the algorithm description. You will also need to throw out strings that are smaller than i if you are currently processing the position i, otherwise you won't get the desired runtime.

Related

Average case nlogn Nuts and Bolts matching

I have to make an algorithm that matches items from two arrays, we are not allowed to sort either array first, we can only match by comparison with an item from array 1 to and an item to array 2 (comparisons being <,=,>). The output is two lists and they have the same order. I can think of ways to solve it using n(n+1)/2 time. The goal is nlog(n). I have been banging my head against a wall trying to think of a way but I can't. Can anyone give me a hint?
So to explain the input is two arrays ex. A = [1,3,6,2,5,4] B =[4,2,3,5,1,6] and the output is the two arrays with the same order. You can not sort the arrays individually first or compare items within the same array. You can only compare items across lists like so A_1<B_1, A_2=B_3, A_4<B_3.
Similar to quicksort:
Use a random A-element to partition B into smaller-B, equal-B and larger-B. Use its equal B-element to partition A. Recursively match smaller-A with smaller-B as well as larger-A with larger-B.
Just like quicksort, expected time is O(n log n) and worst case is O(n2).

How does sorting a string in an array of strings and then sorting that array come out to be O(a*s(loga+logs))?

In the image I do not understand where the extra O(s) comes from when the array of strings are being sorted. I get that sorting the array of string will be O(a log a), I can't understand why we have to add in O(s) as well.
In my mind, O(a log a) has taken care of sorting off all the strings in the array of strings.
Got stuck on the same example! Remember that optimally it takes nlogn time to sort an array of n characters. For the final sort if we assume that each string in the array is of length 1 then we're again just sorting a characters so we get the aloga term, however the worst case length of each string is s so you need to do aloga comparisons s times.
In the image you ask "why add?" Well, they are independent operations, one that sorts each of a strings the length of each is s, and that's O(a * s log s), and one that sorts the array of a strings, the length of each is s to count potential comparisons between each two strings, that's another O(a * s log a). Independent operations means "add". Adding gives O(a * s log s + a * s log a), which is what you got when you extract out the common factors.
Understand in such a way that when you're sorting an array of characters/numbers, you can sort that array based on simple comparisons of two index elements and complexity will be O(N*log(N)) where N is the length of array.
What happens when you start sorting an array of string?
array[0] = "rahul" and array[1]= "raj"
If you have to sort the above two indexes lexicographically (you can't simply compare two strings like numbers), you need to compare character wise. So it will run Math.max(array[0].length(), array[1].length()) times. From here that extra s is coming in O(s*a log(a))
Most clear and to the point explanation when to use "+(Add) and When to use *(Multiply)

Find medians in multiple sub ranges of a unordered list

E.g. given a unordered list of N elements, find the medians for sub ranges 0..100, 25..200, 400..1000, 10..500, ...
I don't see any better way than going through each sub range and run the standard median finding algorithms.
A simple example: [5 3 6 2 4]
The median for 0..3 is 5 . (Not 4, since we are asking the median of the first three elements of the original list)
INTEGER ELEMENTS:
If the type of your elements are integers, then the best way is to have a bucket for each number lies in any of your sub-ranges, where each bucket is used for counting the number its associated integer found in your input elements (for example, bucket[100] stores how many 100s are there in your input sequence). Basically you can achieve it in the following steps:
create buckets for each number lies in any of your sub-ranges.
iterate through all elements, for each number n, if we have bucket[n], then bucket[n]++.
compute the medians based on the aggregated values stored in your buckets.
Put it in another way, suppose you have a sub-range [0, 10], and you would like to compute the median. The bucket approach basically computes how many 0s are there in your inputs, and how many 1s are there in your inputs and so on. Suppose there are n numbers lies in range [0, 10], then the median is the n/2th largest element, which can be identified by finding the i such that bucket[0] + bucket[1] ... + bucket[i] greater than or equal to n/2 but bucket[0] + ... + bucket[i - 1] is less than n/2.
The nice thing about this is that even your input elements are stored in multiple machines (i.e., the distributed case), each machine can maintain its own buckets and only the aggregated values are required to pass through the intranet.
You can also use hierarchical-buckets, which involves multiple passes. In each pass, bucket[i] counts the number of elements in your input lies in a specific range (for example, [i * 2^K, (i+1) * 2^K]), and then narrow down the problem space by identifying which bucket will the medium lies after each step, then decrease K by 1 in the next step, and repeat until you can correctly identify the medium.
FLOATING-POINT ELEMENTS
The entire elements can fit into memory:
If your entire elements can fit into memory, first sorting the N element and then finding the medians for each sub ranges is the best option. The linear time heap solution also works well in this case if the number of your sub-ranges is less than logN.
The entire elements cannot fit into memory but stored in a single machine:
Generally, an external sort typically requires three disk-scans. Therefore, if the number of your sub-ranges is greater than or equal to 3, then first sorting the N elements and then finding the medians for each sub ranges by only loading necessary elements from the disk is the best choice. Otherwise, simply performing a scan for each sub-ranges and pick up those elements in the sub-range is better.
The entire elements are stored in multiple machines:
Since finding median is a holistic operator, meaning you cannot derive the final median of the entire input based on the medians of several parts of input, it is a hard problem that one cannot describe its solution in few sentences, but there are researches (see this as an example) have been focused on this problem.
I think that as the number of sub ranges increases you will very quickly find that it is quicker to sort and then retrieve the element numbers you want.
In practice, because there will be highly optimized sort routines you can call.
In theory, and perhaps in practice too, because since you are dealing with integers you need not pay n log n for a sort - see http://en.wikipedia.org/wiki/Integer_sorting.
If your data are in fact floating point and not NaNs then a little bit twiddling will in fact allow you to use integer sort on them - from - http://en.wikipedia.org/wiki/IEEE_754-1985#Comparing_floating-point_numbers - The binary representation has the special property that, excluding NaNs, any two numbers can be compared like sign and magnitude integers (although with modern computer processors this is no longer directly applicable): if the sign bit is different, the negative number precedes the positive number (except that negative zero and positive zero should be considered equal), otherwise, relative order is the same as lexicographical order but inverted for two negative numbers; endianness issues apply.
So you could check for NaNs and other funnies, pretend the floating point numbers are sign + magnitude integers, subtract when negative to correct the ordering for negative numbers, and then treat as normal 2s complement signed integers, sort, and then reverse the process.
My idea:
Sort the list into an array (using any appropriate sorting algorithm)
For each range, find the indices of the start and end of the range using binary search
Find the median by simply adding their indices and dividing by 2 (i.e. median of range [x,y] is arr[(x+y)/2])
Preprocessing time: O(n log n) for a generic sorting algorithm (like quick-sort) or the running time of the chosen sorting routine
Time per query: O(log n)
Dynamic list:
The above assumes that the list is static. If elements can freely be added or removed between queries, a modified Binary Search Tree could work, with each node keeping a count of the number of descendants it has. This will allow the same running time as above with a dynamic list.
The answer is ultimately going to be "in depends". There are a variety of approaches, any one of which will probably be suitable under most of the cases you may encounter. The problem is that each is going to perform differently for different inputs. Where one may perform better for one class of inputs, another will perform better for a different class of inputs.
As an example, the approach of sorting and then performing a binary search on the extremes of your ranges and then directly computing the median will be useful when the number of ranges you have to test is greater than log(N). On the other hand, if the number of ranges is smaller than log(N) it may be better to move elements of a given range to the beginning of the array and use a linear time selection algorithm to find the median.
All of this boils down to profiling to avoid premature optimization. If the approach you implement turns out to not be a bottleneck for your system's performance, figuring out how to improve it isn't going to be a useful exercise relative to streamlining those portions of your program which are bottlenecks.

sorting a bivalued list

If I have a list of just binary values containing 0's and 1's like the following 000111010110
and I want to sort it to the following 000000111111 what would be the most efficient way to do this if you also know the list size? Right now I am thinking to have one counter where I just count the number of 0's as I traverse the list from beginning to end. Then if I divide the listSize by numberOfZeros I get numberOfOnes. Then I was thinking instead of reordering the list starting with zeros, I would just create a new list. Would you agree this is the most efficient method?
Your algorithm implements the most primitive version of the classic bucket sort algorithm (its counting sort implementation). It is the fastest possible way to sort numbers when their range is known, and is (relatively) small. Since zeros and ones is all you have, you do not need an array of counters that are present in the bucket sort: a single counter is sufficient.
If you have numeric values, you can use the assembly instruction bitscan (BSF in x86 assembly) to count the number of bits. To create the "sorted" value you would set the n+1 bit, then subtract one. This will set all the bits to the right of the n+1 bit.
Bucket sort is a sorting algorithm as it seems.
I dont think there is a need for such operations.As we know there is no Sorting algorithm faster than N*logN . So by default it is wrong.
And all that because all you got to do is what you said in the very beginning.Just traverse the list and count the Zero's or the One's that will give you O(n) complexity.Then just create a new array with the counted zero's in the beginning followed by the One's.Then you have a total of N+N complexity that gives you
O(n) complexity.
And thats only because you have only two values.So neither quick sort or any other sort can do this faster.There is no faster sorting than NLog(n)

Determining if a sequence T is a sorting of a sequence S in O(n) time

I know that one can easily determine if a sequence is sorted in O(n) time. However, how can we insure that some sequence T is indeed the sorting of elements from sequence S in O(n) time?
That is, someone might have an algorithm that outputs some sequence T that is indeed in sorted order, but may not contain any elements from sequence S, so how can we check that T is indeed a sorted sequence of S in O(n) time?
Get the length L of S.
Check the length of T as well. If they differ, you are done!
Let Hs be a hash map with something like 2L buckets of all elements in S.
Let Ht be a hash map (again, with 2L buckets) of all elements in T.
For each element in T, check that it exists in Hs.
For each element in S, check that it exists in Ht.
This will work if the elements are unique in each sequence. See wcdolphin's answer for the small changes needed to make it work with non-unique sequences.
I have NOT taken memory consumption into account. Creating two hashmap of double the size of each sequence may be expensive. This is the usual tradeoff between speed and memory.
While Emil's answer is very good, you can do slightly better.
Fundamentally, in order for T to be a reordering of S it must contain all of the same elements. That is to say, for every element in T or S, they must occur the same number of times. Thus, we will:
Create a Hash table of all elements in S, mapping from the 'Element' to the number of occurrences.
Iterate through every element in T, decrementing the number of times the current element occurred.
If the number of occurrences is zero, remove it from the hash.
If the current element is not in the hash, T is not a reordering of S.
Create a hash map of both sequences. Use the character as key, and the count of the character as value. If a character has not been added yet add it with a count of 1. If a character has already been added increase its count by 1.
Verify that for each character in the input sequence that the hash map of the sorted sequence contains the character as key and has the same count as value.
I believe it this is a O(n^2) problem because:
Assuming data structure you use to store elements is a linked list for minimal operations of removing an element
You will be doing a S.contains(element of T) for every element of T, and one to check they are the same size.
You cannot assume that s is ordered and therefore need to do a element by element comparison for every element.
worst case would be if S is reverse of T
This would mean that for element (0+x) of T you would do (n-x) comparisons if you remove each successful element.
this results in (n*(n+1))/2 operations which is O(n^2)
Might be some other cleverer algorithm out there though

Resources