Suppose I have an "oracle" which sorts as follows:
1, 3, 2000, 11, 17, 20
Becomes
1, 11, 17, 20, 2000, 3
(I don't what this mechanism is called). This is akin to UNIX's sort command (without the -n).
I remember Windows used to sort filenames like this prior to Windows XP
Now, I have a bunch of numbers and this sort oracle and I want to sort the numbers numerically, how can I pre-process the numbers such that the sort oracle returns the correct order.
So, is there a function f() which takes in these numbers such that
sort f([1,3,2000,11,17,20])
would return the correct order.
The problem is, we need to sort a bunch of numbers numerically on a system where the only sort available is the sort procedure I described above.
This is called a lexicographical sort. You can pad the numbers with 0s so they all have the same number of digits to get it to behave like a numerical sort. For instance, use a %04d format code with printf.
Sort the numbers using this "oracle" sort function, resulting in a list of numbers called e.g.: L.
Remove all single-digit numbers from L, resulting in a list of numbers L1, and create a new list of numbers with the single-digit numbers in the same order as they were in L, resulting in a list of numbers S (if there are no single digit numbers in L, S is going to be empty).
Remove all double-digit numbers from L1, resulting in a list of numbers L2, and append the double-digit numbers in the same order they were in L1 to S (if there are no double-digit numbers in L, L1 remains the same as L and no number is appended to S).
...
Remove all n-digit numbers from Ln-1, resulting in an empty list of numbers, and append the n-digit numbers in the same order they were in Ln-1 to S.
Now S contains the numerically sorted list of numbers, using only the "oracle" sort function and a function that returns the number of digits of a number, which hopefully your system has.
Of course, you do not need to create a new list of numbers Lk as you remove all k-digit numbers from Lk-1 if the "oracle" sorted list L is mutable.
Related
I'm working on problem 24 from Project Euler which is as follows:
A permutation is an ordered arrangement of objects. For example, 3124
is one possible permutation of the digits 1, 2, 3 and 4. If all of the
permutations are listed numerically or alphabetically, we call it
lexicographic order. The lexicographic permutations of 0, 1 and 2 are:
012 021 102 120 201 210
What is the millionth lexicographic permutation of the digits 0, 1, 2,
3, 4, 5, 6, 7, 8 and 9?
I am trying to solve this using Haskell, and started with a brute force approach:
import Data.List
(sort . permutation) [0..9] !! 999999
But this takes way too long, and I'm wondering if it's because the program first gets all the permutations, then sorts them all, then finally takes the millionth element, which is a lot more work than it needs to do.
So I was thinking I could speed things up if I were to write a function that enumerates the permutations already in lexicographic order, so that we could just stop at the millionth element and have the answer.
The algorithm I have in mind is to first sort the input list x, then take the first (smallest) element and prepend it to the permutations of the remaining elements in lexicographic order. These orderings would be found by recursively calling the original function on the now already sorted tail of x (which means our original function should have a way of flagging whether or not the input list is sorted). Then we continue with the next largest element of x and so on until we have the full ordered list. Unfortunately I am still a Haskell beginner and my attempts to write this function have failed. Any hints as to how this might be done?
I have a thought that's too long for a comment, but isn't a working solution in its entirety. Still, it's a plan that should work nearly instantly.
Start with generating permutations in lexicographic order. This is easy to do with a recursive algorithm. First, select the least element available, and recursively generate permutations of the remaining elements, prepending the selected element to each permutation. Then select the second element lexicographically and continue on up.
For what it's worth, this is the standard-ish nondeterministic-select based permutation algorithm you often find in Haskell instructional materials, if the input list is sorted into increasing order. It's not the algorithm used by Data.List.permutations, which is designed to be faster and productive with infinite input.
But you can do better than this. You don't need to generate all the permutations before the target one. You can skip ahead, and it turns out to be really easy.
All you need to do is look at the number of permutation you are targeting, let's call it k, and use it to index into the permutations. If the inputs are sorted lexicographically, the first element of the result is the element at index q, followed by the permutation of the remaining elements at index r, given (q, r) = divMod k (fact(n - 1)).
I'm sure there are ways to get it faster than this, but that should make it basically instant for small numbers like a million anyway.
Say you are given a list of names, S = {s1, s2 ... sn} and you want to sort them lexicographically.
How would you guarantee that the running time of the sort is O(the total sum of the lengths of all the words)?
Any useful techniques?
One simple solution would be to use MSD radix sort, assuming a constant-size alphabet. Replace "digit" by "character" while reading the algorithm description. You will also need to throw out strings that are smaller than i if you are currently processing the position i, otherwise you won't get the desired runtime.
Given is a array of numbers:
1, 2, 8, 6, 9, 0, 4
We need to find all the numbers in group of three which sums to a value N ( say 11 in this example). Here, the possible numbers in group of three are:
{1,2,8}, {1,4,6}, {0,2,9}
The first solution I could think was of O(n^3). Later I could improve a little(n^2 log n) with the approach:
1. Sort the array.
2. Select any two number and perform binary search for the third element.
Can it be improved further with some other approaches?
You can certainly do it in O(n^2): for each i in the array, test whether two other values sum to N-i.
You can test in O(n) whether two values in a sorted array sum to k by sweeping from both ends at once. If the sum of the two elements you're on is too big, decrement the "right-to-left" index to make it smaller. If the sum is too small, increment the "left-to-right" index to make it bigger. If there's a pair that works, you'll find them, and you perform at most 2*n iterations before you run out of road at one end or the other. You might need code to ignore the value you're using as i, depends what the rules are.
You could instead use some kind of dynamic programming, working down from N, and you probably end up with time something like O(n*N) or so. Realistically I don't think that's any better: it looks like all your numbers are non-negative, so if n is much bigger than N then before you start you can quickly throw out any large values from the array, and also any duplicates beyond 3 copies of each value (or 2 copies, as long as you check whether 3*i == N before discarding the 3rd copy of i). After that step, n is O(N).
Specifically, given two large files with 64-bit integers produce a file with integers that are present in both files and estimate the time complexity of your algorithm.
How would you solve this?
I changed my mind; I actually like #Ryan's radix sort idea, except I would adapt it a bit for this specific problem.
Let's assume there are so many numbers that they do not fit in memory, but we have all the disk we want. (Not unreasonable given how the question was phrased.)
Call the input files A and B.
So, create 512 new files; call them file A_0 through A_255 and B_0 through B_255. File A_0 gets all of the numbers from file A whose high byte is 0. File A_1 gets all of the numbers from file A whose high byte is 1. File B_37 gets all the numbers from file B whose high byte is 37. And so on.
Now all possible duplicates are in (A_0, B_0), (A_1, B_1), etc., and those pairs can be analyzed independently (and, if necessary, recursively). And all disk accesses are reasonably linear, which should be fairly efficient. (If not, adjust the number of bits you use for the buckets...)
This is still O(n log n), but it does not require holding everything in memory at any time. (Here, the constant factor in the radix sort is log(2^64) or thereabouts, so it is not really linear unless you have a lot more than 2^64 numbers. Unlikely even for the largest disks.)
[edit, to elaborate]
The whole point of this approach is that you do not actually have to sort the two lists. That is, with this algorithm, at no time can you actually enumerate the elements of either list in order.
Once you have the files A_0, B_0, A_1, B_1, ..., A_255, B_255, you simply observe that no numbers in A_0 can be the same as any number in B_1, B_2, ..., B_255. So you start with A_0 and B_0, find the numbers common to those files, append them to the output, then delete A_0 and B_0. Then you do the same for A_1 and B_1, A_2 and B_2, etc.
To find the common numbers between A_0 and B_0, you just recurse... Create file A_0_0 containing all elements of A_0 with second byte equal to zero. Create file A_0_1 containing all elements of A_0 with second byte equal to 1. And so forth. Once all elements of A_0 and B_0 have been bucketed into A_0_0 through A_0_255 and B_0_0 through B_0_255, you can delete A_0 and B_0 themselves because you do not need them anymore.
Then you recurse on A_0_0 and B_0_0 to find common elements, deleting them as soon as they are bucketed... And so on.
When you finally get down to buckets that only have one element (possibly repeated), you can immediately decide whether to append that element to the output file.
At no time does this algorithm consume more than 2+epsilon times the original space required to hold the two files, where epsilon is less than half a percent. (Proof left as an exercise for the reader.)
I honestly believe this is the most efficient algorithm among all of these answers if the files are too large to fit in memory. (As a simple optimization, you can fall back to the std::set solution if and when the "buckets" get small enough.)
You could a radix sort, then iterate over the sorted results keeping the matches . Radix is O(DN), where D is the number of digits in the numbers. The largest 64 bit number is 19 digits long, so the sort sort for 64 bit integers with a radix of 10 will run in about 19N, or O(N), and the search runs in O(N). Thus this would run in O(N) time, where N is the number of integers in both files.
Assuming the files are too large to fit into memory, use an external least-significant-digit (LSD) radix sort on each of the files, then iterate through both files to find the intersection:
external LSD sort on base N (N=10 or N=100 if the digits are in a string format, N=16/32/64 if in binary format):
Create N temporary files (0 - N-1). Iterate through the input file. For each integer, find the rightmost digit in base N, and append that integer to the temporary file corresponding to that digit.
Then create a new set of N temporary files, iterate through the previous set of temporary files, find the 2nd-to-the-rightmost digit in base N (prepending 0s where necessary), and append that integer to the new temporary file corresponding to that digit. (and delete the previous set of temporary files)
Repeat until all the digits have been covered. The last set of temporary files contains the integers in sorted order. (Merge if you like into one file, otherwise treat the temporary files as one list.)
Finding the intersection:
Iterate through the sorted integers in each file to produce a pair of iterators that point to the current integer in each file. For each iterator, if the numbers match, append to an output list, and advance both iterators. Otherwise, for the smaller number, throw it away and advance the iterator for that file. Stop when either iterator ends.
(This outputs duplicates where there are input duplicates. If you want to remove duplicates, then the "advance the iterator" step should advance the iterator until the next larger number appears or the file ends.)
Read integers from both files into two sets (this will take O(N*logN) time), then iterate over two sets and write common elements to output file(this will take O(N) time). Complexity summary - O(N*logN).
Note: The iteration part will perform faster if we store integers into vectors and then sort them, but here we will use much more memory if there are many duplicates of integers inside the files.
UPD: You can also store in the memory only distinct integers from one of the files:
Read the values from the smaller files into a set. Then read values from the second files one by one. For each next number x check it's presence in the set with O(logN). If it exists there, print it and remove it from the set to avoid printing it twice. Complexity remains O(N*logN), but you use memory only necessary to store distinct integers from the smallest file.
I'm in need of some help.
I have a list of delimited integer values that I need to sort. An example:
Typical (alpha?) sort:
1.1.32.22
11.2.4
2.1.3.4
2.11.23.1.2
2.3.7
3.12.3.5
Correct (numerical) sort:
1.1.32.22
2.1.3.4
2.3.7
2.11.23.1.2
3.12.3.5
11.2.4
I'm having trouble figuring out how to setup the algorithm to do such a sort with n number of decimal delimiters and m number of integer fields.
Any ideas? This has to have been done before. Let me know if you need more information.
Thanks a bunch!
-Daniel
All you really need to do is to write "compare()" and then you can plug that into any sort algorithm.
To write compare, compare each field from left to right, and if any field is higher, return that that argument is higher. If one argument is shorter, assume that remaining fields are 0.
Check out radix sort.
versionsort does exactly what you're looking for.
The comparison algorithm is strverscmp, here's a description of it from the man page:
What this function does is the
following. If both strings are equal,
return 0. Otherwise find the position
between two bytes with the property
that before it both strings are equal,
while directly after it there is a
difference. Find the largest
consecutive digit strings containing
(or starting at, or ending at) this
position. If one or both of these is
empty, then return what strcmp() would
have returned (numerical ordering of
byte values). Otherwise, compare both
digit strings numerically, where digit
strings with one or more leading
zeroes are interpreted as if they have
a decimal point in front (so that in
particular digit strings with more
leading zeroes come before digit
strings with fewer leading zeroes).
Thus, the ordering is 000, 00, 01,
010, 09, 0, 1, 9, 10.
It is called natural sort. See Natural Sorting algorithm.
For example, Darius Bacon's answer in Python:
def sorted_nicely(strings):
"Sort strings the way humans are said to expect."
return sorted(strings, key=natural_sort_key)
def natural_sort_key(key):
import re
return [int(t) if t.isdigit() else t for t in re.split(r'(\d+)', key)]
General solution is to convert this string into array of bytes, then use qsort and specify comparison function.