Short Circuiting sort - sorting

I understand that:
head (map (2**) [1..999999])
Will only actually evaluate 2**1, and none of the rest, but the book I am reading says that:
head (sort somelist)
Will only need to find the smallest item in the list, because that is all that is used. How does this work? As far as I can tell, this would be impossible with the sorting algorithms I know (like bubble sorting).
The only way I can think that this would work is if the sorting algorithm were to go through the entire list looking for the smallest item, and then recurse on the list without that item. To me, this sounds really slow.
Is this how the sort function works, or is there another sorting algorithm I don't know about, that would allow for short circuiting like it is?

This:
Will only need to find the smallest item in the list, because that is all that is used.
... should really say that the function only needs to do the minimal amount of work that the sorting algorithm requires to find the smallest element.
For example, if we are using quicksort as our underlying sorting algorithm, then head . quicksort is equivalent to the optimal (!) selection algorithm known as 'quickselect', which is worst-case linear. Moreover, we can implement k-quickselect merely by take k . quicksort.
Wikipedia notes in its article upon selection algorithms that (my emphasis):
Because language support for sorting is more ubiquitous, the simplistic approach of sorting followed by indexing is preferred in many environments despite its disadvantage in speed. Indeed, for lazy languages, this simplistic approach can even get you the best complexity possible for the k smallest/greatest sorted (with maximum/minimum as a special case) if your sort is lazy enough.
Quicksort works well in this scenario, whereas the default sort in Haskell (merge sort) doesn't compose quite as well, as it does more work than strictly necessary to return each element of the sorted list. As this post on the Haskell mailing list notes:
lazy quicksort is able to produce the batch of the
first k smallest elements in
O(n + k log k) total time [1]
while lazy mergesort needs
O(n + k log n) total time [2]
For more you might like to read this blog post.

If you create a comparison function that traces its arguments, like this in GHCi's command line:
> :module + Data.List Debug.Trace
> let myCompare x y = trace ("\tCmp " ++ show x ++ " " ++ show y) $ compare x y
then you can see the behaviour yourself:
> sortBy myCompare "foobar"
" Cmp 'f' 'o'
Cmp 'o' 'b'
Cmp 'f' 'b'
Cmp 'a' 'r'
Cmp 'b' 'a'
a Cmp 'b' 'r'
b Cmp 'f' 'o'
Cmp 'f' 'r'
f Cmp 'o' 'o'
Cmp 'o' 'r'
o Cmp 'o' 'r'
or"
Haskell is evaluating the string lazily, one character at a time. The left hand column is being printed as each character is found, with the right hand column recording the comparisons required, as printed by "trace".
Note that if you compile this, especially with optimisations on, you might get a different result. The optimiser runs a strictness analyser that will probably notice that the entire string gets printed, so it would be more efficient to evaluate it eagerly.
Then try
> head $ sortBy myCompare "foobar"
Cmp 'f' 'o'
Cmp 'o' 'b'
Cmp 'f' 'b'
Cmp 'a' 'r'
Cmp 'b' 'a'
'a'
If you want to understand how this works, look up the source code for the sort function and evaluate 'sort "foobar"' manually on paper.
qsort [] = []
qsort (x:xs) = qsort less ++ [x] ++ qsort greater
where (less, greater) = partition (< x) xs
So
qsort ('f':"oobar")
= qsort ('b':"a") ++ "f" ++ qsort ('o':"or")
= ("a" ++ "b") ++ "f" ++ qsort ('o':"or")
And now we have done enough to find that 'a' is the first item in the result without having to evaluate the other call to "qsort". I've omitted the actual comparison because its hidden inside the call to "partition". Actually "partition" is also lazy, so in fact the argument to the other "qsort" hasn't been evaluated as far as I've shown it.

The algorithm you just described has a specific name: "selection sort". It's O(n2) so it's not quite the fastest thing you could do. However, if you want the first "k" elements in the sorted array, the complexity would be O(kn) which is nice if "k" is small enough (like your example).
Note that you are using a pure function in a functional language. The compiler is likely to be able to generate optimized code for sort in both cases by looking at the way functions are composed. It can easily infer that you want the minimum element when you compose head and sort.

Related

Performance of Longest Substring Without Repeating Characters in Haskell

Upon reading this Python question and proposing a solution, I tried to solve the same challenge in Haskell.
I've come up with the code below, which seems to work. However, since I'm pretty new to this language, I'd like some help in understand whether the code is good performancewise.
lswrc :: String -> String
lswrc s = reverse $ fst $ foldl' step ("","") s
where
step ("","") c = ([c],[c])
step (maxSubstr,current) c
| c `elem` current = step (maxSubstr,init current) c
| otherwise = let candidate = (c:current)
longerThan = (>) `on` length
newMaxSubstr = if maxSubstr `longerThan` candidate
then maxSubstr
else candidate
in (newMaxSubstr, candidate)
Some points I think could be better than they are
I carry on a pair of strings (the longest tracked, and the current candidate) but I only need the former; thinking procedurally, there's no way to escape this, but maybe FP allows another approach?
I construct (c:current) but I use it only in the else; I could make a more complicated longerThan to add 1 to the lenght of its second argument, so that I can apply it to maxSubstr and current, and construct (c:current) in the else, without even giving it a name.
I drop the last element of current when c is in the current string, because I'm piling up the strings with :; I could instead pattern match when checking c against the string (as in c `elem` current#(a:as)), but then when adding the new character I should do current ++ [c], which I know is not as performant as c:current.
I use foldl' (as I know foldl doesn't really make sense); foldr could be an alternative, but since I don't see how laziness enters this problem, I can't tell which one would be better.
Running elem on every iteration makes your algorithm Ω(n^2) (for strings with no repeats). Running length on, in the worst case, every iteration makes your algorithm Ω(n^2) (for strings with no repeats). Running init a lot makes your algorithm Ω(n*sqrt(n)) (for strings that are sqrt(n) repetitions of a sqrt(n)-long string, with every other one reversed, and assuming an O(1) elem replacement).
A better way is to pay one O(n) cost up front to copy into a data structure with constant-time indexing, and to keep a set (or similar data structure) of seen elements rather than a flat list. Like this:
import Data.Set (Set)
import Data.Vector (Vector)
import qualified Data.Set as S
import qualified Data.Vector as V
lswrc2 :: String -> String
lswrc2 "" = ""
lswrc2 s_ = go S.empty 0 0 0 0 where
s = V.fromList s_
n = V.length s
at = V.unsafeIndex s
go seen lo hi bestLo bestHi
| hi == n = V.toList (V.slice bestLo (bestHi-bestLo+1) s)
-- it is probably faster (possibly asymptotically so?) to use findIndex
-- to immediately pick the correct next value of lo
| at hi `S.member` seen = go (S.delete (at lo) seen) (lo+1) hi bestLo bestHi
| otherwise = let rec = go (S.insert (at hi) seen) lo (hi+1) in
if hi-lo > bestHi-bestLo then rec lo hi else rec bestLo bestHi
This should have O(n*log(n)) worst-case performance (achieving that worst case on strings with no repeats). There may be ways that are better still; I haven't thought super hard about it.
On my machine, lswrc2 consistently outperforms lswrc on random strings. On the string ['\0' .. '\100000'], lswrc takes about 40s and lswrc2 takes 0.03s. lswrc2 can handle [minBound .. maxBound] in about 0.4s; I gave up after more than 20 minutes of letting lswrc chew on that list.

perl6 min and max of mixed Str and Int arguments

What type gets converted first for min and max routine when the arguments contain mixture of Str and Int ?
To exit type 'exit' or '^D'
> say ("9", "10").max
9
> say ("9", "10").max.WHAT
(Str)
> say (9, "10").max
9
> say (9, "10").max.WHAT
(Int) # if convert to Int first, result should be 10
> say ("9", 10).max
9
> say ("9", 10).max.WHAT
(Str) # if convert to Str first, result should be 9
> say (9, "10").min
10
> say (9, "10").min.WHAT
(Str) # does min and max convert Str or Int differently?
If min or max converts arguments to be the type of the first argument, the results here are still inconsistent.
Thank you for your enlightenment !!!
Both min and max use the cmp infix operator to do the comparisons. If the types differ, then this logic is used (rewritten slightly to be pure Perl 6, whereas the real one uses an internals shortcut):
multi sub infix:<cmp>($a, $b) {
$a<> =:= $b<>
?? Same
!! a.Stringy cmp b.Stringy
}
Effectively, if the two things point to the exact same object, then they are the Same, otherwise stringify both and then compare. Thus:
say 9 cmp 10; # uses the (Int, Int) candidate, giving Less
say "9" cmp "10"; # uses the (Str, Str) candidate, giving More
say 9 cmp "10"; # delegates to "9" cmp "10", giving More
say "9" cmp 10; # delegates to "9" cmp "10", giving More
The conversion to a string is done for the purpose of comparison (as an implementation detail of cmp), and so has no impact upon the value that is returned by min or max, which will be that found in the input list.
Well, jnthn has answered. His answers are always authoritative and typically wonderfully clear and succinct too. This one is no exception. :) But I'd started so I'll finish and publish...
A search for "method min" in the Rakudo sources yields 4 matches of which the most generic is a match in core/Any-iterable-methods.pm6.
It might look difficult to understand but nqp is actually essentially a simple subset of P6. The key thing is it uses cmp to compare each value that is pulled from the sequence of values being compared against the latest minimum (the $pulled cmp $min bit).
Next comes a search for "sub infix:<cmp>" in the Rakudo sources. This yields 14 matches.
These will all have to be looked at to confirm what the source code shows for comparing these various types of value. Note also that the logic is pairwise for each pair which is slightly weird to think about. So if there's three values a, b, and c, each of a different type, then the logic will be that a is the initial minimum, then there'll be a b cmp a which will be whatever cmp logic wins for that combination of types in that order, and then c cmp d where d is whichever won the b cmp a comparison and the cmp logic will be whatever is suitable to that pair of types in that order.
Let's start with the first one -- the match in core/Order.pm6 -- which is presumably a catchall if none of the other matches are more specific:
If both arguments of cmp are numeric, then comparison is a suitable numeric comparison (eg if they're both Ints then comparison is of two arbitrary precision integers).
If one argument is numeric but not the other, then -Inf and Inf are sorted to the start and end but otherwise comparison is done after both arguments are coerced by .Stringyfication.
Otherwise, both arguments are coerced by .Stringyfication.
So, that's the default.
Next one would have to go thru the individual overloads. For example, the next one is the cmp ops in core/allomorphs.pm6 and we see how for allomorphic types (IntStr etc.) comparison is numeric first, then string if that doesn't settle it. Note the comment:
we define cmp ops for these allomorphic types as numeric first, then Str. If you want just one half of the cmp, you'll need to coerce the args
Anyhoo, I see jnthn's posted yet another great answer so it's time to wrap this one. :)

Algorithm to find "ordered combinations"

I need an algorithm to find, what I call, "ordered combinations" (Maybe someone knows the real name for this if there is one).
Of course I already tried to come up with an algorithm on my own but I'm really stuck.
How it should work:
Given 2 lists (not sets, order is important here!) of elements that are guaranteed to contain the same elements, all ordered combinations.
An ordered combination is a 2-tuple, 3-tuple, ... n-tuple (no limit on N) of elements that appear in the same order in both lists.
Its entirely possible that an element occurs more than once in a list.
But every element from one list is guaranteed to appear at least once in the other list.
It does not matter if the output contains a combination more than once.
I'm not really sure if that makes it clear so here are multiple examples:
(List1, List2, Expected Result, Annotation)
ASDF
ADSF
Result: AS, AD, AF, SF, DF, ASF, ADF
Note: ASD is not a valid result because there is no way to have ascending indices in the second list for this combination
ADSD
ASDD
Result: AD, AS, AD, DD, SD, ASD, ADD
Note: AD appears twice because it can be created from indices 1,2 and 1,4 and in the second list 1,3 and 1,4. But it would also be correct if it only appears once. Also D appears twice in both lists in an order, so this allows ADD as a valid combination too.
SDFG
SDFG
Result: SD, SF, SG, DF, DG, FG, SDF, SFG, SDG, DFG, SDFG,
Note: Same input; all combinations are possible
ABCDEFG
GFEDCBA
Result: <empty>
Note: There are no combinations that appear in the same order in both lists
QWRRRRRRR
WRQ
Result: WR
Note: The only combination that appears in the same order in both sets is WR
Notes:
While it's a language agnostic algorithm I'd prefer answers that contain either C# or pseudo-code so I can understand them.
I realized that longer combinations are always made up from shorter combinations. Example: SDF can only be a valid result if SD and DF are possible too. Maybe this helps to make the algorithm more performant by building the longer combinations from the shorter ones.
Speed is of great importance here. This is algorithm will be used in realtime!
If it's not clear how the algorithm works, drop a comment. I'll add an example to clarify it.
Maybe this problem is already known and solved, but I don't know the proper name for it.
I would describe this problem as enumerating common subsequences of two strings. As a first cut, make a method like this, which chooses the first letter nondeterministically and recurses (Python, sorry).
def commonsubseqs(word1, word2, prefix=''):
if len(prefix) >= 2:
print(prefix)
for letter in set(word1) & set(word2): # set intersection
# figure out what's left after consuming the first instance of letter
remainder1 = word1[word1.index(letter) + 1:]
remainder2 = word2[word2.index(letter) + 1:]
# take letter and recurse
commonsubseqs(remainder1, remainder2, prefix + letter)
If this simple solution is not fast enough for you, then it can be improved as follows. For each pair of suffixes of the two words, we precompute the list of recursive calls. In Python again:
def commonsubseqshelper(table, prefix, i, j):
if len(prefix) >= 2:
print(''.join(prefix))
for (letter, i1, j1) in table[i][j]:
prefix.append(letter)
commonsubseqshelper(table, prefix, i1, j1)
del prefix[-1] # delete the last item
def commonsubseqs(word1, word2):
table = [[[(letter, word1.index(letter, i) + 1, word2.index(letter, j) + 1)
for letter in set(word1[i:]) & set(word2[j:])]
for j in range(len(word2) + 1)] # 0..len(word2)
for i in range(len(word1) + 1)] # 0..len(word1)
commonsubseqshelper(table, [], 0, 0)
This polynomial-time preprocessing step improves the speed of enumeration to its asymptotic optimum.

Algorithm to produce all partitions of a list in order

I've need for a particular form of 'set' partitioning that is escaping me, as it's not quite partitioning. Or rather, it's the subset of all partitions for a particular list that maintain the original order.
I have a list of n elements [a,b,c,...,n] in a particular order.
I need to get all discrete variations of partitioning that maintains the order.
So, for four elements, the result will be:
[{a,b,c,d}]
[{a,b,c},{d}]
[{a,b},{c,d}]
[{a,b},{c},{d}]
[{a},{b,c,d}]
[{a},{b,c},{d}]
[{a},{b},{c,d}]
[{a},{b},{c},{d}]
I need this for producing all possible groupings of tokens in a list that must maintain their order, for use in a broader pattern matching algorithm.
I've found only one other question that relates to this particular issue here, but it's for ruby. As I don't know the language, it looks like someone put code in a blender, and don't particularly feel like learning a language just for the sake of deciphering an algorithm, I feel I'm out of options.
I've tried to work it out mathematically so many times in so many ways it's getting painful. I thought I was getting closer by producing a list of partitions and iterating over it in different ways, but each number of elements required a different 'pattern' for iteration, and I had to tweak them in by hand.
I have no way of knowing just how many elements there could be, and I don't want to put an artificial cap on my processing to limit it just to the sizes I've tweaked together.
You can think of the problem as follows: each of the partitions you want are characterized by a integer between 0 and 2^(n-1). Each 1 in the binary representation of such a number corresponds to a "partition break" between two consecutive numbers, e.g.
a b|c|d e|f
0 1 1 0 1
so the number 01101 corresponds to the partition {a,b},{c},{d,e},{f}. To generate the partition from a known parition number, loop through the list and slice off a new subset whenever the corresponding bit it set.
I can understand your pain reading the fashionable functional-programming-flavored Ruby example. Here's a complete example in Python if that helps.
array = ['a', 'b', 'c', 'd', 'e']
n = len(array)
for partition_index in range(2 ** (n-1)):
# current partition, e.g., [['a', 'b'], ['c', 'd', 'e']]
partition = []
# used to accumulate the subsets, e.g., ['a', 'b']
subset = []
for position in range(n):
subset.append(array[position])
# check whether to "break off" a new subset
if 1 << position & partition_index or position == n-1:
partition.append(subset)
subset = []
print partition
Here's my recursive implementation of partitioning problem in Python. For me, recursive solutions are always easier to comprehend. You can find more explanation about it in here.
# Prints partitions of a set : [1,2] -> [[1],[2]], [[1,2]]
def part(lst, current=[], final=[]):
if len(lst) == 0 :
if len(current) == 0:
print (final)
elif len(current) > 1:
print ([current] + final)
else :
part(lst[1:], current + [lst[0]], final[:])
part(lst[1:], current[:], final + [[lst[0]]])
Since nobody has mentioned backtrack technique in solving this. Here is the Python solution to solve this using backtrack.
def partition(num):
def backtrack(index, chosen):
if index == len(num):
print(chosen)
else:
for i in range(index, len(num)):
# Choose
cur = num[index:i + 1]
chosen.append(cur)
# Explore
backtrack(i + 1, chosen)
# Unchoose
chosen.pop()
backtrack(0, [])
>>> partition('123')
['1', '2', '3']
['1', '23']
['12', '3']
['123']

sorting algorithm where pairwise-comparison can return more information than -1, 0, +1

Most sort algorithms rely on a pairwise-comparison the determines whether A < B, A = B or A > B.
I'm looking for algorithms (and for bonus points, code in Python) that take advantage of a pairwise-comparison function that can distinguish a lot less from a little less or a lot more from a little more. So perhaps instead of returning {-1, 0, 1} the comparison function returns {-2, -1, 0, 1, 2} or {-5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5} or even a real number on the interval (-1, 1).
For some applications (such as near sorting or approximate sorting) this would enable a reasonable sort to be determined with less comparisons.
The extra information can indeed be used to minimize the total number of comparisons. Calls to the super_comparison function can be used to make deductions equivalent to a great number of calls to a regular comparsion function. For example, a much-less-than b and c little-less-than b implies a < c < b.
The deductions cans be organized into bins or partitions which can each be sorted separately. Effectively, this is equivalent to QuickSort with n-way partition. Here's an implementation in Python:
from collections import defaultdict
from random import choice
def quicksort(seq, compare):
'Stable in-place sort using a 3-or-more-way comparison function'
# Make an n-way partition on a random pivot value
segments = defaultdict(list)
pivot = choice(seq)
for x in seq:
ranking = 0 if x is pivot else compare(x, pivot)
segments[ranking].append(x)
seq.clear()
# Recursively sort each segment and store it in the sequence
for ranking, segment in sorted(segments.items()):
if ranking and len(segment) > 1:
quicksort(segment, compare)
seq += segment
if __name__ == '__main__':
from random import randrange
from math import log10
def super_compare(a, b):
'Compare with extra logarithmic near/far information'
c = -1 if a < b else 1 if a > b else 0
return c * (int(log10(max(abs(a - b), 1.0))) + 1)
n = 10000
data = [randrange(4*n) for i in range(n)]
goal = sorted(data)
quicksort(data, super_compare)
print(data == goal)
By instrumenting this code with the trace module, it is possible to measure the performance gain. In the above code, a regular three-way compare uses 133,000 comparisons while a super comparison function reduces the number of calls to 85,000.
The code also makes it easy to experiment with a variety comparison functions. This will show that naïve n-way comparison functions do very little to help the sort. For example, if the comparison function returns +/-2 for differences greater than four and +/-1 for differences four or less, there is only a modest 5% reduction in the number of comparisons. The root cause is that the course grained partitions used in the beginning only have a handful of "near matches" and everything else falls in "far matches".
An improvement to the super comparison is to covers logarithmic ranges (i.e. +/-1 if within ten, +/-2 if within a hundred, +/- if within a thousand.
An ideal comparison function would be adaptive. For any given sequence size, the comparison function should strive to subdivide the sequence into partitions of roughly equal size. Information theory tells us that this will maximize the number of bits of information per comparison.
The adaptive approach makes good intuitive sense as well. People should first be partitioned into love vs like before making more refined distinctions such as love-a-lot vs love-a-little. Further partitioning passes should each make finer and finer distinctions.
You can use a modified quick sort. Let me explain on an example when you comparison function returns [-2, -1, 0, 1, 2]. Say, you have an array A to sort.
Create 5 empty arrays - Aminus2, Aminus1, A0, Aplus1, Aplus2.
Pick an arbitrary element of A, X.
For each element of the array, compare it with X.
Depending on the result, place the element in one of the Aminus2, Aminus1, A0, Aplus1, Aplus2 arrays.
Apply the same sort recursively to Aminus2, Aminus1, Aplus1, Aplus2 (note: you don't need to sort A0, as all he elements there are equal X).
Concatenate the arrays to get the final result: A = Aminus2 + Aminus1 + A0 + Aplus1 + Aplus2.
It seems like using raindog's modified quicksort would let you stream out results sooner and perhaps page into them faster.
Maybe those features are already available from a carefully-controlled qsort operation? I haven't thought much about it.
This also sounds kind of like radix sort except instead of looking at each digit (or other kind of bucket rule), you're making up buckets from the rich comparisons. I have a hard time thinking of a case where rich comparisons are available but digits (or something like them) aren't.
I can't think of any situation in which this would be really useful. Even if I could, I suspect the added CPU cycles needed to sort fuzzy values would be more than those "extra comparisons" you allude to. But I'll still offer a suggestion.
Consider this possibility (all strings use the 27 characters a-z and _):
11111111112
12345678901234567890
1/ now_is_the_time
2/ now_is_never
3/ now_we_have_to_go
4/ aaa
5/ ___
Obviously strings 1 and 2 are more similar that 1 and 3 and much more similar than 1 and 4.
One approach is to scale the difference value for each identical character position and use the first different character to set the last position.
Putting aside signs for the moment, comparing string 1 with 2, the differ in position 8 by 'n' - 't'. That's a difference of 6. In order to turn that into a single digit 1-9, we use the formula:
digit = ceiling(9 * abs(diff) / 27)
since the maximum difference is 26. The minimum difference of 1 becomes the digit 1. The maximum difference of 26 becomes the digit 9. Our difference of 6 becomes 3.
And because the difference is in position 8, out comparison function will return 3x10-8 (actually it will return the negative of that since string 1 comes after string 2.
Using a similar process for strings 1 and 4, the comparison function returns -5x10-1. The highest possible return (strings 4 and 5) has a difference in position 1 of '-' - 'a' (26) which generates the digit 9 and hence gives us 9x10-1.
Take these suggestions and use them as you see fit. I'd be interested in knowing how your fuzzy comparison code ends up working out.
Considering you are looking to order a number of items based on human comparison you might want to approach this problem like a sports tournament. You might allow each human vote to increase the score of the winner by 3 and decrease the looser by 3, +2 and -2, +1 and -1 or just 0 0 for a draw.
Then you just do a regular sort based on the scores.
Another alternative would be a single or double elimination tournament structure.
You can use two comparisons, to achieve this. Multiply the more important comparison by 2, and add them together.
Here is a example of what I mean in Perl.
It compares two array references by the first element, then by the second element.
use strict;
use warnings;
use 5.010;
my #array = (
[a => 2],
[b => 1],
[a => 1],
[c => 0]
);
say "$_->[0] => $_->[1]" for sort {
($a->[0] cmp $b->[0]) * 2 +
($a->[1] <=> $b->[1]);
} #array;
a => 1
a => 2
b => 1
c => 0
You could extend this to any number of comparisons very easily.
Perhaps there's a good reason to do this but I don't think it beats the alternatives for any given situation and certainly isn't good for general cases. The reason? Unless you know something about the domain of the input data and about the distribution of values you can't really improve over, say, quicksort. And if you do know those things, there are often ways that would be much more effective.
Anti-example: suppose your comparison returns a value of "huge difference" for numbers differing by more than 1000, and that the input is {0, 10000, 20000, 30000, ...}
Anti-example: same as above but with input {0, 10000, 10001, 10002, 20000, 20001, ...}
But, you say, I know my inputs don't look like that! Well, in that case tell us what your inputs really look like, in detail. Then someone might be able to really help.
For instance, once I needed to sort historical data. The data was kept sorted. When new data were added it was appended, then the list was run again. I did not have the information of where the new data was appended. I designed a hybrid sort for this situation that handily beat qsort and others by picking a sort that was quick on already sorted data and tweaking it to be fast (essentially switching to qsort) when it encountered unsorted data.
The only way you're going to improve over the general purpose sorts is to know your data. And if you want answers you're going to have to communicate that here very well.

Resources