Why doesn't sortBy take (a -> a -> Bool)? - sorting

The Haskell sortBy function takes (a -> a -> Ordering) as its first argument. Can anyone educate me as to what the reasoning is there? My background is entirely in languages that have a similar function take (a -> a -> Bool) instead, so having to write one that returns LT/GT was a bit confusing.
Is this the standard way of doing it in statically typed/pure functional languages? Is this peculiar to ML-descended languages? Is there some fundamental advantage to it that I'm not seeing, or some hidden disadvantage to using booleans instead?
Summarizing:
An Ordering is not GT | LT, it's actually GT | EQ | LT (apparently GHC doesn't make use of this under the hood for the purposes of sorting, but still)
Returning a trichotomic value more closely models the possible outcomes of a comparison of two elements
In certain cases, using Ordering rather than a Bool will save a comparison
Using an Ordering makes it easier to implement stable sorts
Using an Ordering makes it clear to readers that a comparison between two elements is being done (a boolean doesn't inherently carry this meaning, though I get the feeling many readers will assume it)
I'm tentatively accepting Carl's answer, and posting the above summary since no single answer has hit all the points as of this edit.

I think Boolean Blindness is the main reason. Bool is a type with no domain semantics. Its semantics in the case of a function like sortBy come entirely from convention, not from the domain the function is operating on.
This adds one level of indirection to the mental process involved in writing a comparison function. Instead of just saying "the three values I can return are less than, equal, or greater", the semantic building blocks of ordering, you say "I want to return less than, so I must convert it to a boolean." There's an extra mental conversion step that's always present. Even if you are well-versed in the convention, it still slows you down a bit. And if you're not well-versed in the convention, you are slowed down quite a bit by having to check to see what it is.
The fact that it's 3-valued instead of 2-valued means you don't need to be quite as careful in your sort implementation to get stability, either - but that's a minor implementation detail. It's not nearly as important as actually having your values have meanings. (Also, Bool is no more efficient than Ordering. It's not a primitive in Haskell. They're both algebraic data types defined in libraries.)

When you sort things, you put them in order; there's not a "truth" value to determine.
More to the point, what would "true" mean? That the first argument is less than the second? Greater than? Now you're overriding "true" to really mean "less than" (or "greater than", depending on how you choose to implement the function). And what if they're equal?
So why not cut out the middle man, so to speak, and return what you really mean?

There's no reason it couldn't. If you look at the ghc implementation, it only checks whether the result is GT or not. The Haskell Report version of the code uses insertBy, which likewise only checks for GT or not. You could write the following and use it without any problem:
sortByBool :: (a -> a -> Bool) -> [a] -> [a]
sortByBool lte = sortBy (\x y -> if lte x y then LT else GT)
sort' :: Ord a => [a] -> [a]
sort' = sortByBool (<=)
Some sorts could conceivably perform optimizations by knowing when elements are EQ, but the implementations currently used do not need this information.

I think there were two separate design decisions:
1) Creating the Ordering type
2) Choosing for sortBy to return an Orderingvalue
The Ordering type is useful for more than just sortBy - for example, compare is the "centerpiece" of the Ord typeclass. Its type is :: Ord a => a -> a -> Ordering. Given two values, then, you can find out whether they're less than, greater than, or equal -- with any other comparison function ((<), (<=), (>), (>=)), you can only rule out one of those three possibilities.
Here's a simple example where Ordering (at least in my opinion) makes a function's intent a little clearer:
f a b =
case compare a b of
GT -> {- something -}
LT -> {- something -}
EQ -> {- something -}
Once you've decided to create the Ordering type, then I think it's natural to use it in places where that's the information you're truly looking for (like sortBy), instead of using Bool as a sort of workaround.

Three valued Ordering is needed to save comparisons in cases where we do need to distinguish the EQ case. In duplicates-preserving sort or merge, we ignore the EQ case, so a predicate with less-then-or-equal semantics is perfectly acceptable. But not in case of union or nubSort where we do want to distinguish the three outcomes of comparison.
mergeBy lte (x:xs) (y:ys)
| lte y x = y : mergeBy lte (x:xs) ys
| otherwise = x : mergeBy lte xs (y:ys)
union (x:xs) (y:ys) = case compare x y of
LT -> x : union xs (y:ys)
EQ -> x : union xs ys
GT -> y : union (x:xs) ys
Writing the latter one with lte predicate is unnatural.

Related

Testing if one list is a sublist of another without testing equality between elements

I am wondering if one can write a functional program (as in Haskell or OCaml) that takes two lists and determines if the first is a sublist of the second, with the property that the program cannot invoke equality between elements of the list.
More generally, is there such a program that works for lists of elements of arbitrary type? That is, (in Haskell terms) the type does not have to be constrained by Eq, Ord, or something else.
The reason I ask this is that when dealing with lists of elements of arbitrary type, standard equality (as for ints, strings, etc.) is sometimes not supported for these elements. It would be helpful, however, to test for sublists.
I have been unable to think of an implementation that meets this condition. Is it possible to create one?
Without an equality, the relation is_sublist is non-sensical: [x] `is_sublist` [y] ought to be true if and only if x = y. Conversely, if such function is_sublist existed, it would define an equality function as eq x y = [x] `is_sublist` [y] .

Why is it impossible to Applicative-traverse arrays? (Or is it?)

While pondering how to best map, i.e. traverse, an a -> Maybe a-Kleisli over an unboxed vector, I looked for an existing implementation. Obviously U.Vector is not Traversable, but it does supply a mapM, which for Maybe of course works just fine.
But the question is: is the Monad constraint really needed? Well, it turns out that even boxed vectors cheat for the Traversable instance: they really just traverse a list, which they convert from/to:
instance Traversable.Traversable Vector where
{-# INLINE traverse #-}
traverse f xs = Data.Vector.fromList Applicative.<$> Traversable.traverse f (toList xs)
mono-traversable does the same thing also for unboxed vectors; here this seems even more gruesome performance-wise.
Now, I wouldn't be surprised if vector was actually able to fuse many of these hacked traversals into a far more efficient form, but still – there seems to be a fundamental problem, preventing us from implementing a traversal on an array right away. Is there any “deep reason” for this inability?
After reading through the relevant source of vector and trying to make mapM work with Applicative I think the reason why Data.Vector.Unboxed.Vector doesn't have a traverse :: (Applicative f, Unbox a, Unbox b) -> (a -> f b) -> Vector a -> f (Vector b) function and Data.Vector.Vector doesn't have a native traverse is the fusion code. The offender is the following Stream type:
-- Data/Vector/Fusion/Stream/Monadic.hs Line: 137
-- | Result of taking a single step in a stream
data Step s a where
Yield :: a -> s -> Step s a
Skip :: s -> Step s a
Done :: Step s a
-- | Monadic streams
data Stream m a = forall s. Stream (s -> m (Step s a)) s
This is used internally to implement mapM. The m will be the same as from your initial call to Data.Vector.Unboxed.mapM. But because the spine of this stream is inside the m functor, it is not possible to work with it if you only have an applicative for m.
See also this issue on the vector GitHub repo: Weaken constraint on mapM.
Disclaimer: I don't really know how fusion works. I don't know how vector works.

How to define set in coq without defining set as a list of elements

I am trying to define (1,2,3) as a set of elements in coq. I can define it using list as (1 :: (2 :: (3 :: nil))). Is there any way to define set in coq without using list.
The are basically four possible choices to be made when defining sets in Coq depending on your constraints on the base type of the set and computation needs:
If the base type doesn't have decidable equality, it is common to use:
Definition Set A := A -> Prop
Definition cup A B := fun x => A x /\ B x.
...
basically, Coq's Ensembles. This representation cannot "compute", as we can't even decide if two elements are equal.
If the base data type has decidable equality, then there are two choices depending if extensionality is wanted:
Extensionality means that two sets are equal in Coq's logic iff they have the same elements, formally:
forall (A B : set T), (A = B) <-> (forall x, x \in A <-> x \in B).
If extensionality is wanted, sets should be represented by a canonically-sorted duplicate-free structure, usually a list. A good example is Cyril's Cohen finmap library. This representation however is very inefficient for computing as re-sorting is needed every time a set is modified.
If extensionality is not needed (usually a bad idea if proofs are wanted), you can use representations based on binary trees such as Coq's MSet, which are similar to standard Functional Programming set implementations and can work efficiently.
Finally, when the base type is finite, the set of all sets is a finite type too. The best example of this approach is IMO math-comp's finset, which encodes finite sets as the space of finitely supported membership functions, which is extensional, and forms a complete lattice.
The standard library of coq provides the following finite set modules:
Coq.MSets abstracts away the implementation details of the set. For instance, there is an implementation that uses AVL trees and another based on lists.
Coq.FSets abstracts away the implementation details of the set; it is a previous version of MSets.
Coq.Lists.ListSet is an encoding of lists as sets, which I am including for the sake of completeness
Here is an example on how to define a set with FSets:
Require Import Coq.Structures.OrderedTypeEx.
Require Import Coq.FSets.FSetAVL.
Module NSet := FSetAVL.Make Nat_as_OT.
(* Creates a set with only element 3 inside: *)
Check (NSet.add 3 NSet.empty).
There are many encodings of sets in Coq (lists, function, trees, ...) which can be finite or not. You should have a look at Coq's standard library. For example the 'simplest' set definition I know is this one

Functionally comparing data sets to each other once with Haskell

After over a year of mental wrangling, I finally understand Haskell well enough to consider it my primary language for the majority of my general programming needs. I absolutely love it.
But I still struggle with doing very specific operations in a functional way.
A simplified example:
Set = [("Bob", 10), ("Megan", 7), ("Frank", 2), ("Jane", 11)]
I'd like to compare these entries to each other. With a language like C or Python, I'd probably create some complicated loop, but I'm not sure which approach (map, fold, list comprehension?) would be best or most efficient with a functional language.
Here's a sample of the code I started working on:
run xs = [ someAlgorithm (snd x) (snd y) | x <- xs, y <- xs, x /= y ]
The predicate keeps the list comprehension from comparing entries with themselves, but the function isn't very efficient because it compares entries that have already been compared. For example. It'll compare Bob with Megan, and then compare Megan with Bob.
Any advice on how to solve this issue would be greatly appreciated.
If you have an ordering on your data type, you can just use x < y instead of x /= y.
Another approach is to use tails to avoid comparing elements in the same position:
[ ... | (x:ys) <- tails xs, y <- ys]
This has the effect of only picking items y that occur after x in the original list. If your list contains duplicates, you'll want to combine this with the explicit filtering from before.

Purely functional set

Is there an algorithm that implements a purely functional set?
Expected operations would be union, intersection, difference, element?, empty? and adjoin.
Those are not hard requirements though and I would be happy to learn an algorithm that only implements a subset of them.
You can use a purely functional map implementation, where you just ignore the values.
See http://hackage.haskell.org/packages/archive/containers/0.1.0.1/doc/html/Data-IntMap.html (linked to from https://cstheory.stackexchange.com/questions/1539/whats-new-in-purely-functional-data-structures-since-okasaki ).
(sidenote: For more information on functional datastructures, see http://www.amazon.com/Purely-Functional-Structures-Chris-Okasaki/dp/0521663504 )
A purely functional implementation exists for almost any data structure. In the case of sets or maps, you typically use some form of search tree, e.g. red/black trees or AVL trees. The standard reference for functional data structures is the book by Okasaki:
http://www.cambridge.org/gb/knowledge/isbn/item1161740/
Significant parts of it are available for free via his thesis:
http://www.cs.cmu.edu/~rwh/theses/okasaki.pdf
The links from the answer by #ninjagecko are good. What I've been following recently are the Persistent Data Structures used in Clojure, which are functional, immutable and persistent.
A description of the implementation of the persistent hash map can be found in this two-part blog post:
http://blog.higher-order.net/2009/09/08/understanding-clojures-persistenthashmap-deftwice/
http://blog.higher-order.net/2010/08/16/assoc-and-clojures-persistenthashmap-part-ii/
These are implementations of some of the ideas (see the first answer, first entry) found in this reference request question.
The sets that come out of these structures support the functions you need:
http://clojure.org/data_structures#Data Structures-Sets
All that's left is to browse the source code and try to wrap your head around it.
Here is an implementation of a purely functional set in OCaml (it is the standard library of OCaml).
Is there an algorithm that implements a purely functional set?
You can implement set operations using many different purely functional data structures. Some have better complexity than others.
Examples include:
Lists
Where we have:
List Difference:
(\\) :: Eq a => [a] -> [a] -> [a]
The \\ function is list difference ((non-associative). In the result of xs \\ ys, the first occurrence of each element of ys in turn (if any) has been removed from xs. Thus
union :: Eq a => [a] -> [a] -> [a]
The union function returns the list union of the two lists. For example,
"dog" `union` "cow" == "dogcw"
Duplicates, and elements of the first list, are removed from the the second list, but if the first list contains duplicates, so will the result. It is a special case of unionBy, which allows the programmer to supply their own equality test.
intersect :: Eq a => [a] -> [a] -> [a]
The intersect function takes the list intersection of two lists. For example,
[1,2,3,4] `intersect` [2,4,6,8] == [2,4]
If the first list contains duplicates, so will the result.
Immutable Sets
More efficient data structures can be designed to improve the complexity of set operations. For example, the standard Data.Set library in Haskell implements sets as size-balanced binary trees:
Stephen Adams, "Efficient sets: a balancing act", Journal of Functional Programming 3(4):553-562, October 1993, http://www.swiss.ai.mit.edu/~adams/BB/.
Which is this data structure:
data Set a = Bin !Size !a !(Set a) !(Set a)
| Tip
type Size = Int
Yielding complexity of:
union, intersection, difference: O(n+m)

Resources