Why doesn't Haskell provide folds for one-dimensional Arrays? - higher-order-functions

Data.Array doesn't provide folds for the Array type.
In Real World Haskell (ch. 12), the reason is said to be that Arrays could be folded in different ways based on the programmer's needs:
First of all, there are several kinds of folds that make sense. We might still want to fold over single elements, but we now have the possibility of folding over rows or columns, too. On top of this, for element-at-a-time folding, there are no longer just two sequences for traversal.
Isn't this exactly true of Lists? It's very common to represent e.g. a matrix with a multidimensional List, but there are still folds defined for one-dimensional Lists.
What's the subtlety I'm missing? Is it that a multidimensional Array is sufficiently different from an Array of Arrays?
Edit: Hm, even multidimensional arrays do have folds defined, in the form of instances of Data.Foldable.[0] So how does that fit in with the Real World Haskell quote?
[0] http://hackage.haskell.org/packages/archive/base/4.6.0.0/doc/html/Data-Foldable.html

Since you mention the difference between a "multidimensional" Array and an Array of Arrays, that will illustrate the point nicely, alongside a comparison with lists.
A fold (in the Foldable class sense) is an inherently linear operation, just as lists are an inherently linear structure; a right fold fully characterizes a list by matching its constructors one-for-one with the arguments to foldr. While you can define functions like foldl as well, there's a clear choice of a standard, canonical fold.
Array has no such transparent structure that can be matched one-for-one in a fold. It's an abstract type, with access to individual elements provided by index values, which can be of any type that has an Ix instance. So not only is there no single obvious choice for implementing a fold, there also is no intrinsic linear structure. It does so happen that Ix lets you enumerate a range of indices, but this is more an implementation detail than anything else.
What about multidimensional Arrays? They don't really exist, as such. Ix defines instances for tuples of types that are also instances, and if you want to think of such tuples as an index type for a "multidimensional" Array, go ahead! But they're still just tuples. Obviously, Ix puts some linear order on those tuples, but what is it? Can you find anything in the documentation that tells you?
So, I think we can safely say that folding a multidimensional Array using the order defined by Ix is unwise unless you don't really care what order you get the elements in.
For an Array of Arrays, on the other hand, there's only one sensible way to combine them, much like nested lists: fold each inner Array separately according to their own order of elements, then fold the result of each according to the outer Array's order of elements.
Now, you might reasonably object that since there's no type distinction between one-dimensional and multidimensional Arrays, and the former can be assumed to have a sensible fold ordering based on the Ix instance, why not just use that ordering by default? There's already a function that returns the elements of an Array in a list, after all.
As it turns out, the library itself would agree with you, because that's exactly what the Foldable instance does.

There is one natural way to fold lists, which is foldr. Note the types of the list constructors:
(:) :: a -> [a] -> [a]
[] :: [a]
Replacing the occurrences of [a] with b, we get these types:
f :: a -> b -> b
z :: b
And now, of course, the type of foldr is based on this principle:
foldr :: (a -> b -> b) -> b -> [a] -> b
So given the construction/observation semantics of lists, foldr is the one that's most natural. You can read the type as "tell me what to do with a (:) and what to do with a [], and I'll get rid of a list for you."
Array doesn't have this property; you build an array from an association list (Ix i => [(i,a)]), and the type doesn't really expose any recursive structure: one array is not built from other arrays through a recursive constructor as a list or tree would be.

Related

Why are Python sets not considered sequences?

In the python documentation for versions 2.x it says explicitly that there are seven sequence data types. The docs go on to discuss sets and tuples some time later (on the same page), both of which are not included in the above seven. Does anyone know what exactly makes defines a sequence type? My intuited definition has sets and tuples fitting the bill quite nicely, and I haven't had any luck finding an explicit official definition.
Thanks!
The word "sequence" implies an order, but sets are not in a specific order.
Element index is a fundamental notion for Python sequences. If you look at the table of sequence operations, you'll see a few that work directly with indices:
s[i] ith item of s, origin 0 (3)
s[i:j] slice of s from i to j (3)(4)
s[i:j:k] slice of s from i to j with step k (3)(5)
s.index(i) index of the first occurence of i in s
Sets and dictionaries have no notion of an element index, and therefore can't be considered sequences.
In mathematics, informally speaking, a sequence is an ordered list of objects (or events). Like a set, it contains members (also called elements, or terms). The number of ordered elements (possibly infinite) is called the length of the sequence. Unlike a set, order matters, and exactly the same elements can appear multiple times at different positions in the sequence. Most precisely, a sequence can be defined as a function whose domain is a countable totally ordered set, such as the natural numbers.
http://en.wikipedia.org/wiki/Sequence
;)
See the Python glossary:
Sequence
An iterable which supports efficient element access using integer indices via the __getitem__() special method and defines a len() method that returns the length of the sequence. Some built-in sequence types are list, str, tuple, and unicode. Note that dict also supports __getitem__() and __len__(), but is considered a mapping rather than a sequence because the lookups use arbitrary immutable keys rather than integers.
Tuples are sequences. Sets aren't sequences - they have no order and they can't be indexed via set[index] - they even don't have any kind of notion of indices. (They are iterable, though - you can iterate over their items.)

Given an object A and a list of objects L, how to find which objects on L are clones of A without testing all cases?

Using JavaScript notation:
A = {color:'red',size:8,type:'circle'};
L = [{color:'gray',size:15,type:'square'},
{color:'pink',size:4,type:'triangle'},
{color:'red',size:8,type:'circle'},
{color:'red',size:12,type:'circle'},
{color:'blue',size:10,type:'rectangle'}];
The answer for this case would be 2, because L[2] is identic to A. You could find the answer in O(n) by testing each possibility. What is a representation/algorithm that allows finding that answer faster?
I would just create a HashMap and put all objects into the HashMap. Also we would need to define a hash function which is function of data in object (something similar to overriding Object.hashcode() in java)
Suppose given array L is [B, C, D] where B, C and D are objects. Then HashMap would be {B=>1, C=>2, D=>3}. Now suppose D is copy of A. So we would just lookup A in this map and get the answer. Also as suggested by Eric P in comment, we would need to keep the hashmap updated with respect to any change in array L. This also can be done in O(1) for every operation in array L.
Cost of Looking up an object in the HashMap is O(1). So we can achieve O(1) complexity.
I think it's not possible to do it faster than O(n) with your preconditions.
It's possible to find element in O(logn) using binary search, but:
A) you need elements with one variable to compare
B) sorted list by that variable
Maybe with some technics (ordering, skip lists, etc.) you can find answer faster than N iterations, but the worst case is O(n)
Since the goal is to find all objects which are clones of A, you must test every object at least once to determine whether it is a clone of A, so the minimum number of tests is N. Passing through the list once and testing each object performs N tests, so since this method is the minimum number of tests, it is an optimal method.
first, I assume, that you are talking about array, not list. the word 'list' is reserved for specific type of data structures, that has O(n) indexing comlexity, so meantime for any search in it is at least linear.
for unsorted array, the only algorithm is full scan with linear time. However, if array is sorted, you can use binary or interpolating search to get better time.
The problem with sorted arrays is that they have linear insert time. No good. So if you wish to update your set much and both update and search times are important, you should search for optimized container, that in c++ and haskell is called Set (set template in set header and Data.Set module in containers package respectively). I dunno if there is any in JS.

Can a set have duplicate elements?

I have been asked a question that is a little ambiguous for my coursework.
The array of strings is regarded as a set, i.e. unordered.
I'm not sure whether I need to remove duplicates from this array?
I've tried googling but one place will tell me something different to the next. Any help would be appreciated.
From Wikipedia in Set (Mathematics)
A set is a collection of well defined and distinct objects.
Perhaps the confusion derives from the fact that a set does not depend on the way its elements are displayed. A set remains the same if its elements are allegedly repeated or rearranged.
As such, the programming languages I know would not put an element into a set if the element already belongs to it, or they would replace it if it already exists, but would never allow a duplication.
Programming Language Examples
Let me offer a few examples in different programming languages.
In Python
A set in Python is defined as "an unordered collection of unique elements". And if you declare a set like a = {1,2,2,3,4} it will only add 2 once to the set.
If you do print(a) the output will be {1,2,3,4}.
Haskell
In Haskell the insert operation of sets is defined as: "[...] if the set already contains an element equal to the given value, it is replaced with the new value."
As such, if you do this: let a = fromList([1,2,2,3,4]), if you print a to the main ouput it would render [1,2,3,4].
Java
In Java sets are defined as: "a collection that contains no duplicate elements.". Its add operation is defined as: "adds the specified element to this set if it is not already present [...] If this set already contains the element, the call leaves the set unchanged".
Set<Integer> myInts = new HashSet<>(asList(1,2,2,3,4));
System.out.println(myInts);
This code, as in the other examples, would ouput [1,2,3,4].
A set cannot have duplicate elements by its mere definition. The correct structure to allow duplicate elements is Multiset or Bag:
In mathematics, a multiset (or bag) is a generalization of the concept of a set that, unlike a set, allows multiple instances of the multiset's elements. For example, {a, a, b} and {a, b} are different multisets although they are the same set. However, order does not matter, so {a, a, b} and {a, b, a} are the same multiset.
A very common and useful example of a Multiset in programming is the collection of values of an object:
values({a: 1, b: 1}) //=> Multiset(1,1)
The values here are unordered, yet cannot be reduced to Set(1) that would e.g. break the iteration over the object values.
Further, quoting from the linked Wikipedia article (see there for the references):
Multisets have become an important tool in databases.[18][19][20] For instance, multisets are often used to implement relations in database systems. Multisets also play an important role in computer science.
Let A={1,2,2,3,4,5,6,7,...} and B={1,2,3,4,5,6,7,...} then any element in A is in B and any element in B is in A ==> A contains B and B contains A ==> A=B. So of course sets can have duplicate elements, it's just that the one with duplicate elements would end up being exactly the same as the one without duplicate elements.
"Sets are Iterables that contain no duplicate elements."
https://docs.scala-lang.org/overviews/collections/sets.html

Puzzle: Need an example of a "complicated" equivalence relation / partitioning that disallows sorting and/or hashing

From the question "Is partitioning easier than sorting?":
Suppose I have a list of items and an
equivalence relation on them, and
comparing two items takes constant
time. I want to return a partition of
the items, e.g. a list of linked
lists, each containing all equivalent
items.
One way of doing this is to extend the
equivalence to an ordering on the
items and order them (with a sorting
algorithm); then all equivalent items
will be adjacent.
(Keep in mind the distinction between equality and equivalence.)
Clearly the equivalence relation must be considered when designing the ordering algorithm. For example, if the equivalence relation is "people born in the same year are equivalent", then sorting based on the person's name is not appropriate.
Can you suggest a datatype and equivalence relation such that it is not possible to create an ordering?
How about a datatype and equivalence relation where it is possible to create such an ordering, but it is not possible to define a hash function on the datatype that will map equivalent items to the same hash value.
(Note: it is OK if nonequivalent items map to the same hash value (collide) -- I'm not asking to solve the collision problem -- but on the other hand, hashFunc(item) { return 1; } is cheating.)
My suspicion is that for any datatype/equivalence pair where it is possible to define an ordering, it will also be possible to define a suitable hash function, and they will have similar algorithmic complexity. A counterexample to that conjecture would be enlightening!
The answer to questions 1 and 2 is no, in the following sense: given a computable equivalence relation ≡ on strings {0, 1}*, there exists a computable function f such that x ≡ y if and only if f(x) = f(y), which leads to an order/hash function. One definition of f(x) is simple, and very slow to compute: enumerate {0, 1}* in lexicographic order (ε, 0, 1, 00, 01, 10, 11, 000, …) and return the first string equivalent to x. We are guaranteed to terminate when we reach x, so this algorithm always halts.
Creating a hash function and an ordering may be expensive but will usually be possible. One trick is to represent an equivalence class by a pre-arranged member of that class, for instance, the member whose serialised representation is smallest, when considered as a bit string. When somebody hands you a member of an equivalence class, map it to this canonicalised member of that class, and then hash or compare the bit string representation of that member. See e.g. http://en.wikipedia.org/wiki/Canonical#Mathematics
Examples where this is not possible or convenient include when somebody gives you a pointer to an object that implements equals() but nothing else useful, and you do not get to break the type system to look inside the object, and when you get the results of a survey that only asks people to judge equality between objects. Also Kruskal's algorithm uses Union&Find internally to process equivalence relations, so presumbly for this particular application nothing more cost-effective has been found.
One example that seems to fit your request is an IEEE floating point type. In particular, a NaN doesn't compare as equivalent to anything else (nor even to itself) unless you take special steps to detect that it's a NaN, and always call that equivalent.
Likewise for hashing. If memory serves, any floating point number with all bits of the significand set to 0 is treated as having the value 0.0, regardless of what the bits in the exponent are set to. I could be remembering that a bit wrong, but the idea is the same in any case -- the right bit pattern in one part of the number means that it has the value 0.0, regardless of the bits in the rest. Unless your hash function takes this into account, it will produce different hash values for numbers that really compare precisely equal.
As you probably know, comparison-based sorting takes at least O(n log n) time (more formally you would say it is Omega(n log n)). If you know that there are fewer than log2(n) equivalence classes, then partitioning is faster, since you only need to check equivalence with a single member of each equivalence class to determine which part in the partition you should assign a given element to.
I.e. your algorithm could be like this:
For each x in our input set X:
For each equivalence class Y seen so far:
Choose any member y of Y.
If x is equivalent to y:
Add x to Y.
Resume the outer loop with the next x in X.
If we get to here then x is not in any of the equiv. classes seen so far.
Create a new equivalence class with x as its sole member.
If there are m equivalence classes, the inner loop runs at most m times, taking O(nm) time overall. As ShreetvatsaR observes in a comment, there can be at most n equivalence classes, so this is O(n^2). Note this works even if there is not a total ordering on X.
Theoretically, it is alway possible (for questions 1 and 2), because of the Well Ordering Theorem, even when you have an uncountable number of partitions.
Even if you restrict to computable functions, throwawayaccount's answer answers that.
You need to more precisely define your question :-)
In any case,
Practically speaking,
Consider the following:
You data type is the set of unsigned integer arrays. The ordering is lexicographic comparison.
You could consider hash(x) = x, but I suppose that is cheating too :-)
I would say (but haven't thought more about getting a hash function, so might well be wrong) that partitioning by ordering is much more practical than partitioning by hashing, as hashing itself could become impractical. (A hashing function exists, no doubt).
I believe that...
1- Can you suggest a datatype and equivalence relation such that it is
not possible to create an ordering?
...it's possible only for infinite (possibly only for non-countable) sets.
2- How about a datatype and equivalence relation where it is
possible to create such an ordering,
but it is not possible to define a
hash function on the datatype that
will map equivalent items to the same
hash value.
...same as above.
EDIT: This answer is wrong
I am not going to delete it just because some of the comments below are enlightening
Not every equivalence relationship implies an order
As your equivalence relationship should not induce an order, let´s take an un-ordered distance function as relation.
If we get the set of functions f(x):R -> R as our datatype, and define an equivalence relation as:
f is equivalent to g if f(g(x)) = g(f(x) [commuting Operators][1]
Then you can't sort on that order (no injective function exists with the Real numbers). You just can't find a function which maps your datatype to numbers due to the cardinality of the function's space.
Suppose that F(X) is a function which maps an element of some data type T to another of the same type, such that for any Y of type T, there is exactly one X of type T such that F(X)=Y. Suppose further that the function is chosen so that there is generally no practical way of finding the X in the above equation for a given Y.
Define F0=X, F{1}(X)=F(X), F{2}(X)=F(F(X)), etc. so F{n}(X) = F(F{n-1}(X)).
Now define a data type Q containing a positive integer K and an object X of type T. Define an equivalence relation thus:
Q(a,X) vs Q(b,Y):
If a > b, the items are equal iff F{a-b}(Y)==X
If a < b, the items are equal iff F{b-a}(X)==Y
If a=b, the items are equal iff X==Y
For any given object Q(a,X) there exists exactly one Z for F{a}(Z)==X. Two objects are equivalent iif they would have the same Z. One could define an ordering or hash function based upon Z. On the other hand, if F is chosen such that its inverse cannot be practically computed, the only practical way to compare elements may be to use the equivalence function above. I know of no way to define an ordering or hash function without either knowing the largest possible "a" value an item could have, or having a means to invert function F.

How is memory-efficient non-destructive manipulation of collections achieved in functional programming?

I'm trying to figure out how non-destructive manipulation of large collections is implemented in functional programming, ie. how it is possible to alter or remove single elements without having to create a completely new collection where all elements, even the unmodified ones, will be duplicated in memory. (Even if the original collection would be garbage-collected, I'd expect the memory footprint and general performance of such a collection to be awful.)
This is how far I've got until now:
Using F#, I came up with a function insert that splits a list into two pieces and introduces a new element in-between, seemingly without cloning all unchanged elements:
// return a list without its first n elements:
// (helper function)
let rec skip list n =
if n = 0 then
list
else
match list with
| [] -> []
| x::xs -> skip xs (n-1)
// return only the first n elements of a list:
// (helper function)
let rec take list n =
if n = 0 then
[]
else
match list with
| [] -> []
| x::xs -> x::(take xs (n-1))
// insert a value into a list at the specified zero-based position:
let insert list position value =
(take list position) # [value] # (skip list position)
I then checked whether objects from an original list are "recycled" in new lists by using .NET's Object.ReferenceEquals:
open System
let (===) x y =
Object.ReferenceEquals(x, y)
let x = Some(42)
let L = [Some(0); x; Some(43)]
let M = Some(1) |> insert L 1
The following three expressions all evaluate to true, indicating that the value referred to by x is re-used both in lists L and M, ie. that there is only 1 copy of this value in memory:
L.[1] === x
M.[2] === x
L.[1] === M.[2]
My question:
Do functional programming languages generally re-use values instead of cloning them to a new memory location, or was I just lucky with F#'s behaviour? Assuming the former, is this how reasonably memory-efficient editing of collections can be implemented in functional programming?
(Btw.: I know about Chris Okasaki's book Purely functional data structures, but haven't yet had the time to read it thoroughly.)
I'm trying to figure out how
non-destructive manipulation of large
collections is implemented in
functional programming, ie. how it is
possible to alter or remove single
elements without having to create a
completely new collection where all
elements, even the unmodified ones,
will be duplicated in memory.
This page has a few descriptions and implementations of data structures in F#. Most of them come from Okasaki's Purely Functional Data Structures, although the AVL tree is my own implementation since it wasn't present in the book.
Now, since you asked, about reusing unmodified nodes, let's take a simple binary tree:
type 'a tree =
| Node of 'a tree * 'a * 'a tree
| Nil
let rec insert v = function
| Node(l, x, r) as node ->
if v < x then Node(insert v l, x, r) // reuses x and r
elif v > x then Node(l, x, insert v r) // reuses x and l
else node
| Nil -> Node(Nil, v, Nil)
Note that we re-use some of our nodes. Let's say we start with this tree:
When we insert an e into the the tree, we get a brand new tree, with some of the nodes pointing back at our original tree:
If we don't have a reference to the xs tree above, then .NET will garbage collect any nodes without live references, specifically thed, g and f nodes.
Notice that we've only modified nodes along the path of our inserted node. This is pretty typical in most immutable data structures, including lists. So, the number of nodes we create is exactly equal to the number of nodes we need to traverse in order to insert into our data structure.
Do functional programming languages
generally re-use values instead of
cloning them to a new memory location,
or was I just lucky with F#'s
behaviour? Assuming the former, is
this how reasonably memory-efficient
editing of collections can be
implemented in functional programming?
Yes.
Lists, however, aren't a very good data structure, since most non-trivial operations on them require O(n) time.
Balanced binary trees support O(log n) inserts, meaning we create O(log n) copies on every insert. Since log2(10^15) is ~= 50, the overhead is very very tiny for these particular data structures. Even if you keep around every copy of every object after inserts/deletes, your memory usage will increase at a rate of O(n log n) -- very reasonable, in my opinion.
How it is possible to alter or remove single elements without having to create a completely new collection where all elements, even the unmodified ones, will be duplicated in memory.
This works because no matter what kind of collection, the pointers to the elements are stored separately from the elements themselves. (Exception: some compilers will optimize some of the time, but they know what they are doing.) So for example, you can have two lists that differ only in the first element and share tails:
let shared = ["two", "three", "four"]
let l = "one" :: shared
let l' = "1a" :: shared
These two lists have the shared part in common and their first elements different. What's less obvious is that each list also begins with a unique pair, often called a "cons cell":
List l begins with a pair containing a pointer to "one" and a pointer to the shared tail.
List l' begins with a pair containing a pointer to "1a" and a pointer to the shared tail.
If we had only declared l and wanted to alter or remove the first element to get l', we'd do this:
let l' = match l with
| _ :: rest -> "1a" :: rest
| [] -> raise (Failure "cannot alter 1st elem of empty list")
There is constant cost:
Split l into its head and tail by examining the cons cell.
Allocate a new cons cell pointing to "1a" and the tail.
The new cons cell becomes the value of list l'.
If you're making point-like changes in the middle of a big collection, typically you'll be using some sort of balanced tree which uses logarithmic time and space. Less frequently you may use a more sophisticated data structure:
Gerard Huet's zipper can be defined for just about any tree-like data structure and can be used to traverse and make pointlike modifications at constant cost. Zippers are easy to understand.
Paterson and Hinze's finger trees offer very sophisticated representations of sequences, which among other tricks enable you to change elements in the middle efficiently—but they are hard to understand.
While the referenced objects are the same in your code,
I beleive the storage space for the references themselves
and the structure of the list
is duplicated by take.
As a result, while the referenced objects are the same,
and the tails are shared between the two lists,
the "cells" for the initial portions are duplicated.
I'm not an expert in functional programming,
but maybe with some kind of tree you could achieve
duplication of only log(n) elements,
as you would have to recreate only the path from the root
to the inserted element.
It sounds to me like your question is primarily about immutable data, not functional languages per se. Data is indeed necessarily immutable in purely functional code (cf. referential transparency), but I'm not aware of any non-toy languages that enforce absolute purity everywhere (though Haskell comes closest, if you like that sort of thing).
Roughly speaking, referential transparency means that no practical difference exists between a variable/expression and the value it holds/evaluates to. Because a piece of immutable data will (by definition) never change, it can be trivially identified with its value and should behave indistinguishably from any other data with the same value.
Therefore, by electing to draw no semantic distinction between two pieces of data with the same value, we have no reason to ever deliberately construct a duplicate value. So, in cases of obvious equality (e.g., adding something to a list, passing it as a function argument, &c.), languages where immutability guarantees are possible will generally reuse the existing reference, as you say.
Likewise, immutable data structures possess an intrinsic referential transparency of their structure (though not their contents). Assuming all contained values are also immutable, this means that pieces of the structure can safely be reused in new structures as well. For example, the tail of a cons list can often be reused; in your code, I would expect that:
(skip 1 L) === (skip 2 M)
...would also be true.
Reuse isn't always possible, though; the initial portion of a list removed by your skip function can't really be reused, for instance. For the same reason, appending something to the end of a cons list is an expensive operation, as it must reconstruct a whole new list, similar to the problem with concatenating null-terminated strings.
In such cases, naive approaches quickly get into the realm of awful performance you were concerned about. Often, it's necessary to substantially rethink fundamental algorithms and data structures to adapt them successfully to immutable data. Techniques include breaking structures into layered or hierarchical pieces to isolate changes, inverting parts of the structure to expose cheap updates to frequently-modified parts, or even storing the original structure alongside a collection of updates and combining the updates with the original on the fly only when the data is accessed.
Since you're using F# here I'm going to assume you're at least somewhat familiar with C#; the inestimable Eric Lippert has a slew of posts on immutable data structures in C# that will probably enlighten you well beyond what I could provide. Over the course of several posts he demonstrates (reasonably efficient) immutable implementations of a stack, binary tree, and double-ended queue, among others. Delightful reading for any .NET programmer!
You may be interested in reduction strategies of expressions in functional programming languages. A good book on the subject is The Implementation of Functional Programming Languages, by Simon Peyton Jones, one of the creators of Haskell.
Have a look especially at the following chapter Graph Reduction of Lambda Expressions since it describes the sharing of common subexpressions.
Hope it helps, but I'm afraid it applies only to lazy languages.

Resources