How to efficiently find distinct values from each column in Spark - performance

To find distinct values from each column of an Array I tried
RDD[Array[String]].map(_.map(Set(_))).reduce {
(a, b) => (a.zip(b)).map { case (x, y) => x ++ y}}
which executes successfully. However, I'd like to know if there was a more efficient way of doing this than my sample code above. Thank you.

Aggregate saves a step, might or might not be more efficient
val z = Array.fill(5)(Set[String]()) // or whatever the length is
val d= lists.aggregate(z)({(a, b) => (a.zip(b)).map { case (x, y) => x + y}},
{(a, b) => (a.zip(b)).map { case (x, y) => x ++ y}})
You could also try using mutable sets and modifying rather than producing a new one at each step (which is explicitly allowed by Spark):
val z = Array.fill(5)(scala.collection.mutable.Set[String]())
val d= lists.aggregate(z)({(a, b) => (a.zip(b)).foreach { case (x, y) => x+= y };a},
{(a, b) => (a.zip(b)).foreach { case (x, y) => x ++= y};a})

Related

Scala permutations via flatMap of subset

I have the method which makes permutations:
def permutations[T](lst: List[T]): List[List[T]] = lst match {
case Nil => List(Nil)
case x :: xs => permutations(xs) flatMap { perm =>
(0 to xs.length) map { num =>
(perm take num) ++ List(x) ++ (perm drop num)
}
}
}
Firstly, it take a recursive call with pair - head, tail for List("a","b", "c"):
c - Nil
b - c
a - bc
and permute before and after parts. As result, I have all permutations from three later. My question is next: why recursive call doesn't return the intermediate statement like "bc", "cb", "c" and return valid set from three later.
Well, in each iteration you necessarily return only lists with the same size as the input list, because you return permutations of the form (perm take num) ++ List(x) ++ (perm drop num), which would always contain all of the elements in perm plus the x element.
Therefore, recursively - if each cycle returns only values of the same size as its input (including the end case of Nil), then the end result must only contain permutations of the same size.
To fix this, you can add the perm without x to each cycle's result, but you'll have to add distinct to get rid of duplications:
def permutations[T](lst: List[T]): List[List[T]] = lst match {
case Nil => List(Nil)
case x :: xs => permutations(xs) flatMap { perm =>
((0 to xs.length) flatMap { num =>
List(perm, (perm take num) ++ List(x) ++ (perm drop num))
}).distinct
}
}

Relational operations using only increment, loop, assign, zero

This is a follow up question for: Subtraction operation using only increment, loop, assign, zero
We're only allowed to use the following operations:
incr(x) - Once this function is called it will assign x + 1 to x
assign(x, y) - This function will assign the value of y to x (x = y)
zero(x) - This function will assign 0 to x (x = 0)
loop X { } - operations written within brackets will be executed X times
For example, addition can be implemented as follows:
add(x, y) {
loop x
{ y = incr(y) }
return y
}
How do I implement the relational operators using these four operations? The relational operations are:
eq(x, y) - Is x equal to y?
lt(x, y) - Is x lesser than y?
gt(x, y) - Is x greater than y?
We also have their opposites:
ne(x, y) - Is x not equal to y?
gte(x, y) - Is x greater than or equal to y?
lte(x, y) - Is x lesser than or equal to y?
Any help will be appreciated.
The set of natural numbers N is closed under addition and subtraction:
N + N = N
N - N = N
This means that the addition or subtraction of two natural numbers is also a natural number (considering 0 - 1 is 0 and not -1, we can't have negative natural numbers).
However, the set of natural numbers N is not closed under relational operations:
N < N = {0, 1}
N > N = {0, 1}
This means that the result of comparing two natural numbers is either truthfulness (i.e. 1) or falsehood (i.e. 0).
So, we treat the set of booleans (i.e. {0, 1}) as a restricted set of the natural numbers (i.e. N).
false = 0
true = incr(false)
The first question we must answer is “how do we encode if statements so that we may branch based on either truthfulness or falsehood?” The answer is simple, we use the loop operation:
isZero(x) {
y = true
loop x { y = false }
return y
}
If the loop condition is true (i.e. 1) then the loop executes exactly once. If the loop condition is false (i.e. 0) then the loop doesn't execute. We can use this to write branching code.
So, how do we define the relational operations? Turns out, everything can be defined in terms of lte:
lte(x, y) {
z = sub(x, y)
z = isZero(z)
return z
}
We know that x ≥ y is the same as y ≤ x. Therefore:
gte(x, y) {
z = lte(y, x)
return z
}
We know that if x > y is true then x ≤ y is false. Therefore:
gt(x, y) {
z = lte(x, y)
z = not(z)
return z
}
We know that x < y is the same as y > x. Therefore:
lt(x, y) {
z = gt(y, x)
return z
}
We know that if x ≤ y and y ≤ x then x = y. Therefore:
eq(x, y) {
l = lte(x, y)
r = lte(y, x)
z = and(l, r)
return z
}
Finally, we know that if x = y is true then x ≠ y is false. Therefore:
ne(x, y) {
z = eq(x, y)
z = not(z)
return z
}
Now, all we need to do is define the following functions:
The sub function is defined as follows:
sub(x, y) {
loop y
{ x = decr(x) }
return x
}
decr(x) {
y = 0
z = 0
loop x {
y = z
z = incr(z)
}
return y
}
The not function is the same as the isZero function:
not(x) {
y = isZero(x)
return y
}
The and function is the same as the mul function:
and(x, y) {
z = mul(x, y)
return z
}
mul(x, y) {
z = 0
loop x { z = add(y, z) }
return z
}
add(x, y) {
loop x
{ y = incr(y) }
return y
}
That's all you need. Hope that helps.

Better pattern for caching results

I'm running a number of times now into a similar pattern which is error-prone (typos can skip some caching) and simply doesn't look nice to me. Is there a better way of writing something like this?
sum_with_cache' result cache ((p1,p2,p3,p4):partitions) = let
(cache_p1, sol1) = count_noncrossing' cache p1
(cache_p2, sol2) = count_noncrossing' cache_p1 p2
(cache_p3, sol3) = count_noncrossing' cache_p2 p3
(cache_p4, sol4) = count_noncrossing' cache_p3 p4
in sum_with_cache' (result+(sol1*sol2*sol3*sol4)) cache_p4 partitions
So basically N operations which can update the cache?
I could write also something like:
process_with_cache' res cache _ [] = (cache, res)
process_with_cache' res cache f (x:xs) =
let (new_cache, r) = f cache x
in process_with_cache' (r:res) new_cache f xs
process_with_cache = process_with_cache' []
But that doesn't look really clean either. Is there a nicer way of writing this code?
Another similar pattern is when you request a series of named random numbers:
let (x, rng') = random rng''
(y, rng) = random rng'
in (x^2 + y^2, rng)
This is exactly when using a state monad is the right way to go:
import Control.Monad.State
For all random number generators of type (RandomGen g) => g there is a state monad State g, which threads the state implicitly:
do x <- state random
y <- state random
return (x^2 + y^2)
The state function simply takes a function of type s -> (a, s) and turns it into a computation of type State s a, in this case:
state :: (RandomGen g) => (g -> (a, g)) -> State g a
You can run a State computation by using runState, evalState or execState:
runState (liftA2 (\x y -> x^2 + y^2) (state random) (state random))
(mkStdGen 0)

Weight-Biased Leftist Heaps: advantages of top-down version of merge?

I am self-studying Okasaki's Purely Functional Data Structures, now on exercise 3.4, which asks to reason about and implement a weight-biased leftist heap. This is my basic implementation:
(* 3.4 (b) *)
functor WeightBiasedLeftistHeap (Element : Ordered) : Heap =
struct
structure Elem = Element
datatype Heap = E | T of int * Elem.T * Heap * Heap
fun size E = 0
| size (T (s, _, _, _)) = s
fun makeT (x, a, b) =
let
val sizet = size a + size b + 1
in
if size a >= size b then T (sizet, x, a, b)
else T (sizet, x, b, a)
end
val empty = E
fun isEmpty E = true | isEmpty _ = false
fun merge (h, E) = h
| merge (E, h) = h
| merge (h1 as T (_, x, a1, b1), h2 as T (_, y, a2, b2)) =
if Elem.leq (x, y) then makeT (x, a1, merge (b1, h2))
else makeT (y, a2, merge (h1, b2))
fun insert (x, h) = merge (T (1, x, E, E), h)
fun findMin E = raise Empty
| findMin (T (_, x, a, b)) = x
fun deleteMin E = raise Empty
| deleteMin (T (_, x, a, b)) = merge (a, b)
end
Now, in 3.4 (c) & (d), it asks:
Currently, merge operates in two
passes: a top-down pass consisting of
calls to merge, and a bottom-up pass
consisting of calls to the helper
function, makeT. Modify merge to
operate in a single, top-down pass.
What advantages would the top-down
version of merge have in a lazy
environment? In a concurrent
environment?
I changed the merge function by simply inlining makeT, but I fail to see any advantages, so I think I haven't grasped the spirit of these parts of the exercise. What am I missing?
fun merge (h, E) = h
| merge (E, h) = h
| merge (h1 as T (s1, x, a1, b1), h2 as T (s2, y, a2, b2)) =
let
val st = s1 + s2
val (v, a, b) =
if Elem.leq (x, y) then (x, a1, merge (b1, h2))
else (y, a2, merge (h1, b2))
in
if size a >= size b then T (st, v, a, b)
else T (st, v, b, a)
end
I think I've figured out one point with regards to lazy evaluation. If I don't use the recursive merge to calculate the size, then the recursive call won't need to be evaluated until the child is needed:
fun merge (h, E) = h
| merge (E, h) = h
| merge (h1 as T (s1, x, a1, b1), h2 as T (s2, y, a2, b2)) =
let
val st = s1 + s2
val (v, ma, mb1, mb2) =
if Elem.leq (x, y) then (x, a1, b1, h2)
else (y, a2, h1, b2)
in
if size ma >= size mb1 + size mb2
then T (st, v, ma, merge (mb1, mb2))
else T (st, v, merge (mb1, mb2), ma)
end
Is that all? I am not sure about concurrency though.
I think you've essentially got it as far as the lazy evaluation goes -- it's not very helpful to use lazy evaluation if you are going to have to end up traversing the whole data structure to find out anything every time you do a merge...
As to the concurrency, I expect the issue is that if, while one thread is evaluating the merge, another comes along and wants to look something up, it will not be able to get anything useful done at least until the first thread completes the merge. (And it might even take longer than that.)
It doesn’t any benefit to WMERGE-3-4C function in a lazy environment. It still does all the work that the original down-up merge did. It pretty sure it would not be any easier for the language system to memorize..
No benefit to WMERGE-3-4C functions in a concurrent environment. Each call to WMERGE-3-4C does all its work before passing the buck to another instance of WMERGE-3-4C. In fact, if we eliminated the recursion by hand, WMERGE-3-4C could be implemented as a single loop that does all the work while accumulating a stack, then a second loop that does the REDUCE work on the stack. The first loop would not be naturally parallizable, though maybe the REDUCE could operate by calling the function on pairs, in parallel, until only one element remained in the list.

Associativity in Lambda calculus

I am working on the exercise questions of book The Lambda calculus. One of the questions that I am stuck is proving the following:
Show that the application is not associative; in fact, x(yz) not equals (xy)z
Here is what I have worked on so far:
Let x = λa.λb. ab
Let y = λb.λc. bc
Let z = λa.λc. ac
(xy)z => ((λa.λb. ab) (λb.λc. bc)) (λa.λc. ac)
=> (λb. (λb.λc. bc) b) (λa.λc. ac)
=> (λb.λc. bc) (λa.λc. ac)
=> (λc. (λa.λc. ac) c)
x(yz) => (λa.λb. ab) ((λb.λc. bc) (λa.λc. ac))
=> (λb. ((λb.λc. bc) (λa.λc. ac)) b)
=> (λb. (λc. (λa.λc. ac) c) b)
Is this correct? Please help me understand.
I also think that your counter-example is correct.
You can probably get a simpler counter-example like this:
let x = λa.n and y, z variables then:
(xy)z => ((λa.n) y) z => n z
x(yz) => (λa.n) (y z) => n
The derivations seem fine, at a glance.
Conceptually, just think that x, y, and z can represent any computable functions, and clearly, some of those functions are not associative. Say, x is 'subtract 2', y is 'divide by 2', and z is 'double'. For this example, x(yz) = 'subtract 2' and (xy)z = 'substract 1'.
It seems ok, but for simplicity, how about prove by contradiction?
Assume (xy)z = x(yz), and let
x = λa.λb. a # x = const
y = λa. -a # y = negate
z = 1 # z = 1
and show that ((xy)z) 0 ≠ (x(yz)) 0.
The book you mention by Barendregt is extremely formal and precise (a great book), so it would be nice to have the precise statement of the exercise.
I guess the actual goal was to find instantiations for x, y and z such that
x (y z) reduces to the boolean true = \xy.x and (x y) z reduces to the boolean false = \xy.y
Then, you can take e.g. x = \z.true and z = I = \z.z (y arbitrary).
But how can we prove that true is not convertible with false? You have no way to prove it inside the calculus, since you have no negation: you can only prove equalities and not inequalities. However, let us observe that if true=false then all terms are equal.
Indeed, for any M and N, if true = false then
true M N = false M N
but true M N reduces to M, while false M N reduces to N, so
M = N
Hence, if true = false all terms would be equal, and the calculus would be trivial. Since we can find not trivial models of the lambda calculus, no such
model may equate true and false (more generally may equate terms with different normal forms, that would require us to talk about the bohm-out technique).

Resources