How does Spark achieve sort order? - sorting

Assume I have a list of Strings. I filter & sort them, and collect the result to driver. However, things are distributed, and each RDD has it's own part of original list. So, how does Spark achieve the final sorted order, does it merge results?

Sorting in Spark is a multiphase process which requires shuffling:
input RDD is sampled and this sample is used to compute boundaries for each output partition (sample followed by collect)
input RDD is partitioned using rangePartitioner with boundaries computed in the first step (partitionBy)
each partition from the second step is sorted locally (mapPartitions)
When the data is collected, all that is left is to follow the order defined by the partitioner.
Above steps are clearly reflected in a debug string:
scala> val rdd = sc.parallelize(Seq(4, 2, 5, 3, 1))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at ...
scala> rdd.sortBy(identity).toDebugString
res1: String =
(6) MapPartitionsRDD[10] at sortBy at <console>:24 [] // Sort partitions
| ShuffledRDD[9] at sortBy at <console>:24 [] // Shuffle
+-(8) MapPartitionsRDD[6] at sortBy at <console>:24 [] // Pre-shuffle steps
| ParallelCollectionRDD[0] at parallelize at <console>:21 [] // Parallelize

Related

Does an ordered Map knows how to search efficiently for a key in Scala?

Does an ordered Map knows how to search efficiently for a key in Scala?
Imagine I have a Map:
val unorderdMap: Map[Int, String] = ...
val orederedMap: Map[Int, String] = unorderedMap.sort
Is lookup operation for a key faster in orderedMap?
unorderedMap.get(i) //Slower???
orderedMap.get(i) //Faster???
Does the compliler knows how to search efficiently?
Does the compiler performs the lookup operation differently in each case?
*EDIT:
I have a
case class A(key: Int, value1: String, value2: String, ...)
val SeqA: Seq[A] = Seq(A(1, "One", "Uno", ...), A(2, "Two", "Duo",...), ..., A(20000,... ,...))
I want to have fast lookup operations on key(That's what i am interested ONLY)
Is it better to make a Map out of it like:
val mapA = SeqA.map(a => a.key -> a)(collection.breakOut)
Or Is it Better to leave it as a Sequence(and maybe order them).
Then If I make it a Map should I Order it or not? *Elements are around
20K - 30K elements!
Sorted maps are usually(*) slower than hash maps in any languages. This is because sorted maps has O(log n) complexity compared to hash maps which have O(1) amortized complexity.
You should have a look at relevant wiki pages for a more in depth explanation.
(*) That depends on many factors like the size of the map. For small sets, sorted arrays with binary searches might do better if it fits in cache.

Cartesian product using spark

I have two sequences A and B. We want to generate a Boolean sequence where each element in A has a subsequence which occurs in B. For example:
a = ["abababab", "ccffccff", "123123", "56575656"]
b = ["ab", "55", "adfadf", "123", "5656"]
output = [True, False, True, True]
A and B do not fit in memory. One solution may be as follows:
val a = sc.parallelize(List("abababab", "ccffccff", "123123", "56575656"))
val b = sc.parallelize(List("ab", "55", "adfadf", "123", "5656"))
a.cartesian(b)
.map({case (x,y) => (x, x contains y) })
.reduceByKey(_ || _).map(w => w._1 + "," + w._2).saveAsTextFile("./output.txt")
One could appreciate that there is no need to compute the cartesian product because once we find a first couple of sequence that meets our condition we can stop the search. Take for example the first element of A. If we start iterating B from the beginning, the first element of B is a subsequence and therefore the output is True. In this case, we have been very lucky but in general there is no need to verify all combinations.
The question is: is there any other way to optimize this computation?
I believe the short answer is 'NO' :)
I also don't think it's fair to compare what Spark does with iterating. You have to remember that Spark is for huge data set where sequential processing is not an option. It runs your function in parallel with potentially thousands of tasks executed concurrently on many different machines. And it does this to ensure that processing will finish in a reasonable time even if the first element of A matches the very last element of B.
In contrast, iterating or looping is a sequential operation comparing two elements at the time. It is well suited for small data sets, but not for huge data sets and definitely not for distributed processing.

What is stratified bootstrap?

I have learned bootstrap and stratification. But what is stratified bootstrap? And how does it work?
Let's say we have a dataset of n instances (observations), and m is the number of classes. How should I divide the dataset, and what's the percentage for training and testing?
You split your dataset per class. Afterwards, you sample from each sub-population independently. The number of instances you sample from one sub-population should be relative to its proportion.
data
d(i) <- { x in data | class(x) =i }
for each class
for j = 0..samplesize*(size(d(i))/size(data))
sample(i) <- draw element from d(i)
sample <- U sample(i)
If you sample four elements from a dataset with classes {'a', 'a', 'a', 'a', 'a', 'a', 'b', 'b'}, this procedure makes sure that at least one element of class b is contained in the stratified sample.
Just had to implement this in python, I will just post my current approach here in case this is of interest for others.
Function to create index for original Dataframe to create stratified bootstrapped sample
I chose to iterate over all relevant strata clusters in the original Dataframe , retrieve the index of the relevant rows in each stratum and randomly (with replacement) draw the same amount of samples from the stratum that this very stratum consists of.
In turn, the randomly drawn indices can just be combined into one list (that should in the end have the same length as the original Dataframe).
import pandas as pd
from random import choices
def provide_stratified_bootstap_sample_indices(bs_sample):
strata = bs_sample.loc[:, "STRATIFICATION_VARIABLE"].value_counts()
bs_index_list_stratified = []
for idx_stratum_var, n_stratum_var in strata.iteritems():
data_index_stratum = list(bs_sample[bs_sample["STRATIFICATION_VARIABLE"] == idx_stratum_var[0]].index)
bs_index_list_stratified.extend(choices(data_index_stratum , k = len(data_index_stratum )))
return bs_index_list_stratified
And then the actual bootstrapping loop
(say 10k times):
k=10000
for i in range(k):
bs_sample = DATA_original.copy()
bs_index_list_stratified = provide_stratified_bootstap_sample_indices(bs_sample)
bs_sample = bs_sample.loc[bs_index_list_stratified , :]
# process data with some statistical operation as required and save results as required for each iteration
RESULTS = FUNCTION_X(bs_sample)

How to carry-over the calculated value within the RDD ? -Apache spark

SOLVED: There is no good solution to this problem
I am sure that this is just a syntax-relevant question and that answer is an easy one.
What I am trying to achieve is to:
-pass a variable to RDD
-change the variable according to RDD data
-get the adjusted variable
Lets say I have:
var b = 2
val x = sc.parallelize(0 to 3)
what I want to do is to obtain the value 2+0 + 2+0+1 + 2+0+1+2 + 2+0+1+2+3 = 18
That is, the value 18 by doing something like
b = x.map(i=> … b+i...).collect
The problem is, for each i, I need to carry over the value b, to be incremented with the next i
I want to use this logic for adding the elements to an array that is external to RDD
How would I do that without doing the collect first ?
As mentioned in the comments, it's not possible to mutate one variable with the contents of an RDD as RDDs are distributed across potentially many different nodes while mutable variables are local to each executor (JVM).
Although not particularly performant, it's possible to implement these requirements on Spark by translating the sequential algorithm in a series of transformations that can be executed in a distributed environment.
Using the same example as on the question, this algorithm in Spark could be expressed as:
val initialOffset = 2
val rdd = sc.parallelize(0 to 3)
val halfCartesian = rdd.cartesian(rdd).filter{case (x,y) => x>=y}
val partialSums = halfCartesian.reduceByKey(_ + _)
val adjustedPartials = partialSums.map{case (k,v) => v+initialOffset}
val total = adjustedPartials.reduce(_ + _)
scala> total
res33: Int = 18
Note that cartesian is a very expensive transformation as it creates (m x n) elements, or in this case n^2.
This is just to say that it's not impossible, but probably not ideal.
If the amount of data to be processed sequentially would fit in the memory of one machine (maybe after filtering/reduce), then Scala has a built-in collection operation to realize exactly what's being asked: scan[Left|Right]
val arr = Array(0,1,2,3)
val cummulativeScan = arr.scanLeft(initialOffset)(_ + _)
// we remove head b/c scan adds the given element at the start of the sequence
val result = cummulativeScan.tail.sum
result: Int = 18

Copying the hash table to new rehashed table

I have a question about rehashing. Let us say, we have a hash table of size 7, and our hash function is (key%tableSize). We insert 24 to the table, and 24 will be at index 3 since 24%7=3. Then, let us say we added more elements, and now we want to rehash. The table size will be twice the size of the initial table, i.e. new table size will be 14. Then, while copying the elements to the new hash table, for example, while copying the element 24, will it still be in the index 3, or will it be at the index 24%14=10. I mean, do we use the new table size while copying the elements, or the elements stay in their initial indexes?
Thanks
It's depend on your hashing function. In your case you should use key%size_of_table else slots after 7 will never be mapped by hashing function. These slots will occupied only when you chose linear probing in order to tackle the collision.(Where we look for next empty slot). Chosing new size will help to reduce the collisions at early stage, else it would be the case table haven't reached the Load Factor still you are facing lot of collisions.
Important thing about the hash tables is that the order of the elements is not guaranteed, it depends on the hash function.
For your example: if you copy the data into the new hash using 7 for hash size your indexes: 7, 8, 9, 10, 11, 12 and 13 of the new array will be unused because you've used bigger array and your hash function cant give you result bigger than 6. These unused indexes are a bad thing because simply you don't need them, so it's better to use key % 14 instead.
Interesting thing is that the internal hash table state depends not only by the hash function but it also can depend on the order in which the elements have been inserted. For example, imagine there's a hash table (implemented with array and linked lists) X with size 4 and you insert the elements 2,3,6,10 in that order:
x
{
[0] -> []
[1] -> []
[2] -> [2,6,10]
[3] -> [3]
}
For hash function again is used key % size.
Now if we insert the keys in different order - 10, 6, 3, 2 we get:
x
{
[0] -> []
[1] -> []
[2] -> [10,6,2]
[3] -> [3]
}
I've written all these lines above just to show you that two copies of a hash can look different internally because on many factors. I think that was the consideration of your question.

Resources