Sending Items to specific partitions - hadoop

I'm looking for a way to send structures to pre-determined partitions so that they can be used by another RDD
Lets say I have two RDDs of key-value pairs
val a:RDD[(Int, Foo)]
val b:RDD[(Int, Foo)]
val aStructure = a.reduceByKey(//reduce into large data structure)
b.mapPartitions{
iter =>
val usefulItem = aStructure(samePartitionKey)
iter.map(//process iterator)
}
How could I go about setting up the Partition such that the specific data structure I need will be present for the mapPartition but I won't have the extra overhead of sending over all values (which would happen if I were to make a broadcast variable).
One thought I have been having is to store the objects in HDFS but I'm not sure if that would be a suboptimal solution.
Another thought I am currently exploring is whether there is some way I can create a custom Partition or Partitioner that could hold the data structure (Although that might get too complicated and become problematic)
thank you for your help!
edit:
Pangea makes a very good point that I should offer some more specifics. Essentially I'm given and RDD of SparseVectors and an RDD of inverted indexes. The inverted index objects are quite large.
My hope is to do a MapPartitions within the RDD of vectors where I can compare each vector to the inverted index. The issue is that I only NEED one inverted index object per partition and doing a join would cause me to have a lot of copies of that index.
val vectors:RDD[(Int, SparseVector)]
val invertedIndexes:RDD[(Int, InvIndex)] = a.reduceByKey(generateInvertedIndex)
vectors:RDD.mapPartitions{
iter =>
val invIndex = invertedIndexes(samePartitionKey)
iter.map(invIndex.calculateSimilarity(_))
)
}

A Partitioner is a function that, given a generic element, will return in which partition it belongs. It also decides the number of partitions.
There's a form of reduceByKey that takes a partitioner as an argument.
If I am understanding correctly your question, you want the data be partitioned while doing the reduce.
See the example:
// create example data
val a =sc.parallelize(List( (1,1),(1,2), (2,3),(2,4) ) )
// create simple sample partitioner - 2 partitions, one for odd
// one for even key.hashCode. You should put your partitioning logic here
val p = new Partitioner { def numPartitions: Int = 2; def getPartition(key:Any) = key.hashCode % 2 }
// your reduceByKey function. Sample: just add
val f = (a:Int,b:Int) => a + b
val rdd = a.reduceByKey(p, f)
// here your rdd will be partitioned the way you want with the number
// of partitions you want
rdd.partitions.size
res8: Int = 2
rdd.map() .. // go on with your processing

Related

Spark: lazy action?

I am working on a complex application. From source data, we compute many statistics, eg .
val df1 = sourceData.filter($"col1" === "val" and ...)
.select(...)
.groupBy(...)
.min()
val df2 = sourceData.filter($"col2" === "val" and ...)
.select(...)
.groupBy(...)
.count()
As the dataframe are grouped on the same columns, the result dataframes are then grouped together:
df1.join(df2, Seq("groupCol"), "full_outer")
.join(df3....)
.write.save(...)
(in my code this is done in a loop)
This is not performant, the problem is that each dataframe (I have about 30) ends with a action, so in my understanding each dataframe is computed and returned to the driver, which then sends back data to executors to perform the join.
This gives me memory error, I can increase the driver memory but I am looking for a better way of doing it. For ex. if all dataframes were computed only at the end (with the saving of the joined dataframe) I guess that everything would be managed by the cluster.
Is there a way to do a kind of lazy action? Or should I join the dataframes in another way?
Thx
First of all, the code you've shown contains only one action-like operation - DataFrameWriter.save. All other components are lazy.
But laziness doesn't really help you here. The biggest problem (assuming no ugly data skew or misconfigured broadcasting) is that the individual aggregations require separate shuffles and expensive subsequent merge.
A naive solution would be to leverage that:
the dataframe are grouped on the same columns
to shuffle first:
val groupColumns: Seq[Column] = ???
val sourceDataPartitioned = sourceData.groupBy(groupColumns: _*)
and use the result to compute individual aggregates
val df1 = sourceDataPartitioned
...
val df2 = sourceDataPartitioned
...
However, this approach is rather brittle and is unlikely to scale in presence large / skewed groups.
Therefore it would be much better to rewrite your code to perform only aggregation. Luckily for you, standard SQL behavior is all you need.
Let's start with structuring you code into three element tuples with:
_1 being a predicate (the condition you use with filter).
_2 being a list of Columns for which you want to compute aggregates.
_3 being an aggregate function.
Where example structure can look this:
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions.{count, min}
val ops: Seq[(Column, Seq[Column], Column => Column)] = Seq(
($"col1" === "a" and $"col2" === "b", Seq($"col3", $"col4"), count),
($"col2" === "b" and $"col3" === "c", Seq($"col4", $"col5"), min)
)
Now you compose aggregate expressions using
agg_function(when(predicate, column))
pattern
import org.apache.spark.sql.functions.when
val exprs: Seq[Column] = ops.flatMap {
case (p, cols, f) => cols.map {
case c => f(when(p, c))
}
}
and use it on the sourceData
sourceData.groupBy(groupColumns: _*).agg(exprs.head, exprs.tail: _*)
Add aliases when necessary.

union not happening with Spark transform

I have a Spark stream in which records are flowing in. And the interval size is 1 second.
I want to union all the data in the stream. So i have created an empty RDD , and then using transform method, doing union of RDD (in the stream) with this empty RDD.
I am expecting this empty RDD to have all the data at the end.
But this RDD always remains empty.
Also, can somebody tell me if my logic is correct.
JavaRDD<Row> records = ss.emptyDataFrame().toJavaRDD();
JavaDStream<Row> transformedMessages = messages.flatMap(record -> processData(record))
.transform(rdd -> rdd.union(records));
transformedMessages.foreachRDD(record -> {
System.out.println("Aman" +record.count());
StructType schema = DataTypes.createStructType(fields);
Dataset ds = ss.createDataFrame(records, schema);
ds.createOrReplaceTempView("tempTable");
ds.show();
});
Initially, records is empty.
Then we have transformedMessages = messages + records, but records is empty, so we have: transformedMessages = messages (obviating the flatmap function which is not relevant for the discussion)
Later on, when we do Dataset ds = ss.createDataFrame(records, schema); records
is still empty. That does not change in the flow of the program, so it will remain empty as an invariant over time.
I think what we want to do is, instead of
.transform(rdd -> rdd.union(records));
we should do:
.foreachRDD{rdd => records = rdd.union(records)} //Scala: translate to Java syntax
That said, please note that as this process iteratively adds to the lineage of the 'records' RDD and also will accumulate all data over time. This is not a job that can run stable for a long period of time as, eventually, given enough data, it will grow beyond the limits of the system.
There's no information about the usecase behind this question, but the current approach does not seem to be scalable nor sustainable.

Spark imbalanced partitions after leftOuterJoin

I have a pattern like this... psuedo-code, but I think it makes sense...
type K // key, function of records in B
class A // compact data structure
val a: RDD[(K, A)] // many records
class B { // massive data structure
def funcIter // does full O(n) scans of huge data structure
}
val b: RDD[(K,B)] // comparatively few records
val emptyB = new B("", Nil, etc.)
val C: RDD[(A,B)] = {
a
.leftOuterJoin(b, 1.5x increase in partitions)
.map{ case (k, (val_a, option_b)) => (val_a, option_b.getOrElse(emptyB)) }
.map{ case (val_a, val_b) => (val_a, val_b.funcIter(val_a.attributes)) }
}
My problem is that records in val b vary enormously in size with some quite enormous, and since it's a leftOuterJoin, each of those records is replicated 1,000's or 10,000's of times to join to val a... so it's not just that there are large values in b to handle, but that the worst case records in b end up copied many times in one partition after the join. So the worse partitions are almost exclusively made up of many copies of only the worse case values from b. So my last few partitions take ages to work through while most of my enormous cluster sits idle, draining my wallet.
Is there anything I can do to modify this pattern... try broadcasting b and joining with a in place (it's probably too big).... or split partitions after the join maybe splitting copies of the worst b vals apart into different partitions without doing another shuffle... like the opposite of a coalesce so at least multiple executors on the same core instance (I have 3 executors per core instance) can work on those records in parallel?
Thanks for any advice.

Is it possible to get the first n elements of every RDD in Spark Streaming?

When using Spark Streaming, is it possible to get the first n elements of every RDD in a DStream? In the real world, my stream consists of a number of geotagged events, and I want to take the 100 (or whatever) which are closest to a given point for further processing, but a simple example which shows what I'm trying to do is something like:
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.ConstantInputDStream
import org.apache.spark.streaming.{Seconds, StreamingContext}
object take {
def main(args: Array[String]) {
val data = 1 to 10
val sparkConf = new SparkConf().setAppName("Take");
val streamingContext = new StreamingContext(sparkConf, Seconds(1))
val rdd = streamingContext.sparkContext.makeRDD(data)
val stream = new ConstantInputDStream(streamingContext, rdd)
// In the real world, do a bunch of stuff which results in an ordered RDD
// This obviously doesn't work
// val filtered = stream.transform { _.take(5) }
// In the real world, do some more processing on the DStream
stream.print()
streamingContext.start()
streamingContext.awaitTermination()
}
}
I understand I could pull the top n results back to the driver fairly easily, but that isn't something I want to do in this case as I need to do further processing on the RDD after having filtered it down.
Why is it not working? I think your example is fine.
You should compute the distance for each event
Sort the events by distance with a number of partitions adapted to your amount of data
Take the first 100 events from each partition (so you'll shuffle a small part of the initial data), make the returned collection a new RDD with sparkContext.parallelize(data)
Sort again with only one partition so all the data is shuffled in the same dataset
Take the first 100 events, this is your top 100
The code for the sort is the same in step 2 and 4, you just change the number of partitions.
Step 1 is executed on the DStream, steps 2 to 5 are executed on the RDDs in a transform operation.

Hadoop Buffering vs Streaming

Could someone please explain to me what is the difference between Hadoop Streaming vs Buffering?
Here is the context I have read in Hive :
In every map/reduce stage of the join, the last table in the sequence is streamed through the reducers whereas the others are buffered. Therefore, it helps to reduce the memory needed in the reducer for buffering the rows for a particular value of the join key by organizing the tables such that the largest tables appear last in the sequence. e.g. in:
SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key1)
In a reduce side join, the values from multiple tables are often tagged to identify them on reducer stage, for the table they are coming from.
Consider a case of two tables:
On reduce call, the mixed values associated with both tables are iterated.
During iteration, the value for one of the tag/table are locally stored into an arraylist. (This is buffering).
While the rest of the values are being streamed through and values for another tag/table are detected, the values of first tag are fetched from the saved arraylist. The two tag values are joined and written to output collector.
Contrast this with the case what if the larger table values are kept in arraylist then it could result into OOM if the arraylist outgrows to overwhelm the memory of the container's JVM.
void reduce(TextPair key , Iterator <TextPair> values ,OutputCollector <Text,Text> output ,Reporter reporter ) throws IOException {
//buffer for table1
ArrayList <Text> table1Values = new ArrayList <Text>() ;
//table1 tag
Text table1Tag = key . getSecond();
TextPair value = null;
while( values . hasNext() ){
value = values . next() ;
if(value.getSecond().equals(table1Tag)){
table1Values.add (value.getFirst() );
}
else{
for( Text val : table1Values ){
output.collect ( key.getFirst() ,new Text(val.toString() + "\t"+ value.getFirst().toString () ));
}
}
}
}
You can use the below hint to specify which of the joined tables would be streamed on reduce side:
SELECT /*+ STREAMTABLE(a) */ a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key1)
Hadoop Streaming in general refers to using custom made python or shell scripts to perform your map-reduce logic. ( For example, using the Hive TRANSFORM keyword.)
Hadoop buffering, in this context, refers to the phase in a map-reduce job of a Hive query with a join, when records are read into the reducers, after having been sorted and grouped coming out of the mappers. The author is explaining why you should order the join clauses i n a Hive query, so that the largest tables are last; because it helps optimize the implementation of joins in Hive.
They are completely different concepts.
In response to your comments:
In Hive's join implementation, it must take records from multiple tables, sort them by the join key, and then collate them together in the proper order. It has to read them grouped by the different tables, so they have to see groups from different tables, and once all tables have been seen, start processing them. The first groups from the first tables need to be buffered (kept in memory) because they can not be processed until the last table is seen. The last table can be streamed, (each row processed as they are read) since the other tables group are in memory, and the join can start.

Resources