Spark: lazy action? - performance

I am working on a complex application. From source data, we compute many statistics, eg .
val df1 = sourceData.filter($"col1" === "val" and ...)
.select(...)
.groupBy(...)
.min()
val df2 = sourceData.filter($"col2" === "val" and ...)
.select(...)
.groupBy(...)
.count()
As the dataframe are grouped on the same columns, the result dataframes are then grouped together:
df1.join(df2, Seq("groupCol"), "full_outer")
.join(df3....)
.write.save(...)
(in my code this is done in a loop)
This is not performant, the problem is that each dataframe (I have about 30) ends with a action, so in my understanding each dataframe is computed and returned to the driver, which then sends back data to executors to perform the join.
This gives me memory error, I can increase the driver memory but I am looking for a better way of doing it. For ex. if all dataframes were computed only at the end (with the saving of the joined dataframe) I guess that everything would be managed by the cluster.
Is there a way to do a kind of lazy action? Or should I join the dataframes in another way?
Thx

First of all, the code you've shown contains only one action-like operation - DataFrameWriter.save. All other components are lazy.
But laziness doesn't really help you here. The biggest problem (assuming no ugly data skew or misconfigured broadcasting) is that the individual aggregations require separate shuffles and expensive subsequent merge.
A naive solution would be to leverage that:
the dataframe are grouped on the same columns
to shuffle first:
val groupColumns: Seq[Column] = ???
val sourceDataPartitioned = sourceData.groupBy(groupColumns: _*)
and use the result to compute individual aggregates
val df1 = sourceDataPartitioned
...
val df2 = sourceDataPartitioned
...
However, this approach is rather brittle and is unlikely to scale in presence large / skewed groups.
Therefore it would be much better to rewrite your code to perform only aggregation. Luckily for you, standard SQL behavior is all you need.
Let's start with structuring you code into three element tuples with:
_1 being a predicate (the condition you use with filter).
_2 being a list of Columns for which you want to compute aggregates.
_3 being an aggregate function.
Where example structure can look this:
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions.{count, min}
val ops: Seq[(Column, Seq[Column], Column => Column)] = Seq(
($"col1" === "a" and $"col2" === "b", Seq($"col3", $"col4"), count),
($"col2" === "b" and $"col3" === "c", Seq($"col4", $"col5"), min)
)
Now you compose aggregate expressions using
agg_function(when(predicate, column))
pattern
import org.apache.spark.sql.functions.when
val exprs: Seq[Column] = ops.flatMap {
case (p, cols, f) => cols.map {
case c => f(when(p, c))
}
}
and use it on the sourceData
sourceData.groupBy(groupColumns: _*).agg(exprs.head, exprs.tail: _*)
Add aliases when necessary.

Related

In pyspark, is df.select(column1, column2....) impact performance

for example, I have a dataframe with 10 columns, and later I need use this dataframe join with other dataframes. But in the dataframe only column1, and column2 are used, others are not useful.
If I do this:
df1 = df.select(['column1', 'column2'])
...
...
result = df1.join(other_df)....
Is this good for the performance?
If yes, why this is good, is there any document?
Thanks.
Spark is distributed lazily evaluated framework, which means either you select all columns or some of the columns they will be brought into the memory only when an action is applied to it.
So if you run
df.explain()
at any stage, it'll show you the projection of the column. So if the column is required only then it'll be available in memory else it'll not be selected.
It's better to specify the required column as it comes under best practices and also will improve your code in terms of understanding the logic.
To understand more about action and transformation visit here
Especially for a join, the least columns you have to use (and therefore select), the maximum it will be efficient.
Of course, Spark is lazy & optimized, which means as long as you don't call a triggering function like show() or count() for example, it won't change anything.
So doing :
df = df.select(["a", "b"])
df = df.join(other_df)
df.show()
OR join first and select after :
df = df.join(other_df)
df = df.select(["a", "b"])
df.show()
doesn't change anything because it will optimize and choose the select first, when compiling the query with a count() or show() after.
On the other hand and to answer your question,
Doing a show() or count() in between will definitely impact performances and the one with the lowest column will be definitely faster.
Try comparing :
df = df.select(["a", "b"])
df.count()
df = df.join(other_df)
df.show()
and
df = df.join(other_df)
df.count()
df = df.select(["a", "b"])
df.show()
You will see the difference in time.
The difference will might not be huge, but if you're using filters (df = df.filter("b" == "blabla"), it can be really really big, especially if you're working with joins.

Convert rank() partition by oracle query to pyspark sql [duplicate]

I'm trying to use some windows functions (ntile and percentRank) for a data frame but I don't know how to use them.
Can anyone help me with this please? In the Python API documentation there are no examples about it.
Specifically, I'm trying to get quantiles of a numeric field in my data frame.
I'm using spark 1.4.0.
To be able to use window function you have to create a window first. Definition is pretty much the same as for normal SQL it means you can define either order, partition or both. First lets create some dummy data:
import numpy as np
np.random.seed(1)
keys = ["foo"] * 10 + ["bar"] * 10
values = np.hstack([np.random.normal(0, 1, 10), np.random.normal(10, 1, 100)])
df = sqlContext.createDataFrame([
{"k": k, "v": round(float(v), 3)} for k, v in zip(keys, values)])
Make sure you're using HiveContext (Spark < 2.0 only):
from pyspark.sql import HiveContext
assert isinstance(sqlContext, HiveContext)
Create a window:
from pyspark.sql.window import Window
w = Window.partitionBy(df.k).orderBy(df.v)
which is equivalent to
(PARTITION BY k ORDER BY v)
in SQL.
As a rule of thumb window definitions should always contain PARTITION BY clause otherwise Spark will move all data to a single partition. ORDER BY is required for some functions, while in different cases (typically aggregates) may be optional.
There are also two optional which can be used to define window span - ROWS BETWEEN and RANGE BETWEEN. These won't be useful for us in this particular scenario.
Finally we can use it for a query:
from pyspark.sql.functions import percentRank, ntile
df.select(
"k", "v",
percentRank().over(w).alias("percent_rank"),
ntile(3).over(w).alias("ntile3")
)
Note that ntile is not related in any way to the quantiles.

Spark imbalanced partitions after leftOuterJoin

I have a pattern like this... psuedo-code, but I think it makes sense...
type K // key, function of records in B
class A // compact data structure
val a: RDD[(K, A)] // many records
class B { // massive data structure
def funcIter // does full O(n) scans of huge data structure
}
val b: RDD[(K,B)] // comparatively few records
val emptyB = new B("", Nil, etc.)
val C: RDD[(A,B)] = {
a
.leftOuterJoin(b, 1.5x increase in partitions)
.map{ case (k, (val_a, option_b)) => (val_a, option_b.getOrElse(emptyB)) }
.map{ case (val_a, val_b) => (val_a, val_b.funcIter(val_a.attributes)) }
}
My problem is that records in val b vary enormously in size with some quite enormous, and since it's a leftOuterJoin, each of those records is replicated 1,000's or 10,000's of times to join to val a... so it's not just that there are large values in b to handle, but that the worst case records in b end up copied many times in one partition after the join. So the worse partitions are almost exclusively made up of many copies of only the worse case values from b. So my last few partitions take ages to work through while most of my enormous cluster sits idle, draining my wallet.
Is there anything I can do to modify this pattern... try broadcasting b and joining with a in place (it's probably too big).... or split partitions after the join maybe splitting copies of the worst b vals apart into different partitions without doing another shuffle... like the opposite of a coalesce so at least multiple executors on the same core instance (I have 3 executors per core instance) can work on those records in parallel?
Thanks for any advice.

Sending Items to specific partitions

I'm looking for a way to send structures to pre-determined partitions so that they can be used by another RDD
Lets say I have two RDDs of key-value pairs
val a:RDD[(Int, Foo)]
val b:RDD[(Int, Foo)]
val aStructure = a.reduceByKey(//reduce into large data structure)
b.mapPartitions{
iter =>
val usefulItem = aStructure(samePartitionKey)
iter.map(//process iterator)
}
How could I go about setting up the Partition such that the specific data structure I need will be present for the mapPartition but I won't have the extra overhead of sending over all values (which would happen if I were to make a broadcast variable).
One thought I have been having is to store the objects in HDFS but I'm not sure if that would be a suboptimal solution.
Another thought I am currently exploring is whether there is some way I can create a custom Partition or Partitioner that could hold the data structure (Although that might get too complicated and become problematic)
thank you for your help!
edit:
Pangea makes a very good point that I should offer some more specifics. Essentially I'm given and RDD of SparseVectors and an RDD of inverted indexes. The inverted index objects are quite large.
My hope is to do a MapPartitions within the RDD of vectors where I can compare each vector to the inverted index. The issue is that I only NEED one inverted index object per partition and doing a join would cause me to have a lot of copies of that index.
val vectors:RDD[(Int, SparseVector)]
val invertedIndexes:RDD[(Int, InvIndex)] = a.reduceByKey(generateInvertedIndex)
vectors:RDD.mapPartitions{
iter =>
val invIndex = invertedIndexes(samePartitionKey)
iter.map(invIndex.calculateSimilarity(_))
)
}
A Partitioner is a function that, given a generic element, will return in which partition it belongs. It also decides the number of partitions.
There's a form of reduceByKey that takes a partitioner as an argument.
If I am understanding correctly your question, you want the data be partitioned while doing the reduce.
See the example:
// create example data
val a =sc.parallelize(List( (1,1),(1,2), (2,3),(2,4) ) )
// create simple sample partitioner - 2 partitions, one for odd
// one for even key.hashCode. You should put your partitioning logic here
val p = new Partitioner { def numPartitions: Int = 2; def getPartition(key:Any) = key.hashCode % 2 }
// your reduceByKey function. Sample: just add
val f = (a:Int,b:Int) => a + b
val rdd = a.reduceByKey(p, f)
// here your rdd will be partitioned the way you want with the number
// of partitions you want
rdd.partitions.size
res8: Int = 2
rdd.map() .. // go on with your processing

Spark dataframe requery when converted to rdd

I have a dataframe queried as
val df1 = sqlContext.sql("select * from table1 limit 1")
df1.cache()
df1.take(1)
scala> Array[org.apache.spark.sql.Row] = Array([10,20151100-0000,B95A,293759,0,7698141.001,8141-11,GOOD,22.01,number,2015-10-07 11:34:37.492])
However, if I continue
val df2 = df1.rdd
df2.take(1)
scala> Array[org.apache.spark.sql.Row] = Array([10,20151100-0000,B95A,293759,0,7685751.001,5751-05,GOOD,0.0,number,2015-10-03 13:19:22.631])
The two results are totally different even though I tried to cache df1. Is there a way to make the result consistent ie. df2 is not going to requery the table again to get the value? Thank you.
with take(1) you are just taking one random value out of the rdd. When the command is executed, there is no order/sorting specified. As you have a distributed dataset, it is not ensured that you get the same value every time.
You could do a sorting/filtering on the rdd e.g. based on a key (index) or schema column. Then you should be able to always extract the same value you are looking for.

Resources