Spark Scala DataFrame vs Dataset - Performance analysis [duplicate] - performance

starting out with spark 2.0.1 I got some questions. I read a lot of documentation but so far could not find sufficient answers:
What is the difference between
df.select("foo")
df.select($"foo")
do I understand correctly that
myDataSet.map(foo.someVal) is typesafe and will not convert into RDD but stay in DataSet representation / no additional overhead (performance wise for 2.0.0)
all the other commands e.g. select, .. are just syntactic sugar. They are not typesafe and a map could be used instead. How could I df.select("foo") type-safe without a map statement?
why should I use a UDF / UADF instead of a map (assuming map stays in the dataset representation)?

Difference between df.select("foo") and df.select($"foo") is signature. The former one takes at least one String, the later one zero or more Columns. There is no practical difference beyond that.
myDataSet.map(foo.someVal) type checks, but as any Dataset operation uses RDD of objects, and compared to DataFrame operations, there is a significant overhead. Let's take a look at a simple example:
case class FooBar(foo: Int, bar: String)
val ds = Seq(FooBar(1, "x")).toDS
ds.map(_.foo).explain
== Physical Plan ==
*SerializeFromObject [input[0, int, true] AS value#123]
+- *MapElements <function1>, obj#122: int
+- *DeserializeToObject newInstance(class $line67.$read$$iw$$iw$FooBar), obj#121: $line67.$read$$iw$$iw$FooBar
+- LocalTableScan [foo#117, bar#118]
As you can see this execution plan requires access to all fields and has to DeserializeToObject.
No. In general other methods are not syntactic sugar and generate a significantly different execution plan. For example:
ds.select($"foo").explain
== Physical Plan ==
LocalTableScan [foo#117]
Compared to the plan shown before it can access column directly. It is not so much a limitation of the API but a result of a difference in the operational semantics.
How could I df.select("foo") type-safe without a map statement?
There is no such option. While typed columns allow you to transform statically Dataset into another statically typed Dataset:
ds.select($"bar".as[Int])
there are not type safe. There some other attempts to include type safe optimized operations, like typed aggregations, but this experimental API.
why should I use a UDF / UADF instead of a map
It is completely up to you. Each distributed data structure in Spark provides its own advantages and disadvantages (see for example Spark UDAF with ArrayType as bufferSchema performance issues).
Personally, I find statically typed Dataset to be the least useful:
Don't provide the same range of optimizations as Dataset[Row] (although they share storage format and some execution plan optimizations it doesn't fully benefit from code generation or off-heap storage) nor access to all the analytical capabilities of the DataFrame.
Typed transformations are black boxes, and effectively create analysis barrier for the optimizer. For example selections (filters) cannot be be pushed over typed transformation:
ds.groupBy("foo").agg(sum($"bar") as "bar").as[FooBar].filter(x => true).where($"foo" === 1).explain
== Physical Plan ==
*Filter (foo#133 = 1)
+- *Filter <function1>.apply
+- *HashAggregate(keys=[foo#133], functions=[sum(cast(bar#134 as double))])
+- Exchange hashpartitioning(foo#133, 200)
+- *HashAggregate(keys=[foo#133], functions=[partial_sum(cast(bar#134 as double))])
+- LocalTableScan [foo#133, bar#134]
Compared to:
ds.groupBy("foo").agg(sum($"bar") as "bar").as[FooBar].where($"foo" === 1).explain
== Physical Plan ==
*HashAggregate(keys=[foo#133], functions=[sum(cast(bar#134 as double))])
+- Exchange hashpartitioning(foo#133, 200)
+- *HashAggregate(keys=[foo#133], functions=[partial_sum(cast(bar#134 as double))])
+- *Filter (foo#133 = 1)
+- LocalTableScan [foo#133, bar#134]
This impacts features like predicate pushdown or projection pushdown.
There are not as flexible as RDDs with only a small subset of types supported natively.
"Type safety" with Encoders is disputable when Dataset is converted using as method. Because data shape is not encoded using a signature, a compiler can only verify the existence of an Encoder.
Related questions:
Perform a typed join in Scala with Spark Datasets
Spark 2.0 DataSets groupByKey and divide operation and type safety

Spark Dataset is way more powerful than Spark Dataframe. Small example - you can only create Dataframe of Row, Tuple or any primitive datatype but Dataset gives you power to create Dataset of any non-primitive type too. i.e. You can literally create Dataset of object type.
Ex:
case class Employee(id:Int,name:String)
Dataset[Employee] // is valid
Dataframe[Employee] // is invalid

DATAFRAME: DataFrame is an abstraction that allows a schema view of data.
case class Person(name: String, age: Int, address: String)
defined class Person
scala > val df = List ( Person ( “Sumanth”, 23, “BNG”)
DATAFRAME VS DATASET
DATASET: Data Set is an extension to Dataframe API, the latest abstraction which tries to provide the best of both RDD and Dataframe.

Related

Faster XPath expressions to execute queries from multiple XMLs

I have the two following XMLs and the problem statement is as follows.
Parse XML 1 and if subnode of any node_x contains 'a' in its name (like in value_a_0) and value_a_0 contains a specific number, parse XML 2 and go to node_x-1 for all abc_x in and compare the content of value_x-1_0/1/2/3 with certain entities.
If subnode of any node_x contains 'b' in its name (like in value_b_0) and value_b_0 contains a specific number(say 'm'), parse XML 2 and go to node_x+1 for all abc_x in and compare the content of value_x-1_0/1/2/3 with 'm'.
Example : For all the value_a_0 in record1 check if value_a_0 node contains 5. If so, which are the case for node_1 and node_9, go to record2/node_0 and record2/node_8 and compare the contents of value_0_0/1/2/3 whether they contains 5 or not. Similarly, for rest of the cases.
I was wondering what would be the best practice to solve it? Is there any hash-table approach in Xpath 3.0?
First XML
<record1>
<node_1>
<value_a_0>5</value_1_0>
<value_b_1>0</value_1_1>
<value_c_2>10</value_1_2>
<value_d_3>8</value_1_3>
</node_1>
.................................
.................................
<node_9>
<value_a_0>5</value_a_0>
<value_b_1>99</value_b_1>
<value_c_2>53</value_c_2>
<value_d_3>5</value_d_3>
</node_9>
</record1>
Second XML
<record2>
<abc_0>
<node_0>
<value_0_0>5</value_0_0>
<value_0_1>0</value_0_1>
<value_0_2>150</value_0_2>
<value_0_3>81</value_0_3>
</node_0>
<node_1>
<value_1_0>55</value_1_0>
<value_1_1>30</value_1_1>
<value_1_2>150</value_1_2>
<value_1_3>81</value_1_3>
</node_1>
.................................
.................................
<node_63>
<value_63_0>1</value_63_0>
<value_63_1>99</value_63_1>
<value_63_2>53</value_63_2>
<value_63_3>5</value_63_3>
</node_63>
</abc_0>
================================================
<abc_99>
<node_0>
<value_0_0>555</value_0_0>
<value_0_1>1810</value_0_1>
<value_0_2>140</value_0_2>
<value_0_3>80</value_0_3>
</node_0>
<node_1>
<value_1_0>555</value_1_0>
<value_1_1>1810</value_1_1>
<value_1_2>140</value_1_2>
<value_1_3>80</value_1_3>
</node_1>
<node_2>
<value_2_0>5</value_2_0>
<value_2_1>60</value_2_1>
<value_2_2>10</value_2_2>
<value_2_3>83</value_2_3>
</node_2>
.................................
.................................
<node_63>
<value_63_0>1</value_63_0>
<value_63_1>49</value_63_1>
<value_63_2>23</value_63_2>
<value_63_3>35</value_63_3>
</node_63>
</abc_99>
</record2>
First I would say that using structured element names like this is pretty poor XML design. That's relevant because when you do a join query in XPath or XQuery you're very dependent on the optimizer to find a fast execution path (e.g. a hash join), and the "weirder" your query is, the less likely the optimizer is to find a fast execution strategy.
I often start by converting "weird" XML into something more sanitary. For example in this case I would transform <value_a_0>5</value_1_0> into <value cat="a" seq="0">5</value>. That makes it easier to write your query and easier for the optimizer to recognize it, and the transformation phase is re-usable so you can apply it before any operations on the XML, not just this one.
If you're looking for better than O(n*m) performance on a join query, you need to look at the capabilities of your chosen XPath engine. Saxon-EE for example will do such optimizations, Saxon-HE won't. You're generally more likely to find advanced optimization in an XQuery engine than an XPath engine.
As for the detail of your query, I got lost with the requirement statement when you start talking about abc_x. I'm not sure what that refers to.
It seems like a task that can partially solved by grouping but as in your previous examples the poor use of XML elements names that all differ by index values that should be part of an element or attribute value and not part of the element name makes it harder to write succinct code:
let $abc-elements := $doc2/record2/*
for $node-element in record1/*
for $index in (1 to count($node-element[1]/*))
for $index-element in $node-element/*[position() = $index]
group by $index, $group-value := $index-element
where tail($index-element)
return
<group index="{$index}" value="{$group-value}">
{
let $suffixes := $index-element/../string((xs:integer(substring-after(local-name(), '_')) - 1)),
$relevant-abc-node-elements := $abc-elements/*[substring-after(local-name(), '_') = $suffixes]
return $relevant-abc-node-elements[* = $group-value]
}
</group>
https://xqueryfiddle.liberty-development.net/nbUY4kA

What problem does a reinitializable iterator solve?

From the tf.data documentation:
A reinitializable iterator can be initialized from multiple different
Dataset objects. For example, you might have a training input pipeline
that uses random perturbations to the input images to improve
generalization, and a validation input pipeline that evaluates
predictions on unmodified data. These pipelines will typically use
different Dataset objects that have the same structure (i.e. the same
types and compatible shapes for each component).
the following example was given:
# Define training and validation datasets with the same structure.
training_dataset = tf.data.Dataset.range(100).map(
lambda x: x + tf.random_uniform([], -10, 10, tf.int64))
validation_dataset = tf.data.Dataset.range(50)
# A reinitializable iterator is defined by its structure. We could use the
# `output_types` and `output_shapes` properties of either `training_dataset`
# or `validation_dataset` here, because they are compatible.
iterator = tf.data.Iterator.from_structure(training_dataset.output_types,
training_dataset.output_shapes)
next_element = iterator.get_next()
training_init_op = iterator.make_initializer(training_dataset)
validation_init_op = iterator.make_initializer(validation_dataset)
# Run 20 epochs in which the training dataset is traversed, followed by the
# validation dataset.
for _ in range(20):
# Initialize an iterator over the training dataset.
sess.run(training_init_op)
for _ in range(100):
sess.run(next_element)
# Initialize an iterator over the validation dataset.
sess.run(validation_init_op)
for _ in range(50):
sess.run(next_element)
It is unclear what the benefit of this complexity is.
Why not simply create 2 different iterators?
The original motivation for reinitializable iterators was as follows:
The user's input data is in two or more tf.data.Dataset objects with the same structure but different pipeline definitions.
For example, you might have a training data pipeline with augmentations in a Dataset.map(), and an evaluation data pipeline that produced raw examples, but they would both produce batches with the same structure (in terms of the number of tensors, their element types, shapes, etc.).
The user would define a single training graph that took input from a tf.data.Iterator, created using Iterator.from_structure().
The user could then switch between the different input data sources by reinitializing the iterator from one of the datasets.
In hindsight, reinitializable iterators have turned out to be quite hard to use for their intended purpose. In TensorFlow 2.0 (or 1.x with eager execution enabled), it is much easier to create iterators over different datasets using idiomatic Python for loops and high-level training APIs:
tf.enable_eager_execution()
model = ... # A `tf.keras.Model`, or some other class exposing `fit()` and `evaluate()` methods.
train_data = ... # A `tf.data.Dataset`.
eval_data = ... # A `tf.data.Dataset`.
for i in range(NUM_EPOCHS):
model.fit(train_data, ...)
# Evaluate every 5 epochs.
if i % 5 == 0:
model.evaluate(eval_data, ...)

Spark RDD Persistence and Partitions

When a certain RDD is created in Spark for example:
lines = sc.textFile("README.md")
And then a transformation is called on this RDD:
pythonLines = lines.filter(lambda line: "Python" in line)
If you call an action on this transformed Filter RDD (such as pythonlines.first) what does it mean when they say an RDD will be recomputed ones again each time you run an action on them? I thought the original RDD that you created using the textFile method is not persisted after you called the filter transformation on that original RDD. So will it just recompute the most recent transformed RDD, where in this case it is the RDD I made using the filter transformation? I don't really see why that would be necessary if my assumption is correct?
In spark, RDDs are lazily evaluated. This means if you simply write
lines = sc.textFile("README.md").map(xxx)
Your program will exit without reading the file since you never used the result. If you write something like:
linesLength = sc.textFile("README.md").map(line => line.split(" ").length)
sumLinesLength = linesLength.reduce(_ + _) // <-- scala way
maxLineLength = linesLength.max()
The computations needed to have lineLength will be made twice, since you are reusing it in two different places. To avoid that, you should persist your resulting RDD before using it in two different ways
linesLength = sc.textFile("README.md").map(line => line.split(" ").length)
linesLength.persist()
// ...
You can also take a look at https://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence. Hope my explanation isn't too confused!

groupingBy operation in Java-8

I'm trying to re-write famous example of Spark's text classification (http://chimpler.wordpress.com/2014/06/11/classifiying-documents-using-naive-bayes-on-apache-spark-mllib/) on Java 8.
I have a problem - in this code I'm making some data preparations for getting idfs of all words in all files:
termDocsRdd.collect().stream().flatMap(doc -> doc.getTerms().stream()
.map(term -> new ImmutableMap.Builder<String, String>()
.put(doc.getName(),term)
.build())).distinct()
And I'm stuck on the groupBy operation. (I need to group this by term, so each term must be a key and the value must be a sequence of documents).
In Scala this operation looks very simple - .groupBy(_._2).
But how can I do this in Java?
I tried to write something like:
.groupingBy(term -> term, mapping((Document) d -> d.getDocNameContainsTerm(term), toList()));
but it's incorrect...
Somebody knows how to write it in Java?
Thank You very much.
If I understand you correctly, you want to do something like this:
(import static java.util.stream.Collectors.*;)
Map<Term, Set<Document>> collect = termDocsRdd.collect().stream().flatMap(
doc -> doc.getTerms().stream().map(term -> new AbstractMap.SimpleEntry<>(doc, term)))
.collect(groupingBy(Map.Entry::getValue, mapping(Map.Entry::getKey, toSet())));
The use of Map.Entry/ AbstractMap.SimpleEntry is due to the absence of a standard Pair<K,V> class in Java-8. Map.Entry implementations can fulfill this role but at the cost of having unintuitive and verbose type and method names (regarding the task of serving as Pair implementation).
If you are using the current Eclipse version (I tested with LunaSR1 20140925) with its limited type inference, you have to help the compiler a little bit:
Map<Term, Set<Document>> collect = termDocsRdd.collect().stream().flatMap(
doc -> doc.getTerms().stream().<Map.Entry<Document,Term>>map(term -> new AbstractMap.SimpleEntry<>(doc, term)))
.collect(groupingBy(Map.Entry::getValue, mapping(Map.Entry::getKey, toSet())));

Dealing with huge data

Let's assume that I have a big file (500GB+) and I have a data record
declaration Sample which indicates a row in that file:
data Sample = Sample {
field1 :: Int,
field2 :: Int
}
Now what is the data structure suitable for processing
(filter/map/fold) on the collection of these Sample datas ? Don
Stewart has answered here that the Sample type should not be treated
as a list [Sample] type but as a Vector type. My question is how
does representing it as Vector type solve the problem ? Doesn't
representing the file contents as a vector of Sample type will also
occupy around 500Gb ?
What is the recommended method for solving these types of problem ?
As far as I can see, the operations you want to use (filter, map and fold) can be done via both conduit (see Data.Conduit.List) and pipes (see Pipes.Prelude).
Both libraries are perfectly capable of manipulating/folding and filtering streaming data. Depending on your scenario they might solve your actual problem.
If you, however, need to investigate values several times, you're better of by loading chunks into a vector, as #Don said.

Resources