Spark now offers predefined functions that can be used in dataframes, and it seems they are highly optimized. My original question was going to be on which is faster, but I did some testing myself and found the spark functions to be about 10 times faster at least in one instance. Does anyone know why this is so, and when would a udf be faster (only for instances that an identical spark function exists)?
Here is my testing code (ran on Databricks community ed):
# UDF vs Spark function
from faker import Factory
from pyspark.sql.functions import lit, concat
fake = Factory.create()
fake.seed(4321)
# Each entry consists of last_name, first_name, ssn, job, and age (at least 1)
from pyspark.sql import Row
def fake_entry():
name = fake.name().split()
return (name[1], name[0], fake.ssn(), fake.job(), abs(2016 - fake.date_time().year) + 1)
# Create a helper function to call a function repeatedly
def repeat(times, func, *args, **kwargs):
for _ in xrange(times):
yield func(*args, **kwargs)
data = list(repeat(500000, fake_entry))
print len(data)
data[0]
dataDF = sqlContext.createDataFrame(data, ('last_name', 'first_name', 'ssn', 'occupation', 'age'))
dataDF.cache()
UDF function:
concat_s = udf(lambda s: s+ 's')
udfData = dataDF.select(concat_s(dataDF.first_name).alias('name'))
udfData.count()
Spark Function:
spfData = dataDF.select(concat(dataDF.first_name, lit('s')).alias('name'))
spfData.count()
Ran both multiple times, the udf usually took about 1.1 - 1.4 s, and the Spark concat function always took under 0.15 s.
when would a udf be faster
If you ask about Python UDF the answer is probably never*. Since SQL functions are relatively simple and are not designed for complex tasks it is pretty much impossible compensate the cost of repeated serialization, deserialization and data movement between Python interpreter and JVM.
Does anyone know why this is so
The main reasons are already enumerated above and can be reduced to a simple fact that Spark DataFrame is natively a JVM structure and standard access methods are implemented by simple calls to Java API. UDF from the other hand are implemented in Python and require moving data back and forth.
While PySpark in general requires data movements between JVM and Python, in case of low level RDD API it typically doesn't require expensive serde activity. Spark SQL adds additional cost of serialization and serialization as well cost of moving data from and to unsafe representation on JVM. The later one is specific to all UDFs (Python, Scala and Java) but the former one is specific to non-native languages.
Unlike UDFs, Spark SQL functions operate directly on JVM and typically are well integrated with both Catalyst and Tungsten. It means these can be optimized in the execution plan and most of the time can benefit from codgen and other Tungsten optimizations. Moreover these can operate on data in its "native" representation.
So in a sense the problem here is that Python UDF has to bring data to the code while SQL expressions go the other way around.
* According to rough estimates PySpark window UDF can beat Scala window function.
After years, when I have a more spark knowledge and had second look on the question, just realized what #alfredox really want to ask. So I revised again, and divide the answer into two parts:
To answer Why native DF function (native Spark-SQL function) is faster:
Basically, why native Spark function is ALWAYS faster than Spark UDF, regardless your UDF is implemented in Python or Scala.
Firstly, we need to understand what Tungsten, which is firstly introduced in Spark 1.4.
It is a backend and what it focus on:
Off-Heap Memory Management using binary in-memory data representation aka Tungsten row format and managing memory explicitly,
Cache Locality which is about cache-aware computations with cache-aware layout for high cache hit rates,
Whole-Stage Code Generation (aka CodeGen).
One of the biggest Spark performance killer is GC. The GC would pause the every threads in JVM until the GC finished. This is exactly why Off-Heap Memory Management being introduced.
When executing Spark-SQL native functions, the data will stays in tungsten backend. However, in Spark UDF scenario, the data will be moved out from tungsten into JVM (Scala scenario) or JVM and Python Process (Python) to do the actual process, and then move back into tungsten. As a result of that:
Inevitably, there would be a overhead / penalty on :
Deserialize the input from tungsten.
Serialize the output back into tungsten.
Even using Scala, the first-class citizen in Spark, it will increase the memory footprint within JVM, and which may likely involve more GC within JVM.
This issue exactly what tungsten "Off-Heap Memory Management" feature try to address.
To answer if Python would necessarily slower than Scala:
Since 30th October, 2017, Spark just introduced vectorized udfs for pyspark.
https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html
The reason that Python UDF is slow, is probably the PySpark UDF is not implemented in a most optimized way:
According to the paragraph from the link.
Spark added a Python API in version 0.7, with support for user-defined functions. These user-defined functions operate one-row-at-a-time, and thus suffer from high serialization and invocation overhead.
However the newly vectorized udfs seem to be improving the performance a lot:
ranging from 3x to over 100x.
Use the higher-level standard Column-based functions with Dataset operators whenever possible before reverting to using your own custom UDF functions since UDFs are a BlackBox for Spark and so it does not even try to optimize them.
What actually happens behind the screens, is that the Catalyst can’t process and optimize UDFs at all, and it threats them as BlackBox, which results in losing many optimizations like Predicate pushdown, Constant folding and many others.
Related
I replaced CIFAR-10 preprocessing pipeline in the project with Dataset API approach and it resulted in performance decrease of about 10-20%.
Preporcessing is rather standart:
- read image from disk
- make random/crop and flip
- shuffle, batch
- feed to the model
Overall i see that batche processing is now 15% faster, but every once in a while (or, more precisely, whenever I reinitialize dataframe or expect reshuffling) the batch is being blocked for up long time (30 sec) which totals to slower epoch-per-epoch processing.
This behaviour seems to do something with internal hashing. If I reduce N in ds.shuffle(buffer_size=N) delays are shorter but proportionally more frequent. Removing shuffle at all results to delays as if buffer_size was set to dataset size.
Can somebody explain internal logic of Dataset API when it comes to reading/caching? Is there any reason at all to expect Dataset API to work faster than manually created Queues?
I am using TF 1.3.
If you implement the same pipeline using the tf.data.Dataset API and using queues, the performance of the Dataset version should be better than the queue-based version.
However, there are a few performance best practices to observe in order to get the best performance. We have collected these in a performance guide for tf.data. Here are the main issues:
Prefetching is important: the queue-based pipelines prefetch by default and the Dataset pipelines do not. Adding dataset.prefetch(1) to the end of your pipeline will give you most of the benefit of prefetching, but you might need to tune this further.
The shuffle operator has a delay at the beginning, while it fills its buffer. The queue-based pipelines shuffle a concatenation of all epochs, which means that the buffer is only filled once. In a Dataset pipeline, this would be equivalent to dataset.repeat(NUM_EPOCHS).shuffle(N). By contrast, you can also write dataset.shuffle(N).repeat(NUM_EPOCHS), but this needs to restart the shuffling in each epoch. The latter approach is slightly preferable (and truer to the definition of SGD, for example), but the difference might not be noticeable if your dataset is large.
We are adding a fused version of shuffle-and-repeat that doesn't incur the delay, and a nightly build of TensorFlow will include the custom tf.contrib.data.shuffle_and_repeat() transformation that is equivalent to dataset.shuffle(N).repeat(NUM_EPOCHS) but doesn't suffer the delay at the start of each epoch.
Having said this, if you have a pipeline that is significantly slower when using tf.data than the queues, please file a GitHub issue with the details, and we'll take a look!
Suggested things didn't solve my problem back in the days, but I would like to add a couple of recommendations for those, who don't want to learn about queues and still get the most out of TF data pipeline:
Convert your input data into TFRecord (as cumbersome as it might be)
Use recommended input pipeline format
.
files = tf.data.Dataset.list_files(data_dir)
ds = tf.data.TFRecordDataset(files, num_parallel_reads=32)
ds = (ds.shuffle(10000)
.repeat(EPOCHS)
.map(parser_fn, num_parallel_calls=64)
.batch(batch_size)
)
dataset = dataset.prefetch(2)
Where you have to pay attention to 3 main components:
num_parallel_read=32 to parallelize disk IO operations
num_parallel_calls=64 to parallelize calls to parser function
prefetch(2)
I'm playing around with some data on cluster and want to do some aggregations --- nothing too complicated, but more complicated than sum, there are few joins and count distincts. I have implemented this aggregation in Hive and Spark with Scala and want to compare the execution times.
When I submit the scripts from gateway, linux time functions gives me real time smaller than sys time, which I expected. But I'm not sure which one to pick as proper comparision. Maybe just use sys.time and run the both queries for several times? Is it acceptable or I'm complete noob in this case?
Real time. From a performance benchmark perspective, you only care about how long (human time) it takes before your query is completed and you can look at the results, not how many processes are getting spun up by the application internally.
Note, I would be very careful with performance benchmarking, as both Spark and Hive have plenty of tunable configuration knobs that greatly affect performance. See here for a few examples to alter Hive performance with vectorization, data format choices, data bucketing and data sorting.
The "general consensus" is that Spark is faster than Hive on Tez, but that Hive can handle huge data sets that don't fit in memory better. (I'm not going to cite a source since I'm lazy, do some googling)
I know that some of Spark Actions like collect() cause performance issues.
It has been quoted in documentation
To print all elements on the driver, one can use the collect() method to first bring the RDD to the driver node thus:rdd.collect().foreach(println). This can cause the driver to run out of memory, though,
because collect() fetches the entire RDD to a single machine; if you only need to print a few elements of the RDD, a safer approach is to use the take(): rdd.take(100).foreach(println).
And from one more related SE question: Spark runs out of memory when grouping by key
I have come to know that groupByKey(), reduceByKey() may cause out of memory if parallelism is not set properly.
I did not get enough evidence on other Transformations and Action commands, which have to be used with caution.
These three are the only commands to be tackled? I have doubts about below commands too
aggregateByKey()
sortByKey()
persist() / cache()
It would be great if you provide information on intensive commands (global across partitions instead of single partition OR low performance commands), which have to be tackled with better guarding.
You have to consider three types of operations:
transformations implemented using only mapPartitions(WithIndex) like filter, map, flatMap etc. Typically it will be the safest group. Probably the biggest possible issue you can encounter is an extensive spilling to disk.
transformations which require shuffle. It includes obvious suspects like different variants of combineByKey (groupByKey, reduceByKey, aggregateByKey) or join and less obvious like sortBy, distinct or repartition. Without a context (data distribution, exact function for reduction, partitioner, resources) it is hard to tell if particular transformation will be problematic. There are two main factors:
network traffic and disk IO - any operation which is not performed in memory will be at least an order of magnitude slower.
skewed data distribution - if distribution is highly skewed shuffle can fail or subsequent operations may suffer from a suboptimal resource allocation
operations which require passing data to and from the driver. Typically it covers actions like collect or take and creating distributed data structure from a local one (parallelize).
Other members of this category are broadcasts (including automatic broadcast joins) and accumulators. Total cost depends of course on a particular operation and the amount of data.
While some of these operations can be expensive none is particularly bad (including demonized groupByKey) by itself. Obviously it is better to avoid network traffic or additional disk IO but in practice you cannot avoid it in any complex application.
Regarding cache you may find Spark: Why do i have to explicitly tell what to cache? useful.
I would like to know what performance impact I should expect when invoking an UDF (user defined function) written in C everytime some record is created or changed (with the assumption, that the UDF code itself takes no time - I will optimize that on my own).
Let's say I have hardware capable of running an SSD-persisted namespace on 200k writes/s, can I expect atleast 50k writes/s with the UDF run everytime?
Subquestion: what might limit the UDFs performance (context switching?)
Reason for asking is that Aerospike is using those UDFs e.g. for Large Data Types, but those are not highly performant according to AS staff (compared to KVS-Ops). My usecase is to use UDFs to keep a broad range of secondary indices within a Redis Cluster up-to-date, allowing for much richer realtime queries (e.g. intersections/unions of 5-10 secondary indices).
Best thing is to run the test yourself. Its hard to predict. But I believe that you should be able to do 50k tps.
Mainly the UDF's performance is effected because of the memory allocations that happen under the hood before calling the UDF. If you are using simple datatypes like int/string/blob, then you are better off. If you use list/map in UDF, it will do more memory allocations which will impact the performance.
I'm currently investigating how to store and analyze enriched time based data with up to 1000 columns per line. At the moment Cassandra together with either Solr, Hadoop or Spark offered by Datastax Enterprise seem to fulfill my requirements on the rough. But the devil is in the detail.
Out of the 1000 columns about 60 are used for real-time-like queries (web-frontend, user sends form and expect quick response). These queries are more or less GROUPBY statements where the number or occurrences are counted.
As Cassandra itself does not provide the required analytical capabilities (no GROUPBY), I'm left these alternatives:
Roughly query via Cassandra and filter the resultset within self-written code
Index the data with Solr and run facet.pivot queries
Use either Hadoop or Spark and run the queries
The first approach seems cumbersome and prone to errors… Solr does have some anayltic features but without multifield grouping I'm stuck with pivots. I don't know whether this is a good or performant approach though… Last but not least there are Hadoop and Spark, the prior known not to be the best for real-time queries, the later pretty new and maybe not production ready.
So which way to go? There is no one-fits-all here, but before I go one way through I'd like to get some feedback. Maybe I'm thinking to complex or my expectations are too high :S
Thanks in advance,
Arman
In a place I work now we have a similar set of tech requirements and a solution is Cassandra-Solr-Spark, exactly in that order.
So if a query can be "covered" by Cassandra indices - good, if not - it's covered by Solr. For testing & less often queries - Spark (Scala, no SparkSQL due to old version of it -- it's a bank, everything should be tested and matured, from cognac to software, argh).
Generally I agree with the solution, though sometimes I have a feeling that some client's requests should NOT be taken seriously at all, saving us from loads of weird queries :)
I would recommend Spark, if you take a loot at the list of companies using it you'll such names as Amazon, eBay and Yahoo!. Also, as you noted in the comment, it's becoming a mature tool.
You've given arguments against Cassandra and Solr already, so I'll focus on explaining why Hadoop MapReduce wouldn't do as well as Spark for real-time queries.
Hadoop and MapReduce were designed to leverage hard disk under the assumption that for big data IO is negligible. As a result data are read and wrote at least twice - in map stage and in reduce stage. This allows you to recover from failures as partial result are secured but it that's not want you want when aiming for real-time queries.
Spark not only aims to fix MapReduce shortcomings, it also focuses on interactive data analysis, which is exactly what you want. This goal is achieved mainly by utilizing RAM and the results are astonishing. Spark jobs will often be 10-100 times faster than MapReduce equivalents.
The only caveat is the amount of memory you have. Most probably your data is probably going to feat in the RAM you can provide or you can rely on sampling. Usually when interactively working with data there is no real need to use MapReduce and it seems to be so in your case.