Where can I find the cost of the operations in Spark? - performance

Let's say I have two RDDs with size M1 and M2, distributed equally into p partitions.
I'm interested in knowing that (theoretically / approximately) what is the cost of the operations filter, map, leftOuterJoin, ++, reduceByKey, etc.
Thanks for the help.

To measure the cost of execution it is important to understand how spark execution is performed.
In a nutshell, when you execute a set of transformations on your RDDs spark will create an execution plan (aka DAG), and group them together in the form of stages which are executed once you trigger an action.
Operations like map/filter/flatMap are grouped together to form one stage since they do not incur a shuffle, and operations like join, reduceByKey will create more stages because they involve data to be moved across executors. Spark executes action as a sequence of stages (which gets executed sequentially or parallely if they are independent of each other). And, each stage gets executed as a number of parallel tasks where number of tasks running at a time depends upon the partitions of RDD and resources available.
Best way to measure the cost for your operations is to look at the SparkUI. Open the spark UI (by default it will be at localhost:4040 if you are running in local mode). You'll find several tabs on the top of the page, once you click on any of them you'll be directed to the page which will show you the corresponding metrics.
Here is what I do to measure the performance:
Cost of a Job => Sum of costs of executing all its stages.
Cost of a Stage => Mean of cost of executing each parallel tasks on the stage.
Cost of a Task => By default, a task consumes one CPU core. Memory consumed is given in the UI which depends upon the size of your partition.
It is really difficult to derive metrics for each transformation within a stage since Spark combines these transformations and executes them together on a partition of RDD.


what is the difference between parallelism and parallel computing in Flink?

I have confusion in the number of tasks that can work in parallel in Flink,
Can someone explain to me:
what is the number of parallelism in a distributed system? and its relation to Flink terminology
In Flink, is it the same as we say 2 parallelism = 2 tasks work in parallel?
In Flink, if 2 operators work separately but the number of parallelism in each one of them is 1, does that count as parallel computation?
Is it true that in a KeyedStream, the maximum number of parallelism is the number of keys?
Does the Current CEP engine in Flink able to work in more than 1 task?
Thank you.
Flink uses the term parallelism in a pretty standard way -- it refers to running multiple copies of the same computation simultaneously on multiple processors, but with different data. When we speak of parallelism with respect to Flink, it can apply to an operator that has parallel instances, or it can apply to a pipeline or job (composed of a several operators).
In Flink it is possible for several operators to work separately and concurrently. E.g., in this job
source ---> map ---> sink
the source, map, and sink could all be running simultaneously in separate processors, but we wouldn't call that parallel computation. (Distributed, yes.)
In a typical Flink deployment, the number of task slots equals the parallelism of the job, and each slot is executing one complete parallel slice of the application. Each parallel instance of an operator chain will correspond to a task. So in the simple example above, the source, map, and sink can all be chained together and run in a single task. If you deploy this job with a parallelism of two, then there will be two tasks. But you could disable the chaining, and run each operator in its own task, in which case you'd be using six tasks to run the job with a parallelism of two.
Yes, with a KeyedStream, the number of distinct keys is an upper bound on the parallelism.
CEP can run in parallel if it is operating on a KeyedStream (in which case, the pattern matching is being done independently for each key).

Task scheduling with spark

I am running fairly large task on my 4 node cluster. I am reading around 4 GB of filtered data from a single table and running Naïve Baye’s training and prediction. I have HBase region server running on a single machine which is separate from the spark cluster running in fair scheduling mode, although HDFS is running on all machines.
While executing, I am experiencing strange task distribution in terms of the number of active tasks on the cluster. I observed that only one active task or at most two tasks are running on one/two machines at any point of time while the other are sitting idle. My expectation was that the data in the RDD will be divided and processed on all the nodes for operations like count and distinct etcetera. Why are all nodes not being used for large tasks of a single job? Does having HBase on a separate machine has anything to do with this?
Some things to check:
Presumably you are reading in your data using hadoopFile() or hadoopRDD(): consider setting the [optional] minPartitions parameter to make sure the number of partitions is equal to the number of nodes you want to use.
As you create other RDDs in your application, check the number of partitions of those RDDs and how evenly the data is distributed across them. (Sometimes an operation can create an RDD with the same number of partitions but can make the data within it badly unbalanced.) You can check this by calling the glom() method, printing the number of elements of the resulting RDD (the number of partitions) and then looping through it and printing the number of elements of each of the arrays. (This introduces communication so don't leave it in your production code.)
Many of the API calls on RDD have optional parameters for setting the number of partitions, and then there are calls like repartition() and coalesce() that can change the partitioning. Use them to fix problems you find using the above technique (but sometimes it will expose the need to rethink your algorithm.)
Check that you're actually using RDDs for all your large data, and haven't accidentally ended up with some big data structure on the master.
All of these assume that you have data skew problems rather than something more sinister. That's not guaranteed to be true, but you need to check your data skew situation before looking for something complicated. It's easy for data skew to creep in, especially given Spark's flexibility, and it can make a real mess.

Apache Tez architecture Explanation

I was trying to see what makes Apache Tez with Hive much faster than map reduce with hive.
I am not able to understand DAG concept.
Anyone have a good reference for understanding the architecture of Apache TEZ.
The presentation from Hadoop Summit (slide 35) discussed how the DAG approach is optimal vs MapReduce paradigm:
Essentially it will allow higher level tools (like Hive and Pig) to define their overall processing steps (aka workflow, aka Directed Acyclical Graph) before the job begins. A DAG is a graph of all the steps needed to complete the job (hive query, Pig job, etc.). Because the entire job's steps can be computed before execution time, the system can take advantage of caching intermediate job results "in memory". Whereas, in MapReduce all intermediate data between MapReduce phases required writing to HDFS (disk) adding latency.
YARN also allows container reuse for Tez tasks. E.g. each server is chopped into multiple "containers" rather than "map" or "reduce" slots. For any given point in the job execution this allows Tez to use the entire cluster for the map phases or the reduce phases as needed. Whereas in Hadoop v1 prior to YARN, the number of map slots (and reduce slots) were fixed/hard coded at the platform level. Better utilization of all available cluster resources generally leads to faster
Apache Tez represents an alternative to the traditional MapReduce that allows for jobs to meet demands for fast response times and extreme throughput at petabyte scale.
Higher-level data processing applications like Hive and Pig need an execution framework that can express their complex query logic in an efficient manner and then execute it with high performance which is managed by Tez. Tez achieves this goal by modeling data processing not as a single job, but rather as a data flow graph.
… with vertices in the graph representing application logic and edges representing movement
of data. A rich dataflow definition API allows users to express complex query logic in an
intuitive manner and it is a natural fit for query plans produced by higher-level
declarative applications like Hive and Pig... [The] dataflow pipeline can be expressed as
a single Tez job that will run the entire computation. Expanding this logical graph into a
physical graph of tasks and executing it is taken care of by Tez.
Data Processing API in Apache Tez blog post describes a simple Java API used to express a DAG of data processing. The API has three components
•DAG. this defines the overall job. The user creates a DAG object for each data processing job.
•Vertex. this defines the user logic and the resources & environment needed to execute the user logic. The user creates a Vertex object for each step in the job and adds it to the DAG.
•Edge. this defines the connection between producer and consumer vertices. The user creates an Edge object and connects the producer and consumer vertices using it.
Edge properties defined by Tez enable it to instantiate user tasks, configure their inputs and outputs, schedule them appropriately and define how to route data between the tasks. Tez also allows to define parallelism for each vertex execution by specifying user guidance, data size and resources.
Data movement: Defines routing of data between tasks ◦One-To-One: Data from the ith producer task routes to the ith consumer task.
Broadcast: Data from a producer task routes to all consumer tasks.
Scatter-Gather: Producer tasks scatter data into shards and consumer tasks gather the shards. The ith shard from all producer tasks routes to the ith consumer task.
Scheduling. Defines when a consumer task is scheduled ◦Sequential: Consumer task may be scheduled after a producer task completes.
Concurrent: Consumer task must be co-scheduled with a producer task.
Data source: Defines the lifetime/reliability of a task output ◦Persisted: Output will be available after the task exits. Output may be lost later on.
Persisted-Reliable: Output is reliably stored and will always be available
Ephemeral: Output is available only while the producer task is running.
Additional details on Tez architecture are presented in this Apache Tez Design Doc.
I am not yet using Tez but I have read about it. I think the main two reasons that will make Hive to run faster over Tez are:
Tez will share data between Map Reduce jobs in memory when possible, avoiding the overhead of writing/ reading to/ from HDFS
With Tez you can run multiple map/ reduce DAGs defined on Hive, in one Tez session without needing to start a new application master each time.
You can find a list of links that will help you to understand Tez better here: http://hortonworks.com/hadoop/tez/
Tez is a DAG (Directed acyclic graph) architecture. A typical Map reduce job has following steps:
Read data from file -->one disk access
Run mappers
Write map output --> second disk access
Run shuffle and sort --> read map output, third disk access
write shuffle and sort --> write sorted data for reducers --> fourth disk access
Run reducers which reads sorted data --> fifth disk output
Write reducers output -->sixth disk access
Tez works very similar to Spark (Tez was created by Hortonworks well before Spark):
Execute the plan but no need to read data from disk.
Once ready to do some calculations (similar to actions in spark), get the data from disk and perform all steps and produce output.
Only one read and one write.
Notice the efficiency introduced by not going to disk multiple times. Intermediate results are stored in memory (not written to disks). On top of that there is vectorization (process batch of rows instead of one row at a time). All this adds to efficiencies in query time.
References http://www.slideshare.net/Hadoop_Summit/w-235phall1pandey
Main difference to MR and TEZ is writing intermediate data to local disk in MR. But, in TEZ, either mapper/reducer functionality will execute in an single instance on each container using in memory. TEZ is moreover performs operations like transactions or actions in spark operations.

How to decide on the number of partitions required for input data size and cluster resources?

My use case as mentioned below.
Read input data from local file system using sparkContext.textFile(input path).
partition the input data(80 million records) into partitions using RDD.coalesce(numberOfPArtitions) before submitting it to mapper/reducer function. Without using coalesce() or repartition() on the input data spark executes really slow and fails with out of memory exception.
The issue i am facing here is in deciding the number of partitions to be applied on the input data. The input data size varies every time and hard coding a particular value is not an option. And spark performs really well only when certain optimum partition is applied on the input data for which i have to perform lots of iteration(trial and error). Which is not an option in a production environment.
My question: Is there a thumb rule to decide the number of partitions required depending on the input data size and cluster resources available(executors,cores, etc...)? If yes please point me in that direction. Any help is much appreciated.
I am using spark 1.0 on yarn.
Two notes from Tuning Spark in the Spark official documentation:
1- In general, we recommend 2-3 tasks per CPU core in your cluster.
2- Spark can efficiently support tasks as short as 200 ms, because it reuses one executor JVM across many tasks and it has a low task launching cost, so you can safely increase the level of parallelism to more than the number of cores in your clusters.
These are two rule of tumb that help you to estimate the number and size of partitions. So, It's better to have small tasks (that could be completed in hundred ms).
Determining the number of partitions is a bit tricky. Spark by default will try and infer a sensible number of partitions. Note: if you are using the textFile method with compressed text then Spark will disable splitting and then you will need to re-partition (it sounds like this might be whats happening?). With non-compressed data when you are loading with sc.textFile you can also specify a minium number of partitions (e.g. sc.textFile(path, minPartitions) ).
The coalesce function is only used to reduce the number of partitions, so you should consider using the repartition() function.
As far as choosing a "good" number you generally want at least as many as the number of executors for parallelism. There already exists some logic to try and determine a "good" amount of parallelism, and you can get this value by calling sc.defaultParallelism
I assume you know the size of the cluster going in,
then you can essentially try to partition the data in some multiples of
that & use rangepartitioner to partition the data roughly equally. Dynamic
partitions are created based on number of blocks on filesystem & hence the
task overhead of scheduling so many tasks mostly kills the performance.
import org.apache.spark.RangePartitioner;
var file=sc.textFile("<my local path>")
var partitionedFile=file.map(x=>(x,1))
var data= partitionedFile.partitionBy(new RangePartitioner(3, partitionedFile))

hadoop: tasks not local with file?

I ran a hadoop job and when I look in some map tasks I see they are not running where the file's blocks are. E.g., the map task runs on slave1, but the file blocks (all of them) are in slave2. The files are all gzip.
Why is that happening and how to resolve?
UPDATE: note there are many pending tasks, so this is not a case of a node being idle and therefore hosting tasks that read from other nodes.
Hadoop's default (FIFO) scheduler works like this: When a node has spare capacity, it contacts the master and asks for more work. The master tries to assign a data-local task, or a rack-local task, but if it can't, it will assign any task in the queue (of waiting tasks) to that node. However, while this node was being assigned this non-local task (we'll call it task X), it is possible that another node also had spare capacity and contacted the master asking for work. Even if this node actually had a local copy of the data required by X, it will not be assigned that task because the other node was able to acquire the lock to the master slightly faster than the latter node. This results in poor data locality, but FAST task assignment.
In contrast, the Fair Scheduler uses a technique called delayed scheduling that achieves higher locality by delaying non-local task assignment for a "little bit" (configurable). It achieves higher locality but at a small cost of delaying some tasks.
Other people are working on better schedulers, and this may likely be improved in the future. For now, you can choose to use the Fair Scheduler if you wish to achieve higher data locality.
I disagree with #donald-miner's conclusion that "With a default replication factor of 3, you don't see very many tasks that are not data local." He is correct in noting that more replicas will give improve your locality %, but the percentage of data-local tasks may still be very low. I've also ran experiments myself and saw very low data locality with the FIFO scheduler. You could achieve high locality if your job is large (has many tasks), but for the more common, smaller jobs, they suffer from a problem called "head-of-line scheduling". Quoting from this paper:
The first locality problem occurs in small jobs (jobs that
have small input files and hence have a small number of data
blocks to read). The problem is that whenever a job reaches
the head of the sorted list [...] (i.e. has the fewest
running tasks), one of its tasks is launched on the next slot
that becomes free, no matter which node this slot is on. If
the head-of-line job is small, it is unlikely to have data on
the node that is given to it. For example, a job with data on
10% of nodes will only achieve 10% locality.
That paper goes on to cite numbers from a production cluster at Facebook, and they reported observing just 5% of data locality in a large, production environment.
Final note: Should you care if you have low data locality? Not too much. The running time of your jobs may be dominated by the stragglers (tasks that take longer to complete) and shuffle phase, so improving data locality would only have a very modest improve in running time (if any at all).
Unfortunately, the default scheduler isn't that smart. I'm not sure exactly what's going on, but I think it's using some sort of greedy-style scheduling where it tries to schedule what it can now for the next task, and then moves on. There could definitely be improvements made to the hadoop scheduler and there have been a few academic attempts and making hadoop scheduling more optimal.
This research paper shows that the default hadoop scheduler is not optimal. In the results, they show that increasing the replication factor to three improves data locality significantly, with diminishing returns after that.
So, why hasn't the default scheduler been improved? Here is my opinion/theory: With a default replication factor of 3, you don't see very many tasks that are not data local. By having more replicas, you give the schedule more flexibility to fit tasks in the right spots. Basically, it's a coincidence that you have 3 replicas, and the default scheduler takes advantage of that by being implemented in a lazy manner. Since you typically have 3 replicas for redundancy sake already... there isn't much motivation to help scheduler performance for people with a replication of 1.
If you have the space, I suggest just upping the replication factor to two or three. There really isn't much downside.
