Hadoop speculative execution testing - hadoop

I am working on Hadoop for my master thesis, Hadoop 1.1.2.
I am studying a new algorithm for speculative task and so in this first step i m trying to apply some changes in the code.
Sadly, also using 2 node, i cannot cause the speculative execution. I wrote some lines of code as Log in the class DefaultTaskSelector (this is the class for speculative task), but this class, after the initialization, is never called by the FairScheduler class.
I activated the option "speculative" in the config file too (mapred-site...xml) but nothing.
So the question is: How can i cause/force the speculative execution?
Regards

Speculative execution typically happens when there are multiple mappers running and one or more of them lag the others. A good way to get it to happen:
set up hive
set up a partitioned table
make sure the data is big enough to cause many mappers to run. This means: at least a few dozen HDFS blocks worth of data
enter data into the partitions: have one of the partitions with highly skewed data much more than the other partitions.
run a select * from the table
Now you may see speculative execution run.
If not, feel free to get back here. I can provide further suggestions (e.g. making some moderately complicated queries that would likely induce SE)
EDIT
Hive may be a bit of a stretch for you. But you can apply the "spirit" of the strategy to regular HDFS files as well. Write a map/reduce program with a custom partitioner that is intentionally skewed: i.e. it causes a single mapper to do an outsized proportion of the work.
Remember to have some tens of hdfs blocks (at least) to give the task trackers some decent amount of work to chew on.

You should be able to cause speculative execution using the two methods called setMapSpeculativeExecution(boolean) and setReduceSpeculativeExecution(boolean) that you can specify using Job, the MapReduce job configuration.

Related

Spark Jobs on Yarn | Performance Tuning & Optimization

What is the best way to optimize the Spark Jobs deployed on Yarn based cluster ? .
Looking for changes based on configuration not code level. My Question is classically design level question, what approach should be used to optimized the Jobs that are either developed on Spark Streaming or Spark SQL.
There is myth that BigData is magic and your code will be work like a dream once deployed to a BigData cluster.
Every newbie have same belief :) There is also misconception that given configurations over web blogs will be working fine for every problem.
There is no shortcut for optimization or Tuning the Jobs over Hadoop without understating your cluster deeply.
But considering the below approach I'm certain that you'll be able to optimize your job within a couple of hours.
I prefer to apply the pure scientific approach to optimize the Jobs. Following steps can be followed specifically to start optimization of Jobs as baseline.
Understand the Block Size configured at cluster.
Check the maximum memory limit available for container/executor.
Under the VCores available for cluster
Optimize the rate of data specifically in case of Spark streaming real-time jobs. (This is most tricky park in Spark-streaming)
Consider the GC setting while optimization.
There is always room of optimization at code level, that need to be considered as well.
Control the block size optimally based on cluster configuration as per Step 1. based on data rate. Like in Spark it can be calculated batchinterval/blockinterval
Now the most important steps come here. The knowledge I'm sharing is more specific to real-time use cases like Spark streaming, SQL with Kafka.
First of all you need to know to know that at what number or messages/records your jobs work best. After it you can control the rate to that particular number and start configuration based experiments to optimize the jobs. Like I've done below and able to resolve performance issue with high throughput.
I have read some of parameters from Spark Configurations and check the impact on my jobs than i made the above grid and start the experiment with same job but with five difference configuration versions. Within three experiment I'm able to optimize my job. The green highlighted in above picture is magic formula for my jobs optimization.
Although the same parameters might be very helpful for similar use cases but obviously these parameter not covers everything.
Assuming that the application works i.e memory configuration is taken care of and we have at least one successful run of the application. I usually look for underutilisation of executors and try to minimise it. Here are the common questions worth asking to find opportunities for improving utilisation of cluster/executors:
How much of work is done in driver vs executor? Note that when the main spark application thread is in driver, executors are killing time.
Does you application have more tasks per stage than number of cores? If not, these cores will not be doing anything while in this stage.
Are your tasks uniform i.e not skewed. Since spark move computation from stage to stage (except for some stages that can be parallel), it is possible for most of your tasks to complete and yet the stage is still running because one of skewed task is still held up.
Shameless Plug (Author) Sparklens https://github.com/qubole/sparklens can answer these questions for you, automatically.
Some of things are not specific to the application itself. Say if your application has to shuffle lots of data, pick machines with better disks and network. Partition your data to avoid full data scans. Use columnar formats like parquet or ORC to avoid fetching data for columns you don't need all the time. The list is pretty long and some problems are known, but don't have good solutions yet.

Spark performance advantage vs. Hadoop MapReduce [duplicate]

This question already has answers here:
Why is Spark faster than Hadoop Map Reduce
(2 answers)
Closed 5 years ago.
I am hearing that Spark has an advantage over hadoop due to spark's in-memory computation. However, one of the obvious problems is not all the data can fit into one computers memory. So is Spark then limited to smaller datasets. At the same time, there is the notion of spark cluster. So I am not following the purported advantages of spark over hadoop MR.
Thanks
Hadoop MapReduce has been the mainstay on Hadoop for batch jobs for a long time. However, two very promising technologies have emerged, Apache Drill, which is a low-density SQL engine for self-service data exploration and Apache Spark, which is a general-purpose compute engine that allows you to run batch, interactive and streaming jobs on the cluster using the same unified frame. Let's dig a little bit more into Spark.
To understand Spark, you have to understand really three big concepts.
First is RDDs, the resilient distributed data sets. This is really a representation of the data that's coming into your system in an object format and allows you to do computations on top of it. RDDs are resilient because they have a long lineage. Whenever there's a failure in the system, they can recompute themselves using the prior information using lineage.
The second concept is transformations. Transformations is what you do to RDDs to get other resilient RDDs. Examples of transformations would be things like opening a file and creating an RDD or doing functions like printer that would then create other resilient RDDs.
The third and the final concept is actions. These are things which will do where you're actually asking for an answer that the system needs to provide you, for instance, count or asking a question about what's the first line that has Spark in it. The interesting thing with Spark is that it does lazy elevation which means that these RDDs are not loaded and pushed into the system as in when the system encounters an RDD but they're only done when there is actually an action to be performed.
One thing that comes up with RDDs is that when we come back to them being that they are resilient and in main memory is that how do they compare with distributed shared memory architectures and most of what are familiar from our past? There are a few differences. Let's go with them in a small, brief way. First of all, writes in RDDs are core of Spark. They are happening at an RDD level. Writes in distributor-shared memory are typically fine-grained. Reads and distributor-shared memory are fine-grained as well. Writes in RDD can be fine or course-grained.
The second piece is recovery. What happens if there is a part in the system, how do we recover it? Since RDDs build this lineage graph if something goes bad, they can go back and recompute based on that graph and regenerate the RDD. Lineage is used very strongly in RDDs to recovery. In distributor-shared memories we typically go back to check-pointing done at intervals or any other semantic check-pointing mechanism. Consistency is relatively trivial in RDDs because the data underneath it is assumed to be immutable. If, however, the data was changing, then consistency would be a problem here. Distributor-shared memory doesn't make any assumptions about mutability and, therefore, leaves the consistency semantics to the application to take care of.
At last let's look at the benefits of Spark:
Spark provides full recovery using lineage.
Spark is optimized in making computations as well as placing the computations optimally using the directory cyclic graph.
Very easy programming paradigms using the transformation and actions on RDDs as well as a ready-rich library support for machine learning, graphics and recently data frames.
At this point a question comes up. If Spark is so great, does Spark actually replace Hadoop? The answer is clearly no because Spark provides an application framework for you to write your big data applications. However, it still needs to run on a storage system or on a no-SQL system.
Spark is never limited to smaller dataset and its not always about in-memorycomputation. Spark has very good number higher APIS . Spark can process the in GB as well. In my realtime experience i have used Spark to handle the streaming application where we usually gets the data in GB/Hour basic . And we have used Spark in Telecommunication to handle bigger dataset as well . Check this RDD Persistence how to accommodate bigger datasets.
In case of real world problem we can't solve them just by one MapReduce program which is having a Mapper class and a reducer class, We mostly need to build a pipeline. A pipeline will consists of multiple stages each having MapReduce program , and out put of one stage will be fed to one or multiple times to the subsequent stages. And this is a pain because of the amount of IO it involves.
In case of MapReduce there are these Map and Reduce tasks subsequent to which there is a synchronization barrier and one needs to preserve the data to the disc. This feature of MapReduce framework was developed with the intent that in case of failure the jobs can be recovered but the drawback to this is that, it does not leverage the memory of the Hadoop cluster to the maximum. And this becomes worse when you have a iterative algorithm in your pipeline. Every iteration will cause significant amount of Disk IO.
So in order to solve the problem , Spark introduced a new Data Structure called RDD . A DS that can hold the information like how the data can be read from the disk and what to compute. Spark also provided easy programming paradigm to create pipeline(DAG) by transforming RDDs . And what you get it a series of RDD which knows how to get the data and what to compute.
Finally when an Action is invoked Spark framework internally optimize the pipeline , group together the portion that can be executed together(map phases), and create a final optimized execution plan from the logical pipeline. And then executes it. It also provides user the flexibility to select the data user wanted to be cached. Hence spark is able to achieve near about 10 to 100 times faster batch processing than MapReduce.
Spark advantages over hadoop.
As spark tasks across stages can be executed on same executor nodes, the time to spawn the Executor is saved for multiple task.
Even if you have huge memory, MapReduce can never make any advantage of caching data in memory and using the in memory data for subsequent steps.
Spark on other hand can cache data if huge JVM is available to it. Across stages the inmemory data is used.
In Spark task run as threads on same executor, making the task memory footprint light.
In MapReduce the Map of reduce Task are processes and not threads.
Spark uses efficient serialization format to store data on disk.
Follow this for detail understanding http://bytepadding.com/big-data/spark/understanding-spark-through-map-reduce/

Spark write the file inside the worker process

I have a Spark job that is generating a set of results with statistics. My number of work items are more than slave count. So I am doing more than one processing per slave.
I cache results after generating RDD objects to be able to reuse them as I have multiple write operations: one for result objects and another for statistics. Both write operations use saveAsHadoopFile.
Without caching Spark reruns the job again per each write operation and that is taking a long time and redoing the same execution twice (more if I had more writes).
With caching I am hitting the memory limit. Some of previously calculated results are lost during caching and I am seeing "CacheManager:58 - Partition rdd_1_0 not found, computing it" messages. Spark eventually goes into an infinite loop as it tries to cache more results while losing some others.
I am aware of the fact that Spark has different storage levels for caching. Using memory + disk would solve our problem. But I am wondering whether we can write down files right in the worker without generating RDD objects or not. I am not sure if that is possible though. Is it?
It turns out that writing files inside a Spark worker process is not different than writing a file in a Java process. Write operation just requires just creating functionality to serialize and save files to HDFS. This question has several answers on how to do it.
saveAsHadoopFile is just a convenient way of doing it.

Totally independent jobs on same data on Hadoop?

I need to optimize some hyperparameters for a machine learning problem. This involves launching many jobs on the same input data and saving their outputs, completely independently of each other. On every job distribution system that I've ever used, this is a very common use case, which is handled with a few switches on the command line and/or a job config file. Now I'm on a cluster whose job distribution system is Hadoop/Yarn, which I haven't used before. Despite much searching, the only way to do this on Hadoop seems to be to submit each run as a separate job. This would incur the job submission overhead for each run, of which there can be 1000's. Is there a simple way around that? Maybe some kind of MR job without any R? (BTW, my ML code is in C++ so I guess I need to use Hadoop Streaming.) I'll learn Java if I have to, but it seems like a disproportionate amount of effort for something so simple.

Hadoop map-reduce v/s cascading, which is better when compare on basis processing time?

I have used cascading as well M/R, cascading job looks slow as compare to M/R. It looks me 25% to 50% slow. Is it true or i need to dig more in cascading for optimization.
I can't speak to the overhead of a Cascading job compared to a hand drawn raw MapReduce job as it really depends on the workload complexity, version of Cascading, how you wrote each job, the weather inside Amazon or your network, etc.
That said, Cascading is an abstraction over MapReduce and there will be some overhead. But as an abstraction, it has opportunities to do things more efficiently (1.2 will lazily deserialize data during sorting for example, something a raw MR developer would need to code manually for each intermediate object via a Comparator implementation).
My suspicion is that you are assuming Cascading makes some sort of cluster configuration optimizations over and above the defaults. It does not. So if you run a Cascading Flow without setting any different Hadoop properties, it's likely you will only see one reducer in each job as that's the default in Hadoop (see mapred-default.xml).
Or your job is simple enough it can use 'Combiners', which Cascading does not support directly, but has a more flexible alternative using Map side partial aggregation. This is similar to combiners, but it trades memory for cpu, and they are not limited to commutative-associative operations like Combiners are. Here is a better description of partial aggregation.
I should say if your workload is simple enough (and will stay simple) (and Hadoop is really justified here) that you can write a couple MR jobs to satisfy it, you should probably stick with that (yet see below).
But the work I do (and I'm the author of Cascading) results in dozens of, if not a hundred in some cases, MR jobs. The fact that I can actually complete my project and get results within days outweighs the minor overhead Cascading may impart in some cases. For example, Cascading has a fail-fast planner, that is, it will not run a Cascading Flow on the cluster if all the data/field dependencies are not satisfied in the Flow.
It is very unlikely you can have that feature if you are chaining raw MR jobs together. it is more likely your workload will fail hours later because of a missing dependency that can only be identified at runtime.
Or, you are passing raw typed 'business objects' around (in order to gain compiler type safety), which means you are either passing data through the cluster unnecessarily, or have dozens of intermediary objects you must manually maintain as you change the business rules of the data processing either upstream or downstream.
Another point on the number of MR jobs. The only way to decrease the cost of a workload in Hadoop is to reduce IO between jobs in the workload. This is typically done by replacing inefficient algorithms with better ones at the cost of adding complexity, adding more jobs to do things more intelligently. So if you think you only need a handful of MR jobs, and you discover a nasty bottleneck in your data when running at scale (which is what always happens to me at least). You may need to take a different approach to the problem that will likely result in a couple more jobs. I know this seems counterintuitive, but it happens a lot. In such cases you will be glad you are working with an abstraction where you can keep your head in the problem domain, not the MapReduce domain.
If you really are concerned about performance, please feel free to email the Cascading mail list with your code, and I or the community would be glad to help identify any issues with it or in Cascading.

Resources