How to set configurations to make Spark/Yarn job faster? - hadoop

I am new to Spark. I have been reading about Spark config and different properties to set so that we can optimize the job. But I am not sure how do I figure out what should I set ?
For example, I created a cluster of type r3.8x Large (1Master and 10 slaves)
How do I set :
spark.executor.memory
spark.driver.memory
spark.sql.shuffle.partitions
spark.default.parallelism
spark.driver.cores
spark.executor.cores
spark.memory.fraction
spark.executor.instances
Or should I just leave the default ? but leaving default makes my job very slow. My job has 3 group bas and 3 broadcasted maps.
Thanks

For tuning you application you need to know few things
1) You Need to Monitor your application whether your cluster is under utilized or not how much resources are used by your application which you have created
Monitoring can be done using various tools eg. Ganglia From Ganglia you can find CPU, Memory and Network Usage.
2) Based on Observation about CPU and Memory Usage you can get a better idea what kind of tuning is needed for your application
Form Spark point of you
In spark-defaults.conf
you can specify what kind of serialization is needed how much Driver Memory and Executor Memory needed by your application even you can change Garbage collection algorithm.
Below are few Example you can tune this parameter based on your requirements
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.driver.memory 5g
spark.executor.memory 3g
spark.executor.extraJavaOptions -XX:MaxPermSize=2G -XX:+UseG1GC
spark.driver.extraJavaOptions -XX:MaxPermSize=6G -XX:+UseG1GC
For More details refer http://spark.apache.org/docs/latest/tuning.html
Hope this Helps!!

Related

Apache Storm 2.1.0 memory related configurations

We are in the process of migrating to 2.1.0 from 1.1.x.
In our current setup we have following memory configurations in storm.yaml
nimbus.childopts: -Xmx2048m
supervisor.childopts: -Xmx2048m
worker.childopts: -Xmx16384m
I see many other memory related configs in https://github.com/apache/storm/blob/master/conf/defaults.yaml, and have following questions regarding them.
what is the difference between worker.childopts and topology.worker.childopts? If we are setting worker.childopts in storm.yaml, do we still have to override topology.worker.childopts?
If we are setting worker.childopts in storm.yaml, do we still have to override worker.heap.memory.mb? Is there a relationship between these two configs?
Should topology.component.resources.onheap.memory.mb < worker.childopts? How should we decide the value of topology.component.resources.onheap.memory.mb ?
Appreciate if someone could explain these points.
I have recently fiddled with some of these configs myself, so I am sharing my insights here:
worker.childopts vs topology.worker.childopts - the first parameter sets childopts for all workers. The second parameter can be used to override those for individual topologies, e.g. by using conf.put(Config.TOPOLOGY_WORKER_CHILDOPTS, "someJvmArgsHere");
The default value for worker.childopts is "-Xmx%HEAP-MEM%m -XX:+PrintGCDetails -Xloggc:artifacts/gc.log -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=1M -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=artifacts/heapdump" according to the storm git. Pay attention to the first argument, it includes a replacement pattern %HEAP-MEM%. This pattern is replaced with whatever you configure for worker.heap.memory.mb. You are able to override the latter parameter from inside a topology configuration in Java, thus I guess they build it that way to be able to quickly modify Java heap for individual topologies. One thing I noticed is that, when overriding, storm only seems to make use of the override value if at least one spout or bolt is configured with .setMemoryLoad(int heapSize).
this highly depends on the individual topology's needs, but in general it is most likely a very good idea to have topology.component.resources.onheap.memory.mb be smaller than whatever you have configured for -Xmx in worker.childopts. How to find a good value for topology.component.resources.onheap.memory.mb is up to testing and knowledge about the memory consumption of your topology's components. For instance, I have a topology which receives tuples from redis and emits them. If bolts are busy, tuples may pile up in the spout, thus I configure it with some headroom in terms of memory. However, I normally do not modify topology.component.resources.onheap.memory.mb but rather use the setMemoryLoad(int heapSize) method of a topology's component, as this allows to set different values for individual components of the topology. Storm docs for this and related topics are here.

Storing HDFS data only on specific nodes in a Hadoop Cluster

We have a 30 nodes production cluster. We want to add 5 data nodes for additional storage to handle the interim spike of data( around 2 TB). This data is to be stored temporarily and we want to get rid of it after 15 days.
Is it possible to make sure that the interim data (2 TB) coming in will be stored only on the newly added data nodes?
I am looking for something similar to YARN node labelling.
Thank you in advance.
Unfortunately I don't know a simple way to achieve this in the same HDFS cluster.
But I think you can achieve this behavior by implementing a custom "Block Placement Policy".
However, performing this task can be somewhat risky and complex.
Here is the HDFS jira ticket where this functionality is defined/added that allows you to customize this policy (JIRA TICKET).
You can read here the current behavior of choosing datanode to understand you better if you want to customize it:
link 1
Also here you can find a post with several references that can be useful on how to implement a custom policy and the risks of it:
post
Other readings that I recommend if you want to take this way:
link 2
post 2
This is a good paper about an experiment with a custom block placement policy to place replicas in SSD or HDD (Hybrid cluster):
paper
I think that if possible, it will be simpler to use a second cluster. E.g. you can eval ViewFS that uses namespaces to reference each cluster:
viewFs reference
link 3
Regards,

lazy evaluation in Apache Spark

I'm trying to understand lazy evaluation in Apache spark.
My understanding says:
Lets say am having Text file in hardrive.
Steps:
1) First I'll create RDD1, that is nothing but a data definition right now.(No data loaded into memory right now)
2) I apply some transformation logic on RDD1 and creates RDD2, still here RDD2 is data definition (Still no data loaded into memory)
3) Then I apply filter on RDD2 and creates RDD3 (Still no data loaded into memory and RDD3 is also an data definition)
4) I perform an action so that I could get RDD3 output in text file. So the moment I perform this action where am expecting output something from memory, then spark loads data into memory creates RDD1, 2 and 3 and produce output.
So laziness of RDDs in spark says just keep making the roadmap(RDDs) until they dont get the approval to make it or produce it live.
Is my understanding correct upto here...?
My second question here is, its said that its(Lazy Evaluation) one of the reason that the spark is powerful than Hadoop, May I know please how because am not much aware of Hadoop ? What happens in hadoop in this scenario ?
Thanks :)
Yes, your understanding is fine. A graph of actions (a DAG) is built via transformations, and they computed all at once upon an action. This is what is meant by lazy execution.
Hadoop only provides a filesystem (HDFS), a resource manager (YARN), and the libraries which allow you to run MapReduce. Spark only concerns itself with being more optimal than the latter, given enough memory
Apache Pig is another framework in the Hadoop ecosystem that allows for lazy evaluation, but it has its own scripting language compared to the wide programmability of Spark in the languages it supports. Pig supports running MapReduce, Tez, or Spark actions for computations. Spark only runs and optimizes its own code.
What happens in actual MapReduce code is that you need to procedurally write out each stage of an action to disk or memory in order to accomplish relatively large tasks
Spark is not a replacement for "Hadoop" it's a compliment.

How to Throttle DataStage

I work on a project where we run a number of DataStage sequences can be run in parallel, one in particular is poorly performing and takes a lot of resources, impacting the shared environment. Performance tuning initiative is in progress but will take time.
In the meantime I was hopeful that we could throttle DataStage to restrict the resources that could be used by this particular job/sequence - however I'm not personally experienced with DataStage specifically.
Can anyone comment if this facility exists in DataStage (v8.5 I believe), and point me in the direction of some further detail.
Secondly, I know that we can at the throttle based on the user (I think this ties into AIX 'ulimit', but not sure). Is it easy/possbile to run different jobs/sequences as different users?
In this type of situations resources for a particular job can be restricted by specifying number of nodes and resources in a config file. Possible in 8.5 and you may find something at www.datastagetips.com
Revolution_In_Progress is right.
Datastage PX has the notion of a configuration file. That file can be specified for all the jobs you run or it can be overridden on a job by job basis. The configuration file can be used to limit the physical resources that are associated with a job.
In this case, if you have a 4-node config file for most of your jobs, you may want to write a 2-node config file for the job with performance issue. That way, you'll get the minimum amount of parallelism (without going completely sequential) and use the minimum amount of resources.
http://pic.dhe.ibm.com/infocenter/iisinfsv/v8r1/index.jsp?topic=/com.ibm.swg.im.iis.ds.parjob.tut.doc/module5/lesson5.1exploringtheconfigurationfile.html
Sequence is a collection of individual jobs.
In most cases, jobs in a sequence can be rearranged to run serially. Please check the organisation of the sequence and do a critical path analyis to remove the jobs that need not run in parallel to critical jobs.

Streaming data and Hadoop? (not Hadoop Streaming)

I'd like to analyze a continuous stream of data (accessed over HTTP) using a MapReduce approach, so I've been looking into Apache Hadoop. Unfortunately, it appears that Hadoop expects to start a job with an input file of fixed size, rather than being able to hand off new data to consumers as it arrives. Is this actually the case, or am I missing something? Is there a different MapReduce tool that works with data being read in from an open socket? Scalability is an issue here, so I'd prefer to let the MapReducer handle the messy parallelization stuff.
I've played around with Cascading and was able to run a job on a static file accessed via HTTP, but this doesn't actually solve my problem. I could use curl as an intermediate step to dump the data somewhere on a Hadoop filesystem and write a watchdog to fire off a new job every time a new chunk of data is ready, but that's a dirty hack; there has to be some more elegant way to do this. Any ideas?
The hack you describe is more or less the standard way to do things -- Hadoop is fundamentally a batch-oriented system (for one thing, if there is no end to the data, Reducers can't ever start, as they must start after the map phase is finished).
Rotate your logs; as you rotate them out, dump them into HDFS. Have a watchdog process (possibly a distributed one, coordinated using ZooKeeper) monitor the dumping grounds and start up new processing jobs. You will want to make sure the jobs run on inputs large enough to warrant the overhead.
Hbase is a BigTable clone in the hadoop ecosystem that may be interesting to you, as it allows for a continuous stream of inserts; you will still need to run analytical queries in batch mode, however.
What about http://s4.io/. It's made for processing streaming data.
Update
A new product is rising: Storm - Distributed and fault-tolerant realtime computation: stream processing, continuous computation, distributed RPC, and more
I think you should take a look over Esper CEP ( http://esper.codehaus.org/ ).
Yahoo S4 http://s4.io/
It provide real time stream computing, like map reduce
Twitter's Storm is what you need, you can have a try!
Multiple options here.
I suggest the combination of Kafka and Storm + (Hadoop or NoSql) as the solution.
We already build our big data platform using those opensource tools, and it works very well.
Your use case sounds similar to the issue of writing a web crawler using Hadoop - the data streams back (slowly) from sockets opened to fetch remote pages via HTTP.
If so, then see Why fetching web pages doesn't map well to map-reduce. And you might want to check out the FetcherBuffer class in Bixo, which implements a threaded approach in a reducer (via Cascading) to solve this type of problem.
As you know the main issues with Hadoop for usage in stream mining are the fact that first, it uses HFDS which is a disk and disk operations bring latency that will result in missing data in stream. second, is that the pipeline is not parallel. Map-reduce generally operates on batches of data and not instances as it is with stream data.
I recently read an article about M3 which tackles the first issue apparently by bypassing HDFS and perform in-memory computations in objects database. And for the second issue, they are using incremental learners which are not anymore performed in batch. Worth checking it out M3
: Stream Processing on
Main-Memory MapReduce. I could not find the source code or API of this M3 anywhere, if somebody found it please share the link here.
Also, Hadoop Online is also another prototype that attemps to solve the same issues as M3 does: Hadoop Online
However, Apache Storm is the key solution to the issue, however it is not enough. You need some euqivalent of map-reduce right, here is why you need a library called SAMOA which actually has great algorithms for online learning that mahout kinda lacks.
Several mature stream processing frameworks and products are available on the market. Open source frameworks are e.g. Apache Storm or Apache Spark (which can both run on top of Hadoop). You can also use products such as IBM InfoSphere Streams or TIBCO StreamBase.
Take a look at this InfoQ article, which explains stream processing and all these frameworks and products in detail: Real Time Stream Processing / Streaming Analytics in Combination with Hadoop. Besides the article also explains how this is complementary to Hadoop.
By the way: Many software vendors such as Oracle or TIBCO call this stream processing / streaming analytics approach "fast data" instead of "big data" as you have to act in real time instead of batch processing.
You should try Apache Spark Streaming.
It should work well for your purposes.

Resources