How to run oozie workflows faster - hadoop

So I have a large oozie workflow comprising somewhere of 300 actions, out of which there are few shells actions, sqoops, lot of hives and map-reduces. There are subworkflows as well.
I have a cluster of X machines, where each machine is having decent RAM and disk space.
The time taken by the total job in production is fine, however for development purposes where I have limited data for testing, the job still takes time on magnitude of hours.
I understand that even forking of one JVM takes about 1 to 3 seconds and this alone will make my job take 1 hour (considering 4MR jobs taken per action on an average)
However since I know my data is small in development, I would like to make the execution much faster.
I think I should be able to run the entire oozie workflow on a single machine (1 out of those X) and be done with the job in a few minutes -
One of the alternatives I know is to run uber tasks - which I am currently exploring. However it seems it will only run the MR jobs of same hadoop job in the same JVM.
So if a hive query fires 4 MR jobs, I'll still be needing 4 JVMs.
Is it possible to reuse JVMs across MR jobs?
Any other suggestions on faster runtimes for small amount of data will be helpful.
Thanks.

Related

Why is hadoop slow for a simple hello world job

I am following the tutorial on the hadoop website: https://hadoop.apache.org/docs/r3.1.2/hadoop-project-dist/hadoop-common/SingleCluster.html.
I run the following example in Pseudo-Distributed Mode.
time hadoop jar hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.2.jar grep input output 'dfs[a-z.]+'
It takes 1:47min to complete. When I turn off the network (wifi), it finishes in approx 50 seconds.
When I run the same command using the Local (Standalone) Mode, it finishes in approx 5 seconds (on a mac).
I understand that in Pseudo-Distributed Mode there is more overhead involved and hence it will take more time, but in this case it takes way more time. The CPU is completely idle during the run.
Do you have any idea what can cause this issue?
First, I don't have an explanation for why turning off your network would result in faster times. You'd have to dig through the Hadoop logs to figure out that problem.
This is typical behavior most people encounter running Hadoop on a single node. Effectively, you are trying to use Fedex to deliver something to your next door neighbor. It will always be faster to walk it over because the inherent overhead of operating a distributed system. When you run local mode, you are only performing the Map-Reduce function. When you run pseudo-distributed, it will use all the Hadoop servers (NameNode, DataNodes for data; Resource Manager, NodeManagers for compute) and what you are seeing is the latencies involved in that.
When you submit your job, the Resource Manager has to schedule it. As your cluster is not busy, it will ask for resources from the Node Manager. The Node Manager will give it a container which will run your Application Master. Typically, this loop takes about 10 seconds. Once your AM is running it will ask for resources from the Resource Manager for it's Map and Reduce tasks. This takes another 10 seconds. Also when you submit your job there is around a 3 second wait before this job is actually submitted to the Resource Manager. So far that's 23 seconds and you haven't done any computation yet.
Once the job is running, the most likely cause of waiting is allocating memory. On smaller systems (> 32GB of memory) the OS might take a while to allocate space. If you were to run the same thing on what is considered commodity hardware for Hadoop (16+ core, 64+ GB) you would probably see run time closer to 25-30 seconds.

How many reducers can simultaneously run?

Learning Big Data at Uni and I'm kind of confused on the topic of MapReduce. I was wondering how many reducers can run simultaneously. For example lets say if we had 864 reducers, how many could run simultaneously?
All of them can run simultaneously depending upon what is the state(health, i.e. no rouge/bad node) of cluster is, what is the capacity of the cluster is and also how free the cluster is. If there are other MR jobs running on the same cluster then out of your 864 reducers only few will go in running state, and once the capacity is free then another set of reducer will start running.
Also there is one case which happens sometimes is when your reducer/mapper keep on preempting each other and takes up the whole memory. Job fails in majority of this case. To avoid this we generally set less number of reducer.
One line answer is - all of them can run simultaneously; as each of the reducer performs an independent unit of task in map reduce framework.
Now, how many would actually run in parallel, or more precisely when each of them would be scheduled to run depends on many factors including but not limited to resource availability, scheduling mechanism, cluster configuration etc.

Hadoop Mapper Reducer Taking So much Time

I have a MR code for small files its taking 7 minutes for 15gb single file.
But for multiple file of 37gb its taking too much time and showing percentage 1% completed in 1 minute consistently.
Please suggest me.
MapReduce was never designed for low latency. The idea of MapReduce is that you have cases where you process all the data in parallel. The key idea was to reduce the time by parallelism.
Take wordcount for example. Lets say, you want to run a wordcount on 50 GB. Running this on a single machine, might take long. Paralleling it to lets say 10 machines means 5 GB per machine in parallel. Thats an improvement. This are the cases what MapReduce is designed for.
If you looking for a technology that returns result fasts and also does this with random reads, consider a different technology. Depending on your specific needs, there are several approaches that might solve your problem better.
It was my mistake I put custom logger in the code so every time when MR run it was logging in MR log file that's why it was taking time.

Why submitting job to mapreduce takes so much time in General?

So usually for 20 node cluster submitting job to process 3GB(200 splits) of data takes about 30sec and actual execution about 1m.
I want to understand what is the bottleneck in job submitting process and understand next quote
Per-MapReduce overhead is significant: Starting/ending MapReduce job costs time
Some process I'm aware:
1. data splitting
2. jar file sharing
A few things to understand about HDFS and M/R that helps understand this latency:
HDFS stores your files as data chunk distributed on multiple machines called datanodes
M/R runs multiple programs called mapper on each of the data chunks or blocks. The (key,value) output of these mappers are compiled together as result by reducers. (Think of summing various results from multiple mappers)
Each mapper and reducer is a full fledged program that is spawned on these distributed system. It does take time to spawn a full fledged programs, even if let us say they did nothing (No-OP map reduce programs).
When the size of data to be processed becomes very big, these spawn times become insignificant and that is when Hadoop shines.
If you were to process a file with a 1000 lines content then you are better of using a normal file read and process program. Hadoop infrastructure to spawn a process on a distributed system will not yield any benefit but will only contribute to the additional overhead of locating datanodes containing relevant data chunks, starting the processing programs on them, tracking and collecting results.
Now expand that to 100 of Peta Bytes of data and these overheads looks completely insignificant compared to time it would take to process them. Parallelization of the processors (mappers and reducers) will show it's advantage here.
So before analyzing the performance of your M/R, you should first look to benchmark your cluster so that you understand the overheads better.
How much time does it take to do a no-operation map-reduce program on a cluster?
Use MRBench for this purpose:
MRbench loops a small job a number of times
Checks whether small job runs are responsive and running efficiently on your cluster.
Its impact on the HDFS layer is very limited
To run this program, try the following (Check the correct approach for latest versions:
hadoop jar /usr/lib/hadoop-0.20/hadoop-test.jar mrbench -numRuns 50
Surprisingly on one of our dev clusters it was 22 seconds.
Another issue is file size.
If the file sizes are less than the HDFS block size then Map/Reduce programs have significant overhead. Hadoop will typically try to spawn a mapper per block. That means if you have 30 5KB files, then Hadoop may end up spawning 30 mappers eventually per block even if the size of file is small. This is a real wastage as each program overhead is significant compared to the time it would spend processing the small sized file.
As far as I know, there is no single bottleneck which causes the job run latency; if there was, it would have been solved a long time ago.
There are a number of steps which takes time, and there are reasons why the process is slow. I will try to list them and estimate where I can:
Run hadoop client. It is running Java, and I think about 1 second overhead can be assumed.
Put job into the queue and let the current scheduler to run the job. I am not sure what is overhead, but, because of async nature of the process some latency should exists.
Calculating splits.
Running and syncronizing tasks. Here we face with the fact that TaskTrackes poll the JobTracker, and not opposite. I think it is done for the scalability sake. It mean that when JobTracker wants to execute some task, it do not call task tracker, but wait that approprieate tracker will ping it to get the job. Task trackers can not ping JobTracker to frequently, otherwise they will kill it in large clusters.
Running tasks. Without JVM reuse it takes about 3 seconds, with it overhead is about 1 seconds per task.
Client poll job tracker for the results (at least I think so) and it also add some latency to getting information that job is finished.
I have seen similar issue and I can state the solution to be broken in following steps :
When the HDFS stores too many small files with fixed chunk size, there will be issues on efficiency in HDFS, the best way would be to remove all unnecessary files and small files having data. Try again.
Try with the data nodes and name nodes:
Stop all the services using stop-all.sh.
Format name-node
Reboot machine
Start all services using start-all.sh
Check data and name nodes.
Try installing lower version of hadoop (hadoop 2.5.2) which worked in two cases and it worked in hit and trial.

Are these Hadoop setup/cleanup/run times reasonable?

I've set up and am testing out a pseudo-distributed Hadoop cluster (with namenode, job tracker, and task tracker/data node all on the same machine). The box I'm running on has about 4 gigs memory, 2 cpus, 32-bit, and is running Red Hat Linux.
I ran the sample grep programs found in the tutorials with various file sizes and number of files. I've found that grep takes around 45 seconds for a 1 mb file, 60 seconds for a 100 mb file, and about 2 minutes for a 1 gig file.
I also created my own Map Reduce program which cuts out all the logic entirely; the map and reduce functions are empty. This sample program took 25 seconds to run.
I have tried moving the datanode to a second machine, as well as added in a second node, but I'm only seeing changes of a few seconds. Particularly, I have noticed that setup and clean up times are always about 3 seconds, no matter what input I give it. This seems to me like a really long time just for setup.
I know that these times will vary greatly depending on my hardware, configuration, inputs, etc. but I was just wondering if anyone can let me know if these are the times I should be expecting or if with major tuning and configuration I can cut it down considerably (for example, grep taking < 5 seconds total).
So you have only 2 CPU's, Hadoop will spawn (in pseudo-distributed mode) many JVMs': One for the Namenode, 1 for the Datanode, 1 for the Tasktracker and 1 for the Jobtracker. For each file in your job path Hadoop sets up a mapper task and per task it will spawn a new JVM, too. So your two Cores are sharing 4-n applications. So your times are not unnormal... At least Hadoop won't be as fast for plain-text files as for sequence files. To get the REAL speedup you have to bring the text into serialized bytecode and let hadoop stream over it.
A few thoughts:
There is always a fixed time cost for every Hadoop job run to calculate the splits and launch the JVM's on each node to run the map and reduce jobs.
You won't experience any real speedup over UNIX grep unless you start running on multiple nodes with lots of data. With 100mb-1G files, a lot of the time will be spent setting up the jobs rather than doing actual grepping. If you don't anticipate dealing with more than a gig or two of data, it probably isn't worth using Hadoop.

Resources