Mahout seqdirectory mapreduce just gets one map task - hadoop

I want to have more map tasks to increase parallelism of Mahout seqdirectory job. But every time I try, it creates just one map task.
Hadoop version: 1.2.1
Mahout version: 0.8/0.9 (tested, both not working for more map tasks)
Scenario: lots of small files(about 566114 few KB files) store in HDFS
Before getting this problem (always one map task), I had faced another problem (GC overhead limit exceeded). Hence I had set more memory to solve it.
When I found that only one map task is getting spawned, I configured the Hadoop configuration file (mapred-site.xml,hadoop-env.sh...). I set mapred.map.tasks to 20 in mapred-site.xml. This did not help.
I found out that, the number of map tasks is decided by the number of chunks. In my case, total size of files is over 500 MB (bigger than default chunk size of 64 MB). I dug into Mahout source code (v0.9). I am not finding any solution.
Also, since I suspected that, the problem could be because of small files, I created two big files (2 500 MB files), and loaded them into HDFS, using seqdirectory command. But, still just one map task is launched.
I have no idea, how to solve this. Can someone with a knowledge of Hadoop and Mahout help me?
Note, I found someone contribute to this pb. note as mahout-833
I am tracing SequenceFileFromDirectory.java from Mahout 0.9 source code, try to figure out why.
Here is my configuration info: job config file, hadoop config file(mapred-site.xml, hadoop-env.sh)

Related

Why my Pig UDF not faster with more machine on Amazon EMR?

I'm new to this Hadoop and Big Data. We have hundreds of log files everyday. Each file is about ~78Mb. So, we thought we could benefit from Hadoop job which we could write Pig UDF and submit to Amazon EMR.
We did a really simple Pig UDF
public class ProcessLog extends EvalFunc<String> {
// Extract IP Address from log file line by line and convert that to JSON format.
}
It works locally with Pig and hadoop. So we submitted to Amazon EMR and we run with 5x x-large instances. It took about 40 minutes to finish. So, we thought if we double the instances (10x x-large) we would get the result faster but it ended up slower. What are the factors that we need to account for when writing Pig UDF to get the result faster?
Hundreds of log files ... Each file is about ~78Mb
The problem is that you don't have "Big Data". Unless you are doing seconds of processing for each MB, it will be faster NOT to use Hadoop. (The best definition of big data is "Data so big or streaming so fast that normal tools don't work".)
Hadoop has a lot of overhead, so you should use "normal" tools when your data is that small (a few GB). Your data probably fits into RAM on my phone! Use something like parallel to make sure all your cores are occupied.
You need to check following things when you run the job:
Number of mappers used
Number of reducers used
As you are processing 7 GB of data, it should create more than 56 mappers (split size 128M). In your case you can run it as map only job to convert each line to JSON. If it is not map only job, check how many reducers being used. If it is using only fewer mappers, then increasing number of reducers for the job might help. But you can eliminate reducers completely.
Please paste the progress log of the execution which includes counters. It will help in pin pointing the issue.

Out of memory error in Mapreduce shuffle phase

I am getting strange errors while running a wordcount-like mapreduce program. I have a hadoop cluster with 20 slaves, each having 4 GB RAM. I configured my map tasks to have a heap of 300MB and my reduce task slots get 1GB. I have 2 map slots and 1 reduce slot per node. Everything goes well until the first round of map tasks finishes. Then there progress remains at 100%. I suppose then the copy phase is taking place. Each map task generates something like:
Map output bytes 4,164,335,564
Map output materialized bytes 608,800,675
(I am using SnappyCodec for compression)
After stalling for about an hour the reduce tasks crach with the following exception:
Error: java.lang.OutOfMemoryError: Java heap space at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1703) at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1563) at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1401) at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1333
I was googling and found this link but I don't really know what to make of it:
hadoop common link
I don't understand why hadoop would experience any problems in copying and merging if it is able to perform a terasort benchmark. It cannot be that all map output should fit into the RAM of the reducer thread. So what is going on here?
In the link provided above they have a discussion about tuning the following parameters:
mapreduce.reduce.shuffle.input.buffer.percent = 0.7
mapreduce.reduce.shuffle.memory.limit.percent = 0.25
mapreduce.reduce.shuffle.parallelcopies = 5
They claim that the fact that the product of the parameters is >1 allows for heapsize errors.
EDIT: Note that 5*1.25*0.7 is still <1 so focus om my second solution post!)
Before restarting this intensive simulation I would be very happy to hear about someone's opinion concerning the problem I am facing since it is bothering for almost a week now. I also seem to not completely understand what is happening in this copy phase, I'd expect a merge sort on disk not to require much heap size?
Thanks a lot in advance for any helpful comments and answers!
I think the clue is that the heapsize of my reduce task was required almost completely for the reduce phase. But the shuffle phase is competing for the same heapspace, the conflict which arose caused my jobs to crash. I think this explains why the job no longer crashes if I lower the shuffle.input.buffer.percent.
The parameter you cite mapred.job.shuffle.input.buffer.percent is apparently a pre Hadoop 2 parameter. I could find that parameter in the mapred-default.xml per the 1.04 docs but it's name has changed to mapreduce.reduce.shuffle.input.buffer.percent per the 2.2.0 docs.
Per the docs this parameter's description is:
The percentage of memory to be allocated from the maximum heap size to storing map outputs during the shuffle.
For a complete understanding of Sort and Shuffle see Chapter 6.4 of The Hadoop Definitive Guide. That book provides an alternate definition of the parameter mapred.job.shuffle.input.buffer.percent:
The proportion of total heap size to be allocated to the map outputs buffer during the copy phase of the shuffle.
Since you observed that decreasing the value of mapred.job.shuffle.input.buffer.percent from it's default of 0.7 to 0.2 solved your problem, it is pretty safe to say that you could have also solved your problem by increasing the value of the reducer's heap size.
Even after changing the shuffle.input.buffer.percent to 0.2 it doesn't work for me and got the same error.
After doing hit and trial on single node cluster, I found that there needs to be enough space in / directory as the process uses that space in case of spill.
The spill directory also needs to be changed.
Related bug - https://issues.apache.org/jira/browse/MAPREDUCE-6724
Can cause a NegativeArraySizeException if the calculated maxSingleShuffleLimit > MAX_INT

on Hadoop, Is it possible to increase map (not maptask or node!) when running application (like WordCount or Pi Estimator)

I'm beginner programmer and hadoop learner.
I'm testing hadoop full distribute mode using 5 PC(has Dual-core cpu and ram 2G)
before starting maptask and hdfs, I knew that I must configure file(etc/hosts on Ip, hostname and hadoop folder/conf/masters,slaves file) so I finished configured that file.
and then,
I have one Question
Is it Possible to increase(or add) map(not mapTask or node!)when running application (like WordCount or Pi Estimator)?
For example, I'm running Pi Estimator application using 10 map,iteration 100,000,000.
On the way to Running Application, I feeled that it's so slow , so I want to increase map,even if hadoop state is running. is it Possible? if it's true, Please tell me the way.
Or before running application, Can I configured config files (like hdfs-site.xml or mapred-site.xml) to increase map dynamically when running application?
hadoop master users on stack overflow community, Please tell me detail about the truth.
you can set mapred.map.tasks and give the job a hint about how many mappers you want but its only a hint and hadoop won't necessarily obey it. you can set the number of maximum concurrent map tasks running at once by setting mapred.tasktracker.reduce.tasks.maximum,
Setting the number of map tasks and reduce tasks
How to increase the mappers and reducers in hadoop according to number of instances used to increase the performance?

Hadoop - Reduce the number of Spilled Records

I have an Ubuntu vm running in stand alone/pseudo mode with 4gb ram and 4 cores.
Everything is set to default except:
io.file.buffer.size=65536
io.sort.factor=50
io.sort.mb=500
mapred.tasktracker.map.tasks.maximum=4
mapred.tasktracker.reduce.tasks.maximum=4
This ofc will not be a production machine but I am fiddling with it to get the grips with the fine tuning.
My problem is that when I run my benchmark Hadoop Streaming job (get distinct records over a 1.8gb text file) I get quite a lot of spilled records and the above tweaks don't seem to reduce the spills. Also I have noticed that when I monitor the memory usage in the Ubuntu's System Monitor it never gets fully used and never goes above 2.2gb.
I have looked at chaging HADOOP_HEAP, mapred.map.child.java.opts and mapred.reduce.child.java.opts but I am not sure what to set these to as the defaults seem as though they should be enough.
Is there a setting I am missing that will allow Hadoop to utilise the remaining ram therefore reduce spilled records (hopefully speeding up jobs) or is this normal behaviour?
Many Thanks!
In addition to increasing memory, have you considered if you can run a combiner for your task after the map step, which will compress and reduce the amount of records that need to be kept in memory or spilled?
Unfortunately when you are using streaming, seems that this has to be coded in Java, and can't be in whatever language you're using.
http://wiki.apache.org/hadoop/HadoopStreaming
The default memory assigned to map/reduce task is 200mb. You can increase that value with -Dmapred.child.java.opts=-Xmx512M
Anyway, this is a very interesting material about hadoop tunning Hadoop Performance
Hope it helps!

Idea's for balancing out a HDFS -> HBase map reduce job

For a client, I've been scoping out the short-term feasibility of running a Cloudera flavor hadoop cluster on AWS EC2. For the most part the results have been expected with the performance of the logical volumes being mostly unreliable, that said doing what I can I've got the cluster to run reasonably well for the circumstances.
Last night I ran a full test of their importer script to pull data from a specified HDFS path and push it into Hbase. Their data is somewhat unusual in that the records are less then 1KB's a piece and have been condensed together into 9MB gzipped blocks. All total there are about 500K text records that get extracted from the gzips, sanity checked, then pushed onto the reducer phase.
The job runs within expectations of the environment ( the amount of spilled records is expected by me ) but one really odd problem is that when the job runs, it runs with 8 reducers yet 2 reducers do 99% of the work while the remaining 6 do a fraction of the work.
My so far untested hypothesis is that I'm missing a crucial shuffle or blocksize setting in the job configuration which causes most of the data to be pushed into blocks that can only be consumed by 2 reducers. Unfortunately the last time I worked on Hadoop, another client's data set was in 256GB lzo files on a physically hosted cluster.
To clarify, my question; is there a way to tweak a M/R Job to actually utilize more available reducers either by lowering the output size of the maps or causing each reducer to cut down the amount of data it will parse. Even a improvement of 4 reducers over the current 2 would be a major improvement.
It seems like you are getting hotspots in your reducers. This is likely because a particular key is very popular. What are the keys as the output of the mapper?
You have a couple of options here:
Try more reducers. Sometimes, you get weird artifacts in the randomness of the hashes, so having a prime number of reducers sometimes helps. This will likely not fix it.
Write a custom partitioner that spreads out the work better.
Figure out why a bunch of your data is getting binned into two keys. Is there a way to make your keys more unique to split up the work?
Is there anything you can do with a combiner to reduce the amount of traffic going to the reducers?

Resources