Idea's for balancing out a HDFS -> HBase map reduce job - performance

For a client, I've been scoping out the short-term feasibility of running a Cloudera flavor hadoop cluster on AWS EC2. For the most part the results have been expected with the performance of the logical volumes being mostly unreliable, that said doing what I can I've got the cluster to run reasonably well for the circumstances.
Last night I ran a full test of their importer script to pull data from a specified HDFS path and push it into Hbase. Their data is somewhat unusual in that the records are less then 1KB's a piece and have been condensed together into 9MB gzipped blocks. All total there are about 500K text records that get extracted from the gzips, sanity checked, then pushed onto the reducer phase.
The job runs within expectations of the environment ( the amount of spilled records is expected by me ) but one really odd problem is that when the job runs, it runs with 8 reducers yet 2 reducers do 99% of the work while the remaining 6 do a fraction of the work.
My so far untested hypothesis is that I'm missing a crucial shuffle or blocksize setting in the job configuration which causes most of the data to be pushed into blocks that can only be consumed by 2 reducers. Unfortunately the last time I worked on Hadoop, another client's data set was in 256GB lzo files on a physically hosted cluster.
To clarify, my question; is there a way to tweak a M/R Job to actually utilize more available reducers either by lowering the output size of the maps or causing each reducer to cut down the amount of data it will parse. Even a improvement of 4 reducers over the current 2 would be a major improvement.

It seems like you are getting hotspots in your reducers. This is likely because a particular key is very popular. What are the keys as the output of the mapper?
You have a couple of options here:
Try more reducers. Sometimes, you get weird artifacts in the randomness of the hashes, so having a prime number of reducers sometimes helps. This will likely not fix it.
Write a custom partitioner that spreads out the work better.
Figure out why a bunch of your data is getting binned into two keys. Is there a way to make your keys more unique to split up the work?
Is there anything you can do with a combiner to reduce the amount of traffic going to the reducers?

Related

Debugging Hadoop reducer OutOfMemoryError

I'm trying to debug an OutOfMemoryError I'm getting in my Hadoop reducers. The mappers complete successfully. They generate small records that are less than 128 bytes. In my reducer, I collect records with the same key (there around 15 possible keys), and write them to separate output files with MultipleOutputs. The distribution of records per key isn't uniform.
In the middle of the reduce phase, I start getting OutOfMemoryErrors. I've checked a lot of things:
The reducer doesn't save data; once it gets a value, it writes it out to the corresponding output
I tried different values for the number of reduce tasks. Tuning this is a bit weird in my case because more than 15 won't help because there are only 15 keys
Instantiating MultipleOutputs and closing it in reduce(), thinking it holds onto resources for the output files. This only works because keys and output files have a one-to-one mapping.
I tried adding data to the end of the keys so the data would get distributed evenly between reduce tasks
Out of paranoia, mapreduce.reduce.shuffle.memory.limit.percent=0
Verified keys and values really are small
Disabled output compression, thinking there's a memory leak in the compressor
Blindly tuning things like mapreduce.reduce.shuffle.merge.percent
I'm not sure where else memory could be going other than aggressively buffering the shuffle output.
This is running on GCP Dataproc with Hadoop 3.2.2. A lot of guides recommend setting mapreduce.reduce.java.opts. I tried this unsuccessfully, but I also assume Google chose a reasonable default for the host size, and I don't have a convincing story about where the memory's going. My one other theory is something in GoogleHadoopOutputStream that writes to cloud store is buffering. I have some output files between 10GB and 100GB--larger than the memory of the machine.
What else should I look at? Are these other flags I should try to tune? Attaching VisualVM doesn't look easy, but would a heap dump help?
Each GoogleHadoopOutputStream consumes around ~70 MiB of JVM heap because it uploads data to Google Cloud Storage in 64 MiB chunks by default. That's why if you are writing many objects in the same MR task using MultipleOutputs, each task will need number of outputs x 70 MiB JVM heap.
You can reduce memory consumed by each GoogleHadoopOutputStream via fs.gs.outputstream.upload.chunk.size property but this will reduce upload speed to Google Cloud Storage too, that's why a better approach will be to re-factor your MR job to write a single/fewer files in each MR task.

one mapper sometimes does not start

I am creating a Hadoop MapReduce job and I am using two Scans over one HBase table to feed my mappers. The HBase table has 10 regions. I create two scanners, call setAttribute(Scan.SCAN_ATTRIBUTES_TABLE_NAME, tableName) on them, then I do this:
job.setPartitionerClass(NaturalKeyPartitioner.class);
job.setGroupingComparatorClass(NaturalKeyGroupingComparator.class);
job.setSortComparatorClass(CompositeKeyComparator.class);
TableMapReduceUtil.initTableMapperJob(scans, FaultyRegisterReadMapper.class, MeterTimeKey.class, ReadValueTime.class, job);
For some reason, only two mappers are created most of the time. I would like there to be more mappers but that's not really a big deal.
The really bad part is that SOMETIMES it created three mappers and when it does, the first two mappers finish quite quickly but the third mapper doesn't even start for five minutes. It is this mapper that takes so long to start that is really bothering me. :)
This is on a cluster with some 60 nodes and it is not busy.
I suspect the number of mappers might be driven by how much data it's finding in the table but I'm not positive of that.
Main question: Any ideas why one mapper takes so long to start?
Along with the hardware resources of my nodes I would also check the network traffic. You might be suffering from network saturation(interface errors, framing errors etc).
After that I would make sure of the following things :
RegionServer Hotspotting : Uneven key-space distribution can lead to a huge number of requests to a single region, bombarding the RegionServer process, causing slow response time. Do you have keys consisting of timeseries kinda data?
Non-local data regions : Perhaps your job is requesting data which is not local to the DataNode(RegionServers run on DataNodes), thus forcing HDFS to request data blocks from other servers over the network(Involves network traffic as well).

why Hadoop shuffling time takes longer than expected

I am trying to figure out which steps takes how much time in simple hadoop wordcount example.
In this example 3 maps and 1 reducer is used where each map generates ~7MB shuffle data. I have a cluster which is connected via 1Gb switches. When I look at the job details, realized that shuffling takes ~7 sec after all map tasks are completed wich is more than expected to transfer such a small data. What could be the reason behind this? Thanks
Hadoop uses heartbeats to communicate with nodes. By default hadoop uses minimal heartbeat interval equals to 3seconds. Consequently hadoop completes your task within two heartbeats (roughly 6 seconds).
More details: https://issues.apache.org/jira/browse/MAPREDUCE-1906
The transfer is not the only thing to complete after the map step. Each mapper outputs their part of a given split locally and sorts it. The reducer that is tasked with a particular split then gathers the parts from each mapper output, each requiring a transfer of 7 MB. The reducer then has to merge these segments into a final sorted file.
Honestly though, the scale you are testing on is absolutely tiny. I don't know all the parts of the Hadoop shuffle step, which I understand has some involved details, but you shouldn't expect performance of such small files to be indicative of actual performance on larger files.
I think the shuffling started after first mapper started. But waited for the next two mappers.
There is option to start reduce phase (begins with shuffling) after all the mappers were finished. But that's not really speed up anything.
(BTW. 7 seconds is considered fast in Hadoop. Hadoop is poor in performance. Especially for small files. Unless somebody else is paying for this. Don't use Hadoop.)

How does Hadoop/MapReduce scale when input data is NOT stored?

The intended use for Hadoop appears to be for when the input data is distributed (HDFS) and already stored local to the nodes at the time of the mapping process.
Suppose we have data which does not need to be stored; the data can be generated at runtime. For example, the input to the mapping process is to be every possible IP address. Is Hadoop capable of efficiently distributing the Mapper work across nodes? Would you need to explicitly define how to split the input data (i.e. the IP address space) to different nodes, or does Hadoop handle that automatically?
Let me first clarify a comment you made. Hadoop is designed to support potentially massively parallel computation across a potentially large number of nodes regardless of where the data comes from or goes. The Hadoop design favors scalability over performance when it has to. It is true that being clever about where the data starts out and how that data is distributed can make a significant difference in how well/quickly a hadoop job can run.
To your question and example, if you will generate the input data you have the choice of generating it before the first job runs or you can generate it within the first mapper. If you generate it within the mapper then you can figure out what node the mapper's running on and then generate just the data that would be reduced in that partition (Use a partitioner to direct data between mappers and reducers)
This is going to be a problem you'll have with any distributed platform. Storm, for example, lets you have some say in which bolt instance will will process each tuple. The terminology might be different, but you'll be implementing roughly the same shuffle algorithm in Storm as you would Hadoop.
You are probably trying to run a non-MapReduce task on a map reduce cluster then. (e.g. IP scanning?) There may be more appropriate tools for this, your know...
A thing few people do not realize is that MapReduce is about checkpointing. It was developed for huge clusters, where you can expect machines to fail during the computation. By having checkpointing and recovery built-in into the architecture, this reduces the consequences of failures and slow hosts.
And that is why everything goes from disk to disk in MapReduce. It's checkpointed before, and it's checkpointed after. And if it fails, only this part of the job is re-run.
You can easily outperform MapReduce by leaving away the checkpointing. If you have 10 nodes, you will win easily. If you have 100 nodes, you will usually win. If you have a major computation and 1000 nodes, chances are that one node fails and you wish you had been doing similar checkpointing...
Now your task doesn't sound like a MapReduce job, because the input data is virtual. It sounds much more as if you should be running some other distributed computing tool; and maybe just writing your initial result to HDFS for later processing via MapReduce.
But of course there are way to hack around this. For example, you could use /16 subnets as input. Each mapper reads a /16 subnet and does it's job on that. It's not that much fake input to generate if you realize that you don't need to generate all 2^32 IPs, unless you have that many nodes in your cluster...
Number of Mappers depends on the number of Splits generated by the implementation of the InputFormat.
There is NLineInputFormat, which you could configure to generate as many splits as there are lines in the input file. You could create a file where each line is an IP range. I have not used it personally and there are many reports that it does not work as expected.
If you really need it, you could create your own implementation of the InputFormat which generates the InputSplits for your virtual data and force as many mappers as you need.

HBase bulk load spawn high number of reducer tasks - any workaround

HBase bulk load (using configureIncrementalLoad helper method) configures the job to create as many reducer task as the regions in the hbase table. So if there are few hundred regions then the job would spawn few hundred reducer tasks. This could get very slow on a small cluster..
Is there any workaround possible by using MultipleOutputFormat or something else?
Thanks
Sharding the reduce stage by region is giving you a lot of long-term benefit. You get data locality once the imported data is online. You also can determine when a region has been load balanced to another server. I wouldn't be so quick to go to a coarser granularity.
Since the reduce stage is going a single file write, you should be able to setNumReduceTasks(# of hard drives). That might speed it up more.
It's very easy to get network bottlenecked. Make sure you're compressing your HFile & your intermediate MR data.
job.getConfiguration().setBoolean("mapred.compress.map.output", true);
job.getConfiguration().setClass("mapred.map.output.compression.codec",
org.apache.hadoop.io.compress.GzipCodec.class,
org.apache.hadoop.io.compress.CompressionCodec.class);
job.getConfiguration().set("hfile.compression",
Compression.Algorithm.LZO.getName());
Your data import size might be small enough where you should look at using a Put-based format. This will call the normal HTable.Put API and skip the reducer phase. See TableMapReduceUtil.initTableReducerJob(table, null, job).
When we use HFileOutputFormat, its overrides number of reducers whatever you set.
The number of reducers is equal to number of regions in that HBase table.
So decrease the number of regions if you want to control the number of reducers.
You will find a sample code here:
Hope this will be useful :)

Resources