Writing high volume reducer output to HBase - hadoop

I have an Hadoop MapReduce job whose output is a row-id with a Put/Delete operation for that row-id. Due to the nature of the problem, the output is rather high volume. We have tried several method to get this data back into HBase and they have all failed...
Table Reducer
This is way to slow since it seems that it must do a full round trip for every row. Due to how the keys sort for our reducer step, the row-id is not likely to be on the same node as the reducer.
completebulkload
This seems to take a long time (never completes) and there is no real indication of why. Both IO and CPU show very low usage.
Am I missing something obvious?

I saw from your answer to self that you solved your problem but for completeness I'd mention that there's another option - writing directly to hbase. We have a set up where we stream data into HBase and with proper key and region splitting we get to more than 15,000 1K records per second per node

CompleteBulkLoad was the right answer. Per #DonaldMiner I dug deeper and found out that the CompleteBulkLoad process was running as "hbase" which resulted in a permission denied error when trying to move/rename/delete the source files. The implementation appears to retry for a long time before giving an error message; up to 30 minutes in our case.
Giving the hbase user write access to the files resolved the issue.

Related

Understanding Apache Spark Web UI performance metrics

I'm new to Spark and I'm trying to understand the metrics in the Web UI that are related to in my Spark Application (developed through Dataset API). I've watched few videos by Spark Summit and Databricks and most of the videos I watched were about a general overview of the Web UI like: definition of stage/job/task, how to understand when something is not working properly (e.g. not balanced work between executors), suggestions about things to avoid while programming, etc.
However, I couldn't find a detailed explaination of each performance metrics. In particular I'm interested understanding the things in the following images that are related to a Query that contains a groupBy(Col1, Col2), a orderBy(Col1, Col2) and a show().
Job 0
If I understood well, the default max partition size is set to 128 MB. Since my dataset size is 1378MB I get 11 tasks that work with 128MB, right? and since in the first stage I did some filtering (before applying groupBy) tasks write in memory so Shuffle Write is 108.3KB but why do I get 200 tasks for second stage?
After the groupBy I used an orderBy, is the number of tasks related to how my dataset is or it is related to the size of it?
UPDATE: I found this spark.sql.shuffle.partitions of 200 default partitions conundrum and some other questions, but now I'm wondering if there is a specific reason for it to be 200?
Stage 0
Why some tasks have result serialization here? If I understood well the serialization is related to the output so any show(), count(), collect(), etc. But in this stage those actions are not present (before the groupBy).
Stage 1
Is it normal that there is a huge part for result serialization time? I called show() (that takes 20 rows by default and there is an orderBy) so all tasks run in parallel and that one serialized all its records?
Why only one task have a considerable Shuffle Read Time? I expected all to have at least a small amount of Shuffle Read Time, again it is something related to my dataset?
The deserialization time is related to reading my dataset file? I'm asking because I wouldnt have expected it there since it is stage 1 and it was already present in stage 0.
Job 1- caching
Since I'm dealing with 3 queries that starts from the same dataset, I used cache() at the beginning of the first Query. I was wondering why it shows 739.9MB / 765 [input size/records] ... In the first query it shows 1378.8 MB / 7647431 [input size/records].
I guess that it has 11 tasks since the size of the dataset cached is still 1378MB but 765 is a really low number compared to the initial that was 7647431 so I dont think it is really related to records/rows, right?
Thanks for reading.

Reducer not completing and getting stuck at 99%

I am having some issues with running a mapreduce job. The mapper completes quickly. However, the reducer gets stuck at 99.33 %. I could see some IO errors in the log. However, isn't hadoop itself supposed to handle the IO errors. I ran the job twice and the same thing. Any suggestions?
How balanced are your keys ? It sounds like one key has the bulk of your records, so they can only be processed by a single reducer.
If your job is some calculation which can be divided easily into sub-calculations ( like simple counts), try breaking up your job into two jobs by salting your key. Add a random number or string to your key, in order to distribute to multiple reducers on the first pass, then merge those results on a second pass.
Hope that makes sense !!!
Please provide some more input
What kind of setup do you have, is it pseudo cluster with one VM or with multiple VMs
Run df on your system as you get the IO Exception , to confirm that you don't have a disk space issue.
What do you mean by "it is getting stuck"...Reducers will timeout and fail at the end. So please elaborate what you mentioned.
Answer to your questions However, isn't hadoop itself supposed to handle the IO errors.
Yes, like any good code, Hadoop handles IOException, but it may or may not finish job successfully after the IO error depending on your answer to my question 1 & ,2. Simply, put hadoop can be fault tolerant, if you provide enough redundancy. If you have less redundancy, hadoop jobs will fail on serious issues like IOException.

Hadoop reducer error: "Shuffle Error: Exceeded the abort failure limit; bailing-out"

I've got a hadoop 0.20 map/reduce job that used to run just fine. In the last few days, it's getting stuck in the reduce phase at 16.66%, and I'm seeing the following error when I look at the reduce task in the jobtracker;
Shuffle Error: Exceeded the abort failure limit; bailing-out.
Can anyone tell me what that means, and maybe point me in the right direction so I can figure out how to fix this?
This error corresponds to the maximum number of times a reducer tries to fetch a map output before it reports it and maps to the property mapreduce.reduce.shuffle.maxfetchfailures.
You could try increasing this property, but the default value of 10 is usually enough, so there may be something more serious.
I remember a case where something similar with fetch failures was due to an incorrect /etc/hosts file and after googling a bit it looks like this could be the issue, so try the following:
use hostnames instead of ips
synchronize your /etc/hosts across all nodes (easier if you use something like Puppet)
try commenting out “127.0.0.1 localhost”
restart the cluster

Unusual Hadoop error - tasks get killed on their own

When I run my hadoop job I get the following error:
Request received to kill task 'attempt_201202230353_23186_r_000004_0' by user
Task has been KILLED_UNCLEAN by the user
The logs appear to be clean. I run 28 reducers, and this doesnt happen for all the reducers. It happens for a selected few and the reducer starts again. I fail to understand this. Also other thing I have noticed is that for a small dataset, I rarely see this error!
There are three things to try:
Setting a CounterIf Hadoop sees a counter for the job progressing then it won't kill it (see Arockiaraj Durairaj's answer.) This seems to be the most elegant as it could allow you more insight into long running jobs and were the hangups may be.
Longer Task TimeoutsHadoop jobs timeout after 10 minutes by default. Changing the timeout is somewhat brute force, but could work. Imagine analyzing audio files that are generally 5MB files (songs), but you have a few 50MB files (entire album). Hadoop stores an individual file per block. So if your HDFS block size is 64MB then a 5MB file and a 50 MB file would both require 1 block (64MB) (see here http://blog.cloudera.com/blog/2009/02/the-small-files-problem/, and here Small files and HDFS blocks.) However, the 5MB job would run faster than the 50MB job. Task timeout can be increased in the code (mapred.task.timeout) for the job per the answers to this similar question: How to fix "Task attempt_201104251139_0295_r_000006_0 failed to report status for 600 seconds."
Increase Task AttemptsConfigure Hadoop to make more than the 4 default attempts (see Pradeep Gollakota's answer). This is the most brute force method of the three. Hadoop will attempt the job more times, but you could be masking an underlying issue (small servers, large data blocks, etc).
Can you try using counter(hadoop counter) in your reduce logic? It looks like hadoop is not able to determine whether your reduce program is running or hanging. It waits for a few minutes and kills it, even though your logic may be still executing.

Hadoop DistributedCache failed to report status

In a Hadoop job i am mapping several XML-files and filtering an ID for every element (from < id>-tags). Since I want to restrict the job to a certain set of IDs, I read in a large file (about 250 million lines in 2.7 GB, every line with just an integer as a ID). So I use a DistributedCache, parse the file in the setup() method of the Mapper with a BufferedReader and save the IDs to a HashSet.
Now when I start the job, I get countless
Task attempt_201201112322_0110_m_000000_1 failed to report status. Killing!
Before any map-job is executed.
The cluster consists of 40 nodes and since the files of a DistributedCache are copied to the slave nodes before any tasks for the job are executed, i assume the failure is caused by the large HashSet. I have already increased the mapred.task.timeout to 2000s. Of course I could raise the time even more, but actually this period should suffice, shouldn't it?
Since DistributedCache's are used to be a way to "distribute large, read-only files efficiently", I wondered what causes the failure here and if there is another way to pass the relevant IDs to every map-job?
Can you add some some debug printlns to your setup method to check that it is timing out in this method (log the entry and exit times)?
You may also want to look into using a BloomFilter to hold the IDs in. You can probably store these values in a 50MB bloom filter with a good false positive rate (~0.5%), and then run a secondary job to perform a partitioned check against the actual reference file.

Resources