In a Hadoop job i am mapping several XML-files and filtering an ID for every element (from < id>-tags). Since I want to restrict the job to a certain set of IDs, I read in a large file (about 250 million lines in 2.7 GB, every line with just an integer as a ID). So I use a DistributedCache, parse the file in the setup() method of the Mapper with a BufferedReader and save the IDs to a HashSet.
Now when I start the job, I get countless
Task attempt_201201112322_0110_m_000000_1 failed to report status. Killing!
Before any map-job is executed.
The cluster consists of 40 nodes and since the files of a DistributedCache are copied to the slave nodes before any tasks for the job are executed, i assume the failure is caused by the large HashSet. I have already increased the mapred.task.timeout to 2000s. Of course I could raise the time even more, but actually this period should suffice, shouldn't it?
Since DistributedCache's are used to be a way to "distribute large, read-only files efficiently", I wondered what causes the failure here and if there is another way to pass the relevant IDs to every map-job?
Can you add some some debug printlns to your setup method to check that it is timing out in this method (log the entry and exit times)?
You may also want to look into using a BloomFilter to hold the IDs in. You can probably store these values in a 50MB bloom filter with a good false positive rate (~0.5%), and then run a secondary job to perform a partitioned check against the actual reference file.
Related
I have an Hadoop MapReduce job whose output is a row-id with a Put/Delete operation for that row-id. Due to the nature of the problem, the output is rather high volume. We have tried several method to get this data back into HBase and they have all failed...
Table Reducer
This is way to slow since it seems that it must do a full round trip for every row. Due to how the keys sort for our reducer step, the row-id is not likely to be on the same node as the reducer.
completebulkload
This seems to take a long time (never completes) and there is no real indication of why. Both IO and CPU show very low usage.
Am I missing something obvious?
I saw from your answer to self that you solved your problem but for completeness I'd mention that there's another option - writing directly to hbase. We have a set up where we stream data into HBase and with proper key and region splitting we get to more than 15,000 1K records per second per node
CompleteBulkLoad was the right answer. Per #DonaldMiner I dug deeper and found out that the CompleteBulkLoad process was running as "hbase" which resulted in a permission denied error when trying to move/rename/delete the source files. The implementation appears to retry for a long time before giving an error message; up to 30 minutes in our case.
Giving the hbase user write access to the files resolved the issue.
We have a system that receives archives on a specified directory and on a regular basis it launches a mapreduce job that opens the archives and processes the files within them. To avoid re-processing the same archives the next time, we're hooked into the close() method on our RecordReader to have it deleted after the last entry is read.
The problem with this approach (we think) is that if a particular mapping fails, the next mapper that makes another attempt at it finds that the original file has been deleted by the record reader from the first one and it bombs out. We think the way to go is to hold off until all the mapping and reducing is complete and then delete the input archives.
Is this the best way to do this?
If so, how can we obtain a listing of all the input files found by the system from the main program? (we can't just scrub the whole input dir, new files may be present)
i.e.:
. . .
job.waitForCompletion(true);
(we're done, delete input files, how?)
return 0;
}
Couple comments.
I think this design is heartache-prone. What happens when you discover that someone deployed a messed up algorithm to your MR cluster and you have to backfill a month's worth of archives? They're gone now. What happens when processing takes longer than expected and a new job needs to start before the old one is completely done? Too many files are present and some get reprocessed. What about when the job starts while an archive is still in flight? Etc.
One way out of this trap is to have the archives go to a rotating location based on time, and either purge the records yourself or (in the case of something like S3) establish a retention policy that allows a certain window for operations. Also whatever the back end map reduce processing is doing could be idempotent: processing the same record twice should not be any different than processing it once. Something tells me that if you're reducing your dataset, that property will be difficult to guarantee.
At the very least you could rename the files you processed instead of deleting them right away and use a glob expression to define your input that does not include the renamed files. There are still race conditions as I mentioned above.
You could use a queue such as Amazon SQS to record the delivery of an archive, and your InputFormat could pull these entries rather than listing the archive folder when determining the input splits. But reprocessing or backfilling becomes problematic without additional infrastructure.
All that being said, the list of splits is generated by the InputFormat. Write a decorator around that and you can stash the split list wherever you want for use by the master after the job is done.
The simplest way would probably be do a multiple input job, read the directory for the files before you run the job and pass those instead of a directory to the job (then delete the files in the list after the job is done).
Based on the situation you are explaining I can suggest the following solution:-
1.The process of data monitoring I.e monitoring the directory into which the archives are landing should be done by a separate process. That separate process can use some metadata table like in mysql to put status entries based on monitoring the directories. The metadata entries can also check for duplicacy.
2. Now based on the metadata entry a separate process can handle the map reduce job triggering part. Some status could be checked in metadata for triggering the jobs.
I think you should use Apache Oozie to manage your workflow. From Oozie's website (bolding is mine):
Oozie is a workflow scheduler system to manage Apache Hadoop jobs.
...
Oozie Coordinator jobs are recurrent Oozie Workflow jobs triggered by time (frequency) and data availabilty.
I am wondering if someone can explain how the distributed cache works in Hadoop. I am running a job many times, and after each run I notice that the local distributed cache folder on each node is growing in size.
Is there a way for multiple jobs to re-use the same file in the distributed cache? Or is the distributed cache only valid for the lifetime of any individual job?
The reason I am confused is that the Hadoop documentation mentions that "DistributedCache tracks modification timestamps of the cache files", so this leads me to believe that if the time stamp hasn't changed, then it should not need to re-cache or re-copy the files to the nodes.
I am adding files successfully to the distributed cache using:
DistributedCache.addFileToClassPath(hdfsPath, conf);
DistributedCache uses reference counting to manage the caches. org.apache.hadoop.filecache.TrackerDistributedCacheManager.CleanupThread is in charge of cleaning up the CacheDirs whose reference count is 0. It will check every minute (default period is 1 minute, you can set it by "mapreduce.tasktracker.distributedcache.checkperiod").
When a Job finishes or fails, JobTracker will send a org.apache.hadoop.mapred.KillJobAction to the TaskTrackers. Then if a TaskTracker receives a KillJobAction, it puts the action to tasksToCleanup. In the TaskTracker, there is a background Thread called taskCleanupThread which takes the action from tasksToCleanup and do the cleanup work. For a KillJobAction, it will invoke purgeJob to clean up the Job. In this method, it will decrease the reference count used by this Job (rjob.distCacheMgr.release();).
The above analysis bases on hadoop-core-2.0.0-mr1-cdh4.2.1-sources.jar. I also checked the hadoop-core-0.20.2-cdh3u1-sources.jar and found there was a litte difference between this two versions. For example, there was not a org.apache.hadoop.filecache.TrackerDistributedCacheManager.CleanupThread in 0.20.2-cdh3u1. When initializing a Job, TrackerDistributedCacheManager will check if there is enough space to put the new caches files for this Job. If not, it will delete the caches which have 0 reference count.
If you are using cdh4.2.1, you can increase "mapreduce.tasktracker.distributedcache.checkperiod" to let the clean up work delay. Then the probability that multiple Jobs use the same distributed cache is increased.
If you are using cdh3u1, you can increase the limitation of the cache size("local.cache.size", default is 10G) and the max directories for caches("mapreduce.tasktracker.cache.local.numberdirectories", default is 10000). This can be also applied to cdh4.2.1.
If you look closely at what this book says, is that there is a limit of what can be stored in Distributed Cache. By default it's 10GB (configurable). There can be multiple different jobs running in the cluster concurrently. Furthermore, Hadoop kind of guarantees the files stay available in the cache for a single job as it is maintained by reference count done by the tasktracker for different tasks accessing the files in cache. In your case, for subsequent Jobs, the files may not be there as they are already marked for deletion.
Please correct me if you disagree anywhere. I'll be glad to discuss this further.
According to this: http://www.datasalt.com/2011/05/handling-dependencies-and-configuration-in-java-hadoop-projects-efficiently/
You should be able to do this via DistributedCache API instead of "-libjars"
I have a chain of Map/Reduce jobs:
Job1 takes data with a time stamp as a key and some data as value and transforms it.
For Job2 I need to pass the maximum time stamp that appears across all mappers in Job1 as a parameter. (I know how to pass parameters to Mappers/Reducers)
I can keep track of the maximum time stamp in each mapper of Job1, but how can I get the maximum across all mappers and pass it as a parameter to Job2?
I want to avoid running a Map/Reduce Job just to determine the maximum time stamp, since the size of my data set is in the terabyte+ scale.
Is there a way to accomplish this using Hadoop or maybe Zookeeper?
There is no way 2 maps can talk to each other.So a map only job( job1) can not get you global max. timestamp.However,I can think of 2 approaches as below.
I assume your job1 currently is a map only job and you are writing output from map itself.
A. Change your mapper to write the main output using MultipleOutputs and not Context or OutputCollector.Emit additional (key,value) pair as (constant,timestamp) using context.write().This way, you shuffle only the (constant,timestamp) pairs to reducer.Add a reducer that caliculates max. among the values it received.Run the job, with number of reducers set as 1.The output written from mapper will give you your original output while output written from reducer will give you global max. timestamp.
B. In job1, write the max. timestamp in each mapper as output.You can do this in cleanup().Use MultipleOutputs to write to a folder other than that of your original output.
Once job1 is done, you have 'x' part files in the output folder assuming you have 'x' mappers in job1.You can do a getmerge on this folder to get all the part files into a single local file.This file will have 'x' lines each contain a timestamp.You can read this using a stand-alone java program,find the global max. timestamp and save it in some local file.Share this file to job2 using distrib cache or pass the global max. as a parameter.
I would suggest doing the following, create a directory where you can put the maximum of each Mapper inside a file that is the mapper name+id. The idea is to have a second output directory and to avoid concurrency issues just make sure that each mapper writes to a unique file. Keep the maximum as a variable and write it to the file on each mappers cleanup method.
Once the job completes, it's trivial to iterate over secondary output directory to find the maximum.
When I run my hadoop job I get the following error:
Request received to kill task 'attempt_201202230353_23186_r_000004_0' by user
Task has been KILLED_UNCLEAN by the user
The logs appear to be clean. I run 28 reducers, and this doesnt happen for all the reducers. It happens for a selected few and the reducer starts again. I fail to understand this. Also other thing I have noticed is that for a small dataset, I rarely see this error!
There are three things to try:
Setting a CounterIf Hadoop sees a counter for the job progressing then it won't kill it (see Arockiaraj Durairaj's answer.) This seems to be the most elegant as it could allow you more insight into long running jobs and were the hangups may be.
Longer Task TimeoutsHadoop jobs timeout after 10 minutes by default. Changing the timeout is somewhat brute force, but could work. Imagine analyzing audio files that are generally 5MB files (songs), but you have a few 50MB files (entire album). Hadoop stores an individual file per block. So if your HDFS block size is 64MB then a 5MB file and a 50 MB file would both require 1 block (64MB) (see here http://blog.cloudera.com/blog/2009/02/the-small-files-problem/, and here Small files and HDFS blocks.) However, the 5MB job would run faster than the 50MB job. Task timeout can be increased in the code (mapred.task.timeout) for the job per the answers to this similar question: How to fix "Task attempt_201104251139_0295_r_000006_0 failed to report status for 600 seconds."
Increase Task AttemptsConfigure Hadoop to make more than the 4 default attempts (see Pradeep Gollakota's answer). This is the most brute force method of the three. Hadoop will attempt the job more times, but you could be masking an underlying issue (small servers, large data blocks, etc).
Can you try using counter(hadoop counter) in your reduce logic? It looks like hadoop is not able to determine whether your reduce program is running or hanging. It waits for a few minutes and kills it, even though your logic may be still executing.