Apache Nutch flushes gora record after limit - hadoop

I have configured Nutch 2.3.1 with Hadoop/Hbase ecosystem. I have not changed gora.buffer.read.limit and gora.buffer.read.limit i.e., using their default values that is 10000 in both cases. At generate phase, I set topN to 100,000. During generate job I get following information
org.apache.gora.mapreduce.GoraRecordWriter: Flushing the datastore after 60000 records
After job completion, I found that 100,000 urls are marked for fetched that I want to be. But I am confused what does above warning shows ? What is impact of gora.buffer.read.limit on my crawling ?
Can someone guide ?

That log is written here. By default, the buffer is flushed after writing 10000 records, so you must have somewhere configured gora.buffer.write.limit to 60000 (at core-site.xml or mapred-site.xml or code?).
It is not important, since it is at INFO level. It only notifies that the write buffer is going to be written into the storage.
The writing process happens each time you call store.flush(), or in gora.buffer.write.limit size batches.

Related

Spark batches does not complete when running on Yarn cluster

Setting the scene
I am working to make a Spark streaming application (Spark 2.2.1 with Scala) run on a Yarn cluster (Hadoop 2.7.4).
So far I managed to submit the application to the Yarn cluster with spark-submit. I can see that the receiver task starts up correctly and fetches a lot of records from the database (Couchbase Server 5.0) and I can also see that the records are divided into batches.
The question
When I look at the Streaming Statistics on the Spark Web UI, I can however see that my batches are never processed. I have seen batches with 0 records process and complete but when a batch with records start processing it never completes. One time it even got stuck on a batch with 0 records.
I even tried simplifying the output operations on the SteamingContext as much as possible. But still with the very simple output operation print() my batches are never processed. The logs does not show any warnings or errors.
Does anyone know what might be wrong? Any suggestions on how to solve this will be much appreciated.
More Info
The main class of the Spark application is built from this example (first one) from the Couchbase Spark Connector documentation combined with this example with checkpoint from the Spark Documentation.
Right now I have 3230 Active Batches (3229 queued and 1 processing) and 1 Completed Batch (that had 0 records) and the application has been running for 4 hours and 30 minutes... and another batch is added every 5 seconds.
If I look at the "thread dump" for the executors I see a lot of WAITING, TIMED WAITING and a few RUNNABLE threads. The list will fill up 3 screenshots, so i will only post it if needed.
Below you will find some screenshots from the Web UI
Executor Overview
Spark Jobs Overview
Node Overview with resources
Capacity Scheduler Overview
Per screenshot, you have 2 cores and 1 is being used for driver and another is being used for receiver. You don't have a core for the actual processing to happen. Please increase the number of cores and try again.
Refer: https://spark.apache.org/docs/latest/streaming-programming-guide.html#input-dstreams-and-receivers
If you are using an input DStream based on a receiver (e.g. sockets, Kafka, Flume, etc.), then the single thread will be used to run the receiver, leaving no thread for processing the received data. Hence, when running locally, always use “local[n]” as the master URL, where n > number of receivers to run (see Spark Properties for information on how to set the master).

"Too many fetch-failures" while using Hive

I'm running a hive query against a hadoop cluster of 3 nodes. And I am getting an error which says "Too many fetch failures". My hive query is:
insert overwrite table tablename1 partition(namep)
select id,name,substring(name,5,2) as namep from tablename2;
that's the query im trying to run. All i want to do is transfer data from tablename2 to tablename1. Any help is appreciated.
This can be caused by various hadoop configuration issues. Here a couple to look for in particular:
DNS issue : examine your /etc/hosts
Not enough http threads on the mapper side for the reducer
Some suggested fixes (from Cloudera troubleshooting)
set mapred.reduce.slowstart.completed.maps = 0.80
tasktracker.http.threads = 80
mapred.reduce.parallel.copies = sqrt (node count) but in any case >= 10
Here is link to troubleshooting for more details
http://www.slideshare.net/cloudera/hadoop-troubleshooting-101-kate-ting-cloudera
Update for 2020 Things have changed a lot and AWS mostly rules the roost. Here is some troubleshooting for it
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-troubleshoot-error-resource-1.html
Too many fetch-failures
PDF
Kindle
The presence of "Too many fetch-failures" or "Error reading task output" error messages in step or task attempt logs indicates the running task is dependent on the output of another task. This often occurs when a reduce task is queued to execute and requires the output of one or more map tasks and the output is not yet available.
There are several reasons the output may not be available:
The prerequisite task is still processing. This is often a map task.
The data may be unavailable due to poor network connectivity if the data is located on a different instance.
If HDFS is used to retrieve the output, there may be an issue with HDFS.
The most common cause of this error is that the previous task is still processing. This is especially likely if the errors are occurring when the reduce tasks are first trying to run. You can check whether this is the case by reviewing the syslog log for the cluster step that is returning the error. If the syslog shows both map and reduce tasks making progress, this indicates that the reduce phase has started while there are map tasks that have not yet completed.
One thing to look for in the logs is a map progress percentage that goes to 100% and then drops back to a lower value. When the map percentage is at 100%, this does not mean that all map tasks are completed. It simply means that Hadoop is executing all the map tasks. If this value drops back below 100%, it means that a map task has failed and, depending on the configuration, Hadoop may try to reschedule the task. If the map percentage stays at 100% in the logs, look at the CloudWatch metrics, specifically RunningMapTasks, to check whether the map task is still processing. You can also find this information using the Hadoop web interface on the master node.
If you are seeing this issue, there are several things you can try:
Instruct the reduce phase to wait longer before starting. You can do this by altering the Hadoop configuration setting mapred.reduce.slowstart.completed.maps to a longer time. For more information, see Create Bootstrap Actions to Install Additional Software.
Match the reducer count to the total reducer capability of the cluster. You do this by adjusting the Hadoop configuration setting mapred.reduce.tasks for the job.
Use a combiner class code to minimize the amount of outputs that need to be fetched.
Check that there are no issues with the Amazon EC2 service that are affecting the network performance of the cluster. You can do this using the Service Health Dashboard.
Review the CPU and memory resources of the instances in your cluster to make sure that your data processing is not overwhelming the resources of your nodes. For more information, see Configure Cluster Hardware and Networking.
Check the version of the Amazon Machine Image (AMI) used in your Amazon EMR cluster. If the version is 2.3.0 through 2.4.4 inclusive, update to a later version. AMI versions in the specified range use a version of Jetty that may fail to deliver output from the map phase. The fetch error occurs when the reducers cannot obtain output from the map phase.
Jetty is an open-source HTTP server that is used for machine to machine communications within a Hadoop cluster

Writing high volume reducer output to HBase

I have an Hadoop MapReduce job whose output is a row-id with a Put/Delete operation for that row-id. Due to the nature of the problem, the output is rather high volume. We have tried several method to get this data back into HBase and they have all failed...
Table Reducer
This is way to slow since it seems that it must do a full round trip for every row. Due to how the keys sort for our reducer step, the row-id is not likely to be on the same node as the reducer.
completebulkload
This seems to take a long time (never completes) and there is no real indication of why. Both IO and CPU show very low usage.
Am I missing something obvious?
I saw from your answer to self that you solved your problem but for completeness I'd mention that there's another option - writing directly to hbase. We have a set up where we stream data into HBase and with proper key and region splitting we get to more than 15,000 1K records per second per node
CompleteBulkLoad was the right answer. Per #DonaldMiner I dug deeper and found out that the CompleteBulkLoad process was running as "hbase" which resulted in a permission denied error when trying to move/rename/delete the source files. The implementation appears to retry for a long time before giving an error message; up to 30 minutes in our case.
Giving the hbase user write access to the files resolved the issue.

Re-use files in Hadoop Distributed cache

I am wondering if someone can explain how the distributed cache works in Hadoop. I am running a job many times, and after each run I notice that the local distributed cache folder on each node is growing in size.
Is there a way for multiple jobs to re-use the same file in the distributed cache? Or is the distributed cache only valid for the lifetime of any individual job?
The reason I am confused is that the Hadoop documentation mentions that "DistributedCache tracks modification timestamps of the cache files", so this leads me to believe that if the time stamp hasn't changed, then it should not need to re-cache or re-copy the files to the nodes.
I am adding files successfully to the distributed cache using:
DistributedCache.addFileToClassPath(hdfsPath, conf);
DistributedCache uses reference counting to manage the caches. org.apache.hadoop.filecache.TrackerDistributedCacheManager.CleanupThread is in charge of cleaning up the CacheDirs whose reference count is 0. It will check every minute (default period is 1 minute, you can set it by "mapreduce.tasktracker.distributedcache.checkperiod").
When a Job finishes or fails, JobTracker will send a org.apache.hadoop.mapred.KillJobAction to the TaskTrackers. Then if a TaskTracker receives a KillJobAction, it puts the action to tasksToCleanup. In the TaskTracker, there is a background Thread called taskCleanupThread which takes the action from tasksToCleanup and do the cleanup work. For a KillJobAction, it will invoke purgeJob to clean up the Job. In this method, it will decrease the reference count used by this Job (rjob.distCacheMgr.release();).
The above analysis bases on hadoop-core-2.0.0-mr1-cdh4.2.1-sources.jar. I also checked the hadoop-core-0.20.2-cdh3u1-sources.jar and found there was a litte difference between this two versions. For example, there was not a org.apache.hadoop.filecache.TrackerDistributedCacheManager.CleanupThread in 0.20.2-cdh3u1. When initializing a Job, TrackerDistributedCacheManager will check if there is enough space to put the new caches files for this Job. If not, it will delete the caches which have 0 reference count.
If you are using cdh4.2.1, you can increase "mapreduce.tasktracker.distributedcache.checkperiod" to let the clean up work delay. Then the probability that multiple Jobs use the same distributed cache is increased.
If you are using cdh3u1, you can increase the limitation of the cache size("local.cache.size", default is 10G) and the max directories for caches("mapreduce.tasktracker.cache.local.numberdirectories", default is 10000). This can be also applied to cdh4.2.1.
If you look closely at what this book says, is that there is a limit of what can be stored in Distributed Cache. By default it's 10GB (configurable). There can be multiple different jobs running in the cluster concurrently. Furthermore, Hadoop kind of guarantees the files stay available in the cache for a single job as it is maintained by reference count done by the tasktracker for different tasks accessing the files in cache. In your case, for subsequent Jobs, the files may not be there as they are already marked for deletion.
Please correct me if you disagree anywhere. I'll be glad to discuss this further.
According to this: http://www.datasalt.com/2011/05/handling-dependencies-and-configuration-in-java-hadoop-projects-efficiently/
You should be able to do this via DistributedCache API instead of "-libjars"

Hadoop DistributedCache failed to report status

In a Hadoop job i am mapping several XML-files and filtering an ID for every element (from < id>-tags). Since I want to restrict the job to a certain set of IDs, I read in a large file (about 250 million lines in 2.7 GB, every line with just an integer as a ID). So I use a DistributedCache, parse the file in the setup() method of the Mapper with a BufferedReader and save the IDs to a HashSet.
Now when I start the job, I get countless
Task attempt_201201112322_0110_m_000000_1 failed to report status. Killing!
Before any map-job is executed.
The cluster consists of 40 nodes and since the files of a DistributedCache are copied to the slave nodes before any tasks for the job are executed, i assume the failure is caused by the large HashSet. I have already increased the mapred.task.timeout to 2000s. Of course I could raise the time even more, but actually this period should suffice, shouldn't it?
Since DistributedCache's are used to be a way to "distribute large, read-only files efficiently", I wondered what causes the failure here and if there is another way to pass the relevant IDs to every map-job?
Can you add some some debug printlns to your setup method to check that it is timing out in this method (log the entry and exit times)?
You may also want to look into using a BloomFilter to hold the IDs in. You can probably store these values in a 50MB bloom filter with a good false positive rate (~0.5%), and then run a secondary job to perform a partitioned check against the actual reference file.

Resources