Hadoop Streaming MapReduce slow finding files - hadoop

I have a Hadoop job that is taking a very long time to initialize when fed a large number of input files, and I'm not sure why. The job will find all of the nodes and files within a few seconds, regardless of how many files are used, but takes significant time (minutes) to determine the number of splits if given 10,000 files. When I run the job as a different user, the job will determine the number of splits almost immediately.
$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -D 'mapreduce.job.name=...'
packageJobJar: [] [/usr/lib/hadoop-mapreduce/hadoop-streaming-2.6.0-cdh5.11.0.jar] /tmp/streamjob4556879591438635545.jar tmpDir=null
17/08/07 22:01:40 INFO client.RMProxy: Connecting to ResourceManager at jobtracker-dev.la.prod.factual.com/10.20.103.16:8032
...
17/08/07 22:01:41 INFO security.TokenCache: Got dt for hdfs://dev; Kind: HDFS_DELEGATION_TOKEN....
17/08/07 22:01:41 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library
17/08/07 22:01:41 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev 674c65bbf0f779edc3e00a00c953b121f1988fe1]
17/08/07 22:01:41 INFO mapred.FileInputFormat: Total input paths to process : 10000
17/08/07 22:01:41 INFO net.NetworkTopology: Adding a new node: /cs3/211/...
17/08/07 22:01:41 INFO net.NetworkTopology: Adding a new node: /cs3/210/...
...
<LONG PAUSE>
...
17/08/07 22:31:39 INFO mapreduce.JobSubmitter: number of splits:10000
This is not a lot of information, obviously, but does anyone have an idea what might be going on?

Time taken depends of so many parameter, for starting you can check your cluster capacity, and yarn configuration specifically.
IF you have 10k splits that mean AM coordination with tasks will take significant amount of time , remember hadoop is built for big files processing not small {large number}files.
Do check your hdfs block size as well , and how much you are putting.
Make sure if you are distributed mode establish password less connection with your data nodes.
For specifically"When I run the job as a different user, the job will determine the number of splits almost immediately." This is not HADOOP capacity issue, check your configuration properly. If possible use ambari to manage your cluster if you have enough budget to do so.

Related

why RandomForestClassificationModel broadcasted to executores for every mini batch in spark streaming

Setup:
Trained Random forest model in offline and stored in file system.
This model is loaded once at the start of spark-streaming application using Pipeline.load .
Predict function is called for every batch (model.transform(input_data_frame))
Observation: From the Spark-UI we can see that every task of this stage is spending most of the time(more than 95%) for task deserialization. Our assumption is every task is deserializing the models that loaded initially so we have tried broadcasting the models (broadcast variables is useful when caching the data in deserialized form is important but still it is showing high task deserialization time.
Spark standalone cluster details : spark version : 2.2.1 Executor core = 4 Executor Memory = 4 GB Total Executors = 24
#
model size 45MB
spark kafka streaming job jar size 8 MB
1) why there is delay between this two steps ? what is happening between that steps?
attached is the spark kafka streaming log
18/04/15 03:21:23 INFO KafkaSource: GetBatch generating RDD of offset range: KafkaSourceRDDOffsetRange(Kafka_input_topic-0,242,242,Some(executor_xx.xxx.xx.110_2)), KafkaSourceRDDOffsetRange(Kafka_input_topic-1,239,239,Some(executor_xx.xxx.xx.107_0)), KafkaSourceRDDOffsetRange(Kafka_input_topic-2,241,241,Some(executor_xx.xxx.xx.102_3)), KafkaSourceRDDOffsetRange(Kafka_input_topic-3,238,239,Some(executor_xx.xxx.xx.138_4)), KafkaSourceRDDOffsetRange(Kafka_input_topic-4,240,240,Some(executor_xx.xxx.xx.137_1)), KafkaSourceRDDOffsetRange(Kafka_input_topic-5,242,242,Some(executor_xx.xxx.xx.111_5)) 18/04/15 03:21:24 INFO SparkContext: Starting job: start at App.java:106
2) why spark broad casting model to executor for every mini batch ?
18/04/15 03:21:31 INFO BlockManagerInfo: Added broadcast_92_piece0 in memory on xx.xxx.xx.137:44682 (size: 62.6 MB, free: 1942.0 MB) ##

Hadoop MapReduce job I/O Exception due to premature EOF from inputStream

I ran a MapReduce program using the command hadoop jar <jar> [mainClass] path/to/input path/to/output. However, my job was hanging at: INFO mapreduce.Job: map 100% reduce 29%.
Much later, I terminated and checked the datanode log (I am running in pseudo-distributed mode). It contained the following exception:
java.io.IOException: Premature EOF from inputStream
at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:201)
at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:472)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:849)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:804)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:251)
at java.lang.Thread.run(Thread.java:745)
5 seconds later in the log was ERROR DataXceiver error processing WRITE_BLOCK operation.
What problem might be causing this exception and error?
My NodeHealthReport said:
1/1 local-dirs are bad: /home/$USER/hadoop/nm-local-dir;
1/1 log-dirs are bad: /home/$USER/hadoop-2.7.1/logs/userlogs
I found this which indicates that dfs.datanode.max.xcievers may need to be increased. However, it is deprecated and the new property is called dfs.datanode.max.transfer.threads with default value 4096. If changing this would fix my problem, what new value should I set it to?
This indicates that the ulimit for the datanode may need to be increased. My ulimit -n (open files) is 1024. If increasing this would fix my problem, what should I set it to?
Premature EOF can occur due to multiple reasons, one of which is spawning of huge number of threads to write to disk on one reducer node using FileOutputCommitter. MultipleOutputs class allows you to write to files with custom names and to accomplish that, it spawns one thread per file and binds a port to it to write to the disk. Now this puts a limitation on the number of files that could be written to at one reducer node. I encountered this error when the number of files crossed 12000 roughly on one reducer node, as the threads got killed and the _temporary folder got deleted leading to plethora of these exception messages. My guess is - this is not a memory overshoot issue, nor it could be solved by allowing hadoop engine to spawn more threads. Reducing the number of files being written at one time at one node solved my problem - either by reducing the actual number of files being written, or by increasing reducer nodes.

Hadoop2.4.0 creating 39063 Map tasks to process 10MB file in Local mode with invalid Inputsplit configuration

am using hadoop-2.4.0 with all default configuration expect below:
FileInputFormat.setInputPaths(job, new Path("in")); //10mb file; just one file.
FileOutputFormat.setOutputPath(job, new Path("out"));
job.getConfiguration().set("mapred.max.split.size", "64");
job.getConfiguration().set("mapred.min.split.size", "128");
PS: I set max split size is lesser than min(Initially I set by mistake and I realized)
And, as per inputsplit calucaiton logic
max(minimumSize, min(maximumSize, blockSize))
max(128,min(64,128) --> 128MB and it is great than file size, so it should create only one inputsplit(one mapper)
Am just curious about how the framework calculating 39063 mappers each time when I run this program in eclipse?
Logs:
2015-07-15 12:02:37 DEBUG LocalJobRunner Starting mapper thread pool executor.
2015-07-15 12:02:37 DEBUG LocalJobRunner Max local threads: 1
2015-07-15 12:02:37 DEBUG LocalJobRunner Map tasks to process: 39063
2015-07-15 12:02:38 INFO LocalJobRunner Starting task:
attempt_local192734774_0001_m_000000_0
Thanks,
In your code you have specified:
job.getConfiguration().set("mapred.max.split.size", "64");
job.getConfiguration().set("mapred.min.split.size", "128");
Its calculating into bytes. Hence you are getting high number of Mapper.
I think you should use something like this:
job.getConfiguration().set("mapred.min.split.size", 67108864);
67108864 is value in bytes of 64MB
Calculation: 64*1024*1024 = 67108864
mapred.max.split.size is basicall used to combine small file to defint split size where you are dealing with large number of small files and mapred.min.split.size is used to define split where you are dealing with large files.
If you are using YARN or MR2 then you should use mapreduce.input.fileinputformat.split.minsize

What is Apache Spark doing before a job start

I have an Apache Spark batch job running continuously on AWS EMR. It pulls from AWS S3, runs a couple of jobs with that data, and then stores the data in an RDS instance.
However, there seems to be a long period of inactivity between jobs.
This is the CPU use:
And this is the network:
Notice the gap between each column, it is almost the same size as the activity column!
At first I thought these two columns were shifted (when it was pulling from S3, it wasn't using a lot of CPU and vice-versa) but then I noticed that these two graphs actually follow each other. This makes sense since the RDDs are lazy and will thus pull as the job is running.
Which leads to my question, what is Spark doing during that time? All of the Ganglia graphs seem zeroed during that time. It is as if the cluster decided to take a break before each job.
Thanks.
EDIT: Looking at the logs, this is the part where it seems to take an hour of...doing nothing?
15/04/27 01:13:13 INFO storage.DiskBlockManager: Created local directory at /mnt1/var/lib/hadoop/tmp/nm-local-dir/usercache/hadoop/appcache/application_1429892010439_0020/spark-c570e510-934c-4510-a1e5-aa85d407b748
15/04/27 01:13:13 INFO storage.MemoryStore: MemoryStore started with capacity 4.9 GB
15/04/27 01:13:13 INFO netty.NettyBlockTransferService: Server created on 37151
15/04/27 01:13:13 INFO storage.BlockManagerMaster: Trying to register BlockManager
15/04/27 01:13:13 INFO storage.BlockManagerMaster: Registered BlockManager
15/04/27 01:13:13 INFO util.AkkaUtils: Connecting to HeartbeatReceiver: akka.tcp://sparkDriver#ip-10-0-3-12.ec2.internal:41461/user/HeartbeatReceiver
15/04/27 02:30:45 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 0
15/04/27 02:30:45 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 7
15/04/27 02:30:45 INFO executor.Executor: Running task 77251.0 in stage 0.0 (TID 0)
15/04/27 02:30:45 INFO executor.Executor: Running task 77258.0 in stage 0.0 (TID 7)
15/04/27 02:30:45 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 8
15/04/27 02:30:45 INFO executor.Executor: Running task 0.0 in stage 0.0 (TID 8)
15/04/27 02:30:45 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 15
15/04/27 02:30:45 INFO executor.Executor: Running task 7.0 in stage 0.0 (TID 15)
15/04/27 02:30:45 INFO broadcast.TorrentBroadcast: Started reading broadcast variable
Notice at 01:13:13, it just hangs there until 20:30:45.
I found the issue. The problem was in the way I was calling pulling from S3.
We have our data in S3 separated by a date pattern as in s3n://bucket/2015/01/03/10/40/actualData.txt for the data from 2015-01-03 at 10:40
So when we want to run the batch process on the whole set, we call sc.textFiles("s3n://bucket/*/*/*/*/*/*").
BUT that is bad. In retrospect, this makes sense; for each star (*), Spark needs to get all of the files in that "directory", and then get all of the files in the directory under that. A single month has about 30 files and each day has 24 files, and each of those has 60. So the above pattern would call a "list files" on each star AND the call list files on the files returned, all the way down to the minutes! This is so that is can eventually get all of the **/acutalData.txt files and then union all of their RDDs.
This, of course, is really slow. So the answer was to build these paths in code (a list of strings for all the dates. In our case, all possible dates can be determined) and reducing them into a comma-separated string that can be passed into textFiles.
If in your case you can't determine all of the possible paths, consider either restructuring your data or build as much as possible of the paths and only call * towards the end of the path, or use the AmazonS3Client to get all the keys using the list-objects api (which allows you to get ALL keys in a bucket with a prefix very quickly) and then pass them as comma-separated string into textFiles. It will still make a list Status call for each file and it will still be serial, but there will be a lot less calls.
However, all of these solutions just slow down the inevitable; as more and more data gets built, more and more list status calls will be made serially. The root of the issue seems to the that sc.textFiles(s3n://) pretends that s3 is a file system, which is not. It is a key-value store. Spark (and Hadoop) need a different way of dealing with S3 (and possibly other key-value stores) that don't assume a file system.

Lease mismatch LeaseExpiredException

I have seen some posts on this topic, but I could not figure the fix to my problem. I am using Hadoop version Hadoop 2.0.0-cdh4.2.0, and java version "1.7.0_09-icedtea". I am running a program that utilizes counters to control interations in a simple mapreduce example. I also employ sequence files for communicating data. The code is simple: It starts with a number, say, 3. The mapper doesn't modify the number, but simply transmits the value; the reducer reduces the number by 1 each time it runs. The counter is increamented, if the number is greater than zero. Eventually, the nummber must decrease to 0, and the program should stop at that point. However, I always get the following error after the first iteration (during the second iteration):
" Running job: job_201304151408_0181
13/05/10 18:55:54 INFO mapred.JobClient: map 0% reduce 0%
13/05/10 18:56:03 INFO mapred.JobClient: map 100% reduce 0%
13/05/10 18:56:10 INFO mapred.JobClient: map 100% reduce 33%
13/05/10 18:56:11 INFO mapred.JobClient: Task Id : attempt_201304151408_0181_r_000002_0, Status : FAILED
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): Lease mismatch on /user/harsha/iterone/import/data owned by DFSClient_NONMAPREDUCE_-592566041_1 but is accessed by DFSClient_NONMAPREDUCE_-965911637_1"
Can anyone please help? Thank you.
Regards...
Usually Lease Mismatch happens if we are trying to write into a file which is not existing.
Please check if /user/harsha/iterone/import/data is in hdfs.
Is that a file?

Resources