I am trying to run a job with both mappers and reducers but the mappers are running slow..
If for the same input i disable reducers, the mappers finish in 3 mins
while for mapper-reducer jobs, even at the end of 30 mins the Mappers are not finished.
I am using hadoop 1.0.3 ..I tried both with and without compression of map output. I removed the older version of hadoop 0.20.203 and reinstalled everything from scratch for 1.0.3
Also the Jobtracker logs are filled with:
2012-10-03 10:26:20,138 INFO org.apache.hadoop.ipc.Server: IPC Server listener on 54311: readAndProcess threw exception java.lang.RuntimeException: readObject can't find class . Count of bytes read: 0
java.lang.RuntimeException: readObject can't find class
at org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:185)
at org.apache.hadoop.ipc.RPC$Invocation.readFields(RPC.java:102)
at org.apache.hadoop.ipc.Server$Connection.processData(Server.java:1303)
at org.apache.hadoop.ipc.Server$Connection.processOneRpc(Server.java:1282)
at org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1182)
at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:537)
at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:344)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:679)
Caused by: java.lang.ClassNotFoundException:
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:264)
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:820)
at org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:183)
Can anyone tell what may be wrong
if your mapper is getting completed in 3 mins. then its not slow with batch processing nature. Yes with your used version of mapreduce you need to make sure that you are using correct no of reducers. if you have cluster size is X then try to use number of reducer as X-1 . See if this helps or not
Related
I am currently using the free version of Teradata Hadoop connector teradata-connector 1.3.4 to load data to Teradata. I am using internal.fastload to load the data.
Database version is 14.10
jdbc driver version is 15.0
Sometimes I face Connection refused exception while running the job but this issue goes off while reseting the load job 2-3 times. Also this has nothing to do with the load on teradata database as the load is pretty normal. The exception which is thrown is below:
15/10/29 22:52:54 INFO mapreduce.Job: Running job: job_1445506804193_290389
com.teradata.connector.common.exception.ConnectorException: Internal fast load socket server time out
at com.teradata.connector.teradata.TeradataInternalFastloadOutputFormat$InternalFastloadCoordinator.beginLoading(TeradataInternalFastloadOutputFormat.java:642)
at com.teradata.connector.teradata.TeradataInternalFastloadOutputFormat$InternalFastloadCoordinator.run(TeradataInternalFastloadOutputFormat.java:503)
at java.lang.Thread.run(Thread.java:745)
15/10/29 23:39:29 INFO mapreduce.Job: Job job_1445506804193_290389 running in uber mode : false
15/10/29 23:39:29 INFO mapreduce.Job: map 0% reduce 0%
15/10/29 23:40:08 INFO mapreduce.Job: Task Id : attempt_1445506804193_290389_m_000001_0, Status : FAILED
Error: com.teradata.connector.common.exception.ConnectorException: java.net.ConnectException: Connection refused
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:345)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:589)
at java.net.Socket.connect(Socket.java:538)
at java.net.Socket.<init>(Socket.java:434)
at java.net.Socket.<init>(Socket.java:211)
at com.teradata.connector.teradata.TeradataInternalFastloadOutputFormat.getRecordWriter(TeradataInternalFastloadOutputFormat.java:301)
at com.teradata.connector.common.ConnectorOutputFormat$ConnectorFileRecordWriter.<init>(ConnectorOutputFormat.java:84)
at com.teradata.connector.common.ConnectorOutputFormat.getRecordWriter(ConnectorOutputFormat.java:33)
at org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.<init>(MapTask.java:624)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:744)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1591)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
Any pointers in this regard will definitely help.
Thanks in advance.
Root cause: com.teradata.connector.common.exception.ConnectorException: Internal fast load socket server time out
Internal fast load server socket time out
When running export job using the "internal.fastload" method, the following error may occur: Internal fast load socket server time out
This error occurs because the number of available map tasks currently is less than the number of map tasks specified in the command line by parameter of "-nummappers".
This error can occur in the following conditions:
(1) There are some other map/reduce jobs running concurrently in the Hadoop cluster, so there are not enough resources to allocate specified map tasks for the export job.
(2) The maximum number of map tasks is smaller than existing map tasks added expected map tasks of the export jobs in the Hadoop cluster.
When the above error occurs, please try to increase the maximum number of map tasks of the Hadoop cluster, or decrease the number of map tasks for the export job
There is good trouble shooter PDF available #teradata
If you get any type of errors, have a look at above PDF and get it fixed.
Have a look at other map reduce properties if you have to fine tune them.
ravindra-babu's answer is correct since the answer is buried in the pdf docs. The support.teradata.com kb article KB0023556 also offers more details on the why.
Resolution
All mappers should run simultaneously. If they are not
running simultaneously, try to reduce the number of mappers in TDCH
job through -nummappers argument.
Resubmit the TDCH job after
changing the -nummappers
Honestly it's a very confusing error and could be displayed better.
I ran a MapReduce program using the command hadoop jar <jar> [mainClass] path/to/input path/to/output. However, my job was hanging at: INFO mapreduce.Job: map 100% reduce 29%.
Much later, I terminated and checked the datanode log (I am running in pseudo-distributed mode). It contained the following exception:
java.io.IOException: Premature EOF from inputStream
at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:201)
at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:472)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:849)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:804)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:251)
at java.lang.Thread.run(Thread.java:745)
5 seconds later in the log was ERROR DataXceiver error processing WRITE_BLOCK operation.
What problem might be causing this exception and error?
My NodeHealthReport said:
1/1 local-dirs are bad: /home/$USER/hadoop/nm-local-dir;
1/1 log-dirs are bad: /home/$USER/hadoop-2.7.1/logs/userlogs
I found this which indicates that dfs.datanode.max.xcievers may need to be increased. However, it is deprecated and the new property is called dfs.datanode.max.transfer.threads with default value 4096. If changing this would fix my problem, what new value should I set it to?
This indicates that the ulimit for the datanode may need to be increased. My ulimit -n (open files) is 1024. If increasing this would fix my problem, what should I set it to?
Premature EOF can occur due to multiple reasons, one of which is spawning of huge number of threads to write to disk on one reducer node using FileOutputCommitter. MultipleOutputs class allows you to write to files with custom names and to accomplish that, it spawns one thread per file and binds a port to it to write to the disk. Now this puts a limitation on the number of files that could be written to at one reducer node. I encountered this error when the number of files crossed 12000 roughly on one reducer node, as the threads got killed and the _temporary folder got deleted leading to plethora of these exception messages. My guess is - this is not a memory overshoot issue, nor it could be solved by allowing hadoop engine to spawn more threads. Reducing the number of files being written at one time at one node solved my problem - either by reducing the actual number of files being written, or by increasing reducer nodes.
I have map reduce job for hbase bulk load. Job is converting data into Hfiles and loading into hbase but after certain map % job is failing. Below is the exception that I am getting.
Error: java.io.FileNotFoundException: /var/mapr/local/tm4/mapred/nodeManager/spill/job_1433110149357_0005/attempt_1433110149357_0005_m_000000_0/spill83.out.index
at org.apache.hadoop.fs.RawLocalFileSystem.open(RawLocalFileSystem.java:198)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:800)
at org.apache.hadoop.io.SecureIOUtils.openFSDataInputStream(SecureIOUtils.java:156)
at org.apache.hadoop.mapred.SpillRecord.<init>(SpillRecord.java:74)
at org.apache.hadoop.mapred.MapRFsOutputBuffer.mergeParts(MapRFsOutputBuffer.java:1382)
at org.apache.hadoop.mapred.MapRFsOutputBuffer.flush(MapRFsOutputBuffer.java:1627)
at org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:709)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:779)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:345)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1566)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
Only thing that i noticed in job that for small set of data it is working fine but as data grows job starts failing.
Let me know if anyone has faced this issue.
Thanks
This was a bug in MapR. I got reply on MapR forum. If someone is facing similar issue then refere below link.
http://answers.mapr.com/questions/163440/hbase-bulk-load-map-reduce-job-failing-on-mapr.html
From the error message it is quite obvious that, there was a problem in saving a replica of a particular block related to a file. The reason might be, there was a problem in accessing a data node to save a particular block(replica of a block).
Please refer below for the complete log:
I found another user "huasanyelao" - https://stackoverflow.com/users/987275/huasanyelao also had a similar exception/problem but the use case was different.
Now, how do we solve these kind of problems? I understand that there is no fixed solution to handle in all scenarios.
1. What is the immediate step I need to take to fix errors of this kind?
2. If there are jobs for which I'm not monitoring the log at that time. What approaches do I need to take to fix such issues.
P.S: Apart from fixing Network or Access Issues, what other approaches should I follow.
Error Log:
*15/04/10 11:21:13 INFO impl.TimelineClientImpl: Timeline service address: http://your-name-node/ws/v1/timeline/
15/04/10 11:21:14 INFO client.RMProxy: Connecting to ResourceManager at your-name-node/xxx.xx.xxx.xx:0000
15/04/10 11:21:34 WARN hdfs.DFSClient: DataStreamer Exception
java.nio.channels.UnresolvedAddressException
at sun.nio.ch.Net.checkAddress(Net.java:29)
at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:512)
at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:192)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529)
at org.apache.hadoop.hdfs.DFSOutputStream.createSocketForPipeline(DFSOutputStream.java:1516)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1318)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1272)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:525)
15/04/10 11:21:40 INFO hdfs.DFSClient: Could not complete /user/xxxxx/.staging/job_11111111111_1212/job.jar retrying...
15/04/10 11:21:46 INFO hdfs.DFSClient: Could not complete /user/xxxxx/.staging/job_11111111111_1212/job.jar retrying...
15/04/10 11:21:59 INFO mapreduce.JobSubmitter: Cleaning up the staging area /user/xxxxx/.staging/job_11111111111_1212
Error occured in MapReduce process:
java.io.IOException: Unable to close file because the last block does not have enough number of replicas.
at org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2132)
at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2100)
at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70)
at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:54)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:112)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:366)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:338)
at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1903)
at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1871)
at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1836)
at org.apache.hadoop.mapreduce.JobSubmitter.copyJar(JobSubmitter.java:286)
at org.apache.hadoop.mapreduce.JobSubmitter.copyAndConfigureFiles(JobSubmitter.java:254)
at org.apache.hadoop.mapreduce.JobSubmitter.copyAndConfigureFiles(JobSubmitter.java:301)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:389)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1285)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1282)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1594)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1282)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1303)
at com.xxx.xxx.xxxx.driver.GenerateMyFormat.runMyExtract(GenerateMyFormat.java:222)
at com.xxx.xxx.xxxx.driver.GenerateMyFormat.run(GenerateMyFormat.java:101)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at com.xxx.xxx.xxxx.driver.GenerateMyFormat.main(GenerateMyFormat.java:42)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)*
We had similar issue. Its primarily attributed to dfs.namenode.handler.count was not enough. Increasing that may help in some small clusters but it is because of DOS issue where nameNode couldn't handle no. of connections or RPC call and your Pending deletion blocks will grow humongous. Validate the hdfs audit logs and see any mass deletion happening or other hdfs actions and match with the jobs which might be overwhelming NN . Stoping those tasks will help HDFS to recover.
When i run my job on a larger dataset, lots of mappers / reducers fail causing the whole job to crash. Here's the error i see on many mappers:
java.io.FileNotFoundException: File does not exist: /mnt/var/lib/hadoop/tmp/mapred/staging/hadoop/.staging/job_201405050818_0001/job.split
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1933)
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.<init>(DFSClient.java:1924)
at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:608)
at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:154)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:429)
at org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:385)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:417)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:377)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
Has anybody been able to solve this problem ? I see another human experiencing the same pain as me (here), sadly he could not be saved in time.
After hours of debugging, I found absolutely nothing useful in hadoop logs (as usual). Then i tried the following changes:
Increasing the cluster size to 10
Increase the failure limits :
mapred.map.max.attempts=20
mapred.reduce.max.attempts=20
mapred.max.tracker.failures=20
mapred.max.map.failures.percent=20
mapred.max.reduce.failures.percent=20
I was able to run my cascading job on large amounts of data subsequently. It seems like a problem caused by cascading.