Hadoop Job throws java.io.IOException: Attempted read from closed stream - hadoop

I'm running a simple map-reduce job. This job uses 250 files from common crawl data.
e.g. s3://aws-publicdatasets/common-crawl/parse-output/segment/1341690169105/
If I use, 50, 100 files, everything works OK. But with 250 files I get this error
java.io.IOException: Attempted read from closed stream.
at org.apache.commons.httpclient.ContentLengthInputStream.read(ContentLengthInputStream.java:159)
at java.io.FilterInputStream.read(FilterInputStream.java:116)
at org.apache.commons.httpclient.AutoCloseInputStream.read(AutoCloseInputStream.java:107)
at org.jets3t.service.io.InterruptableInputStream.read(InterruptableInputStream.java:76)
at org.jets3t.service.impl.rest.httpclient.HttpMethodReleaseInputStream.read(HttpMethodReleaseInputStream.java:136)
at org.apache.hadoop.fs.s3native.NativeS3FileSystem$NativeS3FsInputStream.read(NativeS3FileSystem.java:111)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
at java.io.DataInputStream.readByte(DataInputStream.java:248)
at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:299)
at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:320)
at org.apache.hadoop.io.SequenceFile$Reader.readBuffer(SequenceFile.java:1707)
at org.apache.hadoop.io.SequenceFile$Reader.seekToCurrentValue(SequenceFile.java:1773)
at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1849)
at org.apache.hadoop.mapreduce.lib.input.SequenceFileRecordReader.nextKeyValue(SequenceFileRecordReader.java:74)
at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:532)
at org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
at org.apache.hadoop.mapreduce.lib.map.MultithreadedMapper$SubMapRecordReader.nextKeyValue(MultithreadedMapper.java:180)
at org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
at org.apache.hadoop.mapreduce.lib.map.MultithreadedMapper$MapRunner.run(MultithreadedMapper.java:268)
Any clues?

How many map slots do you have to process the input? Is it close to 100?
This is a guess, but its possible that the connection to S3 is timing out whilst you process the first batch of files and as slots become available to process further files the connection is no longer open. I believe timeout errors from NativeS3FileSystem show up as IOExceptions.

Related

Must hacking protobuf jar?

1.My namenode log always prints error log java.io.IOException: Requested data length 113675682 is longer than maximum configured RPC length 67108864. RPC came from 172.16.xxx.xxx
And datanode prints Unsuccessfully sent block report 0x706cd6d00df0effe, containing 1 storage report(s), of which we sent 0. The reports had 9016550 total blocks and used 0 RPC(s). This took 1734 msec to generate and 252 msecs for RPC and NN processing. Got back no commands
2.I set ipc.maximum.data.length to 134217728 and solved the problem,But unfortunately,i find after set length,my hdfs client often can't to write data,but just take a few minutes every time.Then i find the namenode throw a new exception,when client can't write,DatanodeProtocol.blockReport from 172.16.xxx.xxx:43410 Call#30074227 Retry#0
java.lang.IllegalStateException: com.google.protobuf.InvalidProtocolBufferException: Protocol message was too large. May be malicious. Use CodedInputStream.setSizeLimit() to increase the size limit.
like Referring HDFS-5153,it says The NameSystem write lock is held during this time.`
I must hacking protobuf jar and set the limit?
EDIT:
I find a Same question,but no solution

Connection Refused Exception while using Teradata Hadoop Connector

I am currently using the free version of Teradata Hadoop connector teradata-connector 1.3.4 to load data to Teradata. I am using internal.fastload to load the data.
Database version is 14.10
jdbc driver version is 15.0
Sometimes I face Connection refused exception while running the job but this issue goes off while reseting the load job 2-3 times. Also this has nothing to do with the load on teradata database as the load is pretty normal. The exception which is thrown is below:
15/10/29 22:52:54 INFO mapreduce.Job: Running job: job_1445506804193_290389
com.teradata.connector.common.exception.ConnectorException: Internal fast load socket server time out
at com.teradata.connector.teradata.TeradataInternalFastloadOutputFormat$InternalFastloadCoordinator.beginLoading(TeradataInternalFastloadOutputFormat.java:642)
at com.teradata.connector.teradata.TeradataInternalFastloadOutputFormat$InternalFastloadCoordinator.run(TeradataInternalFastloadOutputFormat.java:503)
at java.lang.Thread.run(Thread.java:745)
15/10/29 23:39:29 INFO mapreduce.Job: Job job_1445506804193_290389 running in uber mode : false
15/10/29 23:39:29 INFO mapreduce.Job: map 0% reduce 0%
15/10/29 23:40:08 INFO mapreduce.Job: Task Id : attempt_1445506804193_290389_m_000001_0, Status : FAILED
Error: com.teradata.connector.common.exception.ConnectorException: java.net.ConnectException: Connection refused
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:345)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:589)
at java.net.Socket.connect(Socket.java:538)
at java.net.Socket.<init>(Socket.java:434)
at java.net.Socket.<init>(Socket.java:211)
at com.teradata.connector.teradata.TeradataInternalFastloadOutputFormat.getRecordWriter(TeradataInternalFastloadOutputFormat.java:301)
at com.teradata.connector.common.ConnectorOutputFormat$ConnectorFileRecordWriter.<init>(ConnectorOutputFormat.java:84)
at com.teradata.connector.common.ConnectorOutputFormat.getRecordWriter(ConnectorOutputFormat.java:33)
at org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.<init>(MapTask.java:624)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:744)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1591)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
Any pointers in this regard will definitely help.
Thanks in advance.
Root cause: com.teradata.connector.common.exception.ConnectorException: Internal fast load socket server time out
Internal fast load server socket time out
When running export job using the "internal.fastload" method, the following error may occur: Internal fast load socket server time out
This error occurs because the number of available map tasks currently is less than the number of map tasks specified in the command line by parameter of "-nummappers".
This error can occur in the following conditions:
(1) There are some other map/reduce jobs running concurrently in the Hadoop cluster, so there are not enough resources to allocate specified map tasks for the export job.
(2) The maximum number of map tasks is smaller than existing map tasks added expected map tasks of the export jobs in the Hadoop cluster.
When the above error occurs, please try to increase the maximum number of map tasks of the Hadoop cluster, or decrease the number of map tasks for the export job
There is good trouble shooter PDF available #teradata
If you get any type of errors, have a look at above PDF and get it fixed.
Have a look at other map reduce properties if you have to fine tune them.
ravindra-babu's answer is correct since the answer is buried in the pdf docs. The support.teradata.com kb article KB0023556 also offers more details on the why.
Resolution
All mappers should run simultaneously. If they are not
running simultaneously, try to reduce the number of mappers in TDCH
job through -nummappers argument.
Resubmit the TDCH job after
changing the -nummappers
Honestly it's a very confusing error and could be displayed better.

Apache flume rolling out the HDFS files on hourly basis

I'm new to Flume and I was exploring options to roll over my HDFS files on hourly basis using Flume.
In my project Apache Flume will read the messages from Rabbit MQ and it will write it to HDFS.
hdfs.rollInterval - It closes the file based on the time interval when it got opened.
New file will be created only when Flume reads a message after the file got closed. This option is not solving our problem.
hdfs.path = /%y/%m/%d/%H - This option is working fine and it creates folder on hourly basis. But the problem is new folder will be created only when new message comes.
For example: Messages are coming till 11.59, the file will be in open state. Then the messages stop coming till 12.30. But, the file will still be in open state. After 12.30 new message comes. Then because of hdfs.path configuration, previous file will be closed and new file will be created in new folder.
Previous file cannot be used for computation till it is closed.
We need an option of closing the opened files on hourly basis perfectly. I'm wondering if there any options in flume for doing that.
hdfs.rollInterval is described as
Number of seconds to wait before rolling current file
So this line should cause the files to allocate for an hour at a time
hdfs.rollInterval = 3600
And I would additionally ignore file size and event count, so add these as well
hdfs.rollSize = 0
hdfs.rollCount = 0
hdfs.idleTimeout
Timeout after which inactive files get closed (0 = disable automatic closing of idle files)
For example, you can set this property to 180. The file will be opened

Hadoop MapReduce job I/O Exception due to premature EOF from inputStream

I ran a MapReduce program using the command hadoop jar <jar> [mainClass] path/to/input path/to/output. However, my job was hanging at: INFO mapreduce.Job: map 100% reduce 29%.
Much later, I terminated and checked the datanode log (I am running in pseudo-distributed mode). It contained the following exception:
java.io.IOException: Premature EOF from inputStream
at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:201)
at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:472)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:849)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:804)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:251)
at java.lang.Thread.run(Thread.java:745)
5 seconds later in the log was ERROR DataXceiver error processing WRITE_BLOCK operation.
What problem might be causing this exception and error?
My NodeHealthReport said:
1/1 local-dirs are bad: /home/$USER/hadoop/nm-local-dir;
1/1 log-dirs are bad: /home/$USER/hadoop-2.7.1/logs/userlogs
I found this which indicates that dfs.datanode.max.xcievers may need to be increased. However, it is deprecated and the new property is called dfs.datanode.max.transfer.threads with default value 4096. If changing this would fix my problem, what new value should I set it to?
This indicates that the ulimit for the datanode may need to be increased. My ulimit -n (open files) is 1024. If increasing this would fix my problem, what should I set it to?
Premature EOF can occur due to multiple reasons, one of which is spawning of huge number of threads to write to disk on one reducer node using FileOutputCommitter. MultipleOutputs class allows you to write to files with custom names and to accomplish that, it spawns one thread per file and binds a port to it to write to the disk. Now this puts a limitation on the number of files that could be written to at one reducer node. I encountered this error when the number of files crossed 12000 roughly on one reducer node, as the threads got killed and the _temporary folder got deleted leading to plethora of these exception messages. My guess is - this is not a memory overshoot issue, nor it could be solved by allowing hadoop engine to spawn more threads. Reducing the number of files being written at one time at one node solved my problem - either by reducing the actual number of files being written, or by increasing reducer nodes.

Hibernate - Getting exception : maximum number of processes (550) exceeded

I am using hibernate 3 along with spring.My Hibernate configurations are as under :
hibernate.dialect=org.hibernate.dialect.Oracle8iDialect
hibernate.connection.release_mode=on_close
But after starting application, even if only one user accesses it then also I am getting this exception :
ORA-00020: maximum number of processes (550) exceeded
This is stacktrace:
Caused by: java.sql.SQLException: ORA-00020: maximum number of processes (550) exceeded
at oracle.jdbc.driver.DatabaseError.throwSqlException(DatabaseError.java:112)
at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:331)
at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:288)
at oracle.jdbc.driver.T4C8Oall.receive(T4C8Oall.java:743)
at oracle.jdbc.driver.T4CPreparedStatement.doOall8(T4CPreparedStatement.java:216)
at oracle.jdbc.driver.T4CPreparedStatement.executeForDescribe(T4CPreparedStatement.java:799)
at oracle.jdbc.driver.OracleStatement.executeMaybeDescribe(OracleStatement.java:1038)
at oracle.jdbc.driver.T4CPreparedStatement.executeMaybeDescribe(T4CPreparedStatement.java:839)
at oracle.jdbc.driver.OracleStatement.doExecuteWithTimeout(OracleStatement.java:1133)
at oracle.jdbc.driver.OraclePreparedStatement.executeInternal(OraclePreparedStatement.java:3285)
at oracle.jdbc.driver.OraclePreparedStatement.executeQuery(OraclePreparedStatement.java:3329)
at com.mchange.v2.c3p0.impl.NewProxyPreparedStatement.executeQuery(NewProxyPreparedStatement.java:76)
at org.hibernate.jdbc.AbstractBatcher.getResultSet(AbstractBatcher.java:208)
at org.hibernate.loader.Loader.getResultSet(Loader.java:1953)
at org.hibernate.loader.Loader.doQuery(Loader.java:802)
at org.hibernate.loader.Loader.doQueryAndInitializeNonLazyCollections(Loader.java:274)
at org.hibernate.loader.Loader.loadEntity(Loader.java:2037)
I have kept connection pool time out = 5000. I have also tried to found the cause and got that release mode may affect the mechanism of closing DB resources. But I couldn't find exact solution for that.
Please help..
Thanks in advance..
This is a database error not an application error so you need to go to the database to solve it. 550 processes is a lot more than it sounds so either someone has gone insane or you have a lot of inactive processes running.
The best way to find out is to query the v$session view or Gv$session if you're using a RAC, look at the STATUS column.
Take careful not of where all these sessions are coming from; the OSUSER, TERMINAL and PROGRAM will probably be the most useful. It might almost be worth creating a temporary table with this information - proof and a record afterwards. Then after checking that you're not going to break anything, and with your DBAs if you have any, kill all the inactive sessions simultaneously or one at a time.
That'll remove the error but if it's occurred once it can occur again, so you need to solve it. Either:
You've got a lot of people using the database.
There is an application / program somewhere that is not closing it's
sessions after it's finished.
Someone is connecting in the middle of a loop.
Whichever reason it is you need to track it down and correct it. I'd start with the program or terminal from v$session that had the most number of processes.

Resources