Must hacking protobuf jar? - hadoop

1.My namenode log always prints error log java.io.IOException: Requested data length 113675682 is longer than maximum configured RPC length 67108864. RPC came from 172.16.xxx.xxx
And datanode prints Unsuccessfully sent block report 0x706cd6d00df0effe, containing 1 storage report(s), of which we sent 0. The reports had 9016550 total blocks and used 0 RPC(s). This took 1734 msec to generate and 252 msecs for RPC and NN processing. Got back no commands
2.I set ipc.maximum.data.length to 134217728 and solved the problem,But unfortunately,i find after set length,my hdfs client often can't to write data,but just take a few minutes every time.Then i find the namenode throw a new exception,when client can't write,DatanodeProtocol.blockReport from 172.16.xxx.xxx:43410 Call#30074227 Retry#0
java.lang.IllegalStateException: com.google.protobuf.InvalidProtocolBufferException: Protocol message was too large. May be malicious. Use CodedInputStream.setSizeLimit() to increase the size limit.
like Referring HDFS-5153,it says The NameSystem write lock is held during this time.`
I must hacking protobuf jar and set the limit?
EDIT:
I find a Same question,but no solution

Related

IBM MQ version 7.5 error AMQ7472: Object %CHLBATCH.706, type scratchpad damaged

We are currently having an issue with an MQ Cluster were a CLUSSDR channel is going into retry as the receiving MQ object is showing as damaged.
Configuration is many QMGR's (STAT00-11) sending messages to the Cluster of 4 QMGR's, 2 FullRepos (HUB01-02 and 2PartialRepos HUB03-04)
Problem is that on the STAT02 QMGR the CLUSSDR channel to HUB01 is in a retry state
with the MQ log error;
AMQ9506: Message receipt confirmation failed.
and on HUB01 the MQ log errors;
AMQ7472: Object %CHLBATCH.706, type scratchpad damaged. (many)
AMQ9999: Channel 'TO_HUB01' to host 'server02 (n.n.n.n)' ended abnormally.
AMQ9588: Program cannot update queue manager object. (single instance)
AMQ9587: Program cannot open queue manager object (many)
I have now stopped the CLUSSDR on STAT02 to HUB01 and there is no longer any log entries, however as the QMGR's have linear logging the log files are not being released on the HUB01 QMGR
this has introduced a new error
AMQ7084: Object syncfile, type syncfile damaged.
which is filling up the disk.
I have so far tried to recover the damaged object, the command used was on the HUB01 QMGR
rcrmqobj -m HUB01 -t channel TO_STAT02
and this returned the result, AMQ7085: Object TO_STAT02, type channel not found., although the following results contradict this;
DIS CLUSQMGR(STAT*) CHANNEL
outputs a list of all the STAT* QMGR's which includes the TO_STAT02 channel
and the channel status
DIS CHS(TO_STAT*) STATUS
shows all the channels in a RUNNING state, including the supposed non-existent TO_STAT02
Anyone had similar issues please, note that this is the second occurrence we have had in the last month to different clusters and last time we had to take the drastic action of rebuilding the QMGR once the disk space was exhausted and the QMGR crashed
rcrmqobj -m HUB01 -t syncfile
is the correct way to rebuild a corrupt syncfile and if using linear logging this will also repair any damaged scratchpad objects. Damaged scratchpad objects should only ever occur through operational or filesystem error, for example if files were deleted or partially restored from backup and so having a large number is something that you should try and identify the root cause.
rcrmqobj -t channel will be able to recover damage to channel object definitions, but it is the synchronization data and its index (syncfile) that is damaged/missing. TO_STAT02 sounds like it is a cluster sender that MQ clustering maintains from information shared within the cluster - you can check on whether a cluster channel has a local channel definition using DEFTYPE on DISPLAY CLUSQMGR.

ActiveMQ warning: Frame size of 1 GB larger than max allowed 100 MB

I'm trying to switch from a legacy jms broker to ActiveMQ.
One thing I cannot figure out is a warning in the logs once per hour:
WARN | Transport Connection to: tcp://127.0.0.1:38542 failed: java.io.IOException:
Frame size of 1 GB larger than max allowed 100 MB | ...
It's obviously some scheduled job in ActiveMQ that outputs this warning,
because it comes at the same minute every hour,
regardless of whether any messages are sent or not.
But what exactly means "Frame size" here?
We are not sending any jms-messages larger than a few kilobytes or so...
I read you can increase this "maxFramesize" in the connector, but doesn't help either.
When I try set it to 1GB (1073741824) (or higher) :
<transportConnector name="openwire"
uri="tcp://0.0.0.0:61616?maximumConnections=100&wireFormat.maxFrameSize=1073741824"/>
I still see the (now absurd) warning-message:
WARN | Transport Connection to: tcp://127.0.0.1:42256 failed: java.io.IOException:
Frame size of 1 GB larger than max allowed 1 GB
What is ActiveMQ actually complaining about?
And how can I fix it?
ActiveMQ 5 would only log this message if someone was sending your broker a message that is encoded to a size larger than the configured limit. Since it happens to you on a regular interval then I'd look for some external resource that is doing something silly like trying to telnet into the broker Openwire port to check liveness and sending some garbage string or some such. The broker would not be logging that error unless something was inbound so you need to start looking for the source of the errant sender.

CoGroupByKey always failed on big data (PythonSDK)

I have about 4000 files (avg ~7MB each) input.
My pipeline always failed on the step CoGroupByKey when the data size reach about 4GB.
I tried to limit only use 300 file then it run just fine.
In case of fail, the logs on GCP dataflow only show:
Workflow failed. Causes: S24:CoGroup Geo data/GroupByKey/Read+CoGroup Geo data/GroupByKey/GroupByWindow+CoGroup Geo data/Map(_merge_tagged_vals_under_key) failed., The job failed because a work item has failed 4 times. Look in previous log entries for the cause of each one of the 4 failures. For more information, see https://cloud.google.com/dataflow/docs/guides/common-errors. The work item was attempted on these workers:
store-migration-10212040-aoi4-harness-m7j7
Root cause: The worker lost contact with the service.,
store-migration-xxxxx
Root cause: The worker lost contact with the service.,
store-migration-xxxxx
Root cause: The worker lost contact with the service.,
store-migration-xxxxx
Root cause: The worker lost contact with the service.
I digging through all logs in Logs Explorer. Nothing else indicate error other than the above, even my logging.info and try...except code.
Think this relate to the memory of the instances but I didn't digging into that direction. Because it kindna what I don't want to worry about when I am using GCP services.
Thanks.

Failed to place enough replicas: expected size is 1 but only 0 storage types can be selected

Failed to place enough replicas: expected size is 1 but only 0 storage types can be selected (replication=1, selected=[], unavailable=[DISK], removed=[DISK], policy=BlockStoragePolicy
{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}
We have a scenario where there are multiple hdfs files being written (order of 500-1000 files - at most 10-40 such files written concurrently) -- we don't call close immediately on each file for every write -- but keep writing till end and then call close.
It seems that sometimes we get the above error - and the write fails. We have set hdfs retries to 10 - but that does not seem to help.
We also increased dfs.datanode.handler.count to 200 - that did sometime helped but not always.
a) Would increasing dfs.datanode.handler.count help here? even if 10 are written concurrently..
b) What should be done so that we don't get error at application level -- as such hadoop monitoring page indicates that disks are healthy - but from the warning message, it did seemed that sometimes disks were not available -- org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Failed to place enough replicas, still in need of 1 to reach 1 (unavailableStorages=[DISK], storagePolicy=BlockStoragePolicy
{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}
, newBlock=true) All required storage types are unavailable: unavailableStorages=[DISK], storagePolicy=BlockStoragePolicy
Assuming that above happens only when we find failures to disks -- we also tried setting dfs.client.block.write.replace-datanode-on-failure.enable to false, so that for temporary failures, we don't get errors. But it does not seem to help either.
Any further suggestions here?
In my case this was fixed by opening the firewall port 50010 for the datanodes (on Docker)

Hadoop MapReduce job I/O Exception due to premature EOF from inputStream

I ran a MapReduce program using the command hadoop jar <jar> [mainClass] path/to/input path/to/output. However, my job was hanging at: INFO mapreduce.Job: map 100% reduce 29%.
Much later, I terminated and checked the datanode log (I am running in pseudo-distributed mode). It contained the following exception:
java.io.IOException: Premature EOF from inputStream
at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:201)
at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:472)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:849)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:804)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:251)
at java.lang.Thread.run(Thread.java:745)
5 seconds later in the log was ERROR DataXceiver error processing WRITE_BLOCK operation.
What problem might be causing this exception and error?
My NodeHealthReport said:
1/1 local-dirs are bad: /home/$USER/hadoop/nm-local-dir;
1/1 log-dirs are bad: /home/$USER/hadoop-2.7.1/logs/userlogs
I found this which indicates that dfs.datanode.max.xcievers may need to be increased. However, it is deprecated and the new property is called dfs.datanode.max.transfer.threads with default value 4096. If changing this would fix my problem, what new value should I set it to?
This indicates that the ulimit for the datanode may need to be increased. My ulimit -n (open files) is 1024. If increasing this would fix my problem, what should I set it to?
Premature EOF can occur due to multiple reasons, one of which is spawning of huge number of threads to write to disk on one reducer node using FileOutputCommitter. MultipleOutputs class allows you to write to files with custom names and to accomplish that, it spawns one thread per file and binds a port to it to write to the disk. Now this puts a limitation on the number of files that could be written to at one reducer node. I encountered this error when the number of files crossed 12000 roughly on one reducer node, as the threads got killed and the _temporary folder got deleted leading to plethora of these exception messages. My guess is - this is not a memory overshoot issue, nor it could be solved by allowing hadoop engine to spawn more threads. Reducing the number of files being written at one time at one node solved my problem - either by reducing the actual number of files being written, or by increasing reducer nodes.

Resources