CoGroupByKey always failed on big data (PythonSDK) - debugging

I have about 4000 files (avg ~7MB each) input.
My pipeline always failed on the step CoGroupByKey when the data size reach about 4GB.
I tried to limit only use 300 file then it run just fine.
In case of fail, the logs on GCP dataflow only show:
Workflow failed. Causes: S24:CoGroup Geo data/GroupByKey/Read+CoGroup Geo data/GroupByKey/GroupByWindow+CoGroup Geo data/Map(_merge_tagged_vals_under_key) failed., The job failed because a work item has failed 4 times. Look in previous log entries for the cause of each one of the 4 failures. For more information, see https://cloud.google.com/dataflow/docs/guides/common-errors. The work item was attempted on these workers:
store-migration-10212040-aoi4-harness-m7j7
Root cause: The worker lost contact with the service.,
store-migration-xxxxx
Root cause: The worker lost contact with the service.,
store-migration-xxxxx
Root cause: The worker lost contact with the service.,
store-migration-xxxxx
Root cause: The worker lost contact with the service.
I digging through all logs in Logs Explorer. Nothing else indicate error other than the above, even my logging.info and try...except code.
Think this relate to the memory of the instances but I didn't digging into that direction. Because it kindna what I don't want to worry about when I am using GCP services.
Thanks.

Related

IBM MQ version 7.5 error AMQ7472: Object %CHLBATCH.706, type scratchpad damaged

We are currently having an issue with an MQ Cluster were a CLUSSDR channel is going into retry as the receiving MQ object is showing as damaged.
Configuration is many QMGR's (STAT00-11) sending messages to the Cluster of 4 QMGR's, 2 FullRepos (HUB01-02 and 2PartialRepos HUB03-04)
Problem is that on the STAT02 QMGR the CLUSSDR channel to HUB01 is in a retry state
with the MQ log error;
AMQ9506: Message receipt confirmation failed.
and on HUB01 the MQ log errors;
AMQ7472: Object %CHLBATCH.706, type scratchpad damaged. (many)
AMQ9999: Channel 'TO_HUB01' to host 'server02 (n.n.n.n)' ended abnormally.
AMQ9588: Program cannot update queue manager object. (single instance)
AMQ9587: Program cannot open queue manager object (many)
I have now stopped the CLUSSDR on STAT02 to HUB01 and there is no longer any log entries, however as the QMGR's have linear logging the log files are not being released on the HUB01 QMGR
this has introduced a new error
AMQ7084: Object syncfile, type syncfile damaged.
which is filling up the disk.
I have so far tried to recover the damaged object, the command used was on the HUB01 QMGR
rcrmqobj -m HUB01 -t channel TO_STAT02
and this returned the result, AMQ7085: Object TO_STAT02, type channel not found., although the following results contradict this;
DIS CLUSQMGR(STAT*) CHANNEL
outputs a list of all the STAT* QMGR's which includes the TO_STAT02 channel
and the channel status
DIS CHS(TO_STAT*) STATUS
shows all the channels in a RUNNING state, including the supposed non-existent TO_STAT02
Anyone had similar issues please, note that this is the second occurrence we have had in the last month to different clusters and last time we had to take the drastic action of rebuilding the QMGR once the disk space was exhausted and the QMGR crashed
rcrmqobj -m HUB01 -t syncfile
is the correct way to rebuild a corrupt syncfile and if using linear logging this will also repair any damaged scratchpad objects. Damaged scratchpad objects should only ever occur through operational or filesystem error, for example if files were deleted or partially restored from backup and so having a large number is something that you should try and identify the root cause.
rcrmqobj -t channel will be able to recover damage to channel object definitions, but it is the synchronization data and its index (syncfile) that is damaged/missing. TO_STAT02 sounds like it is a cluster sender that MQ clustering maintains from information shared within the cluster - you can check on whether a cluster channel has a local channel definition using DEFTYPE on DISPLAY CLUSQMGR.

I have a separate machine for elasticsearch. It is 500GB, but logs are consuming full memory in 24 hours. How do I compress it and free memory?

[2019-08-01T13:20:48,015][INFO ][logstash.outputs.elasticsearch] retrying failed action with response code: 403 ({“type”=>“cluster_block_exception”, “reason”=>“index [metricbea...delete (api)];“})
Your log message is cut off. Is it by any chance actually this one or close to it?
[logstash.outputs.elasticsearch] retrying failed action with response code: 403 ({"type"=>"cluster_block_exception", "reason"=>"blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];"})
That would mean your disk is full (hit the floodstage watermark, which is at 95% by default). I can't really see anything related to memory in your log message.
To clear the floodstage: Add disk space (or delete old data) and then you will need unlock all affected indices with something like this:
PUT /_all/_settings
{
"index.blocks.read_only_allow_delete": null
}

Failed to place enough replicas: expected size is 1 but only 0 storage types can be selected

Failed to place enough replicas: expected size is 1 but only 0 storage types can be selected (replication=1, selected=[], unavailable=[DISK], removed=[DISK], policy=BlockStoragePolicy
{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}
We have a scenario where there are multiple hdfs files being written (order of 500-1000 files - at most 10-40 such files written concurrently) -- we don't call close immediately on each file for every write -- but keep writing till end and then call close.
It seems that sometimes we get the above error - and the write fails. We have set hdfs retries to 10 - but that does not seem to help.
We also increased dfs.datanode.handler.count to 200 - that did sometime helped but not always.
a) Would increasing dfs.datanode.handler.count help here? even if 10 are written concurrently..
b) What should be done so that we don't get error at application level -- as such hadoop monitoring page indicates that disks are healthy - but from the warning message, it did seemed that sometimes disks were not available -- org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Failed to place enough replicas, still in need of 1 to reach 1 (unavailableStorages=[DISK], storagePolicy=BlockStoragePolicy
{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}
, newBlock=true) All required storage types are unavailable: unavailableStorages=[DISK], storagePolicy=BlockStoragePolicy
Assuming that above happens only when we find failures to disks -- we also tried setting dfs.client.block.write.replace-datanode-on-failure.enable to false, so that for temporary failures, we don't get errors. But it does not seem to help either.
Any further suggestions here?
In my case this was fixed by opening the firewall port 50010 for the datanodes (on Docker)

Must hacking protobuf jar?

1.My namenode log always prints error log java.io.IOException: Requested data length 113675682 is longer than maximum configured RPC length 67108864. RPC came from 172.16.xxx.xxx
And datanode prints Unsuccessfully sent block report 0x706cd6d00df0effe, containing 1 storage report(s), of which we sent 0. The reports had 9016550 total blocks and used 0 RPC(s). This took 1734 msec to generate and 252 msecs for RPC and NN processing. Got back no commands
2.I set ipc.maximum.data.length to 134217728 and solved the problem,But unfortunately,i find after set length,my hdfs client often can't to write data,but just take a few minutes every time.Then i find the namenode throw a new exception,when client can't write,DatanodeProtocol.blockReport from 172.16.xxx.xxx:43410 Call#30074227 Retry#0
java.lang.IllegalStateException: com.google.protobuf.InvalidProtocolBufferException: Protocol message was too large. May be malicious. Use CodedInputStream.setSizeLimit() to increase the size limit.
like Referring HDFS-5153,it says The NameSystem write lock is held during this time.`
I must hacking protobuf jar and set the limit?
EDIT:
I find a Same question,but no solution

Hibernate - Getting exception : maximum number of processes (550) exceeded

I am using hibernate 3 along with spring.My Hibernate configurations are as under :
hibernate.dialect=org.hibernate.dialect.Oracle8iDialect
hibernate.connection.release_mode=on_close
But after starting application, even if only one user accesses it then also I am getting this exception :
ORA-00020: maximum number of processes (550) exceeded
This is stacktrace:
Caused by: java.sql.SQLException: ORA-00020: maximum number of processes (550) exceeded
at oracle.jdbc.driver.DatabaseError.throwSqlException(DatabaseError.java:112)
at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:331)
at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:288)
at oracle.jdbc.driver.T4C8Oall.receive(T4C8Oall.java:743)
at oracle.jdbc.driver.T4CPreparedStatement.doOall8(T4CPreparedStatement.java:216)
at oracle.jdbc.driver.T4CPreparedStatement.executeForDescribe(T4CPreparedStatement.java:799)
at oracle.jdbc.driver.OracleStatement.executeMaybeDescribe(OracleStatement.java:1038)
at oracle.jdbc.driver.T4CPreparedStatement.executeMaybeDescribe(T4CPreparedStatement.java:839)
at oracle.jdbc.driver.OracleStatement.doExecuteWithTimeout(OracleStatement.java:1133)
at oracle.jdbc.driver.OraclePreparedStatement.executeInternal(OraclePreparedStatement.java:3285)
at oracle.jdbc.driver.OraclePreparedStatement.executeQuery(OraclePreparedStatement.java:3329)
at com.mchange.v2.c3p0.impl.NewProxyPreparedStatement.executeQuery(NewProxyPreparedStatement.java:76)
at org.hibernate.jdbc.AbstractBatcher.getResultSet(AbstractBatcher.java:208)
at org.hibernate.loader.Loader.getResultSet(Loader.java:1953)
at org.hibernate.loader.Loader.doQuery(Loader.java:802)
at org.hibernate.loader.Loader.doQueryAndInitializeNonLazyCollections(Loader.java:274)
at org.hibernate.loader.Loader.loadEntity(Loader.java:2037)
I have kept connection pool time out = 5000. I have also tried to found the cause and got that release mode may affect the mechanism of closing DB resources. But I couldn't find exact solution for that.
Please help..
Thanks in advance..
This is a database error not an application error so you need to go to the database to solve it. 550 processes is a lot more than it sounds so either someone has gone insane or you have a lot of inactive processes running.
The best way to find out is to query the v$session view or Gv$session if you're using a RAC, look at the STATUS column.
Take careful not of where all these sessions are coming from; the OSUSER, TERMINAL and PROGRAM will probably be the most useful. It might almost be worth creating a temporary table with this information - proof and a record afterwards. Then after checking that you're not going to break anything, and with your DBAs if you have any, kill all the inactive sessions simultaneously or one at a time.
That'll remove the error but if it's occurred once it can occur again, so you need to solve it. Either:
You've got a lot of people using the database.
There is an application / program somewhere that is not closing it's
sessions after it's finished.
Someone is connecting in the middle of a loop.
Whichever reason it is you need to track it down and correct it. I'd start with the program or terminal from v$session that had the most number of processes.

Resources