Kafka does not start on Windows - key not found: \tmp\kafka-logs - windows

I have put some effort into getting Kafka to run on windows 32 (company issued laptop - certainly not my choice..).
I was successful to create a handful of topics. But after stopping/restarting kafka it is unable to re-read those topics. Here is the startup logs
[2014-05-29 12:26:23,097] INFO [ReplicaFetcherManager on broker 0] Removed fetcher for partitions [vip_ips_alerts,0],[calls,0],[dropped_calls,0],[calls_online,0],[calls_no_phone,0] (kafka.server.ReplicaFetcherManager)
[2014-05-29 12:26:23,106] ERROR [KafkaApi-0] error when handling request Name:LeaderAndIsrRequest;Version:0;Controller:0;ControllerEpoch:4;CorrelationId:5;ClientId:id_0-host_null-port_9092;Leaders:id:0,host:S80035683-SC01.mycompany.com,port:9092;PartitionState:(vip_ips_alerts,0) -> (LeaderAndIsrInfo:(Leader:0,ISR:0,LeaderEpoch:3,ControllerEpoch:4),ReplicationFactor:1),AllReplicas:0),(calls,0) -> (LeaderAndIsrInfo:(Leader:0,ISR:0,LeaderEpoch:1,ControllerEpoch:4),ReplicationFactor:1),AllReplicas:0),(dropped_calls,0) -> (LeaderAndIsrInfo:(Leader:0,ISR:0,LeaderEpoch:3,ControllerEpoch:4),ReplicationFactor:1),AllReplicas:0),(calls_online,0) -> (LeaderAndIsrInfo:(Leader:0,ISR:0,LeaderEpoch:3,ControllerEpoch:4),ReplicationFactor:1),AllReplicas:0),(calls_no_phone,0) -> (LeaderAndIsrInfo:(Leader:0,ISR:0,LeaderEpoch:3,ControllerEpoch:4),ReplicationFactor:1),AllReplicas:0) (kafka.server.KafkaApis)
java.util.NoSuchElementException: key not found: \tmp\kafka-logs
at scala.collection.MapLike$class.default(MapLike.scala:225)
at scala.collection.immutable.Map$Map1.default(Map.scala:107)
at scala.collection.MapLike$class.apply(MapLike.scala:135)
at scala.collection.immutable.Map$Map1.apply(Map.scala:107)
at kafka.cluster.Partition.getOrCreateReplica(Partition.scala:91)
at kafka.cluster.Partition$$anonfun$makeLeader$2.apply(Partition.scala:175)
at kafka.cluster.Partition$$anonfun$makeLeader$2.apply(Partition.scala:175)
at scala.collection.immutable.Set$Set1.foreach(Set.scala:86)
at kafka.cluster.Partition.makeLeader(Partition.scala:175)
at kafka.server.ReplicaManager$$anonfun$makeLeaders$5.apply(ReplicaManager.scala:305)
at kafka.server.ReplicaManager$$anonfun$makeLeaders$5.apply(ReplicaManager.scala:304)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:95)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:95)
at scala.collection.Iterator$class.foreach(Iterator.scala:772)
at scala.collection.mutable.HashTable$$anon$1.foreach(HashTable.scala:157)
at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:190)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:45)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:95)
at kafka.server.ReplicaManager.makeLeaders(ReplicaManager.scala:304)
at kafka.server.ReplicaManager.becomeLeaderOrFollower(ReplicaManager.scala:258)
at kafka.server.KafkaApis.handleLeaderAndIsrRequest(KafkaApis.scala:100)
at kafka.server.KafkaApis.handle(KafkaApis.scala:72)
at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:42)
at java.lang.Thread.run(Thread.java:744)
Now I am OK to drop /re-create the topics. I have actually done so several times as part of my investigation (e.g. to ensure no Zookeeper corruption), Any tips on how to get a kafka server up on this out-of-date O/S would be appreciated.

Misinterpret log.dir is a huge source of pain in kafka in both unixes and windows.
It seems that the exception was caused in the following statement in
Partition. replicaManager.highWatermarkCheckpoints(log.dir.getParent)
It tries to look up in a map of highWatermarkCheckpoint files to find
the key "\kafka8-tmp\kafka-logs", but it doesn't exist. We register
the keys using the property value in log.dirs.
source
Make sure you don't have trailing slashes and that new java.io.File("\tmp\kafka-logs").getParent is not distorted (I don't have windows machine right next to me to figure out all this forward/backward slashes by myself).

Related

IBM MQ version 7.5 error AMQ7472: Object %CHLBATCH.706, type scratchpad damaged

We are currently having an issue with an MQ Cluster were a CLUSSDR channel is going into retry as the receiving MQ object is showing as damaged.
Configuration is many QMGR's (STAT00-11) sending messages to the Cluster of 4 QMGR's, 2 FullRepos (HUB01-02 and 2PartialRepos HUB03-04)
Problem is that on the STAT02 QMGR the CLUSSDR channel to HUB01 is in a retry state
with the MQ log error;
AMQ9506: Message receipt confirmation failed.
and on HUB01 the MQ log errors;
AMQ7472: Object %CHLBATCH.706, type scratchpad damaged. (many)
AMQ9999: Channel 'TO_HUB01' to host 'server02 (n.n.n.n)' ended abnormally.
AMQ9588: Program cannot update queue manager object. (single instance)
AMQ9587: Program cannot open queue manager object (many)
I have now stopped the CLUSSDR on STAT02 to HUB01 and there is no longer any log entries, however as the QMGR's have linear logging the log files are not being released on the HUB01 QMGR
this has introduced a new error
AMQ7084: Object syncfile, type syncfile damaged.
which is filling up the disk.
I have so far tried to recover the damaged object, the command used was on the HUB01 QMGR
rcrmqobj -m HUB01 -t channel TO_STAT02
and this returned the result, AMQ7085: Object TO_STAT02, type channel not found., although the following results contradict this;
DIS CLUSQMGR(STAT*) CHANNEL
outputs a list of all the STAT* QMGR's which includes the TO_STAT02 channel
and the channel status
DIS CHS(TO_STAT*) STATUS
shows all the channels in a RUNNING state, including the supposed non-existent TO_STAT02
Anyone had similar issues please, note that this is the second occurrence we have had in the last month to different clusters and last time we had to take the drastic action of rebuilding the QMGR once the disk space was exhausted and the QMGR crashed
rcrmqobj -m HUB01 -t syncfile
is the correct way to rebuild a corrupt syncfile and if using linear logging this will also repair any damaged scratchpad objects. Damaged scratchpad objects should only ever occur through operational or filesystem error, for example if files were deleted or partially restored from backup and so having a large number is something that you should try and identify the root cause.
rcrmqobj -t channel will be able to recover damage to channel object definitions, but it is the synchronization data and its index (syncfile) that is damaged/missing. TO_STAT02 sounds like it is a cluster sender that MQ clustering maintains from information shared within the cluster - you can check on whether a cluster channel has a local channel definition using DEFTYPE on DISPLAY CLUSQMGR.

CoGroupByKey always failed on big data (PythonSDK)

I have about 4000 files (avg ~7MB each) input.
My pipeline always failed on the step CoGroupByKey when the data size reach about 4GB.
I tried to limit only use 300 file then it run just fine.
In case of fail, the logs on GCP dataflow only show:
Workflow failed. Causes: S24:CoGroup Geo data/GroupByKey/Read+CoGroup Geo data/GroupByKey/GroupByWindow+CoGroup Geo data/Map(_merge_tagged_vals_under_key) failed., The job failed because a work item has failed 4 times. Look in previous log entries for the cause of each one of the 4 failures. For more information, see https://cloud.google.com/dataflow/docs/guides/common-errors. The work item was attempted on these workers:
store-migration-10212040-aoi4-harness-m7j7
Root cause: The worker lost contact with the service.,
store-migration-xxxxx
Root cause: The worker lost contact with the service.,
store-migration-xxxxx
Root cause: The worker lost contact with the service.,
store-migration-xxxxx
Root cause: The worker lost contact with the service.
I digging through all logs in Logs Explorer. Nothing else indicate error other than the above, even my logging.info and try...except code.
Think this relate to the memory of the instances but I didn't digging into that direction. Because it kindna what I don't want to worry about when I am using GCP services.
Thanks.

Issues with mkdbfile in a simple "read a file > Create a hashfile job"

Hello datastage savvy people here.
Two days in a row, the same single datastage job failed (not stopping at all).
The job tries to create a hashfile using the command /logiciel/iis9.1/Server/DSEngine/bin/mkdbfile /[path to hashfile]/[name of hashfile] 30 1 4 20 50 80 1628
(last trace in the log)
Something to consider (or maybe not ? ) :
The [name of hashfile] directory exists, and was last modified at the time of execution) but the file D_[name of hashfile] does not.
I am trying to understand what happened to prevent the same incident to happen next run (tonight).
Previous to this day, this job is in production since ever, and we had no issue with it.
Using Datastage 9.1.0.1
Did you check the job log to see if captured an error? When a DataStage job executes any system command via command execution stage or similar methods, the stdout of the called job is captured and then added to a message in the job log. Thus if the mkdbfile command gives any output (success messages, errors, etc) it should be captured and logged. The event may not be flagged as an error in the job log depending on the return code, but the output should be there.
If there is no logged message revealing cause of directory non-create, a couple of things to check are:
-- Was target directory on a disk possibly out of space at that time?
-- Do you have any Antivirus software that scans directories on your system at certain times? If so, it can interfere with I/o. If it does a scan at same time you had the problem, you may wish to update the AV software settings to exclude the directory you were writing dbfile to.

Spark Streaming and ElasticSearch - Could not write all entries

I'm currently writing a Scala application made of a Producer and a Consumer. The Producers get some data from and external source and writes em inside Kafka. The Consumer reads from Kafka and writes to Elasticsearch.
The consumer is based on Spark Streaming and every 5 seconds fetches new messages from Kafka and writes them to ElasticSearch. The problem is I'm not able to write to ES because I get a lot of errors like the one below :
ERROR] [2015-04-24 11:21:14,734] [org.apache.spark.TaskContextImpl]:
Error in TaskCompletionListener
org.elasticsearch.hadoop.EsHadoopException: Could not write all
entries [3/26560] (maybe ES was overloaded?). Bailing out... at
org.elasticsearch.hadoop.rest.RestRepository.flush(RestRepository.java:225)
~[elasticsearch-spark_2.10-2.1.0.Beta3.jar:2.1.0.Beta3] at
org.elasticsearch.hadoop.rest.RestRepository.close(RestRepository.java:236)
~[elasticsearch-spark_2.10-2.1.0.Beta3.jar:2.1.0.Beta3] at
org.elasticsearch.hadoop.rest.RestService$PartitionWriter.close(RestService.java:125)
~[elasticsearch-spark_2.10-2.1.0.Beta3.jar:2.1.0.Beta3] at
org.elasticsearch.spark.rdd.EsRDDWriter$$anonfun$write$1.apply$mcV$sp(EsRDDWriter.scala:33)
~[elasticsearch-spark_2.10-2.1.0.Beta3.jar:2.1.0.Beta3] at
org.apache.spark.TaskContextImpl$$anon$2.onTaskCompletion(TaskContextImpl.scala:57)
~[spark-core_2.10-1.2.1.jar:1.2.1] at
org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:68)
[spark-core_2.10-1.2.1.jar:1.2.1] at
org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:66)
[spark-core_2.10-1.2.1.jar:1.2.1] at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
[na:na] at
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
[na:na] at
org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:66)
[spark-core_2.10-1.2.1.jar:1.2.1] at
org.apache.spark.scheduler.Task.run(Task.scala:58)
[spark-core_2.10-1.2.1.jar:1.2.1] at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
[spark-core_2.10-1.2.1.jar:1.2.1] at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
[na:1.7.0_65] at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
[na:1.7.0_65] at java.lang.Thread.run(Thread.java:745) [na:1.7.0_65]
Consider that the producer is writing 6 messages every 15 seconds so I really don't understand how this "overload" can possibly happen (I even cleaned the topic and flushed all old messages, I thought it was related to an offset issue). The task executed by Spark Streaming every 5 seconds can be summarized by the following code :
val result = KafkaUtils.createStream[String, Array[Byte], StringDecoder, DefaultDecoder](ssc, kafkaParams, Map("wasp.raw" -> 1), StorageLevel.MEMORY_ONLY_SER_2)
val convertedResult = result.map(k => (k._1 ,AvroToJsonUtil.avroToJson(k._2)))
//TO-DO : Remove resource (yahoo/yahoo) hardcoded parameter
log.info(s"*** EXECUTING SPARK STREAMING TASK + ${java.lang.System.currentTimeMillis()}***")
convertedResult.foreachRDD(rdd => {
rdd.map(data => data._2).saveToEs("yahoo/yahoo", Map("es.input.json" -> "true"))
})
If I try to print the messages instead of sending to ES, everything is fine and I actually see only 6 messages. Why can't I write to ES?
For the sake of completeness, I'm using this library to write to ES : elasticsearch-spark_2.10 with the latest beta version.
I found, after many retries, a way to write to ElasticSearch without getting any error. Basically passing the parameter "es.batch.size.entries" -> "1" to the saveToES method solved the problem. I don't understand why using the default or any other batch size leads to the aforementioned error considering that I would expect an error message if I'm trying to write more stuff than the allowed max batch size, not less.
Moreover I've noticed that actually I was writing to ES but not all my messages, I was losing between 1 and 3 messages per batch.
When I pushed dataframe to ES on Spark, I had the same error message. Even with "es.batch.size.entries" -> "1" configuration,I had the same error.
Once I increased thread pool in ES, I could figure out this issue.
for example,
Bulk pool
threadpool.bulk.type: fixed
threadpool.bulk.size: 600
threadpool.bulk.queue_size: 30000
Like it was already mentioned here, this is a document write conflict.
Your convertedResult data stream contains multiple records with the same id. When written to elastic as part of the same batch produces the error above.
Possible solutions:
Generate unique id for each record. Depending on your use case it can be done in a few different ways. As example, one common solution is to create a new field by combining the id and lastModifiedDate fields and use that field as id when writing to elastic.
Perform de-duplication of records based on id - select only one record with particular id and discard other duplicates. Depending on your use case, this could be the most current record (based on time stamp field), most complete (most of the fields contain data), etc.
The #1 solution will store all records that you receive in the stream.
The #2 solution will store only the unique records for a specific id based on your de-duplication logic. This result would be the same as setting "es.batch.size.entries" -> "1", except you will not limit the performance by writing one record at a time.
One of the possibility is the cluster/shard status being RED. Please address this issue which may be due to unassigned replicas. Once status turned GREEN the API call succeeded just fine.
This is a document write conflict.
For example:
Multiple documents specify the same _id for Elasticsearch to use.
These documents are located in different partitions.
Spark writes multiple partitions to ES simultaneously.
Result is Elasticsearch receiving multiple updates for a single Document at once - from multiple sources / through multiple nodes / containing different data
"I was losing between 1 and 3 messages per batch."
Fluctuating number of failures when batch size > 1
Success if batch write size "1"
Just adding another potential reason for this error, hopefully it helps someone.
If your Elasticsearch index has child documents then:
if you are using a custom routing field (not _id), then according to
the documentation the uniqueness of the documents is not guaranteed.
This might cause issues while updating from spark.
If you are using the standard _id, the uniqueness will be preserved, however you need to make sure the following options are provided while writing from Spark to Elasticsearch:
es.mapping.join
es.mapping.routing

Hibernate - Getting exception : maximum number of processes (550) exceeded

I am using hibernate 3 along with spring.My Hibernate configurations are as under :
hibernate.dialect=org.hibernate.dialect.Oracle8iDialect
hibernate.connection.release_mode=on_close
But after starting application, even if only one user accesses it then also I am getting this exception :
ORA-00020: maximum number of processes (550) exceeded
This is stacktrace:
Caused by: java.sql.SQLException: ORA-00020: maximum number of processes (550) exceeded
at oracle.jdbc.driver.DatabaseError.throwSqlException(DatabaseError.java:112)
at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:331)
at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:288)
at oracle.jdbc.driver.T4C8Oall.receive(T4C8Oall.java:743)
at oracle.jdbc.driver.T4CPreparedStatement.doOall8(T4CPreparedStatement.java:216)
at oracle.jdbc.driver.T4CPreparedStatement.executeForDescribe(T4CPreparedStatement.java:799)
at oracle.jdbc.driver.OracleStatement.executeMaybeDescribe(OracleStatement.java:1038)
at oracle.jdbc.driver.T4CPreparedStatement.executeMaybeDescribe(T4CPreparedStatement.java:839)
at oracle.jdbc.driver.OracleStatement.doExecuteWithTimeout(OracleStatement.java:1133)
at oracle.jdbc.driver.OraclePreparedStatement.executeInternal(OraclePreparedStatement.java:3285)
at oracle.jdbc.driver.OraclePreparedStatement.executeQuery(OraclePreparedStatement.java:3329)
at com.mchange.v2.c3p0.impl.NewProxyPreparedStatement.executeQuery(NewProxyPreparedStatement.java:76)
at org.hibernate.jdbc.AbstractBatcher.getResultSet(AbstractBatcher.java:208)
at org.hibernate.loader.Loader.getResultSet(Loader.java:1953)
at org.hibernate.loader.Loader.doQuery(Loader.java:802)
at org.hibernate.loader.Loader.doQueryAndInitializeNonLazyCollections(Loader.java:274)
at org.hibernate.loader.Loader.loadEntity(Loader.java:2037)
I have kept connection pool time out = 5000. I have also tried to found the cause and got that release mode may affect the mechanism of closing DB resources. But I couldn't find exact solution for that.
Please help..
Thanks in advance..
This is a database error not an application error so you need to go to the database to solve it. 550 processes is a lot more than it sounds so either someone has gone insane or you have a lot of inactive processes running.
The best way to find out is to query the v$session view or Gv$session if you're using a RAC, look at the STATUS column.
Take careful not of where all these sessions are coming from; the OSUSER, TERMINAL and PROGRAM will probably be the most useful. It might almost be worth creating a temporary table with this information - proof and a record afterwards. Then after checking that you're not going to break anything, and with your DBAs if you have any, kill all the inactive sessions simultaneously or one at a time.
That'll remove the error but if it's occurred once it can occur again, so you need to solve it. Either:
You've got a lot of people using the database.
There is an application / program somewhere that is not closing it's
sessions after it's finished.
Someone is connecting in the middle of a loop.
Whichever reason it is you need to track it down and correct it. I'd start with the program or terminal from v$session that had the most number of processes.

Resources