Go Kafka `ProduceChannel()` Fills Up and Hangs - go

I have a server-side app written in Go producing Kafka events. It runs perfectly for days, producing ~1.6k msg/sec, and then hits a sporadic problem, where all Kafka message sending stops, and the server app needs to be manually restarted for Kafka messages to resume sending.
I've included a screenshot of the metric graphs when the incident started. To annotate what I see happening:
For seven days, the app runs perfectly. For every message queued, there is a delivery event notification sent to kafkaProducer.Events(). You can see that num queued = num delivered.
10:39: The issue starts. The delivery notification count quickly drops to zero. Kafka messages keep getting queued, but the callbacks stop.
10:52: kafkaProducer.ProduceChannel() is filled up, and attempts to queue new messsages into the go channel block the goroutine. At this point the app will never send another Kafka message again until it is manually restarted.
17:55: I manually restarted the application. kafka message queue/delivery resumes. kafka_produce_attempts drops back to zero.
The one and only place my Go code sends Kafka messages is here:
recordChannelGauge.Inc()
kafkaProducer.ProduceChannel() <- &msg
recordChannelGauge.Dec()
In the metric screenshot, note that recordChannelGauge normally stays at zero because sending the message to the Kafka ProduceChannel() doesn't block and each Inc() is immediately followed by a matching Dec() However, when the ProduceChannel() is filled up, the goroutine blocks and recordChannelGauge stays at 1 and will never unblock until the app is manually restarted.
FYI, my environment details:
Go server binary built with golang 1.10.x
Latest version of github.com/confluentinc/confluent-kafka-go/kafka. This library doesn't use versions, it's using the latest git commit, which as of this writing is 2 months old, so I'm sure I'm using that latest version.
Server OS Ubuntu 16.04.5
librdkafka1 version librdka0.11.6~1confluent5.0.1-
I suspect this is due to some internal problem in the confluentinc go client, where it doesn't handle some error scenario appropriately.
Also, I see no relevant log output around the time of the problem. I do see sporadic Kafka broker disconnected and time out errors in the logs before the problem happened that don't seem to be serious. These log messages happened every few hours or so for days without serious consequence.
Nov 26 06:52:04 01 appserver.linux[6550]: %4|1543215124.447|REQTMOUT|rdkafka#producer-1| [thrd:kafka-broker-3:9092/bootstrap]: kafka-broker-3:9092/bootstrap: Timed out 0 in-flight, 1 retry-queued, 0 out-queue, 0 partially-sent requests
Nov 26 06:52:10 01 appserver.linux[6550]: %4|1543215130.448|REQTMOUT|rdkafka#producer-1| [thrd:kafka-broker-3:9092/bootstrap]: kafka-broker-3:9092/bootstrap: Timed out 0 in-flight, 1 retry-queued, 0 out-queue, 0 partially-sent requests
Nov 26 08:46:57 01 appserver.linux[6550]: 2018/11/26 08:46:57 Ignored event: kafka-broker-2:9092/bootstrap: Disconnected (after 600000ms in state UP)
Nov 26 08:47:02 01 appserver.linux[6550]: %4|1543222022.803|REQTMOUT|rdkafka#producer-1| [thrd:kafka-broker-2:9092/bootstrap]: kafka-broker-2:9092/bootstrap: Timed out 0 in-flight, 1 retry-queued, 0 out-queue, 0 partially-sent requests
Nov 26 08:47:09 01 appserver.linux[6550]: %4|1543222029.807|REQTMOUT|rdkafka#producer-1| [thrd:kafka-broker-2:9092/bootstrap]: kafka-broker-2:9092/bootstrap: Timed out 0 in-flight, 1 retry-queued, 0 out-queue, 0 partially-sent requests
Zoomed in to problem occurrence
Zoomed out to show before and after

I have the similar problem as you.And I found an article that might explain the cause of the problem.
When there is no message in the blocked topic, after a certain period of time, you will timeout error as below.
%5|1598190018.518|REQTMOUT|rdkafka#consumer-1| [thrd:sasl_ssl://abcd....confluent.cloud:xxxx/2]: sasl_ssl://abcd....confluent.cloud:xxxx/2: Timed out FetchRequest in flight (after 359947ms, timeout #0)
%4|1598190018.840|REQTMOUT|rdkafka#consumer-1| [thrd:sasl_ssl://abcd.confluent.cloud:xxxx/2]: sasl_ssl://abcd.xxxxx.confluent.cloud:xxxx/2: Timed out 1 in-flight, 0 retry-queued, 0 out-queue, 0 partially-sent requests
The link of Article: https://www.thecodebuzz.com/apache-kafka-net-client-producer-consumer-csharp-confluent-examples-ii/
I hope that it can do some help to you.

Related

While using Instaloader via command line, how can I force 429 errors to cause requests to be retried after a longer period of time?

I am using Instaloader via command line on Windows 11, with the following command:
.\instaloader --login=MYUSERNAME :saved --dirname-pattern="Saved_Posts\{profile}" --filename-pattern="{profile}-{shortcode}" --no-resume --no-metadata-json --slide 1 --no-captions --no-video-thumbnails --no-iphone
This attempts to download approximately 12,000 saved posts from a profile. Instaloader behaves as expected for several thousand posts, occasionally giving the following error:
Too many queries in the last time. Need to wait 15 seconds, until 13:19.
The process then resumes successfully for several hundred more posts. Eventually, however, I start encountering 429 errors:
JSON Query to graphql/query: 429 Too Many Requests [retrying; skip with ^C]
Number of requests within last 10/11/20/22/30/60 minutes grouped by type:
d6f4427fbe92d846298cf93df0b937d3: 0 0 0 0 0 0
f883d95537fbcd400f466f63d42bd8a1: 0 0 0 1 1 11
* 2b0673e0dc4580674a88d426fe00ea90: 59 64 121 134 191 709
Instagram responded with HTTP error "429 - Too Many Requests". Please
do not run multiple instances of Instaloader in parallel or within
short sequence. Also, do not use any Instagram App while Instaloader
is running.
The request will be retried in 7 seconds, at 14:01.
This error then repeats over and over again, I believe until the default maximum connection attempts limit is reached and it moves onto the next post — which also receives the same error. Importantly, this error does not go away after several hours of these 'slower' requests being made; it seems to persist as long as Instaloader stays open. I have seen these 429 errors with very few requests in the last 60 minutes (i.e. <100), which makes me think I am hitting quite a long shadowban.
I have tried setting the maximum connection attempts to 0 (i.e. retry indefinitely), but this time limit appears to be capped at 666 seconds, or 11 minutes. The error does not seem to clear even leaving Instaloader to send requests every 11 minutes in this way; it is as though each individual request 'resets' the ban or something.
I am looking for a way of resolving this issue, which could include:
Adding a command to force 429 errors to be retried after subsequently longer periods of time (instead of the number of seconds being capped at 666 seconds)
Adding a command that 'preserves' wait times after each 429 error. e.g. if downloading Post 456 fails and retries after 5, then 10, then 15 seconds before successfully downloading, and then downloading Post 457 immediately fails... start the wait for a retry on Post 457 at at LEAST 15 seconds, rather than going back to 5!
Avoiding the 429 errors in the first place, if there appears to be an issue with my command line prompt
Breaking down the request into 'batches' and running one of those prompts every few days. e.g. is there a way to download Saved Posts 1-500, then 500-1000, and so on? (The Saved Posts are not necessarily in chronological order of the post date, which is what I've tried so far)
I have looked at several other posts on 429 errors but the general theme seems to be either:
Wait some time for the issue to clear — have tried this for up to 48 hours, but running the command again starts from post #1 and never makes it to the latter half of posts
Disable iPhone API requests — already done, which helps but does not solve the issue
The 429 errors simply should not be encountered during normal behaviour – well, they are!

How to list the messages of a nats jetstream and know if they were acknowledged?

I need to list the messages that were posted in nats stream to know which ones were not recognized.
I have tried to look at the admin api that nats suggests in its documentation, but it does not specify if this can be done or not.
I have also looked at the jetstream library for go, with this I can get general information about the streams and their comsumers but not the messages that were not acknowledged and I don't see any functions that give me what I need.
Has anyone already done this no matter the programming language?
Acknowledgements are tied to a specific consumer, not a stream.
You can derive the state of acknowledgements from consumer info, precisely, the Acknowledgement floor:
nats consumer info
State:
Last Delivered Message: Consumer sequence: 8 Stream sequence: 158 Last delivery: 13m59s ago
Acknowledgment floor: Consumer sequence: 4 Stream sequence: 154 Last Ack: 13m59s ago
Outstanding Acks: 2 out of maximum 1,000
Redelivered Messages: 0
Unprocessed Messages: 42
Waiting Pulls: 0 of maximum 512
Which is available in NATS CLI and most client libraries.
There is no way to directly see the list of acknowledged messages.

FB Messenger API - Receiving double requests

I have a working FB Bot built with Ruby which allows players to play a scavenger hunt.
Sometimes though, when I have multiple players in a team, FB is sending me a players 'Answer' webhook twice. I have looked into it and at first thought it was to do with the 20 second timeout if FB gets no 200 OK response (Docs here). After checking the logs though, I am receiving the second webhook from FB only 14 seconds later. See below:
# Webhook #1
{"object"=>"page", "entry"=>[{"id"=>"252445748474312", "time"=>1532153642358, "messaging"=>[{"sender"=>{"id"=>"1709242109154907"}, "recipient"=>{"id"=>"252445748474312"}, "timestamp"=>1532153641935, "message"=>{"mid"=>"0FeOChulGjuPgg3YJqEgajNsY8kMfNRt_bpIdeegEeE54h-KB8szcd-EQ-UHUT3850RwHgH4TxVYFkoFwxqhtg", "seq"=>402953, "text"=>"Larrikins"}}]}]}
# Webhook #2 (14 seconds later)
{"object"=>"page", "entry"=>[{"id"=>"252445748474312", "time"=>1532153656901, "messaging"=>[{"sender"=>{"id"=>"1709242109154907"}, "recipient"=>{"id"=>"252445748474312"}, "timestamp"=>1532153641935, "message"=>{"mid"=>"0FeOChulGjuPgg3YJqEgajNsY8kMfNRt_bpIdeegEeE54h-KB8szcd-EQ-UHUT3850RwHgH4TxVYFkoFwxqhtg", "seq"=>402953, "text"=>"Larrikins"}}]}]}
Notice both are exactly the same apart from the first "time" attribute (14 secs later).
Due to a number of methods and calls that I process after receiving the first webhook, the 200 OK response is only being sent back to FB once I have finished sending my messages in response (hence the 14 second delay).
So I have two questions:
Is the 14 second delay too long and that is why FB is resending? If so, how can I send a 200OK response straight away (head :ok)?
Is it another issue entirely?
You also ensure that "Echo" is disabled.
Go to Settings>Webhooks, edit events.
Asyncronous language like NodeJS is recomended, in my case y work with AWS SQS, I have workers that process the requests witout blocking (dont wait), I return 200,"ok" to FB to avoid that FB send again the message to my webhook.
Anothe apporach maybe store the mid in database, and check in each request if the mid exists, if exists the dont process the message. I was use Dynamo DB (AWS) with TTL enabled, thus with TTL my database autoclean every hour erasing old request.
I think it is the 15 second wait before replying, was also happening to me as Facebook auto retries when you don't reply fast enough. Te EEe Te's idea is solid, write some mechanism to cache mids and check if it is a duplicate before processing

Timeout performing SET {Key}, inst: 0, mgr: Inactive, queue: 2, qu=1, qs=1, qc=0, wr=1/1, in=0/0

I am trying to save a 90 KB pdf file into Azure Redis Cache using StackExchange.Redis client. I have converted that file into byte array and tried to save it using stringSet method and received error.
Code:
byte[] bytes = File.ReadAllBytes("ABC.pdf");
cache.StringSet(info.Name, bytes); --> This Line throws exception "Timeout performing SET {Key}, inst: 0, mgr: Inactive, queue: 2, qu=1, qs=1, qc=0, wr=1/1, in=0/0".
Kindly Help.
Timeout performing SET {Key}, inst: 0, mgr: Inactive, queue: 2, qu=1, qs=1, qc=0, wr=1/1, in=0/0
means, it has sent one request (qs), there is another request that's in unsent queue (qu), while there is nothing to be read from the network. there is an active writer meaning the one unsent is not being ignored. Basically, there is a request sent and waiting for the response to be back.
Few questions:
1. Is your client running in the same region as the cache? Running it from your dev box would introduce additional latency and cause timeouts.
2. How often do you get the exception? Does it succeed any time?
3. You can also contact azurecache#microsoft.com with your cache name, time (with time zone) range in which you see the timeouts and if possible a console app that would help to repro the issue.
Hope this helps,
Deepak
details about the error codes from this thread: #83
inst: in the last time slice: 0 commands have been issued
mgr: the socket manager is performing "socket.select", which means it is asking the OS to indicate a socket that has something to do; basically: the reader is not actively reading from the network because it doesn't think there is anything to do
queue: there are 73 total in-progress operations
qu: 6 of those are in unsent queue: they have not yet been written to the outbound network
qs: 67 of those have been sent and are awaiting responses from the server
qc: 0 of those have seen replies but have not yet been marked as complete due to waiting on the completion loop
wr: there is an active writer (meaning - those 6 unsent are not being ignored)
in: there are no active readers and zero bytes are available to be read on the NIC

Redis sync fails. Redis copy keys and values works

I have two redis instances both running on the same machine on win64. The version is the one from https://github.com/MSOpenTech/redis with no amendments and the binaries are running as per download from github (ie version 2.6.12).
I would like to create a slave and sync it to the master. I am doing this on the same machine to ensure it works before creating a slave on a WAN located machine which will take around an hour to transfer the data that exists in the primary.
However, I get the following error:
[4100] 15 May 18:54:04.620 * Connecting to MASTER...
[4100] 15 May 18:54:04.620 * MASTER <-> SLAVE sync started
[4100] 15 May 18:54:04.620 * Non blocking connect for SYNC fired the event.
[4100] 15 May 18:54:04.620 * Master replied to PING, replication can continue...
[4100] 15 May 18:54:28.364 * MASTER <-> SLAVE sync: receiving 2147483647 bytes from master
[4100] 15 May 18:55:05.772 * MASTER <-> SLAVE sync: Loading DB in memory
[4100] 15 May 18:55:14.508 # Short read or OOM loading DB. Unrecoverable error, aborting now.
The only way I can sync up is via a mini script something along the lines of :
import orm.model
if __name__ == "__main__":
src = orm.model.caching.Redis(**{"host":"source_host","port":6379})
dest = orm.model.caching.Redis(**{"host":"source_host","port":7777})
ks = src.handle.keys()
for i,k in enumerate(ks):
if i % 1000 == 0:
print i, "%2.1f %%" % ( (i * 100.0) / len(ks))
dest.handle.set(k,src.handle.get(k))
where orm.model.caching.* are my middleware cache implementation bits (which for redis is just creating a self.handle instance variable).
Firstly, I am very suspicious of the number in the receiving bytes as that is 2^32-1 .. a very strange coincidence. Secondly, OOM can mean out of memory, yet I can fire up a 2nd process and sync that via the script but doing this via redis --slaveof fails with what appears to be out of memory. Surely this can't be right?
redis-check-dump does not run as this is the windows implementation.
Unfortunately there is sensitive data in the keys I am syncing so I can't offer it to anybody to investigate. Sorry about that.
I am definitely running the 64 bit version as it states this upon startup in the header.
I don't mind syncing via my mini script and then just enabling slave mode, but I don't think that is possible as the moment slaveof is executed, it drops all known data and resyncs from scratch (and then fails).
Any ideas ??
I have also seen this error earlier, but the latest bits from 2.8.4 seem to have resolved it https://github.com/MSOpenTech/redis/tree/2.8.4_msopen

Resources