Performance issue on creating new topic - rocketmq

Currently, we have VM configured with 18GB ram, 8 core CPU.
We are running broker and nameserver both on the same machine.
As of now, we have around 3563 topics. So, name server and broker consuming 13 GB from 18GB.
I am facing latency issue on creating a new topic. (For new topic creating it taking around 13 to 15 seconds).
I am looking to create a topic in just fraction of the second.
Can I know a reason for this latency issue?
Quick note: We looking to create millions of topic in RocketMQ.
We also understand it needed sufficient ram/core to manages these.
Is RocketMQ quite enough to handle millions of topics...?

I have encountered this problem also. When I create a topic ,the command line timeout with 3000ms.The lower version has this problem.The reason I will explain below:
1.client send new topic config to each of broker, then wait for broker register all of topic configs to namesrv.
2.If you have a large number of topic, then the process will costs more than 3000ms and result in a timeout.
The higher version's process was changed!
1.client send new topic config to each of broker, then broker return success at once and register all of topic configs to namesrv async.
By the way, although timeout occurred but the topic create success you can verify it by use topicRoute command.If you want to reduce the latency on create topic, you will figure out a mechanism which need not register the whole topic configs to namesrv.

Related

MassTransit consumers didn't acknowledge some messages

I have a question about some strange behaviour of consumer.
Recently we had strange situation on production environment. Two consumers on two different microservices were stuck at some messages. The first one was holding 20 messages from rabbitMQ queue and the second one 2 messages and they weren't processing them. These messages were visible as Unacked in RabbitMQ for two days. They went back to Ready state just when that two microservices were restarted. At that time when consumers took this messages the whole program was processing thousands messages per hour, so basically our Saga and all consumers were working. When these messages went back to Ready state they were processed in one second after that so I don't think that it's problem with them.
The messages are published by Saga to Exchange and besides these two stucked consumers we have also EventLogger consumer subscribed to all messages and this EventLogger processed this 22 messages normally without any problems (from his own queue). Also we have connected Application Insights to consumers and there is no information about receiving these 22 messages by these two consumers (there are information about receiving it by EventLogger).
The other day we had the same issue with one message on test environment.
Recently we updated version of MassTransit in our project from version 6.2.0 to 7.1.6 and before that we didn't notice any similar issues with consumers but maybe it's just coincidence. We also have retry, redelivery, circuit breaker and in memory outbox mechanisms but I don't think that's problem with them because the consumer didn't even start to process these 22 messages.
Do you have any suggestions what could happened to this consumers?
Usually when a consumer doesn't even start to consume the message once it has been delivered to MassTransit by RabbitMQ, it could be an issue resolving the consumer from the container, such as a dependency to another backing service (database, log server, file, network connection, device, etc.).
The message remains unacknowledged on the broker because the transport/delivery mechanism to the consumer is waiting for a resource to become available. If there isn't anything in the logs for that time period indicating an issue with a resource, it's hard to know what could have blocked those messages from being consumed. The fact that they were ultimately consumed once the services were restarted seems to indicate the message content itself was fine.
Monitoring the lack of message consumption (and likely an associated queue depth increase) would give an indication that the situation has occurred. If it happens again, I'd increase the logging detail levels to see if the issue occurs again and can then be identified.

Kafka cluster on AWS crash

I've been having a recurring issue with a kafka cluster running on AWS EC2 instances.
Description
Kafka cluster version 0.10.1.0
3 brokers cluster
topics have 6 partitions per broker
Instance type is m4.xlarge
Symptoms
The following will happen at random intervals, on random brokers
From the logs here is the information I could gather :
Shrinking Intra-cluster replication on a random broker
(I suppose it could be a temporary network failure but couldn't produce evidence of it)
System starts showing close to no activity #02:27:20 (note that it's not load related as it happens at very quiet times)
From there, this kafka broker doesn't process messages which is expected IMO as it dropped out of the cluster replication.
Now the real issue appears as the number of connections in CLOSE_WAIT
is constantly increasing until it reaches the configured ulimit of the system/process, ending up crashing the kafka process.
Now, I've been changing limits to see if kafka would eventually join again the ISR before crashing but even with a limit that's very high, kafka just seems stuck in a weird state and never recovers.
Note that between the time when the faulty broker is on its own and the time it crashes, kafka is listening and kafka producer.
For this single crash, I could see 320 errors like this from the producers :
java.util.concurrent.ExecutionException: org.springframework.kafka.core.KafkaProducerException: Failed to send; nested exception is org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition.
The configuration being the default one and the use being quite standard, I'm wondering if I missed something.
I put in place a script that check the number of kafka file descriptors and restarts the service when it gets abnormally high, which does the trick for now but I still lose messages when it crashes.
Any help to get to the bottom of this would be appreciated.
Turns out there was a deadlock in the version I was using.
Upgrading fixed the issue.
See ticket about the issue :
https://issues.apache.org/jira/browse/KAFKA-5721

ActiveMQ non-persistent delivery mode limitations?

I am using ActiveMQ where I need following requirements
To have very fast consumers as my producers are already very fast
Need processing at lease 2K messages per second
Not require to process/consume messages again in case of server crash or other failures. I can trigger whole process again.
Needs to run very normal configuration server - 4Gib RAM
I have configured ActiveMQ as given below
Using non-persistent delivery mode (vm://localhost)(http://activemq.apache.org/what-is-the-difference-between-persistent-and-non-persistent-delivery.html)
Using spring integration for put/fetch messages in/from queue/channel.
Using max-concurrent-consumers with 10 threads
Assume all other configs are by default with ActiveMQ and Sprig-integration.
Problems/Questions
I am not sure how ActiveMQ stores messages in case of non-persistent delivery mode, is it possible that my process will fail with out of memory errors once my queue size exceed some limit? I am asking this because it's very difficult to test whole process for me. So I needs to be aware about limitation before I trigger the process.
If non-persistent delivery mode is not sufficient with my above requirements, is there any performance tuning tips with which I can achieve my requirements with persistent delivery mode (tcp://). I have already tested with this mode, but it seems consumers are very slow here. Also, I have already tried to use DUPS_OK_ACKNOWLEDGE to make my consumer fast with persistent delivery mode but no luck.
NOTE : I am using latest ActiveMQ version 5.14
I am not sure how ActiveMQ stores messages in case of non-persistent delivery mode
Activemq store messages in the memory at first, and it will also swap it to the disk(there is a tmp_storage folder in activemq's data path).
is it possible that my process will fail with out of memory errors once my queue size exceed some limit
I have never met out of memory in activemq, even with about one million messages.
You can also make sure by the producer flow control(http://activemq.apache.org/producer-flow-control.html).
You can make the producer hang when there is too many messages not consumed.
And about performance of persistent delivery, I also have no good methods.

Azure Queue delayed message

I has some strange behaviour on production deployment for azure queue messages:
Some of the messages in queues appears with big delay - minutes, and sometimes 10 minutes.
Befere you ask about setting delayTimeout when we put message to queue - we do not set delayTimeout for that message, so message should appear almost immedeatly after it was placed in queue.
At that moments we do not have a big load. So my instances has no work load, and able to process message fast, but they just don't appear.
Our service process millions of messages per month, we able to identify that 10-50 messages processed with very big delay, by that we fail SLA in front of our customers.
Does anyone have any idea what can be reason?
How to overcome?
Did anyone faced similar issues?
Some general ideas for troubleshooting:
Are you certain that the message was queued up for processing - ie the queue.addmessage operation returned successfully and then you are waiting 10 minutes - meaning you can rule out any client side retry policies etc as being the cause of the problem.
Is there any chance that the time calculation could be subject to some kind of clock skew problems. eg - if one of the worker roles pulling messages has its close out of sync with the other worker roles you could see this.
Is it possible that in the situations where the message is appearing to be delayed that a worker role responsible for pulling the messages is actually failing or crashing. If the client calls GetMessage but does not respond with an appropriate acknowledgement within the time specified by the invisibilityTimeout setting then the message will become visible again as the Queue Service assumes the client did not process the message. You could tell if this was a contributing factor by looking at the dequeue count on these messages that are taking longer. More information can be found here: http://msdn.microsoft.com/en-us/library/dd179474.aspx.
Is it possible that the number of workers you have pulling items from the queue is insufficient at certain times of the day and the delays are simply caused by the queue being populated faster than you can pull messages from the queue.
Have you enabled logging for queues and then looked to see if you can find the specific operations (look at e2elatency and serverlatency).
http://blogs.msdn.com/b/windowsazurestorage/archive/tags/analytics+2d00+logging+_2600_amp_3b00_+metrics/. You should also enable client logging and try to determine if the client is having connectivity problems and the retry logic is possibly kicking in.
And finally if none of these appear to help can you please send me the server logs (and ideally the client side logs as well) along with your account information (no passwords) to JAHOGG at Microsoft dot com.
Jason
Azure Service bus has a property in the BrokeredMessage class called ScheduledEnqueueTimeUtc, it allows you to set a time for when the message is added to the queue (effectively creating a delay).
Are you sure that in your code your not setting this property, and this might be the cause for the delay?
You can find more info on this at this url: https://www.amido.com/azure-service-bus-how-to-delay-a-message-being-sent-to-the-queue/
If you are using WebJobs to process messages from the queue, it can be due to WebJobs configuration.
From an MSDN forum post by pranav rastogi:
Starting with 0.4.0-beta, the (WebJobs) SDK implements a random exponential back-off algorithm. As a result of this if there are no messages on the queue, the SDK will back off and start polling less frequently.
The following setting allows you to configure this behavior.
MaxPollingInterval for when a queue remains empty, the longest period of time to wait before checking for a message to. Default is 10min.
static void Main()
{
JobHostConfiguration config = new JobHostConfiguration();
config.Queues.MaxPollingInterval = TimeSpan.FromMinutes(1);
JobHost host = new JobHost(config);
host.RunAndBlock();
}

IBM MQ Message Throttling

We are using IBM MQ and we are facing some serious problems regarding controlling its asynchronous delivery to its recipient.We are having some java listeners configured, now the problem is that we need to control the messages coming towards listener, because the messages coming to server are in millions count and server machine dont have that much capacity t process so many threads at a time, so is there any way like throttling on IBM MQ side where we can configure preetch limit like Apache MQ does?
or is there any other way to achieve this?
Currently we are closing connection with IBM MQ when some X limit has reached on listener, but doesen't seems to be efficient way.
Please guys help us out to solve this issue.
Generally with message queueing technologies like MQ the point of the queue is that the sender is decoupled from the receiver. If you're having trouble with message volumes then the answer is to let them queue up on the receiver queue and process them as best you can, not to throttle the sender.
The obvious answer is to limit the maximum number of threads that your listeners are allowed to take up. I'm assuming you're using some sort of MQ threadpool? What platform are you using that provides unlimited listener threads?
From your description, it almost sounds like you have some process running that - as soon as it detects a message in the queue - it reads the message, starts up a new thread and goes back and looks at the queue again. This is the WRONG approach.
You should have a defined number of process threads running (start with one and scale up as required, and within limits of your server) which read from the queue themselves. They would each open the queue in shared mode and either get-with-wait or do immediate get with a sleep if you get a MQRC 2033 (no messages in queue).
Hope that helps.
If you are running in the application server environment, then the maxPoolDepth property on the activationSpec will define the maximum ServerSessionPool size for the MDB - decreasing this will throttle the number messages being delivered concurrently.
Of course, if your MDB (or javax.jms.MessageListener in the JSE environment) does nothing but hand the message to something else (or, worse, just spawn an unmanaged Thread and start it) onMessage will spin rapidly and you can still encounter problems. So in that case you need to limit other resources too, e.g. via threadpool configuration.
Closing the connection to the QM is never an efficient way, as the MQCONN/MQDISC cycle is expensive.

Resources