I am new to Golang and Kafka and I am using segmentio kafka-go to connect to Kafka server using Golang. As of now I want to push every event of user in Kafka, so I want to push single message(and not in batch), but since the write operation provided by this library takes same time for either batch or single message, it is taking a lot of time. Is there any way of writing single message fast so that i can push million events in kafka in less time?
I have tested it for single message, and batch messages, it is taking same time (min was 10ms).
I think your problem is just the WriterConfig.
For example, if your config looks like the example on segmentio/kafka-go docs:
w := kafka.NewWriter(kafka.WriterConfig{
Brokers: []string{"localhost:9092"},
Topic: "topic-A",
Balancer: &kafka.LeastBytes{},
})
You could try setting batch size and batch timeout:
w := kafka.NewWriter(kafka.WriterConfig{
Brokers: []string{"localhost:9092"},
Topic: "topic-A",
Balancer: &kafka.LeastBytes{},
BatchSize: 1,
BatchTimeout: 10 * time.Millisecond,
})
It happens because kafka-go waits by default 1 second until the batch reach the maximum size, which is by default 100 messages, as we can see in the code.
Hope it helps you.
Update: Be aware that sending the messages one by one slows the process.
For example: sending 100 messages in batch took on my computer 0.0107s. Sending the same 100 messages one by one took 0.0244s.
I don't know much about golang. But the following function using Writer.WriteMessages has synchronous send option.
Writing fast (using sync send) actually depends upon your Network Roundtrip time i.e, the time taken to put the message to Kafka plus the time taken to get the acknowledgement from Kafka.
If you are using sync send, then your send will block until acknowledgement is received.
So, to make it fast, one way is to reduce the acknowledgements. It is better to set it to 1 (meaning, that the leader has written the message to its log but it is not replicated to the followers). But this can cause loss if the leader goes down and the message is not replicated.
So, you can set it to acks=all and change the min.insync.replicas=2 on the topic. The lesser the value the faster your send() returns and the faster it can push the next message to Kafka.
Related
I'm running a Go service that uses the Paho Go MQTT client for subscribing to a topic. The clients that produce the MQTT messages (also Paho, but on Android devices) log when they produce and my service logs when it receives. As you can see from this graph, there seems to be a pretty consistent "cap" right below 36.000 messages per day on the receiving side. The graphs follow each other almost perfectly up to the cap, but then it seems that the go service caps out on slightly below 600 messages per minute, which means around 10 msgs per second.
Where should I look for the solution to this? I cannot find any setting (options) that could explain this cap.
As per the comments paho.mqtt.golang defaults to ordered delivery of messages (the MQTT spec provides some guarantees re message ordering and and calling handlers in a go routine may break this). The upshot of this is that messages will be delivered one-by-one and, if your handler is not keeping up, a queue may form (at QOS1+ the broker needs to retain messages as it may be necessary to resend them).
Some brokers limit the number of messages queued for a client; for example the max_queued_messages option in Mosquitto defaults to 1000 (this default was lower in Mosquitto 1.X) and, if the queue exceeds the limit, "messages will be silently dropped".
This is what appears to have been happening here; the application was not keeping up with incoming messages so the broker began dropping messages when the queue exceeded a limit.
In many cases using the paho.mqtt.golangoption ClientOptions.SetOrderMatters(false) will help; with this option set the message handler will be called in a separate go routine (so the handler must be threadsafe). Alternatively start a go routine within the handler but note that this approach results in the ACK being sent before the handler completes (which may result in message loss if your application terminates unexpectedly).
Implementing Kafka with Spring batch. developed Spring boot Application, My Kafka producer is continuously producing messages. I want to process these message in batches. but when I trigger the job, Job is continuously running. So I decided to add pollTimeout in KafkaItemReader. This way I'm able to stop my job. But how many messages will be coming in Kafka while triggering the Job. that I'm unable to find in google if I set pollTimeout to 1000ms how many message will come in KafkaItemReader.
Hint would be a helpful
#Bean
KafkaItemReader<String,String> item() { return new kafkaItemBuilder<String,String>().partitions(0).consumerproperties(prop).name(“reader”).savedata(true).topic(name).pollTimeout(Duration.ofMillis(1000).build()}
Batch processing is about fixed data sets. If your topic is a continuous stream of events, then a Spring Batch job is not a good choice for you, a streaming solution is more appropriate. Spring Batch expects your ItemReader to return null when the data source is exhausted, but in your case, the data source is never exhausted and that's why your job is never finished.
The timeout property will actually make the reader return null if no messages are received during that period.
The property is a timeout, not a record limit.
You can do some math against max.poll.records and the period of time between starting and stopping the consumer, but it'll only be an estimate, not an exact number because the poll timeout is only an upper bound that waits for the max poll record count
If you want to programmatically calculate number of processed messages, I'd suggest grabbing the offset difference or summing the consumed record count.
I am trying to control number of messages which are consumed by the KStream and I am not very succesful.
I am using:
max.poll.interval.ms=100
and
max.poll.records=20
to get like 200 messages per second.
But it seems to be not very good, as I see that there are like 500 messages per second also in my statistics.
What else shall I set on the side of the stream consumer?
I am using: max.poll.interval.ms=100 and max.poll.records=20 to get
like 200 messages per second.
max.poll.interval.ms and max.poll.records properties do not work this way.
max.poll.interval.ms indicates the maximum time interval in milliseconds the consumer has to wait in between each consumer poll of the topic.
max.poll.records indicates the maximum number of records the consumer can consume during each consumer poll of the topic.
The interval between each poll is not controlled by the above two properties but by the time taken by your consumer to acknowledge the fetched records.
For example, let's say a topic X exists with 1000 records in it, and the time taken by the consumer to acknowledge the fetched records is 20ms. With max.poll.interval.ms = 100 and max.poll.records = 20, the consumer will poll the Kafka topic every 20ms and in every poll, max of 20 records will be fetched. In case, the time taken to acknowledge the fetched records is greater than the max.poll.interval.ms, the polling will be considered as failed and that particular batch will re-polled again from the Kafka topic.
A KafkaConsumer (also the one that is internally used by KafkaStreams reads record as fast as possible.
The parameter you mention can have an impact on performance, but you cannot control the actual data rate. Also note, that max.poll.records only configures how many records poll() return, but it has no impact on client-broker communication. A KafkaConsumer can fetch more records when talking to the broker, and then return buffered messages on poll() as long as records are in the buffer (ie, for this case, poll() is a client-side operator that only ensures that you don't timeout via max.poll.interval.ms). Thus, you might be more interested in fetch.max.bytes, that determines the size of bytes fetches from the broker. If you reduce this parameter, the consumer is less efficient and thus throughput should decrease. (it's not recommended though).
Another way to configure throughput are quotas (https://kafka.apache.org/documentation/#design_quotas) It's a broker side configuration that allows you limit the amount of data a client can read and/or write.
The best thing to do in Kafka Streams (and also when using a plain KafkaConsumer) is to throttle calls to poll() manually. For Kafka Streams, you can add a Thread.sleep() into any UDF. If you don't want to piggyback this into an existing operator, you can just add an foreach() with ephemeral state (ie, a class member variable) to track the throughput and compute how much you need to sleep to throttle the throughput accordingly.
You can use something like akka-stream-kafka (aka reactive-kafka) on the consumer side. akka-streams has nice throttling capabilities which will come in handy here:
http://doc.akka.io/docs/akka/snapshot/java/stream/stream-quickstart.html#time-based-processing
In Kafka there is new concept of Kafka Quota.
All details are here Kafka -> 4.9 Quotas
I want to read messages from JMS MQ or In-memory message store based on count.
Like I want to start reading the messages when the message count is 10, until that i want the message processor to be idle.
I want this to be done using WSO2 ESB.
Can someone please help me?
Thanks.
I'm not familiar with wso2, but from an MQ perspective, the way to do this would be to trigger the application to run once there are 10 messages on the queue. There are trigger settings for this, specifically TRIGTYPE(DEPTH).
To expand on Morag's answer, I doubt that WS02 has built-in triggers that would monitor the queue for depth before reading messages. I suspect it just listens on a queue and processes messages as they arrive. I also doubt that you can use MQ's triggering mechanism to directly execute the flow conveniently based on depth. So although triggering is a great answer, you need a bit of glue code to make that work.
Conveniently, there's a tutorial that provides almost all the information necessary to do this. Please see Mission:Messaging: Easing administration and debugging with circular queues for details. That article has the scripts necessary to make the Q program work with MQ triggering. You just need to make a couple changes:
Instead of sending a command to Q to delete messages, send a command to move them.
Ditch the math that calculates how many messages to delete and either move them in batches of 10, or else move all messages until the queue drains. In the latter case, make sure to tell Q to wait for any stragglers.
Here's what it looks like when completed: The incoming messages land on some queue other than the WS02 input queue. That queue is triggered based on depth so that the Q program (SupportPac MA01) copies the messages to the real WS02 input queue. After the messages are copied, the glue code resets the trigger. This continues until there are less than 10 messages on the queue, at which time the cycle idles.
I got it by pushing the message to db and get as per the count required as in this answer of me take a look at my answer
Concerning ActiveMQ: I have a scenario where I have one producer which sends small (around 10KB) files to the consumers. Although the files are small, the consumers need around 10 seconds to analyze them and return the result to the producer. I've researched a lot, but I still cannot find answers to the following questions:
How do I make the broker store the files (completely) in a queue?
Should I use ObjectMessage (because the files are small) or blob messages?
Because the consumers are slow processing, should I lower their prefetchLimit or use a round-robin dispatch policy? Which one is better?
And finally, in the ActiveMQ FAQ, I read this - "If a consumer receives a message and does not acknowledge it before closing then the message will be redelivered to another consumer.". So my question here is, does ActiveMQ guarantee that only 1 consumer will process the message (and therefore there will be only 1 answer to the producer), or not? When does the consumer acknowledge a message (in the default, automatic acknowledge settings) - when receiving the message and storing it in a session, or when the onMessage handler finishes? And also, because the consumers are so slow in processing, should I change some "timeout limit" so the broker knows how much to wait before giving the work to another consumer (this is kind of related to my previous questions)?
Not sure about others, but here are some thoughts.
First: I am not sure what your exact concern is. ActiveMQ does store messages in a data store; all data need NOT reside in memory in any single place (either broker or client). So you should actually be good in that regard; earlier versions did require that all ids needed to fit in memory (not sure if that was resolved), but even that memory usage would be low enough unless you had tens of millions of in-queue messages.
As to ObjectMessage vs blob; raw byte array (blob) should be most compact representation, but since all of these get serialized for storage, it only affects memory usage on client. Pre-fetch mostly helps with access latency; but given that they are slow to process, you probably don't need any prefetching; so yes, either set it to 1 or 2 or disable altogether.
As to guarantees: best that distributed message queues can guarantee is either at-least-once (with possible duplicates), or at-most-once (no duplicates, can lose messages). It is usually better to take at-least-once, and make clients to de-duping using client-provided ids. How acknowledgement is sent is defiend by JMS specification so you can read more about JMS; this is not ActiveMQ specific.
And yes, you should set timeout high enough that worker typically can finish up work, including all network latencies. This can slow down re-transmit of dropped messages (if worked dies), but it is probably not a problem for you.