I run the example from the package documentation segmentio/kafka-go, but in it I get 1 message at a time.
Is there any way to read all the messages that have accumulated in Kafka at a time and parse them immediately into []MyType?
func main() {
// to consume messages
kafkaBrokerUrl := "localhost:9092"
topic := "test"
conn, err := kafka.DialLeader(context.Background(), "tcp", kafkaBrokerUrl, topic, 0)
if err != nil {
log.Fatal("failed to dial leader:", err)
}
batch := conn.ReadBatch(10e3, 1e6) // fetch 10KB min, 1MB max
b := make([]byte, 10e3) // 10KB max per message
for {
n, err := batch.Read(b)
if err != nil {
break
}
fmt.Println(string(b[:n]))
}
}
Kafka doesn't help in batch processing , if you mean something like sliding window mechanism to take a batch of streams for a particular time ,size better to use a apache kafka and apache flink connector , flink provides sliding window mechanism , hdfs storage and moreover flink gives a much better fault tolerance by checkpoints . Unfortunately the flink go sdk available is unstable.
Related
I was trying to create a simple producer / consumer kafka duo. Since producer was successfully working, according to the examples in confluent's github page, I had trouble while implementing consumer. I use cloud kafka broker, which is Cloudkarafka. The consumer.go code is below:
func main() {
config := &kafka.ConfigMap{
"metadata.broker.list": "XXXXXXX", // 3 hosts Cloudkarafka provides to me
"security.protocol": "SASL_SSL",
"sasl.mechanisms": "SCRAM-SHA-256",
"sasl.username": "XXXXXXXX", // My username provided by Cloudkarafka
"sasl.password": "XXXXXXXX", // My password provided by
"group.id": "cloudkarafka-example",
"go.events.channel.enable": true,
"go.application.rebalance.enable": true,
"default.topic.config": kafka.ConfigMap{"auto.offset.reset": "earliest"},
//"debug": "generic,broker,security",
}
topic := "XXXXXX" + "A" // username + "A"
consumer, err := kafka.NewConsumer(config)
if err != nil {
panic(fmt.Sprintf("Failed to create consumer: %s", err))
}
topics := []string{topic}
//consumer.SubscribeTopics(topics, nil)
err = consumer.SubscribeTopics(topics, nil)
run := true
for run == true {
ev := consumer.Poll(0)
switch e := ev.(type) {
case *kafka.Message:
fmt.Printf("%% Message on %s:\n%s\n",
e.TopicPartition, string(e.Value))
case kafka.PartitionEOF:
fmt.Printf("%% Reached %v\n", e)
case kafka.Error:
fmt.Fprintf(os.Stderr, "%% Error: %v\n", e)
run = false
default:
fmt.Printf("Ignored %v\n", e)
}
}
consumer.Close()
}
The problem here I get is, even though I produce messages to the same topic, consumer always stays in the default case, and constantly gives the output "Ignored <nil> ". Since I feel beginner to these topics, any help & suggestion would be appreciated.
ps: I use Windows 11, in the details it says "confluent-kafka-go is not supported on Windows" but the code works just stays in default state, also the producer part just works fine.
producer.go:
config := &kafka.ConfigMap{
"metadata.broker.list": "XXXXXXXXXX",
"security.protocol": "SASL_SSL",
"sasl.mechanisms": "SCRAM-SHA-256",
"sasl.username": "XXXXXXXXX",
"sasl.password": "XXXXXXXXX",
"group.id": "cloudkarafka-example",
"default.topic.config": kafka.ConfigMap{"auto.offset.reset": "earliest"},
//"debug": "generic,broker,security",
}
topic := "XXXXX-" + "A"
p, err := kafka.NewProducer(config)
if err != nil {
fmt.Printf("Failed to create producer: %s\n", err)
os.Exit(1)
}
fmt.Printf("Created Producer %v\n", p)
deliveryChan := make(chan kafka.Event)
for i := 0; i < 10; i++ {
value := fmt.Sprintf("[%d] Hello Go!", i+1)
err = p.Produce(&kafka.Message{TopicPartition: kafka.TopicPartition{Topic: &topic, Partition: kafka.PartitionAny}, Value: []byte(value)}, deliveryChan)
e := <-deliveryChan
m := e.(*kafka.Message)
if m.TopicPartition.Error != nil {
fmt.Printf("Delivery failed: %v\n", m.TopicPartition.Error)
} else {
fmt.Printf("Delivered message to topic %s [%d] at offset %v\n",
*m.TopicPartition.Topic, m.TopicPartition.Partition, m.TopicPartition.Offset)
}
}
close(deliveryChan)
Poll() will return nil on timeout. Since you are specifying a timeout of 0ms, I suspect that what you are seeing is the behaviour of a consumer with no messages to consume.
i.e. you are asking it to wait 0ms for new messages, there are never any new messages so the Poll() call is immediately returning nil every time, all the time. Without a specific nil case, these are handled by your default case.
Are you SURE you are producing messages to the same topic your consumer is subscribed to?As Andrey pointed out in his comment, either your topic ids are different or you have obfuscated them differently in your consumer vs producer example code. It may be more helpful to attempt first reproducing your problem with a configuration that does not require obfuscation, to avoid uncertainty on such points.
Are you getting any error from Subscribe() (why aren't you checking this)?
How long have you waited to see messages consumed?It can take a few seconds for a broker to accept a new consumer into a group; with a 0ms timeout, you may see lots of "no message" events before you eventually start receiving any waiting messages.
For a minimal, working example, I'd suggest keeping things as simple as possible:
You don't need to configure go.events.channel.enable if you are using Poll() to read messages
You don't need to configure go.application.rebalance.enable if you aren't interested in, and don't need to modify, initial offsets.
If you aren't interested in events such as PartitionEOF etc (and you likely aren't) then you might want to consider using the higher-level consumer.ReadMessage() function rather than Poll() (ReadMessage returns only messages or errors and ignores all other events).
I've never used kafka before. I have two test Go programs accessing a local kafka instance: a reader and a writer. I'm trying to tweak my producer, consumer, and kafka server settings to get a particular behavior.
My writer:
package main
import (
"fmt"
"math/rand"
"strconv"
"time"
"github.com/confluentinc/confluent-kafka-go/kafka"
)
func main() {
rand.Seed(time.Now().UnixNano())
topics := []string{
"policymanager-100",
"policymanager-200",
"policymanager-300",
}
progress := make(map[string]int)
for _, t := range topics {
progress[t] = 0
}
producer, err := kafka.NewProducer(&kafka.ConfigMap{
"bootstrap.servers": "localhost",
"group.id": "0",
})
if err != nil {
panic(err)
}
defer producer.Close()
fmt.Println("producing messages...")
for i := 0; i < 30; i++ {
index := rand.Intn(len(topics))
topic := topics[index]
num := progress[topic]
num++
fmt.Printf("%s => %d\n", topic, num)
msg := &kafka.Message{
Value: []byte(strconv.Itoa(num)),
TopicPartition: kafka.TopicPartition{
Topic: &topic,
},
}
err = producer.Produce(msg, nil)
if err != nil {
panic(err)
}
progress[topic] = num
time.Sleep(time.Millisecond * 100)
}
fmt.Println("DONE")
}
There are three topics that exist on my local kafka: policymanager-100, policymanager-200, policymanager-300. They each only have 1 partition to ensure all messages are sorted by the time kafka receives them. My writer will randomly pick one of those topics and issue a message consisting of a number that increments solely for that topic. When it's done running, I expect the queues to look something like this (topic names shortened for legibility):
100: 1 2 3 4 5 6 7 8 9 10 11
200: 1 2 3 4 5 6 7
300: 1 2 3 4 5 6 7 8 9 10 11 12
So far so good. I'm trying to configure things so that any number of consumers can be spun up and consume these messages in order. By "in-order" I mean that no consumer should get message 2 for topic 100 until message 1 is COMPLETED (not just started). If message 1 for topic 100 is being worked on, consumers are free to consume from other topics that currently don't have a message being processed. If a message of a topic has been sent to a consumer, that entire topic should become "locked" until either a timeout assumes that the consumer failed or the consumer commits the message, then the topic is "unlocked" to have it's next message made available to be consumed.
My reader:
package main
import (
"fmt"
"time"
"github.com/confluentinc/confluent-kafka-go/kafka"
)
func main() {
count := 2
for i := 0; i < count; i++ {
go consumer(i + 1)
}
fmt.Println("cosuming...")
// hold this thread open indefinitely
select {}
}
func consumer(id int) {
c, err := kafka.NewConsumer(&kafka.ConfigMap{
"bootstrap.servers": "localhost",
"group.id": "0", // strconv.Itoa(id),
"enable.auto.commit": "false",
})
if err != nil {
panic(err)
}
c.SubscribeTopics([]string{`^policymanager-.+$`}, nil)
for {
msg, err := c.ReadMessage(-1)
if err != nil {
panic(err)
}
fmt.Printf("%d) Message on %s: %s\n", id, msg.TopicPartition, string(msg.Value))
time.Sleep(time.Second)
_, err = c.CommitMessage(msg)
if err != nil {
fmt.Printf("ERROR commiting: %+v\n", err)
}
}
}
From my current understanding, the way I'm likely to achieve this is by setting up my consumer properly. I've tried many different variations of this program. I've tried having all my goroutines share the same consumer. I've tried using a different group.id for each goroutine. None of these was the right configuration to get the behavior I'm after.
What the posted code does is empty out one topic at a time. Despite having multiple goroutines, the process will read all of 100 then move to 200 then 300 and only one goroutine will actually do all the reading. When I let each goroutine have a different group.id then messages get read by multiple goroutines which I would like to prevent.
My example consumer is simply breaking things up with goroutines but when I begin working this project into my use case at work, I'll need this to work across multiple kubernetes instances that won't be talking to each other so using anything that interacts between goroutines won't work as soon as there are 2 instances on 2 kubes. That's why I'm hoping to make kafka do the gatekeeping I want.
Generally speaking, you cannot. Even if you had a single consumer that consumed all the partitions for the topic, the partitions would be consumed in a non-deterministic order and your total ordering across all partitions would not be guaranteed.
Try Keyed Messages, think you may find this of good use for your use case.
I am using segmentio/kafka-go to connect to Kafka.
// to produce messages
topic := "my-topic"
partition := 0
conn, _ := kafka.DialLeader(context.Background(), "tcp", "localhost:9092", topic, partition)
conn.SetWriteDeadline(time.Now().Add(10*time.Second))
conn.WriteMessages(
kafka.Message{Value: []byte("one!")},
kafka.Message{Value: []byte("two!")},
kafka.Message{Value: []byte("three!")},
)
conn.Close()
I am able to produce into my Kafka server using this code.
// to consume messages
topic := "my-topic"
partition := 0
conn, _ := kafka.DialLeader(context.Background(), "tcp", "localhost:9092", topic, partition)
conn.SetReadDeadline(time.Now().Add(10*time.Second))
batch := conn.ReadBatch(10e3, 1e6) // fetch 10KB min, 1MB max
b := make([]byte, 10e3) // 10KB max per message
for {
_, err := batch.Read(b)
if err != nil {
// err -> "invalid codec"
break
}
fmt.Println(string(b))
}
batch.Close()
conn.Close()
But I am unable to consume using the above code. I am getting the error invalid codec. What can be the reason?
In case relevant, I tweaked the minimum batch size to 1 so that it tries to consume something.
Just a guess:
try adding an import to load compression codecs, in case your topics use compression.
import _ "github.com/segmentio/kafka-go/snappy"
I'm trying to find a good method to consume asynchronously from an input queue, process the content using several workers and then publish to an output queue. So far I've tried a number of examples, most recently using the code from here and here as inspiration.
My current code doesn't appear to be doing what it should be however, increasing the number of workers doesn't increase performance (msg/s consumed or published) and the number of goroutines remains fairly static whilst running.
main:
func main() {
maxWorkers := 10
// channel for jobs
in := make(chan []byte)
out := make(chan []byte)
// start workers
wg := &sync.WaitGroup{}
wg.Add(maxWorkers)
for i := 1; i <= maxWorkers; i++ {
log.Println(i)
defer wg.Done()
go processor(in, out)
}
// add jobs
go collector(in)
go sender(out)
// wait for workers to complete
wg.Wait()
}
The collector is basically the example from the RabbitMQ site with a goroutine that collects messages from the queue and places them on the 'in' channel:
forever := make(chan bool)
go func() {
for d := range msgs {
in <- d.Body
d.Ack(false)
}
}()
log.Printf("[*] Waiting for messages. To exit press CTRL+C")
<-forever
The processor receives an 'in' and 'out' channel, unmarshals JSON, performs a series of regexes and then places the output into the 'out' channel:
func processor(in chan []byte, out chan []byte) {
var (
// list of regexes declared here
)
for {
body := <-in
jsonIn := &Data{}
err := json.Unmarshal(body, jsonIn)
if err != nil {
log.Fatalln("Failed to decode:", err)
}
content := jsonIn.Content
//process regexes using:
//jsonIn.a = r1.FindAllString(content, -1)
jsonOut, _ := json.Marshal(jsonIn)
out <- jsonOut
}
}
And finally the sender is simply the code from the RabbitMQ site, setting up a connection, reading from the 'out' channel and then publishing to a RMQ queue:
for {
jsonOut := <-out
err = ch.Publish(
"", // exchange
q.Name, // routing key
false, // mandatory
false,
amqp.Publishing{
DeliveryMode: amqp.Persistent,
ContentType: "text/json",
Body: []byte(jsonOut),
})
failOnError(err, "Failed to publish a message")
}
This is a pattern that I'll be using quite a lot, so I'm spending a lot of time trying to find something that works correctly (and well) - any advice or help would be appreciated (and in case it isn't obvious, I'm new to Go).
There are a couple of things that jump out:
Done within main function
wg.Add(maxWorkers)
for i := 1; i <= maxWorkers; i++ {
log.Println(i)
defer wg.Done()
go processor(in, out)
}
The defer here is executed when main returns so it's not actually indicating when processing is complete. I don't think this'll have an effect on the performance profile of your program though.
To address this you could pass in wg *sync.WaitGroup to your processor so your processor can indicate when it's done.
CPU Bound Processing
Parsing messages and performing Regex is a cpu intensive workload. How many cores is your machine? How is throughput affected if you run your program on two separate machines, does throughput 2x? What if you double your amount of cores? What about running your program with 1 worker vs 2 processor workers? does that double throughput? Are you maxing out your rabbitmq local instance? is it the bottleneck??
Setting up benchmarking and load testing harnesses should allow you to setup experiments to see where your bottle necks are :)
For queue based services it's pretty easy to setup a test harness to fill rabbitmq with a set backlog and benchmark how fast you can process those messages, or to setup a load generator to send x messages/second to rabbitmq and observe if you can keep up.
Does rabbitmq have good visibility into message processing throughput? If not I frequently add a counter to go code and then log the overall averaged throughput on an interval to get a rough idea of performance:
start := time.Now()
updateInterval := time.Tick(1 * time.Second)
numIn := 0
for {
select {
case <-updateInterval:
log.Infof("IN - Count: %d", numIn)
log.Infof("IN - Througput: %.0f events/second",
float64(numIn)/(time.Now().Sub(start)).Seconds())
case e := <-msgs:
numIn++
in <- d.Body
d.Ack(false)
}
}
So I am trying to use Kafka for my application which has a producer logging actions into the Kafka MQ and the consumer which reads it off the MQ.Since my application is in Go, I am using the Shopify Sarama to make this possible.
Right now, I'm able to read off the MQ and print the message contents using a
fmt.Printf
Howeveer, I would really like the error handling to be better than console printing and I am willing to go the extra mile.
Code right now for consumer connection:
mqCfg := sarama.NewConfig()
master, err := sarama.NewConsumer([]string{brokerConnect}, mqCfg)
if err != nil {
panic(err) // Don't want to panic when error occurs, instead handle it
}
And the processing of messages:
go func() {
defer wg.Done()
for message := range consumer.Messages() {
var msgContent Message
_ = json.Unmarshal(message.Value, &msgContent)
fmt.Printf("Reading message of type %s with id : %d\n", msgContent.Type, msgContent.ContentId) //Don't want to print it
}
}()
My questions (I am new to testing Kafka and new to kafka in general):
Where could errors occur in the above program, so that I can handle them? Any sample code will be great for me to start with. The error conditions I could think of are when the msgContent doesn't really contain any Type of ContentId fields in the JSON.
In kafka, are there situations when the consumer is trying to read at the current offset, but for some reason was not able to (even when the JSON is well formed)? Is it possible for my consumer to backtrack to say x steps above the failed offset read and re-process the offsets? Or is there a better way to do this? again, what could these situations be?
I'm open to reading and trying things.
Regarding 1) Check where I log error messages below. This is more or less what I would do.
Regarding 2) I don't know about trying to walk backwards in a topic. Its very much possible by just creating a consumer over and over, with its starting offset minus one each time. But I wouldn't advise it, as most likely you'll end up replaying the same message over and over. I do advice saving your offset often so you can recover if things go south.
Below is code that I believe addresses most of your questions. I haven't tried compiling this. And sarama api has been changing lately, so the api may currently differ a bit.
func StartKafkaReader(wg *sync.WaitGroup, lastgoodoff int64, out chan<- *Message) (error) {
wg.Add(1)
go func(){
defer wg.Done()
//to track the last known good offset we processed, which is
// updated after each successfully processed event.
saveprogress := func(off int64){
//Save the offset somewhere...a file...
//Ive also used kafka to store progress
//using a special topic as a WAL
}
defer saveprogress(lastgoodoffset)
client, err := sarama.NewClient("clientId", brokers, sarama.NewClientConfig())
if err != nil {
log.Error(err)
return
}
defer client.Close()
sarama.NewConsumerConfig()
consumerConfig.OffsetMethod = sarama.OffsetMethodManual
consumerConfig.OffsetValue = int64(lastgoodoff)
consumer, err := sarama.NewConsumer(client, topic, partition, "consumerId", consumerConfig)
if err != nil {
log.Error(err)
return
}
defer consumer.Close()
for {
select {
case event := <-consumer.Events():
if event.Err != nil {
log.Error(event.Err)
return
}
msgContent := &Message{}
err = json.Unmarshal(message.Value, msgContent)
if err != nil {
log.Error(err)
continue //continue to skip this message or return to stop without updating the offset.
}
// Send the message on to be processed.
out <- msgContent
lastgoodoff = event.Offset
}
}
}()
}