How to paginate with apache kafka messages - bash

A pagination system is done with total records and specify offset to start paginate.
How can I get total records and offset in kafka from console?

Kafka doesn't paginate. A topic is a sequential log of events.
However, your consumer group has an initial or stored offset, and on the next poll, will read up to max.poll.records for the next "page" after that offset.
If you want to count the number of records in a non compacted topic, you can use GetOffsetShell tool to query the first and last offset, then subtract the difference. For a compacted topic, there are gaps in those numbers and the only reasonable way to count records is to consume the entire topic

Related

Prometheus PromQL is not aggregating counts over a period of time correctly

I'm collecting number of messages being consumed with Prometheus.
For each message that is consumed, I increment Counter once.
Now, the prometheus server config is that scraping them each 30s.
I use this PromQL query to have the number of messages consumed by each minute:
sum(rate(consumer_count[1m]))
I use sum() because I have 4 workers. What I get, is an incorrect number. Because in my tests, for each consumption, I also log a message, counting those logs give me nearly 20k messages per minute, while prometheus, shows nearly 1k messages per minute (except the first minute which it shows 20k).
I also am running this query in Grafana panel. If it helps.

What is a reasonable value for StreamsConfig.COMMIT_INTERVAL_MS_CONFIG for Kafka Streams

I was looking to some confluent examples for the Kafka Streams the different values for configuration value 'StreamsConfig.COMMIT_INTERVAL_MS_CONFIG' confused me little bit.
For ex, in micro service example,
config.put(StreamsConfig.COMMIT_INTERVAL_MS_CONFIG, 1); //commit as fast as possible
https://github.com/confluentinc/kafka-streams-examples/blob/5.1.0-post/src/main/java/io/confluent/examples/streams/microservices/util/MicroserviceUtils.java
Another one,
// Records should be flushed every 10 seconds. This is less than the
default
// in order to keep this example interactive.
streamsConfiguration.put(StreamsConfig.COMMIT_INTERVAL_MS_CONFIG, 10 *
1000);
https://github.com/confluentinc/kafka-streams-examples/blob/5.1.0-post/src/main/java/io/confluent/examples/streams/WordCountLambdaExample.java
Another one,
// Set the commit interval to 500ms so that any changes are flushed
frequently and the top five
// charts are updated with low latency.
streamsConfiguration.put(StreamsConfig.COMMIT_INTERVAL_MS_CONFIG,
500);
https://github.com/confluentinc/kafka-streams-examples/blob/5.1.0-post/src/main/java/io/confluent/examples/streams/interactivequeries/kafkamusic/KafkaMusicExample.java
In the examples intervals changes from 1ms to 10000ms, what I am really interested is the 1ms in a system that heavy load all the time, could it be dangerous to go 1ms Commit Interval?
Thx for answers..
Well, it depends how frequently you want to commit your records. It actually refers to the Record Caching in memory:
https://kafka.apache.org/21/documentation/streams/developer-guide/memory-mgmt.html#record-caches-in-the-dsl
If you want to see each record as output, you can set it to the lowest number. In some scenarios, you may want to get the output for each event, there having lowest number makes sense. But in some scenario, where it is okay to consolidate the events and produce fewer output, you can set it to higher number.
Also be aware, that Record caching is affected by these two configuration:
commit.interval.ms and cache.max.byte.buffering
The semantics of caching is that data is flushed to the state store and forwarded to the next downstream processor node whenever the earliest of commit.interval.ms or cache.max.bytes.buffering (cache pressure) hits.

Kinesis triggers lambda with small batch size

I have a Lambda which is configured as a consumer of a Kinesis data stream, with a batch size of 10,000 (maximal).
The lambda parses given records and inserts them to Aurora Postgresql (using an INSERT command).
Somehow, I see that the lambda is invoked most of the time with a relatively small number of records (less than 200), although the 'IteratorAge' is constantly high (about 60 seconds). The records are put in the stream with a random partition key (generated as uuid4), and of size
How can that be explained? As I understand, if the shard is not empty, all the current records, up to the configured batch size, should be polled.
I assume that if the Lambda was invoked with bigger batches this delay could be prevented.
Note: There is also a Kinesis Firehose configured as a consumer (doesn't seem to have any issue).
Finds out that the iterator age of the Kinesis was 0ms, so this behavior makes sense.
The iterator age of the Lambda is a bit different:
Measures the age of the last record for each batch of records processed. Age is the difference between the time Lambda received the batch, and the time the last record in the batch was written to the stream.

Measuring Requests per second using statsd

I want to measure Requests per second using Statsd?
Currently, I am using increment counter, So whenever a new request comes it will increment by 1?
In this case, I was able to capture cumulative data rather than request count per second.
So what I need was flush counter data every second and delete that counter, so that we will get data for that second only.
Is it possible to do so in Statsd?
According to the metric types docs, Statsd will send both the rate as well as the count at each flush. Also, according to the code comments, this rate is per second, so it's exactly what you're looking for.

Cassandra deletion best practice

We have real time data coming in to our system. We have online queries which we need to serve. In order to serve these online queries we need are doing some pre-processing of the data so that we can serve faster.
Now my query is how do I preprocess the online real time data. There should be a way for me to figure out if the data was already processed or not. In order to find this difference, I have the following approaches:
I can have a flag which says that data is processed or unprocessed, based on which i can further take a decision to process or not
I can have a column family where I can insert the data with a TTL, and a topic in a message bus like kafka which gives me the row identifier in cassandra so that I can process this row in cassandra
I can have a column family per day and a topic in a message bus like kafka which gives me the row identifier of the corresponding column family
I can have a keyspace per day and a topic in a message bus like kafka which gives me the row identifier of the corresponding column family
I read some where that if, the number of deletions increases, then the number of tombstones increases and result in slow query times. Now I am confused with the approach I have to chose among the above four or is there a better way to solve this?
According to the datastax blog third option might be better fit.
Cassandra Anti-patterns

Resources