Kafka Rotating Garbage Collection Logs - rotation

I have Kafka,
included logging Garbage Collection logs
/opt/kafka/logs/kafkaServer-gc.log
All logs working Rotating, except Garbage Collection logs.
for example option in /opt/kafka/config/log4j.properties
log4j.appender.authorizerAppender=org.apache.log4j.DailyRollingFileAppender
log4j.appender.authorizerAppender.DatePattern='.'yyyy-MM-dd-HH
log4j.appender.authorizerAppender.File=${kafka.logs.dir}/kafka-authorizer.log
log4j.appender.authorizerAppender.layout=org.apache.log4j.PatternLayout
log4j.appender.authorizerAppender.layout.ConversionPattern=[%d] %p %m (%c)%n
How make option for Rotating Garbage Collection Logs?

The GC logs are not configured by log4j but instead by JVM arguments.
Since Kafka 0.11, by default, Kafka should keep up to 10 files of 100MB. See https://github.com/apache/kafka/blob/trunk/bin/kafka-run-class.sh#L244-L257
If you want a different settings, you can export KAFKA_GC_LOG_OPTS with the desired configuration.

Related

List error logs from Stckdriver of matching pattern

I am evaluating approaches for a scenario where i need to fetch list of logs from Stackdriver. There can be multiple Filter criteria's (for eg. payload contains a word 'retry' of logs of type 'Warning' ...)
With help gcp sdk i was able to query stackdriver but not sure how efficient is this approach. Kindly suggest other approaches where i can use elastic search client to query stackdriver and list matching logs
It looks like you have multiple sets of logs that you wish to consume separately and each of those log sets can be described with a Stackdriver filter. This is a good start since running filters against Stackdriver is an effective way to sort your data. And you are right that running the same filter against Stackdriver over and over again would be pretty inefficient.
The following approach uses Stackdriver log sinks and this is how we manage logs on our GCP account. Our monitoring team is pretty happy with it and it's easy to maintain.
You can read up on log sinks here and aggregated log sinks here.
The general idea is to have Google automatically filter and export the logs for you using multiple log sinks (one sink per filter). The export destination can be Google Storage, BigQuery, or Pub/Sub. Each sink should export to a different location and will do so continuously as long as the sink exists. Also, log sinks can be set up per project or at the organization level (where it can inherit all projects underneath).
For example, let's say you want to set up three log sinks. Each sink uses a different filter and different export location (but all to the same bucket):
Log Sink 1 (compute logs) -> gs://my-example-log-dump-bucket/compute-logs/
Log Sink 2 (network logs) -> gs://my-example-log-dump-bucket/network-logs/
Log Sink 3 (linux logs) -> gs://my-example-log-dump-bucket/linux-logs/
Once this is set up, your code's SDK can just access each location based on what logs it currently needs. This eliminates the need for your code to do the filtering since Google has already handled it for you in the background.
One thing to note: log exports to BigQuery and Pub/Sub are instant, but exports to Google Storage occur at the top of every hour. So if you need a fast turnaround on the logs, avoid Google Storage and go with either BigQuery or Pub/Sub.
Hope this helps!

Is there an added-value to log JSON data to Elasticsearch via Logstash?

I am reviewing my logging strategy (logs from OS and applications which log to syslog, and from my own applications where I can freely decide what and where to log). Not having a lot of experience with Logstash I was wondering whether there is an added value to log JSON data through it (as opposed to directly sending them to Elasticsearch).
The only advantage I could think of is that logging could be consistently to stdout (and then be picked up by syslog), and consistently sent to Logstash (as syslog), to be analyzed there (Logstash would know that data from application myapp.py send raw JSON, for instance).
Are there other advantages to use Logstash as an intermediate? (security aspects are not important in that context).
There are a couple of advantages in using Logstash even when your data is already a JSON Object.
For example:
HTTP Compression: When logstash outputs to elasticsearch, you have the option to use http compression, which greatly reduces the size of the requests and the use of the bandwith.
Persistent Queue: Logstash allows you to have a persistent queue, in memory or in disk, to save the events when it cannot connect with elasticsearch for some reason.
Data Manipulation: You can use filters to change and enrich your data, for example you can remove and add fields, change the name of fields, use a geoip filter on a ip field etc.

Re-queue a Logstash event

Is it possible to re-queue a Logstash event to be processed again in a bit of time?
In order to illustrate what I mean, I will explain my use case: I have a custom Logstash filter that extracts the application version from logs at the start of an application, and then appends the correct version to every log event. However in the very beginning, race conditions can occur where an application version has not yet been written to a file, and yet the Logstash filter tries to read in the data anyway (since it it processing log lines concurrently). This results in an application version that is null. In case it matters, Logstash gets its input from Filebeat.
I would like to re-queue these events to be re-processed entirely a couple seconds (or milliseconds) from now, when the application version has been saved to the disk.
However this leads me to a broader question, which is, can you tell a Logstash event to be re-queued, or is there an alternative solution to this scenario?
Thanks for the help!
Process data and append in a new file after that use that file to further process data.
Logstash Processor - 1 Geat Data Proces Data and append to file.
Logstash Processor - 2 Get Data From 2nd File and do whatever you
want to do.

How can I reset Kafka state to "start of universe"?

I'm still working on a Kafka Streams application that I described in
Why isn't Kafka consumer producing results?. In that posting, I asked why setting
kstreams_props.put( ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
doesn't appear to reset the state of Kafka to "start of the universe" before any data are pushed to any topic. I am now encountering a variant of that issue:
My application consists of a producer program that pushes data to a Kafka stream and a consumer program that groups the data, aggregates the groups, and then converts the resulting KTable back into a stream, which I print out.
The aggregation step is essentially adding up all the values, then putting those sums into the output stream as new data. What I observe, though, is that every time I run the program, the resulting aggregated values get bigger and bigger, almost as if Kafka is somehow retaining the previous results and including those in the aggregation.
In order to try fixing this, I deleted all my topics (except for __consumer_offsets, which Kafka would not allow), then re-ran my application, but the aggregated values continue to grow, as if Kafka were retaining the result of previous computations even though I thought that deleting the intermediate topics would fix things. I even tried stopping and restarting the Kafka server, to no avail.
What's going on here and, more to the point, how can I fix this? I've tried various suggestions about setting AUTO_OFFSET_RESET_CONFIG, also with no effect. I should mention that one aspect of my application is that my original producer creates its own Kafka timestamps in the Producer.send call, although disabling that also seemed to have no effect.
Thanks in advance, -- Mark
AUTO_OFFSET_RESET_CONFIG only triggers if there are not committed offsets: If an application starts, it first looks for committed offsets and applies the reset policy only, if there are no valid offsets.
Furthermore, for a Kafka Streams application, resetting offsets would not be sufficient and you should use the reset tool bin/kafka-streams-applicaion-reset.sh -- this blog post explains the tool in details: https://www.confluent.io/blog/data-reprocessing-with-kafka-streams-resetting-a-streams-application/

Monitoring Kafka Spout with KafkaOffsetMonitoring tool

I am using the kafkaSpout that came with storm-0.9.2 distribution for my project. I want to monitor the throughput of this spout. I tried using the KafkaOffsetMonitoring, but it does not show any consumers reading from my topic.
I suspect this is because I have specified the root path in Zookeeper for the spout to store the consumer offsets. How will the kafkaOffsetMonitor know that where to look for data about my kafkaSpout instance?
Can someone explain exactly where does zookeeper store data about kafka topics and consumers? The zookeeper is a filesystem. So, how does it arrange data of different topics and their partitions? What is consumer groupid and how is it interpreted by zookeeper while storing consumer offset?
If anyone has ever used kafkaOffsetMonitor to monitor throughput of a kafkaSpout, please tell me how I can get the tool to find my spout?
Thanks a lot,
Palak Shah
Kafka-Spout maintains its offset in its own znode rather than under the znode where kafka stores the offsets for regular consumers. We had a similar need where we had to monitor the offsets of both the kafka-spout consumers and also regular kafka consumers, so we ended writing our own tool. You can get the tool from here:
https://github.com/Symantec/kafka-monitoring-tool
I have never used KafkaOffsetMonitor, but I can answer the other part.
zookeeper.connect is the property where you can specify the znode for Kafka; By default it keeps all data at '/'.
You can access the zookeeper filesystem using zkCli.sh, the zookeeper command line.
You should look at /consumers and /brokers; following would give you the offset
get /consumers/my_test_group/offsets/my_topic/0
You can poll this offset continuously to know the rate of consumption at spout.

Resources