Strange 'emitted' numbers behavior / zero stat numbers in Topology stats (Storm 1.0.3) - java-8

This is what my storm UI stat looks like.
The problem is that I have no idea where those numbers (of emitted tuples are coming from).
My topology is pretty simple: kafka spout -> bolt (persisting data into hbase)
topology works - when I put data into kafka topic, I get them processed by bolt and persisted in hbase, which I then verify with scan operator in hbase shell (so new records are being inserted)
however each time I submit new message into kafka and when it’s persisted by bolt - my topology doesn’t increase number of emitted by ‘1’.
periodically I get all numbers increased by 20 - without sending any new messages into kafka. I.e. my kafka topic gets no messages for hours, but the number of tuples emitted always get increased in chunks of 20 over time. I still get the same number of records in hbase.
I get no exceptions/errors anywhere in apache storm logs.
I’m not doing ack() or fail() any of my tuples in my bolt implementation (which is BasicBolt type doing ack automatically)
my capacity or latency in bolt metrics is always staying zero even when I load a lot of messages in Kafka
my kafka offset log ($KAFKA/kafka-run-class.sh kafka.tools.ConsumerOffsetChecker) shows all the messages are processed and Kafka Lag for given topic/group is 0.
So my question:
what are those ‘stealth’ tuples that increase ‘emitted’ in both Spout and Bolt over time by 20s?
is it possible to enable ‘debugging’ in storm UI to see what those tuples are?
why capacity/latency in bolt metrics is always zero while bolt is confirmed to persist data?
Environment details
I’m using Java 8 + Apache Storm 1.0.3
[devops#storm-wk1-prod]~/storm/supervisor/stormdist% storm version
Running: /usr/lib/jvm/jre-1.8.0-openjdk/bin/java -client -Ddaemon.name= -Dstorm.options= -Dstorm.home=/opt/apache-storm-1.0.3 -Dstorm.log.dir=/opt/apache-storm-1.0.3/logs -Djava.library.path=/usr/local/lib:/opt/local/lib:/usr/lib -Dstorm.conf.file= -cp /opt/apache-storm-1.0.3/lib/storm-core-1.0.3.jar:/opt/apache-storm-1.0.3/lib/kryo-3.0.3.jar:/opt/apache-storm-1.0.3/lib/reflectasm-1.10.1.jar:/opt/apache-storm-1.0.3/lib/asm-5.0.3.jar:/opt/apache-storm-1.0.3/lib/minlog-1.3.0.jar:/opt/apache-storm-1.0.3/lib/objenesis-2.1.jar:/opt/apache-storm-1.0.3/lib/clojure-1.7.0.jar:/opt/apache-storm-1.0.3/lib/disruptor-3.3.2.jar:/opt/apache-storm-1.0.3/lib/log4j-api-2.1.jar:/opt/apache-storm-1.0.3/lib/log4j-core-2.1.jar:/opt/apache-storm-1.0.3/lib/log4j-slf4j-impl-2.1.jar:/opt/apache-storm-1.0.3/lib/slf4j-api-1.7.7.jar:/opt/apache-storm-1.0.3/lib/log4j-over-slf4j-1.6.6.jar:/opt/apache-storm-1.0.3/lib/servlet-api-2.5.jar:/opt/apache-storm-1.0.3/lib/storm-rename-hack-1.0.3.jar:/opt/storm/conf org.apache.storm.utils.VersionInfo
Storm 1.0.3
URL https://git-wip-us.apache.org/repos/asf/storm.git -r eac433b0beb3798c4723deb39b3c4fad446378f4
Branch (no branch)
Compiled by ptgoetz on 2017-02-07T20:22Z
From source with checksum c78e52de4b8a22d99551d45dfe9c1a4b
My storm.yaml:
I'm running 2 instances with storm supervisor, each having the following config:
storm.zookeeper.servers:
- "10.138.0.8"
- "10.138.0.9"
- "10.138.0.16"
storm.zookeeper.port: 2181
nimbus.seeds: ["10.138.0.10"]
storm.local.dir: "/var/log/storm"
supervisor.slots.ports:
- 6700
- 6701
- 6702
- 6703
worker.childopts: "-Xmx768m"
nimbus.childopts: "-Xmx512m"
supervisor.childopts: "-Xmx256m"
toplogy.yaml
nimbus.host: "10.138.0.10"
# In Storm 0.7.x, this is necessary in order to give workers time to
# initialize. In Storm 0.8.0 and later, it may not be necessary because Storm
# has added a separate, longer timeout for the initial launch of a worker.
supervisor.worker.timeout.secs: 60
topology.workers: 1
topology
import tbolts
import tspouts
def create(builder):
"""Create toplogy through Petrel library
"""
# spout getting data from kafka instance
# we run 2 tasks of kafka spout
builder.setSpout("kafka",
tspouts.KafkaSpout(), 2)
# persistence bolt
# we run 4 tasks of persistence bolt
builder.setBolt("persistence",
tbolts.PersistenceBolt(), 4).shuffleGrouping("kafka")

The reason your emit count jumps up by 20 is due to the fact that Storm only samples every 20th tuple buy default to update its metrics. This sampling rate is controlled by the topology.stats.sample.rate config variable and can be changed per topology. So you could set this to be 1.0 (it is 0.05 by default) and you would get an accurate emit count, however this would introduce a significant processing overhead and may cause your Acker and/or metrics consumer instances to become overloaded. Use with caution.

Related

Kafka streams keep logging 'Discovered transaction coordinator' after a node crash (with config StreamsConfig.EXACTLY_ONCE_V2)

I have a kafka(kafka_2.13-2.8.0) cluster with 3 partitions and 3 replications distributed in 3 nodes.
A producer cluster is sending messages to the topic.
I also have a consumer cluster using Kafka streams to consume messages from the topic.
To test fault tolerance, I killed a node. Then all consumers get stuck and keep poping below info:
[read-1-producer] o.a.k.c.p.internals.TransactionManager : [Producer clientId=streams-app-3-0451a24c-7e5c-498c-98d4-d30a6f5ecfdb-StreamThread-1-producer, transactionalId=streams-app-3-0451a24c-7e5c-498c-98d4-d30a6f5ecfdb-1] Discovered transaction coordinator myhost:9092 (id: 3 rack: null)
what I found out by now is there are sth relevant to the configuration of StreamsConfig.EXACTLY_ONCE_V2, because if I change it to StreamsConfig.AT_LEAST_ONCE the consumer works as expected.
To keep the EOS consuming, did I miss any configuration for producer/cluster/consumer?

Messages are dropping because too many are queued in AlertManager

I have single instance cluster for AlertManager and I see warning in AlertManager container level=warn ts=2021-11-03T08:50:44.528Z caller=delegate.go:272 component=cluster msg="dropping messages because too many are queued" current=4125 limit=4096
Alert Manager Version information:
Version Information
Branch: HEAD
BuildDate: 20190708-14:31:49
BuildUser: root#868685ed3ed0
GoVersion: go1.12.6
Revision: 1ace0f76b7101cccc149d7298022df36039858ca
Version: 0.18.0
AlertManager metrics
# HELP alertmanager_cluster_members Number indicating current number of members in cluster.
# TYPE alertmanager_cluster_members gauge
alertmanager_cluster_members 1
# HELP alertmanager_cluster_messages_pruned_total Total number of cluster messages pruned.
# TYPE alertmanager_cluster_messages_pruned_total counter
alertmanager_cluster_messages_pruned_total 23020
# HELP alertmanager_cluster_messages_queued Number of cluster messages which are queued.
# TYPE alertmanager_cluster_messages_queued gauge
alertmanager_cluster_messages_queued 4125
How do we see those queued messages in AlertManager?
Do we lose alerts when messages are dropped because of too many
queued ?
Why are messages queued even though there is logic to prune messages
on regular interval i.e 15 minutes ?
Do we lose alerts when AlertManager pruned messages on regular interval?
I am new to alerting. Could you please answer for the above questions?

MetricBeat - Kafka's consumergroup metricset doesn't send any data?

I have running ZooKeeper and single Kafka broker and I want to get metrics with MetricBeat, index it with ElasticSearch and display with Kibana.
However, MetricBeat can only get data from partition metricset and nothing comes from consumergroup metricset.
Since kafka module is defined as periodical in metricbeat.yml, it should send some data on it's own, not just waiting for users interaction (f.exam. - write to topic) ?
To ensure myself, I tried to create consumer group, write and consume from topic, but still no data was collected by consumergroup metricset.
consumergroup is defined in both metricbeat.template.json and metricbeat.template-es2x.json.
While metricbeat.full.yml is completely commented off, this is my metricbeat.yml kafka module definition :
- module: kafka
metricsets: ["partition", "consumergroup"]
enabled: true
period: 10s
hosts: ["localhost:9092"]
client_id: metricbeat1
retries: 3
backoff: 250ms
topics: []
In /logs directory of MetricBeat, lines like this show up :
INFO Non-zero metrics in the last 30s:
libbeat.es.published_and_acked_events=109
libbeat.es.publish.write_bytes=88050
libbeat.publisher.messages_in_worker_queues=109
libbeat.es.call_count.PublishEvents=5
fetches.kafka-partition.events=106
fetches.kafka-consumergroup.success=2
libbeat.publisher.published_events=109
libbeat.es.publish.read_bytes=2701
fetches.kafka-partition.success=2
fetches.zookeeper-mntr.events=3
fetches.zookeeper-mntr.success=3
With ZooKeeper's mntr and Kafka's partition, I can see events= and success= values, but for consumergroup there is only success. It looks like no events are fired.
partition and mntr data are properly visible in Kibana, while consumergroup is missing.
Data stored in ElasticSearch are not readable with human eye, there are some internal strings used for directory names and logs do not contain any useful information.
Can anybody help me to understand what is going on and fix it(probably MetricBeat) to send data to ElasticSearch ? Thanks :)
You need to have an active consumer consuming out of the topics, to be able to generate events for consumergroup metricset.

Apache Storm Worker Process dies

I have installed storm-0.9.2 in a 5-node cluster. I have a simple topology with 1 spout and varying number of bolts (4, 9, 22, 31). For each configuration I have configured (#bolts + 1) workers. Thus for 4 bolts, I have 5 workers, 22 bolts with 23 workers, etc.
I have observed failed worker processes in the worker log files with corresponding EndOfStream exception in the zookeeper.out log file. When I do get a clean test run the number of tuples processed by each bolt is evenly distributed on each worker. On a non-clean test run, the workers that failed attempt to reconnect, however since the number of tuples are finite there are no more tuples to process.
What are the possible causes for a worker process to die?
Excerpt from zookeeper.out log file:
*2014-10-27 17:40:33,198 [myid:] - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn#357] - caught end of stream exception
EndOfStreamException: Unable to read additional data from client sessionid 0x1495431347c001e, likely client has closed socket
at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228)
at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
at java.lang.Thread.run(Thread.java:744)
2014-10-27 17:40:33,201 [myid:] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn#1007] - Closed socket connection for client /192.168.0.1:45693 which had sessionid 0x1495431347c001e*
Cluster Environment:
Storm 0.9.2
Zookeeper 3.4.6
Ubuntu 13.10
To me, it looks like a problem with your Zookeeper. There are a couple of ideas:
Your Zookeeper timeout configuration is too small.
Your Zookeeper instance doesn't has enough children (slaves) to handle your workload.
For diagnosing, start by increasing the default time out for your Zookeeper instance. If it is not working, try to expand your Zookeeper cluster.
You can consolidate Zookeeper documentation. Please, let us know if that solves your problem.

Kafka Storm spout changing topology and consuming from the old offset

I am using the kafka spout for consuming messages. But in case if I have to change topology and upload then will it resume from the old message or start from the new message? Kafka spout gives us to specity the timestamp from where to consume but how will I know the timestamp?
spoutConfig.forceStartOffsetTime(-1);
It will choose the latest offset written around that timestamp to start consuming. You can
force the spout to always start from the latest offset by passing in -1, and you can force
it to start from the earliest offset by passing in -2.
references
If you are using KafkaSpout ensure the following:
In your SpoutConfig “id” and “ zkroot" do NOT change after
redeploying the new version of the topology. Storm uses the“
zkroot”, “id” to store the topic offset into zookeeper
KafkaConfig.forceFromStart is set to false.
KafkaSpout stores the offsets into zookeeper. Be very careful during the re-deployment if you set forceFromStart to true ( which can be the case when you first deploy the topology) in KafkaConfig of the KafkaSpout it will ignore stored zookeeper offsets. Make sure you set it to false.
Consider writing your topology so that the KafkaConfig.forceFromStart value is read from a properties file when your Topology’s main() method executes. This will allow your administrators to control whether the Kafka messages are replayed or not.
Basically the sequence of events will be:
First time start the topology by reading from beginning with below properties:
forceFromStart = true
startOffsetTime = -2
The above props will force it to start from the beginning of the topic. Remember to have both properties because forceFromStart tells storm to read the startOffsetTime property and use the value that is set to determine from where to start reading, and ignore zookeeper offset.
From now on your topology will run and zookeeper will maintain the offset. If your worker dies, it will start be started by supervisor and start reading from the offset in zookeeper.
Now if you want to restart your topology and you want to read from where it was left off before shutdown, use below property and restart the topology:
forceFromStart = false
By the above property, you are telling storm not the read the startOffsetTime value instead use the zookeeper offset which has been maintained before you shutdown your topology.
From now on every time you restart the topology, it will read from where it was left.
If you want to restart your topology and you want to read from the head/top of the topic, use below property and restart topology:
forceFromStart = true
startOffsetTime = -1
By above property you are telling storm to ignore the zookeeper offset and start from the latest offset that is the tip of the topic.

Resources