apache storm reliablity timeout configuration - apache-storm

I have a nodejs->kafka>storm->Mongo deployed in Linux Ubuntu. Everything is normal originally. Then I changed the method in storm worker which makes storm worker process message very slow, around 1 minute per message, I notice the message is sent again and again from storm. I revert back to original method, everything is fine. (original method process time is 90ms per message).
I guess this is Storm reliability come into player. When message is not acknowledged, or time out, it sends message again.
If my guess is right, how to configure this timeout?
If my guess is wrong, why same message is sent twice or three times?

You can set the timeout via configuration parameter Config.TOPOLOGY_MESSAGE_TIMEOUT_SECS. See https://storm.apache.org/javadoc/apidocs/backtype/storm/Config.html#TOPOLOGY_MESSAGE_TIMEOUT_SECS
The default value is 30 seconds, see defaults.yaml here: https://github.com/apache/storm/blob/master/conf/defaults.yaml
# maximum amount of time a message has to complete before it's considered failed
topology.message.timeout.secs: 30
When a tuple fails, it should show up in Storm UI and should be logged, too (maybe you need to adjust log level). So you can double check if a tuple times out or not.

Related

New Relic not dispatching NRQL alert condition for process when errors are triggered

I'm creating a monitoring for a process using New Relic. The process itself is an AWS Lambda that finishes running in around 15 seconds. Any time this process fails, I want to an alert to be triggered and an email to be sent to me per the policy I've configured.
For testing purposes I'm causing the lambda to fail in a QA environment multiple times in a row to see what gets picked up by New Relic, although in production the failure would only occur a couple (less than 3) times per week, potentially a few days apart.
Here is the chart that depicts all of the failures, the NRQL query, and the thresholds. As we can see, the summed errors are well above the threshold but for some reason the alert email is not being dispatched. Any ideas?
Try increasing your evaluation offset in Condition Settings -> Advanced Settings > Evaluation offset
New Relic polls for Lambda metrics every 5 minutes so if your offset is lower than this you may find that the alert doesn't fire.
In reality I've found this quite unreliable and I'd suggest setting quite a high offset initially to test the alert - maybe 20 or 30 minutes.
According to me the red highlighted area is the timeframe where the alert condition is being violated. Alert should had been triggered, check your notification channel and try sending test notification.

Kafka consumer not committing offset correctly

I had a Kafka consumer defined with the following properties :
session.timeout.ms = 60000
heartbeat.interval.ms = 6000
We noticed a lag of ~2000 messages and saw that the same message is being consumed multiple times by the consumer (via our app logs). Also, noticed that some of the messages were taking ~10 seconds to be completely processed. Our suspicion was that the consumer was not committing the offset properly (or was committing the same old offset repeatedly), because of which the same message was being picked up by the consumer.
To fix this, we introduced a few more properties :
auto.commit.interval.ms=20000 //To ensure that commit is happening only after processing of message is completed
max.poll.records=10 //To make the consumer pick only 10 messages in one go
And, we set the concurrency to 1.
This fixed our issue. The lag started to reduce and ultimately came to 0.
But, I am still unclear why the problem occurred in the first place.
As I understand, by default :
enable.auto.commit = true
auto.commit.interval.ms=5000
So, ideally the consumer should have been committing every 5 seconds. If the message was not completely processed within this timeframe, what happens? What offset is being committed by the consumer? Did the problem occur due to large poll record size (which is 500 by default)
Also, about the poll() method, I read that :
The poll() call is issued in the background at the set auto.commit.interval.ms.
So, originally if the poll() was earlier taking place in every 5 seconds (default auto.commit.interval), why was not it committing the latest offset? Because the consumer was still not done processing it? Then, it should have committed that offset at the next 5th second.
Can someone please answer these queries and explain why the original problem occurred?
If you are using Spring for Apache Kafka, we recommend setting enable.auto.commit to false so that the container will commit the offsets in a more deterministic fashion (either after each record, or each batch of records - the default).
Most likely, the problem was max.poll.interval.ms which is 5 minutes by default. If your batch of messages take longer than this you would have seen that behavior. You can either increase max.poll.interval.ms or, as you have done, reduce max.poll.records.
The key is that you MUST process the records returned by the poll in less than max.poll.interval.ms.
Also, about the poll() method, I read that :
The poll() call is issued in the background at the set auto.commit.interval.ms.
That is incorrect; poll() is NOT called in the background; heartbeats are sent in the background since KIP-62.

Why does Nifi PutParquet processor create so many tasks?

The Nifi PutParquet Processor with timer driven run schedule of 0 sec with previous processor in stopped status shows ~3000 Tasks for the last 5 minutes.
We are on Nifi 1.9.2.
My expectation would be that this processor only creates tasks if data is in the incoming queue for the processor. Is this some misconfiguration or a bug in the implementation?
The processor is annotated with #TriggerWhenEmpty which lets it execute all the time regardless of data in the incoming queue. The reason for this is because in a kerberized environment, the processor needs a chance to refresh the credentials. It was a common problem with other processors where no data comes in for a long time, say over a weekend, and during that time the kerberos ticket expired, and then when data starts coming in Monday everything fails.
These empty executions shouldn't have a big impact on the system. When the processor executes and no data is available, it just calls yield and returns. The default yield duration is 1 second, but is controllable through the UI.

How would you emit storm data after a period of time has lapsed?

For example, lets say you were using storm to aggregate web visit start and end dates. A session starts with the first visit from a user and ends after 30 minutes of inactivity from that same user. This data is being streamed into storm in realtime as its collected. How would you tell storm to emit data after that 30 minutes of inactivity?
I am not sure but you can look for TOPOLOGY_TICK_TUPLE_FREQ_SECS properties in storm. As found in this article
Tick tuples: It’s common to require a bolt to “do something” at a fixed interval, like flush writes to a database. Many people have been using variants of a ClockSpout to send these ticks. The problem with a ClockSpout is that you can’t internalize the need for ticks within your bolt, so if you forget to set up your bolt correctly within your topology it won’t work correctly. 0.8.0 introduces a new “tick tuple” config that lets you specify the frequency at which you want to receive tick tuples via the “topology.tick.tuple.freq.secs” component-specific config, and then your bolt will receive a tuple from the __system component and __tick stream at that frequency.
You can also found the sample code to configure spouts or bolt to receive the tick tuple with a specific interval.

Laravel 4 Queues, how to do multithreading with queue:listen?

I need asynchronous, quick processing of everything in the queue. Jobs consist of CURL requests so it takes forever doing them 1 by 1 (They're basically the same as sleep(3)). I'd like all messages in the queue to run at the same time, or at least set a limit like 50. The reason I'm using a queue for this and not just running them instantly is because I need to make sure that if anything fails, it tries again.
Use the queue with iron.io ironMQ push, the queue shouldn't fail but in the unlikely even it does there is a log.
See this link for reference http://blog.iron.io/2013/05/laravel-4-ironmq-push-queues-insane.html
From memory you get 10 million requests free per month with ironMQ

Resources