My question:I want to monitor the duration of each data in storm. How should I do?Are there any demos?I can only count the duration of each bolt processing.
In both apache-storm and jstorm, you can get data of the monitor, and export that to whereever you want.
In storm, use Metrics Consumer (see http://storm.apache.org/releases/1.0.6/Metrics.html).
And in jstorm, use metrics uploader (see http://jstorm.io/Maintenance/JStormMetrics.html)
And there is a jstorm metricsUploader example here: https://github.com/lcy362/StormTrooper/blob/master/src/main/java/com/trooper/storm/monitor/MetricUploaderTest.java
Related
How do I implement a sliding window aggregation (or transformation) with a fixed-size count-based window?
For e.g: If I have stream data like the following
input stream = 1,2,3,4,5,6,7,8...
Assume that time is not relevant here. And say my aggregate function is AVERAGE and window size is fixed at 3 records (not 3 millis, 3 secs, 3 hours etc), I would like my output stream to be
output stream = avg(1,2,3), avg(2,3,4), avg(3,4,5), avg(4,5,6), avg(5,6,7)... = 2,3,4,5,6...
The Windows documented in Kafka streams work are "time-based". Even the constructor of base class Window has following signature:
Window(long startMs, long endMs)
So I was not sure if it's the right tool to do non-time based windowing aggregating.
Apache Flink supports count-based sliding and tumbling windows. That's exactly what I need, but I'm looking for a similar feature in Kafka Streams.
If time-ordering is no concern for you, you can implement a custom Transformer with attached state.
StreamsBuilder builder = new StreamsBuilder();
builder.addStoreStore(...); // add KeyValueStore here
KStream result = builder.stream("topic").transform(...); // pass in name of your KeyValueStore, too
For you custom Transformer you can maintain a List per key with the list being your window -- as long as the list is smaller than you window-size you append new record to the list -- if it's exactly the size, you trigger the computation -- if it exceeds the size, you trim it and trigger the computation afterwards.
See the docs for more details: https://kafka.apache.org/10/documentation/streams/developer-guide/processor-api.html (Note, that a Processor and a Transformer are basically the same thing.)
If you wish to use Apache Storrm which is also an streaming engine, kafka can be connected as a data source to it. Storm new version provides a concept called Tumbling Window, which delivers exact number of tuple to your topology. This can easily be used to solve your problem.
For more have a look at https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.0/bk_storm-component-guide/content/storm-windowing-concepts.html
How do I get in my program (which is running the spark streaming job) the time taken for each rdd job.
for example
val streamrdd = KafkaUtils.createDirectStream[String, String, StringDecoder,StringDecoder](ssc, kafkaParams, topicsSet)
val processrdd = streamrdd.map(some operations...).savetoxyz
In the above code for each microbatch rdd the job is run for map and saveto operation.
I want to get the timetake for each streaming job. I can see the job in port 4040 UI, but want to get in the spark code itself.
Pardon if my question is not clear.
You can use the StreamingListener in you spark app. This interface provides a method onBatchComplete that can give you total time taken by the batch jobs.
context.addStreamingListener(new StatusListenerImpl());
StatusListenerImpl is the implementation class that you have to implement using StreamingListener.
There are more other methods also available in listener you should explore them as well.
I didn't find any documentation related to nifi counters.I Want to know how to use this feature and purpose of this feature.
Counters are a way for a processor to track how many times some event occurred, mostly for monitoring purposes. There is a method in the ProcessSession:
void adjustCounter(String name, long delta, boolean immediate);
So calling this method with ("myCounter", 1, true) would increment the count of "myCounter" by 1, or create the counter if it didn't exist. Counters are not persistent and will be reset on restart. An example is in the syslog processors which increment a counter for each syslog message received.
See discussion here:
https://community.hortonworks.com/questions/50622/apache-nifi-what-are-counters-in-nifi.html
I have configured monitoring system as bunch of next stuff:
my_app -> pystatsd -> statsdaemon -> carbon-tagger -> graphite (via carbon-cache) -> graph-explorer
But it looks like carbon-tagger does only dumping metrics to ElasticSearch but not to Graphite. In the same time carbon-tagger successfully send his internal metrics to carbon-cache and they appear in Graph Explorer well. I have look at the source code of the carbon-tagger and could not find place where it send any received from statsdaemon metrics to graphite. So now I'm confused! How should I configure my monitoring system to dump metrics both to the ElasticSearch and to the Graphite?
In a nutshell, correct configuration of described system should looks likes this:
That is, statsd/statsdaemon should pass in data to the carbon-relay (or carbon-relay-ng), not to the carbon-cache directly. And carbon-relay will broadcast data to the carbon-tagger and carbon-cache. Also, don't forget that carbon-tagger doesn't work with pickle format, while original carbon-relay produces data only through pickle protocol.
Stream definition:
"http --port=9400 | throughput-sampler"
I've sent the START and END/ STOP payloads but I'm not sure how to read out the throughput values.
It doesn't show up in logs, so I'm wondering how do I access the values. There seems to be no documentations detailing it either.
Appreciate any help!
This looks like a bug; add
log4j.logger.org.springframework.xd.integration.throughput=INFO
to the
xd/config/xd-singlenode-logger.properties
or
xd/config/xd-container-logger.properties
depending on which topology you are using, and you'll see...
09:56:16,358 1.1.0.SNAP INFO pool-10-thread-10 throughput.ThroughputSamplerMessageHandler - Throughput sampled for 2 items: 0/s in 5584ms elapsed time.
The default endMessage is END.