Confusing Storm-UI - hadoop

I am confused by the Storm-UI statistics.
For example:
Topology Stats show a number of 69 million
kafka-spout shows a number of 34 million
__acker is at 17 million
es-bolt shows 17 million too
My toplogy is kafka-spout --> es-bolt and I am not sure how the numbers above add up?
If Kafka-spout is emitting only 34 million, why do topology stats show 69 million?
And again if Kafka-spout emitted 34 million, why does es-bolt say 17 million?
I see a pattern of tuples being halved from top-to-down, but not sure I understand why? Is it because of ack-tuples or heart-beat bolts?
Are they always half of the upstream spout?

You can turn off system stats, then numbers will make sense.
There's button on the bottom of Storm UI stats page.

Topology Stats show a number of 69 million
This is sum of all your spouts and bolts: 34.x + 17.x +17.x
And emitting numbers of the spout and bolt are not necessarily related. It's related to your code.

Keep in mind that those metrics are sampled at the rate o of topology.stats.sample.rate, 0.05 by default. If you turn it up to 1.0 you'll see full-resolution, though at the price of more time spent collecting metrics.

Related

NiFi MergeRecords leaving out one file

I'm using NiFi to take in some user data and combine all the JSONs into one record. The MergeRecord processor is working just like I need, except it always leaves out one record (usually the same one every time). The processor is set to run ever 60 seconds. I can't understand why because there are only 56 records to merge. I've included images below for any help y'all may have.
Firstly, you have 56 FlowFiles, that does not necessarily mean 56 Records unless you have 1 Record per FlowFile.
You are using MergeRecord which counts Records, not files.
Your current config is set to Min 50 - Max 1000 Records
If you have 56 files with 1 Record in each, then merging 50 files is enough to meet the Minimum condition and release the bucket.
You also say Merge is set to run every 60 seconds, and perhaps this is not doing what you think it is. In almost all cases, Merge should be left to the default 0 sec schedule.
NiFi has no idea what all means, it takes an input and works on it - it does not know if or when the next input will come.
If every FlowFile is 1 Record, and it is categorically always 56 and that will never change, then your setting could be Min 56 - Max 56 and that will always merge 56 times.
However, that is very inflexible to change - if it suddenly changed to 57, you need to modify the flow.
Instead, you could set the Min-Max to very high numbers, say 10,000-20,000 and then set a Max Bin Age to 60 seconds (and the processor scheduling back to 0 sec). This would have the effect of merging every Record that enters the processor until A) 10-20k Records have been merged, or B) 60 seconds expire.
Example scenarios:
A) All 56 arrives within the first 2 seconds of the flow starting
All 56 are merged into 1 file after 60 seconds of the first file arriving
B) 53 arrive within the first 60 seconds, 3 arrive in the second 60 seconds
The first 53 are merged into 1 file after 60 seconds of the first file arriving, the last 3 are merged into another file after 60 seconds from the frst of the 3 arriving
C) 10,000 arrive in the first 5 seconds
All 10k will merge immediately into 1 file, they will not wait for 60 seconds

Throughput in Apache storm

I want to know exactly throughput in apache Storm. Is it the number of tuples processed/Total time?
If so what is the total number of tuples emitted? I am not getting exact significance of total tuples emitted/Time. Please let me know.
You need to look at the execute count of the sink bolts (the end ones in your topology that don't connect to any other bolts). This is the throughput and is reported for the last 10 mins, 3 hrs, 1 day and all time. Dividing the values by the time period in seconds will give you the throughput in tuples per second.

Evenly schedule timed items into fixed size containers

I got a task at work to evenly schedule commercial timed items into pre-defined commercial breaks (containers).
Each campaign has a set of commercials with or without spreading order. I need to allow users to chose multiple campaigns and distribute all the commercials to best fit the breaks within a time window.
Example of Campaign A:
Item | Duration | # of times to schedule | order
A1 15 sec 2 1
A2 25 sec 2 2
A3 30 sec 2 3
Required outcome:
each item should appear only once in a break, no repeating.
if there is specific order try to best fit by keeping the order. If
no order shuffle it.
At the end of the process the breaks should contain evenly amount of
commercial time.
Ideal spread would fully fill all desired campaigns into the breaks.
For example: Campaign {Item,Duration,#ofTimes,Order}
Campaign A which has set {A1,15,2,1},{A2,25,2,2},{A3,10,1,3}
Campaign B which has set {B1,20,2,2},{B2,35,3,1},
Campaign C which has set {C1,10,1,1},{C2,15,2,3 sec},{C3,15,1,2 sec}
,{C4,40,1,4}
A client will choose to schedule those campaigns in a specific date that hold 5 breaks of 60 second each.
A good outcome would result in:
Container 1: {A1,15}{B2,35}{C1,10} total of 60 sec
Container 2: {C3,15}{A2,25}{B1,20} total of 60 sec
Container 3: {A3,10}{C2,15}{B2,35} total of 60 sec
Container 4: {C4,40}{B1,20} total of 60 sec
Container 5: {C2,15}{A3,10}{B3,35} total of 60 sec
Of course it's rarely that all will fit so perfectly in real-life examples.
There are so many combinations with large amount of items and I'm not sure how to go about it. The order of items inside a break needs to be dynamically calculated so that the end result would best fit all the items into the breaks.
If the architecture is poor and someone has a better idea (like giving priority to items over order and schedule based on priority or such I'll be glad to hear).
It seems like simulated annealing might be a good way to approach this problem. Just incorporate your constraints of: keeping order, even spreading and fitting into 60sec frame into the scoring function. Your random neighbor function might just swap 2 items with each other | move an item to a different frame.

Log data reduction for variable bandwidth data link

I have an embedded system which generates samples (16bit numbers) at 1 milli second intervals. The variable uplink bandwidth can at best transfer a sample every 5ms, so I am
looking for ways to adaptively reduce the data rate while minimizing the loss
of important information -- in this case the minimum and maximum values in a time interval.
A scheme which I think should work involves sparse coding and a variation of lossy compression. Like this:
The system will internally store the min and max values during a 10ms interval.
The system will internally queue a limited number (say 50) of these data pairs.
No loss of min or max values is allowed but the time interval in which they occur may vary.
When the queue gets full, neighboring data pairs will be combined starting at the end of the queue so that the converted min/max pairs now represent 20ms intervals.
The scheme should be iterative so that further interval combining to 40ms, 80ms etc is done when necessary.
The scheme should be linearly weighted across the length of the queue so that there is no combining for the newest data and maximum necessary combining of the oldest data.
For example with a queue of length 6, successive data reduction should cause the data pairs to cover these intervals:
initial: 10 10 10 10 10 10 (60ms, queue full)
70ms: 10 10 10 10 10 20
80ms: 10 10 10 10 20 20
90ms: 10 10 20 20 20 20
100ms: 10 10 20 20 20 40
110ms: 10 10 20 20 40 40
120ms: 10 20 20 20 40 40
130ms: 10 20 20 40 40 40
140ms: 10 20 20 40 40 80
New samples are added on the left, data is read out from the right.
This idea obviously falls into the categories of lossy-compression and sparse-coding.
I assume this is a problem that must occur often in data logging applications with limited uplink bandwidth therefore some "standard" solution might have emerged.
I have deliberately simplified and left out other issues such as time stamping.
Questions:
Are there already algorithms which do this kind of data logging? I am not looking for the standard, lossy picture or video compression algos but something more specific to data logging as described above.
What would be the most appropriate implementation for the queue? Linked list? Tree?
The term you are looking for is "lossy compression" (See: http://en.wikipedia.org/wiki/Lossy_compression ). The optimal compression method depends on various aspects such as the distribution of your data.
As i understand you want to transmit min() and max() of all samples in a timeperiod.
eg. you want transmit min/max every 10ms with taking samples every 1ms?
if you do not need the individual samples you simply compare them after each sampling
i=0; min=TYPE_MAX; max=TYPE_MIN;// First sample will always overwrite the initial values
while true do
sample = getSample();
if min>sample then
min=sample
if max<sample then
max=sample
if i%10 == 0 then
send(min, max);
// if each period should be handled seperatly: min=TYPE_MAX; max=TYPE_MIN;
done
you can also save bandwidth with sending data only on changes (depends on sample data: if they dont change very quick you will save a lot)
Define a combination cost function that matches your needs, e.g. (len(i) + len(i+1)) / i^2, then iterate the array to find the "cheapest" pair to replace.

Is there a limit to the number of indexes an ElasticSearch alias can point to?

I have this alias that I want to point to 60 indexes. At 21 indexes, I start getting Execution Rejected exceptions.
Is this because of a 20 index limit in the alias API?
Assuming that you have 5 shards per index, a request against 21 indices may generates about 105 shard requests, 32 of these requests are sent to threads in the pool and 73 request go into the queue. At this moment you have only about 27 elements in the queue left. So, if another request against 6 or more indices (30 shards) arrives, some shardsrequests are going to be rejected with Execution Rejected Exception. I am oversimplifying the situation here quite a bit and the actual number of threads used depends on many factors including where shards are located, search settings, etc. However, I hope you can see the main idea here: if you want to search against a large number of shards, you need to make sure you have enough capacity in the thread pool to handle the peak load.

Resources