NiFi MergeRecords leaving out one file - apache-nifi

I'm using NiFi to take in some user data and combine all the JSONs into one record. The MergeRecord processor is working just like I need, except it always leaves out one record (usually the same one every time). The processor is set to run ever 60 seconds. I can't understand why because there are only 56 records to merge. I've included images below for any help y'all may have.

Firstly, you have 56 FlowFiles, that does not necessarily mean 56 Records unless you have 1 Record per FlowFile.
You are using MergeRecord which counts Records, not files.
Your current config is set to Min 50 - Max 1000 Records
If you have 56 files with 1 Record in each, then merging 50 files is enough to meet the Minimum condition and release the bucket.
You also say Merge is set to run every 60 seconds, and perhaps this is not doing what you think it is. In almost all cases, Merge should be left to the default 0 sec schedule.
NiFi has no idea what all means, it takes an input and works on it - it does not know if or when the next input will come.
If every FlowFile is 1 Record, and it is categorically always 56 and that will never change, then your setting could be Min 56 - Max 56 and that will always merge 56 times.
However, that is very inflexible to change - if it suddenly changed to 57, you need to modify the flow.
Instead, you could set the Min-Max to very high numbers, say 10,000-20,000 and then set a Max Bin Age to 60 seconds (and the processor scheduling back to 0 sec). This would have the effect of merging every Record that enters the processor until A) 10-20k Records have been merged, or B) 60 seconds expire.
Example scenarios:
A) All 56 arrives within the first 2 seconds of the flow starting
All 56 are merged into 1 file after 60 seconds of the first file arriving
B) 53 arrive within the first 60 seconds, 3 arrive in the second 60 seconds
The first 53 are merged into 1 file after 60 seconds of the first file arriving, the last 3 are merged into another file after 60 seconds from the frst of the 3 arriving
C) 10,000 arrive in the first 5 seconds
All 10k will merge immediately into 1 file, they will not wait for 60 seconds

Related

How many user will be generated in Concurrency Thread Group

Target Concurrency 12
Ramp Up Time(Sec) 48
Ramp up step count 2
Hold Target Rate(sec) 48
Thread Iteration limit is set to 1
I am expecting total user should be 24 i.e.
6 user(as part of 1st step up count) , Next 6 user (as part of 2nd step up count)
Then complete 12 user (as part of Hold Rate Time).
But it is not happening as per above expectation

Kafka Windowed State Stores not cleaning up after retention

For some reason my old state stores are not cleaning up after the retention policy is expiring. I am testing it locally so I am just sending a single test message in every 5 minutes or so. I have the retention durations set low just for testing. retentionPeriod = 120, retentionWindowSize = 15 and I assume retain duplicates should be false. When should that be true?
Stores.persistentWindowStore(storeName,
Duration.of(retentionPeriod, ChronoUnit.SECONDS),
Duration.of(retentionWindowSize, ChronoUnit.SECONDS),
false)
When I ls in the state store directory I see the old stores well after the retention period has expired. For example, store.1554238740000 (assuming the number is epoch ms). I am well pass the 2 minute retention time and that directory is still there.
What am I missing?
Note, it does eventually clean up just a lot later than I was expecting. What triggers the clean up?
Retention time is a minimum guarantee how long data is stored. To make expiration efficient, so-called segments are used to partition the time-line into "buckets". Only after the time for all data in a segment can be expired, the segment is dropped. By default, Kafka Streams uses 3 segment. Thus for your example with a retention time of 120 seconds, each segment will be 60 seconds big (not 40 seconds). The reason is that the oldest segment can only be deleted of all data in it is passed the retention time. If the segment size would only be 40 seconds, 4 segments would be required to achieve this:
S1 [0-40) -- S2 [40,80) -- S3 [80,120)
If a record with timestamp 121 should be store, S1 cannot be deleted yet, because it contains data for timestamps 1 to 40 that are not passed retention period yet. Thus, a new segment S4 would be required. For segment size 60, 3 segments are sufficient:
S1 [0-60) -- S2 [60,120) -- S3 [120,180)
For this case, if a record with timestamp 181 arrives, all data in the first segment are passed the retention time of 181 - 120 = 61 and thus S1 can be deleted before S4 is created.
Note, that since Kafka 2.1, the internal mechanism is still the same, however, Kafka Streams enforced the retention period at application level in a strict manner, ie, write are dropped and reads return null for all data passed the retention period (even if the data is still there, because the segment is still in use).

Apache Storm UI window

In Apache Storm UI, Window specifies The past period of time for which the statistics apply. So it may be 10 mins, 3 hr, 1day. But actually when a topology is running, Is the number of tuples emitted/ transferred be computed using this window time because If I see the actual time 10 mins is quite big but the window shows 10 mins statistics before actual 10 mins which doesn't make sense?
For Example: emitted = 1764260 tuples, so will the rate of tuples emission is 1764260/600= 9801 tuples/sec?
It does not display the average, it displays the total number of tuples emitted in the last period of time (10 min, 3h or 1 day).
Therefore, if you started the application 2 minutes ago, it will display all tuples emitted the last two minutes and you'll see that the number increases until you get to 10 minutes.
After 10 minutes, it will only show the number of tuples emitted in the last 10 minutes, and not an average of the tuples emitted. So if, for example, you started the application 30 minutes ago, it will display the number of tuples emitted between minutes 20 to 30.

Log data reduction for variable bandwidth data link

I have an embedded system which generates samples (16bit numbers) at 1 milli second intervals. The variable uplink bandwidth can at best transfer a sample every 5ms, so I am
looking for ways to adaptively reduce the data rate while minimizing the loss
of important information -- in this case the minimum and maximum values in a time interval.
A scheme which I think should work involves sparse coding and a variation of lossy compression. Like this:
The system will internally store the min and max values during a 10ms interval.
The system will internally queue a limited number (say 50) of these data pairs.
No loss of min or max values is allowed but the time interval in which they occur may vary.
When the queue gets full, neighboring data pairs will be combined starting at the end of the queue so that the converted min/max pairs now represent 20ms intervals.
The scheme should be iterative so that further interval combining to 40ms, 80ms etc is done when necessary.
The scheme should be linearly weighted across the length of the queue so that there is no combining for the newest data and maximum necessary combining of the oldest data.
For example with a queue of length 6, successive data reduction should cause the data pairs to cover these intervals:
initial: 10 10 10 10 10 10 (60ms, queue full)
70ms: 10 10 10 10 10 20
80ms: 10 10 10 10 20 20
90ms: 10 10 20 20 20 20
100ms: 10 10 20 20 20 40
110ms: 10 10 20 20 40 40
120ms: 10 20 20 20 40 40
130ms: 10 20 20 40 40 40
140ms: 10 20 20 40 40 80
New samples are added on the left, data is read out from the right.
This idea obviously falls into the categories of lossy-compression and sparse-coding.
I assume this is a problem that must occur often in data logging applications with limited uplink bandwidth therefore some "standard" solution might have emerged.
I have deliberately simplified and left out other issues such as time stamping.
Questions:
Are there already algorithms which do this kind of data logging? I am not looking for the standard, lossy picture or video compression algos but something more specific to data logging as described above.
What would be the most appropriate implementation for the queue? Linked list? Tree?
The term you are looking for is "lossy compression" (See: http://en.wikipedia.org/wiki/Lossy_compression ). The optimal compression method depends on various aspects such as the distribution of your data.
As i understand you want to transmit min() and max() of all samples in a timeperiod.
eg. you want transmit min/max every 10ms with taking samples every 1ms?
if you do not need the individual samples you simply compare them after each sampling
i=0; min=TYPE_MAX; max=TYPE_MIN;// First sample will always overwrite the initial values
while true do
sample = getSample();
if min>sample then
min=sample
if max<sample then
max=sample
if i%10 == 0 then
send(min, max);
// if each period should be handled seperatly: min=TYPE_MAX; max=TYPE_MIN;
done
you can also save bandwidth with sending data only on changes (depends on sample data: if they dont change very quick you will save a lot)
Define a combination cost function that matches your needs, e.g. (len(i) + len(i+1)) / i^2, then iterate the array to find the "cheapest" pair to replace.

Algorithm for randomly selecting object

I want to implement a simulation: there are 1000 objects; during a period of time 1800 seconds, each object is randomly selected (or whatever action); the number of selected objects along time follows a rough distribution: 30% will be selected within 60 seconds, 40% will be selected after 60 seconds but within 300 seconds, 20% will be selected after 300 seconds but within 600 seconds, and 10% will be selected after 600 seconds.
So what is the probability for each object being selected every second?
This might be more appropriate to the Programmers section of StackExchange here: Programmers Exchange
But just taking a quick swipe at this, you select 300 objects in the first 60 seconds, 400 objects in the next 240 seconds, 200 objects in the next 300 seconds, and 100 objects in the last 1200 seconds. That gives you a sense of objects per second for each second of your simulation.
So, for example, you select 5 objects per second for the first 60 seconds, so there is a 5/1000 or 0.5% probability of selecting any specific object in each second of those first 60 seconds.
I think that should lead you to the answer if I understand your question correctly.

Resources