Say I have a log file looking like this:
# time, count
2016-09-07 23:00:00, 1108731
2016-09-07 23:00:02, 1108733
2016-09-07 23:00:03, 1108734
Now, every next row contains a sum of all events that occurred in the past. I would like to use it in kibana and the natural way would be to have a count as a deltafied number.
So I expect an effect of:
# time, count, deltaCount
2016-09-07 23:00:00, 1108731, 0
2016-09-07 23:00:02, 1108733, 2
2016-09-07 23:00:03, 1108734, 1
How to achieve this in logstash. I know I could edit this files beforehand.
Thanks!
Solution #1: Write your plugin
One way to do it would be to create a plugin. The same problem is solved here. However, the filter that is posted there is not publicly available and, what is worse, it is actually 5 lines of code.
Solution #2: Ruby code snippet
I have found a solution in this thread on elastic forums: Keeping global variables in LS?!. The title says it all.
Cutting long story short, the solution goes as follows:
filter {
...
ruby {
init => "##previous_count = -1"
code => "
if (##previous_count == -1)
delta = 0
else
delta = event.get('count') - ##previous_count
end
event.set('requests', delta)
# remember event for next time
##previous_count = event.get('count')
"
}
}
Was not that hard after all.
Related
I have an issue where, I'm trying to work out if a certain alert on a webpage is calculating sums correctly. I'm using Capybara and Cucumber.
I have an alert that calculates records that expire within 30 days. When selecting this alert, the records are listed in a table and the date is presented in the following format, "1 feb 2016"
What I want to do is somehow take today's date, compare it to the date returned in the table and ensure that it's >= 30 days from the date in the alert.
I'm able to set today's date as the same format using Time.strftime etc.
When I try things like:
And(/^I can see record "([\d]*)" MOT is calculated due within 30 days$/) do |selection1|
today = Time.now.strftime('%l %b %Y')
thirty_days = (today + 30)
first_30day_mot = first('#clickable-rows > tbody > tr:nth-child(' + selection1 + ') > td:nth-child(3)')
if today + first_30day_mot <= thirty_days
puts 'alert correct'
else
(error handler here)
end
end
As you can see, this is quite a mess.
I keep getting the error TypeError: no implicit conversion of Fixnum into String
If anyone can think of a neater way to do this, please put me out of my misery.
Thanks
There are at least a couple of things wrong with your attempt.
You're converting dates to strings and then trying to compare lengths of time with strings. You should be converting strings to dates and then comparing them
#first returns the element in the page not the contents of the element
It's not 100% clear from your code what you're trying to do, but from the test naming I think you just want to make sure the date in the 3rd td cell (which is in the 1 feb 2016 format) of a given row is less than 30 days from now. If so the following should do what you want
mot_element = first("#clickable-rows > tbody > tr:nth-child(#{selection1}) > td:nth-child(3)")
date_of_mot = Date.parse(mot_element.text)
if (date_of_mot - Date.today) < 30
puts 'alert correct'
else
#error handler
end
Beyond that, I'm not sure why you're using #first with that selector since it seems like it should only ever match one element on the page, so you might want to swap that to #find instead, which would get you the benefits of Capybaras waiting behavior. If you do actually need #first, you might consider passing the minimum: 1 option to make sure it waits a bit for the matching element to appear on the page (if this is the first step after clicking a button to go to a new page for instance)
Convert selection1 to the string explicitly (or, better, use string interpolation):
first_30day_mot = first("#clickable-rows > tbody > tr:nth-child(#{selection1}) > td:nth-child(3)")
Also, I suspect that one line below it should be converted to integer to add it to today:
first_30day_mot.to_i <= 30
UPD OK, I finally got time to take a more thorough look at. You do not need all these voodoo magic with days calculus:
# today = Time.now.strftime('%l %b %Y') # today will be a string " 3 Feb 2016"
# thirty_days = (today + 30) this was causing an error
# correct:
# today = DateTime.now # correct, but not needed
# plus_30_days = today + 30.days # correct, but not needed
first_30day_mot = first("#clickable-rows > tbody > tr:nth-child(#{selection1}) > td:nth-child(3)")
if 30 > first_30day_mot.to_i
...
Hope it helps.
I'd strongly recommend not using Cucumber to do this sort of test. You'll find its:
Quite hard to set up
Has a high runtime cost
Doesn't give enough benefit to justify the setup/runtime costs
Instead consider writing a unit test of the thing that provides the date. Generally a good unit test can easily run 10 to 100 times faster than a scenario.
Whilst with a single scenario you won't experience that much pain, once you have alot of scenarios like this the pain will accumulate. Part of the art of using Cucumber is to get plenty of bang for each scenario you write.
it run corectly but it should have around 500 matches but it only has around 50 and I dont know why!
This is a probelm for my comsci class that I am having isues with
we had to make a function that checks a list for duplication I got that part but then we had to apply it to the birthday paradox( more info here http://en.wikipedia.org/wiki/Birthday_problem) thats where I am runing into problem because my teacher said that the total number of times should be around 500 or 50% but for me its only going around 50-70 times or 5%
duplicateNumber=0
import random
def has_duplicates(listToCheck):
for i in listToCheck:
x=listToCheck.index(i)
del listToCheck[x]
if i in listToCheck:
return True
else:
return False
listA=[1,2,3,4]
listB=[1,2,3,1]
#print has_duplicates(listA)
#print has_duplicates(listB)
for i in range(0,1000):
birthdayList=[]
for i in range(0,23):
birthday=random.randint(1,365)
birthdayList.append(birthday)
x= has_duplicates(birthdayList)
if x==True:
duplicateNumber+=1
else:
pass
print "after 1000 simulations with 23 students there were", duplicateNumber,"simulations with atleast one match. The approximate probibilatiy is", round(((duplicateNumber/1000)*100),3),"%"
This code gave me a result in line with what you were expecting:
import random
duplicateNumber=0
def has_duplicates(listToCheck):
number_set = set(listToCheck)
if len(number_set) is not len(listToCheck):
return True
else:
return False
for i in range(0,1000):
birthdayList=[]
for j in range(0,23):
birthday=random.randint(1,365)
birthdayList.append(birthday)
x = has_duplicates(birthdayList)
if x==True:
duplicateNumber+=1
print "after 1000 simulations with 23 students there were", duplicateNumber,"simulations with atleast one match. The approximate probibilatiy is", round(((duplicateNumber/1000.0)*100),3),"%"
The first change I made was tidying up the indices you were using in those nested for loops. You'll see I changed the second one to j, as they were previously bot i.
The big one, though, was to the has_duplicates function. The basic principle here is that creating a set out of the incoming list gets the unique values in the list. By comparing the number of items in the number_set to the number in listToCheck we can judge whether there are any duplicates or not.
Here is what you are looking for. As this is not standard practice (to just throw code at a new user), I apologize if this offends any other users. However, I believe showing the OP a correct way to write a program should be could all do us a favor if said user keeps the lack of documentation further on in his career.
Thus, please take a careful look at the code, and fill in the blanks. Look up the python doumentation (as dry as it is), and try to understand the things that you don't get right away. Even if you understand something just by the name, it would still be wise to see what is actually happening when some built-in method is being used.
Last, but not least, take a look at this code, and take a look at your code. Note the differences, and keep trying to write your code from scratch (without looking at mine), and if it messes up, see where you went wrong, and start over. This sort of practice is key if you wish to succeed later on in programming!
def same_birthdays():
import random
'''
This is a program that does ________. It is really important
that we tell readers of this code what it does, so that the
reader doesn't have to piece all of the puzzles together,
while the key is right there, in the mind of the programmer.
'''
count = 0
#Count is going to store the number of times that we have the same birthdays
timesToRun = 1000 #timesToRun should probably be in a parameter
#timesToRun is clearly defined in its name as well. Further elaboration
#on its purpose is not necessary.
for i in range(0,timesToRun):
birthdayList = []
for j in range(0,23):
random_birthday = random.randint(1,365)
birthdayList.append(random_birthday)
birthdayList = sorted(birthdayList) #sorting for easier matching
#If we really want to, we could provide a check in the above nester
#for loop to check right away if there is a duplicate.
#But again, we are here
for j in range(0, len(birthdayList)-1):
if (birthdayList[j] == birthdayList[j+1]):
count+=1
break #leaving this nested for-loop
return count
If you wish to find the percent, then get rid of the above return statement and add:
return (count/timesToRun)
Here's a solution that doesn't use set(). It also takes a different approach with the array so that each index represents a day of the year. I also removed the hasDuplicate() function.
import random
sim_total=0
birthdayList=[]
#initialize an array of 0's representing each calendar day
for i in range(365):
birthdayList.append(0)
for i in range(0,1000):
first_dup=True
for n in range(365):
birthdayList[n]=0
for b in range(0, 23):
r = random.randint(0,364)
birthdayList[r]+=1
if (birthdayList[r] > 1) and (first_dup==True):
sim_total+=1
first_dup=False
avg = float(sim_total) / 1000 * 100
print "after 1000 simulations with 23 students there were", sim_total,"simulations with atleast one duplicate. The approximate problibility is", round(avg,3),"%"
I'm running some Ruby scripts concurrently using Grosser/Parallel.
During each concurrent test I want to add up the number of times a particular thing has happened, then display that number.
Let's say:
def main
$this_happened = 0
do_this_in_parallel
puts $this_happened
end
def do_this_in_parallel
Parallel.each(...) {
$this_happened += 1
}
end
The final value after do_this_in_parallel has finished will always be 0
I'd like to know why this happens.
How can I get the desired result which would be $this_happenend > 0?
Thanks.
This doesn't work because separate processes have separate memory spaces: setting variables etc in one process has no effect on what happens in the other process.
However you can return a result from your block (because under the hood parallel sets up pipes so that the processes can be fed input/return results). For example you could do this
counts = Parallel.map(...) do
#the return value of the block should
#be the number of times the event occurred
end
Then just sum the counts to get your total count (eg counts.reduce(:+)). You might also want to read up on map-reduce for more information about this way of parallelising work
I have never used parallel but the documentation seems to suggest that something like this might work.
Parallel.each(..., :finish => lambda {|*_| $this_happened += 1}) { do_work }
Update: I've provided a brief analysis of the three answers at the bottom of the question text and explained my choices.
My Question: What is the most efficient method of building a fixed interval dataset from a random interval dataset using stale data?
Some background: The above is a common problem in statistics. Frequently, one has a sequence of observations occurring at random times. Call it Input. But one wants a sequence of observations occurring say, every 5 minutes. Call it Output. One of the most common methods to build this dataset is using stale data, i.e. set each observation in Output equal to the most recently occurring observation in Input.
So, here is some code to build example datasets:
TInput = 100;
TOutput = 50;
InputTimeStamp = 730486 + cumsum(0.001 * rand(TInput, 1));
Input = [InputTimeStamp, randn(TInput, 1)];
OutputTimeStamp = 730486.002 + (0:0.001:TOutput * 0.001 - 0.001)';
Output = [OutputTimeStamp, NaN(TOutput, 1)];
Both datasets start at close to midnight at the turn of the millennium. However, the timestamps in Input occur at random intervals while the timestamps in Output occur at fixed intervals. For simplicity, I have ensured that the first observation in Input always occurs before the first observation in Output. Feel free to make this assumption in any answers.
Currently, I solve the problem like this:
sMax = size(Output, 1);
tMax = size(Input, 1);
s = 1;
t = 2;
%#Loop over input data
while t <= tMax
if Input(t, 1) > Output(s, 1)
%#If current obs in Input occurs after current obs in output then set current obs in output equal to previous obs in input
Output(s, 2:end) = Input(t-1, 2:end);
s = s + 1;
%#Check if we've filled out all observations in output
if s > sMax
break
end
%#This step is necessary in case we need to use the same input observation twice in a row
t = t - 1;
end
t = t + 1;
if t > tMax
%#If all remaining observations in output occur after last observation in input, then use last obs in input for all remaining obs in output
Output(s:end, 2:end) = Input(end, 2:end);
break
end
end
Surely there is a more efficient, or at least, more elegant way to solve this problem? As I mentioned, this is a common problem in statistics. Perhaps Matlab has some in-built function I'm not aware of? Any help would be much appreciated as I use this routine a LOT for some large datasets.
THE ANSWERS: Hi all, I've analyzed the three answers, and as they stand, Angainor's is the best.
ChthonicDaemon's answer, while clearly the easiest to implement, is really slow. This is true even when the conversion to a timeseries object is done outside of the speed test. I'm guessing the resample function has a lot of overhead at the moment. I am running 2011b, so it is possible Mathworks have improved it in the intervening time. Also, this method needs an additional line for the case where Output ends more than one observation after Input.
Rody's answer runs only slightly slower than Angainor's (unsurprising given they both employ the histc approach), however, it seems to have some problems. First, the method of assigning the last observation in Output is not robust to the last observation in Input occurring after the last observation in Output. This is an easy fix. But there is a second problem which I think stems from having InputTimeStamp as the first input to histc instead of the OutputTimeStamp adopted by Angainor. The problem emerges if you change OutputTimeStamp = 730486.002 + (0:0.001:TOutput * 0.001 - 0.001)'; to OutputTimeStamp = 730486.002 + (0:0.0001:TOutput * 0.0001 - 0.0001)'; when setting up the example inputs.
Angainor's appears robust to everything I threw at it, plus it was the fastest.
I did a lot of speed tests for different input specifications - the following numbers are fairly representative:
My naive loop: Elapsed time is 8.579535 seconds.
Angainor: Elapsed time is 0.661756 seconds.
Rody: Elapsed time is 0.913304 seconds.
ChthonicDaemon: Elapsed time is 22.916844 seconds.
I'm +1-ing Angainor's solution and marking the question solved.
This "stale data" approach is known as a zero order hold in signal and timeseries fields. Searching for this quickly brings up many solutions. If you have Matlab 2012b, this is all built in to the timeseries class by using the resample function, so you would simply do
TInput = 100;
TOutput = 50;
InputTimeStamp = 730486 + cumsum(0.001 * rand(TInput, 1));
InputData = randn(TInput, 1);
InputTimeSeries = timeseries(InputData, InputTimeStamp);
OutputTimeStamp = 730486.002 + (0:0.001:TOutput * 0.001 - 0.001);
OutputTimeSeries = resample(InputTimeSeries, OutputTimeStamp, 'zoh'); % zoh stands for zero order hold
Here is my take on the problem. histc is the way to go:
% find Output timestamps in Input bins
N = histc(Output(:,1), Input(:,1));
% find counts in the non-empty bins
counts = N(find(N));
% find Input signal value associated with every bin
val = Input(find(N),2);
% now, replicate every entry entry in val
% as many times as specified in counts
index = zeros(1,sum(counts));
index(cumsum([1 counts(1:end-1)'])) = 1;
index = cumsum(index);
val_rep = val(index)
% finish the signal with last entry from Input, as needed
val_rep(end+1:size(Output,1)) = Input(end,2);
% done
Output(:,2) = val_rep;
I checked against your procedure for a few different input models (I changed the number of Output timestamps) and the results are the same. However, I am still not sure I understood your problem, so if something is wrong here let me know.
looper = (0..3).cycle
20.times { puts looper.next }
can I somehow find the next of 3? I mean if I can get .next of any particular element at any given time. Not just display loop that starts with the first element.
UPDATE
Of course I went though ruby doc before posting my question. But I did not find answer there ...
UPDATE2
input
looper = (0..max_cycle).cycle
max_cycle = variable that can be different every time the script runs
looper = variable that is always from interval (0..max_cycle) but the current value when the script starts could be any. It is based on Time.now.hour
output
I want to know .next value of looper at any time during the running time of the script
Your question is not very clear. Maybe you want something like this?
(current_value + 1) % (max_cycle + 1)
If, for example, max_cycle = 3 you will have the following output:
current_value returns
0 1
1 2
2 3
3 0
http://ruby-doc.org/core-1.9/classes/Enumerable.html#M003074