Timeseries transformations in Ruby, Yahoo! Pipes-style - ruby

I'm trying to build a system for programmatically filtering timeseries data and wonder if this problem has been solved, or at least hacked at, before. It seems that it's a perfect opportunity to do some Ruby block magic, given the scope and passing abilities; however, I'm still a bit short of fully grokking how to take advantage of blocks.
To wit:
Pulling data from my database, I can create either a hash or an array, let's use array:
data = [[timestamp0, value0],[timestamp1,value1], … [timestampN, valueN]]
Then I can add a method to array, maybe something like:
class Array
def filter &block
…
self.each_with_index do |v, i|
…
# Always call with timestep, value, index
block.call(v[0], v[1], i)
…
end
end
end
I understand that one powers of Ruby blocks is that the passed block of code happens within the scope of the closure. So somehow calling data.filter should allow me to work with that scope. I can only figure out how to do that without taking advantage of the scope. To wit:
# average if we have a single null value, assumes data is correctly ordered
data.filter do |t, v, i|
# Of course, we do some error checking…
(data[i-1] + data[i+1]) / 2 if v.nil?
end
What I want to do is actually is (allow the user to) build up mathematical filters programmatically, but taking it one step at a time, we'll build some functions:
def average_single_values(args)
#average over single null values
#return filterable array
end
def filter_by_std(args)
#limit results to those within N standard deviations
#return filterable array
end
def pull_bad_values(args)
#delete or replace values seen as "bad"
#return filterable array
end
my_filters == [average_single_values, filter_by_std, pull_bad_values]
Then, having a list of filters, I figure (somehow) I should be able to do:
data.filter do |t, v, i|
my_filters.each do |f|
f.call t, v, i
end
end
or, assuming a different filter implementation:
filtered_data = data.filter my_filters
which would probably be a better way to design it, as it returns a new array and is non-destructive
The result being an array that has been run through all of the filters. The eventual goal, is to be able to have static data arrays that can be run through arbitrary filters, and filters that can be passed (and shared) as objects the way that Yahoo! Pipes does so with feeds. I'm not looking for too generalized a solution right now, I can make the format of the array/returns strict.
Has anyone seen something similar in Ruby? Or have some basic pointers?

The first half of your question about working in the scope of the array seems unnecessary and irrelevant to your problem. As for creating operations to manipulate data with blocks, you can use Proc instances ("procs"), which essentially are blocks stored in an object. For example, if you want to store them with names, you can create a hash of filters:
my_filters = {}
my_filters[:filter_name] = lambda do |*args|
# filter body here...
end
You do not need to name them, of course, and can use arrays. Then, to run some data through an ordered series of filters, use the helpful Enumerable#inject method:
my_filters.inject(data) do |result, filter|
filter.call result
end
It uses no monkeypatching too!

Related

Loop method until it returns falsey

I was trying to make my bubble sort shorter and I came up with this
class Array
def bubble_sort!(&block)
block = Proc.new { |a, b| a <=> b } unless block_given?
sorted = each_index.each_cons(2).none? do |i, next_i|
if block.call(self[i], self[next_i]) == 1
self[i], self[next_i] = self[next_i], self[i]
end
end until sorted
self
end
def bubble_sort(&prc)
self.dup.bubble_sort!(&prc)
end
end
I don't particularly like the thing with sorted = --sort code-- until sorted.
I just want to run the each_index.each_cons(s).none? code until it returns true. It's a weird situation that I use until, but the condition is a code I want to run. Any way, my try seems awkward, and ruby usually has a nice concise way of putting things. Is there a better way to do this?
This is just my opinion
have you ever read the ruby source code of each and map to understand what they do?
No, because they have a clear task expressed from the method name and if you test them, they will take an object, some parameters and then return a value to you.
For example if I want to test the String method split()
s = "a new string"
s.split("new")
=> ["a ", " string"]
Do you know if .split() takes a block?
It is one of the core ruby methods, but to call it I don't pass a block 90% of the times, I can understand what it does from the name .split() and from the return value
Focus on the objects you are using, the task the methods should accomplish and their return values.
I read your code and I can not refactor it, I hardly can understand what the code does.
I decided to write down some points, with possibility to follow up:
1) do not use the proc for now, first get the Object Oriented code clean.
2) split bubble_sort! into several methods, each one with a clear task
def ordered_inverted! (bubble_sort!), def invert_values, maybe perform a invert_values until sorted, check if existing methods already perform this sorting functionality
3) write specs for those methods, tdd will push you to keep methods simple and easy to test
4) If those methods do not belong to the Array class, include them in the appropriate class, sometimes overly complicated methods are just performing simple String operations.
5) Reading books about refactoring may actually help more then trying to force the usage of proc and functional programming when not necessary.
After looking into it further I'm fairly sure the best solution is
loop do
break if condition
end
Either that or the way I have it in the question, but I think the loop do version is clearer.
Edit:
Ha, a couple weeks later after I settled for the loop do solution, I stumbled into a better one. You can just use a while or until loop with an empty block like this:
while condition; end
until condition; end
So the bubble sort example in the question can be written like this
class Array
def bubble_sort!(&block)
block = Proc.new { |a, b| a <=> b } unless block_given?
until (each_index.each_cons(2).none? do |i, next_i|
if block.call(self[i], self[next_i]) == 1
self[i], self[next_i] = self[next_i], self[i]
end
end); end
self
end
def bubble_sort(&prc)
self.dup.bubble_sort!(&prc)
end
end

How does a code block in Ruby know what variable belongs to an aspect of an object?

Consider the following:
(1..10).inject{|memo, n| memo + n}
Question:
How does n know that it is supposed to store all the values from 1..10? I'm confused how Ruby is able to understand that n can automatically be associated with (1..10) right away, and memo is just memo.
I know Ruby code blocks aren't the same as the C or Java code blocks--Ruby code blocks work a bit differently. I'm confused as to how variables that are in between the upright pipes '|' will automatically be assigned to parts of an object. For example:
hash1 = {"a" => 111, "b" => 222}
hash2 = {"b" => 333, "c" => 444}
hash1.merge(hash2) {|key, old, new| old}
How do '|key, old, new|' automatically assign themselves in such a way such that when I type 'old' in the code block, it is automatically aware that 'old' refers to the older hash value? I never assigned 'old' to anything, just declared it. Can someone explain how this works?
The parameters for the block are determined by the method definition. The definition for reduce/inject is overloaded (docs) and defined in C, but if you wanted to define it, you could do it like so (note, this doesn't cover all the overloaded cases for the actual reduce definition):
module Enumerable
def my_reduce(memo=nil, &blk)
# if a starting memo is not given, it defaults to the first element
# in the list and that element is skipped for iteration
elements = memo ? self : self[1..-1]
memo ||= self[0]
elements.each { |element| memo = blk.call(memo, element) }
memo
end
end
This method definition determines what values to use for memo and element and calls the blk variable (a block passed to the method) with them in a specific order.
Note, however, that blocks are not like regular methods, because they don't check the number of arguments. For example: (note, this example shows the usage of yield which is another way to pass a block parameter)
def foo
yield 1
end
# The b and c variables here will be nil
foo { |a, b, c| [a,b,c].compact.sum } # => 1
You can also use deconstruction to define variables at the time you run the block, for example if you wanted to reduce over a hash you could do something like this:
# this just copies the hash
{a: 1}.reduce({}) { |memo, (key, val)| memo[key] = val; memo }
How this works is, calling reduce on a hash implicitly calls to_a, which converts it to a list of tuples (e.g. {a: 1}.to_a = [[:a, 1]]). reduce passes each tuple as the second argument to the block. In the place where the block is called, the tuple is deconstructed into separate key and value variables.
A code block is just a function with no name. Like any other function, it can be called multiple times with different arguments. If you have a method
def add(a, b)
a + b
end
How does add know that sometimes a is 5 and sometimes a is 7?
Enumerable#inject simply calls the function once for each element, passing the element as an argument.
It looks a bit like this:
module Enumerable
def inject(memo)
each do |el|
memo = yield memo, el
end
memo
end
end
And memo is just memo
what do you mean, "just memo"? memo and n take whatever values inject passes. And it is implemented to pass accumulator/memo as first argument and current collection element as second argument.
How do '|key, old, new|' automatically assign themselves
They don't "assign themselves". merge assigns them. Or rather, passes those values (key, old value, new value) in that order as block parameters.
If you instead write
hash1.merge(hash2) {|foo, bar, baz| bar}
It'll still work exactly as before. Parameter names mean nothing [here]. It's actual values that matter.
Just to simplify some of the other good answers here:
If you are struggling understanding blocks, an easy way to think of them is as a primitive and temporary method that you are creating and executing in place, and the values between the pipe characters |memo| is simply the argument signature.
There is no special special concept behind the arguments, they are simply there for the method you are invoking to pass a variable to, like calling any other method with an argument. Similar to a method, the arguments are "local" variables within the scope of the block (there are some nuances to this depending on the syntax you use to call the block, but I digress, that is another matter).
The method you pass the block to simply invokes this "temporary method" and passes the arguments to it that it is designed to do. Just like calling a method normally, with some slight differences, such as there are no "required" arguments. If you do not define any arguments to receive, it will happily just not pass them instead of raising an ArgumentError. Likewise, if you define too many arguments for the block to receive, they will simply be nil within the block, no errors for not being defined.

Chaining partition, keep_if etc

[1,2,3].partition.inject(0) do |acc, x|
x>2 # this line is intended to be used by `partition`
acc+=x # this line is intended to be used by `inject`
end
I know that I can write above stanza using different methods but this is not important here.
What I want to ask why somebody want to use partition (or other methods like keep_if, delete_if) at the beginning of the "chain"?
In my example, after I chained inject I couldn't use partition. I can write above stanza using each:
[1,2,3].each.inject(0) do |acc, x|
x>2 # this line is intended to be used by `partition`
acc+=x # this line is intended to be used by `inject`
end
and it will be the same, right?
I know that x>2 will be discarded (and not used) by partition. Only acc+=x will do the job (sum all elements in this case).
I only wrote that to show my "intention": I want to use partition in the chain like this [].partition.inject(0).
I know that above code won't work as I intended and I know that I can chain after block( }.map as mentioned by Neil Slater).
I wanted to know why, and when partition (and other methods like keep_if, delete_if etc) becomes each (just return elements of the array as partition do in the above cases).
In my example, partition.inject, partition became each because partition cannot take condition (x>2).
However partition.with_index (as mentioned by Boris Stitnicky) works (I can partition array and use index for whatever I want):
shuffled_array
.partition
.with_index { |element, index|
element > index
}
ps. This is not question about how to get sum of elements that are bigger than 2.
This is an interesting situation. Looking at your code examples, you are obviously new to Ruby and perhaps also to programming. Yet you managed to ask a very difficult question that basically concerns the Enumerator class, one of the least publicly understood classes, especially since Enumerator::Lazy was introduced. To me, your question is difficult enough that I am not able to provide a comprehensive answer. Yet the remarks about your code would not fit into a comment under the OP. That's why I'm adding this non-answer.
First of all, let us notice a few awful things in your code:
Useless lines. In both blocks, x>2 line is useless, because its return value is discarded.
[1,2,3].partition.inject(0) do |x, acc|
x>2 # <---- return value of this line is never used
acc+=x
end
[1,2,3].each.inject(0) do |x, acc|
x>2 # <---- return value of this line is never used
acc+=x
end
I will ignore this useless line when discussing your code examples further.
Useless #each method. It is useless to write
[1,2,3].each.inject(0) do |x, acc|
acc+=x
end
This is enough:
[1,2,3].inject(0) do |x, acc|
acc+=x
end
Useless use of #partition method. Instead of:
[1,2,3].partition.inject(0) do |x, acc|
acc+=x
end
You can just write this:
[1,2,3].inject(0) do |x, acc|
acc+=x
end
Or, as I would write it, this:
[ 1, 2, 3 ].inject :+
But then, you ask a deep question about using #partition method in the enumerator mode. Having discussed the trivial newbie problems of your code, we are left with the question how exactly the enumerator-returning versions of the #partition, #keep_if etc. should be used, or rather, what are the interesting way of using them, because everyone knows that we can use them for chaining:
array = [ *1..6 ]
shuffled_arrray = array.shuffle # randomly shuffles the array elements
shuffled_array
.partition # partition enumerator comes into play
.with_index { |element, index| # method Enumerator#with_index comes into play
element > index # and partitions elements into those greater
} # than their index, and those smaller
And also like this:
e = partition_enumerator_of_array = array.partition
# And then, we can partition the array in many ways:
e.each &:even? # partitions into odd / even numbers
e.each { rand() > 0.5 } # partitions the array randomly
# etc.
An easily understood advantage is that instead of writing longer:
array.partition &:even?
You can write shorter:
e.each &:even?
But I am basically sure that enumerators provide more power to the programmer than just chaining collection methods and shortening code a little bit. Because different enumerators do very different things. Some, such as #map! or #reject!, can even modify the collection on which they operate. In this case, it is imaginable that one could combine different enumerators with the same block to do different things. This ability to vary not just the blocks, but also the enumerators to which they are passed, gives combinatorial power, which can very likely be used to make some otherwise lengthy code very concise. But I am unable to provide a very useful concrete example of this.
In sum, Enumerator class is here mainly for chaining, and to use chaining, programmers do not really need to undestand Enumerator in detail. But I suspect that the correct habits regarding the use of Enumerator might be as difficult to learn as, for instance, correct habits of parametrized subclassing. I suspect I have not grasped the most powerful ways to use enumerators yet.
I think that the result [3, 3] is what you are looking for here - partitioning the array into smaller and larger numbers then summing each group. You seem to be confused about how you give the block "rules" to the two different methods, and have merged what should be two blocks into one.
If you need the net effects of many methods that each take a block, then you can chain after any block, by adding the .method after the close of the block like this: }.each or end.each
Also note that if you create partitions, you are probably wanting to sum over each partition separately. To do that you will need an extra link in the chain (in this case a map):
[1,2,3].partition {|x| x > 2}.map do |part|
part.inject(0) do |acc, x|
x + acc
end
end
# => [3, 3]
(You also got the accumulator and current value wrong way around in the inject, and there is no need to assign to the accumulator, Ruby does that for you).
The .inject is no longer in a method chain, instead it is inside a block. There is no problem with blocks inside other blocks, in fact you will see this very often in Ruby code.
I have chained .partition and .map in the above example. You could also write the above like this:
[1,2,3].partition do
|x| x > 2
end.map do |part|
part.inject(0) do |acc, x|
x + acc
end
end
. . . although when chaining with short blocks, I personally find it easier to use the { } syntax instead of do end, especially at the start of a chain.
If it all starts to look complex, there is not usually a high cost to assigning the results of the first part of a chain to a local variable, in which case there is no chain at all.
parts = [1,2,3].partition {|x| x > 2}
parts.map do |part|
part.inject(0) do |acc, x|
x + acc
end
end

undefined method `assoc' for #<Hash:0x10f591518> (NoMethodError)

I'm trying to return a list of values based on user defined arguments, from hashes defined in the local environment.
def my_method *args
#initialize accumulator
accumulator = Hash.new(0)
#define hashes in local environment
foo=Hash["key1"=>["var1","var2"],"key2"=>["var3","var4","var5"]]
bar=Hash["key3"=>["var6"],"key4"=>["var7","var8","var9"],"key5"=>["var10","var11","var12"]]
baz=Hash["key6"=>["var13","var14","var15","var16"]]
#iterate over args and build accumulator
args.each do |x|
if foo.has_key?(x)
accumulator=foo.assoc(x)
elsif bar.has_key?(x)
accumulator=bar.assoc(x)
elsif baz.has_key?(x)
accumulator=baz.assoc(x)
else
puts "invalid input"
end
end
#convert accumulator to list, and return value
return accumulator = accumulator.to_a {|k,v| [k].product(v).flatten}
end
The user is to call the method with arguments that are keywords, and the function to return a list of values associated with each keyword received.
For instance
> my_method(key5,key6,key1)
=> ["var10","var11","var12","var13","var14","var15","var16","var1","var2"]
The output can be in any order. I received the following error when I tried to run the code:
undefined method `assoc' for #<Hash:0x10f591518> (NoMethodError)
Please would you point me how to troubleshoot this? In Terminal assoc performs exactly how I expect it to:
> foo.assoc("key1")
=> ["var1","var2"]
I'm guessing you're coming to Ruby from some other language, as there is a lot of unnecessary cruft in this method. Furthermore, it won't return what you expect for a variety of reasons.
`accumulator = Hash.new(0)`
This is unnecessary, as (1), you're expecting an array to be returned, and (2), you don't need to pre-initialize variables in ruby.
The Hash[...] syntax is unconventional in this context, and is typically used to convert some other enumerable (usually an array) into a hash, as in Hash[1,2,3,4] #=> { 1 => 2, 3 => 4}. When you're defining a hash, you can just use the curly brackets { ... }.
For every iteration of args, you're assigning accumulator to the result of the hash lookup instead of accumulating values (which, based on your example output, is what you need to do). Instead, you should be looking at various array concatenation methods like push, +=, <<, etc.
As it looks like you don't need the keys in the result, assoc is probably overkill. You would be better served with fetch or simple bracket lookup (hash[key]).
Finally, while you can call any method in Ruby with a block, as you've done with to_a, unless the method specifically yields a value to the block, Ruby will ignore it, so [k].product(v).flatten isn't actually doing anything.
I don't mean to be too critical - Ruby's syntax is extremely flexible but also relatively compact compared to other languages, which means it's easy to take it too far and end up with hard to understand and hard to maintain methods.
There is another side effect of how your method is constructed wherein the accumulator will only collect the values from the first hash that has a particular key, even if more than one hash has that key. Since I don't know if that's intentional or not, I'll preserve this functionality.
Here is a version of your method that returns what you expect:
def my_method(*args)
foo = { "key1"=>["var1","var2"],"key2"=>["var3","var4","var5"] }
bar = { "key3"=>["var6"],"key4"=>["var7","var8","var9"],"key5"=>["var10","var11","var12"] }
baz = { "key6"=>["var13","var14","var15","var16"] }
merged = [foo, bar, baz].reverse.inject({}, :merge)
args.inject([]) do |array, key|
array += Array(merged[key])
end
end
In general, I wouldn't define a method with built-in data, but I'm going to leave it in to be closer to your original method. Hash#merge combines two hashes and overwrites any duplicate keys in the original hash with those in the argument hash. The Array() call coerces an array even when the key is not present, so you don't need to explicitly handle that error.
I would encourage you to look up the inject method - it's quite versatile and is useful in many situations. inject uses its own accumulator variable (optionally defined as an argument) which is yielded to the block as the first block parameter.

Cannot understand what the following code does

Can somebody explain to me what the below code is doing. before and after are hashes.
def differences(before, after)
before.diff(after).keys.sort.inject([]) do |diffs, k|
diff = { :attribute => k, :before => before[k], :after => after[k] }
diffs << diff; diffs
end
end
It is from the papertrail differ gem.
It's dense code, no question. So, as you say before and after are hash(-like?) objects that are handed into the method as parameters. Calling before.diff(after) returns another hash, which then immediately has .keys called on it. That returns all the keys in the hash that diff returned. The keys are returned as an array, which is then immediately sorted.
Then we get to the most complex/dense bit. Using inject on that sorted array of keys, the method builds up an array (called diffs inside the inject block) which will be the return value of the inject method.
That array is made up of records of differences. Each record is a hash - built up by taking one key from the sorted array of keys from the before.diff(after) return value. These hashes store the attribute that's being diffed, what it looked like in the before hash and what it looks like in the after hash.
So, in a nutshell, the method gets a bunch of differences between two hashes and collects them up in an array of hashes. That array of hashes is the final return value of the method.
Note: inject can be, and often is, much, much simpler than this. Usually it's used to simply reduce a collection of values to one result, by applying one operation over and over again and storing the results in an accumlator. You may know inject as reduce from other languages; reduce is an alias for inject in Ruby. Here's a much simpler example of inject:
[1,2,3,4].inject(0) do |sum, number|
sum + number
end
# => 10
0 is the accumulator - the initial value. In the pair |sum, number|, sum will be the accumulator and number will be each number in the array, one after the other. What inject does is add 1 to 0, store the result in sum for the next round, add 2 to sum, store the result in sum again and so on. The single final value of the accumulator sum will be the return value. Here 10. The added complexity in your example is that the accumulator is different in kind from the values inside the block. That's less common, but not bad or unidiomatic. (Edit: Andrew Marshall makes the good point that maybe it is bad. See his comment on the original question. And #tokland points out that the inject here is just a very over-complex alternative for map. It is bad.) See the article I linked to in the comments to your question for more examples of inject.
Edit: As #tokland points out in a few comments, the code seems to need just a straightforward map. It would read much easier then.
def differences(before, after)
before.diff(after).keys.sort.map do |k|
{ :attribute => k, :before => before[k], :after => after[k] }
end
end
I was too focused on explaining what the code was doing. I didn't even think of how to simplify it.
It finds the entries in before and after that differ according to the underlying objects, then builds up a list of those differences in a more convenient format.
before.diff(after) finds the entries that differ.
keys.sort gives you the keys (of the map of differences) in sorted order
inject([]) is like map, but starts with diffs initialized to an empty array.
The block creates a diff line (a hash) for each of these differences, and then appends it to diffs.

Resources