Idiomatic lazy sorting by multiple criteria - ruby

In Ruby, the most common way to sort by multiple criteria is to use sort_by with the sorting function returning an array of the values corresponding to each sorting criterion, in order of decreasing importance, e.g.:
Dir["*"].sort_by { |f| [test(?s, f) || 0, test(?M, f), f] }
will sort the directory entries by size, then by mtime, then finally by the filename. This is efficient to the extent that it uses a Schwartzian transform to only calculate the size and mtime of each file once, not once per comparison. However it is not truly lazy, since it calculates the mtime for every single file, but if (say) every file in the directory had a different size, it should not be necessary to calculate any mtimes.
This is not a big problem in this case, since looking up the mtime immediately after looking up the size should be efficient due to caching at the kernel level (e.g. IIRC on Linux they both come from a stat(2) syscall), and I wouldn't be surprised if Ruby has its own optimizations too. But imagine if the second criterion was not the mtime, but (say) the number of occurrences of a string within the file, and the files in question are huge. In this case you'd really want lazy evaluation, to avoid reading the whole of these huge files if sorting by size is sufficient.
At the time of writing, the Wikibooks entry for Algorithm Implementation/Sorting/Schwartzian transform suggests this solution:
sorted_files =
Dir["*"]. # Get all files
# compute tuples of name, size, modtime
collect{|f| [f, test(?s, f), test(?M, f)]}.
sort {|a, b| # sort
a[1] <=> b[1] or # -- by increasing size
b[2] <=> a[2] or # -- by age descending
a[0] <=> b[0] # -- by name
}.collect{|a| a[0]} # extract original name
This kind of approach is copied from Perl, where
sort {
$a->[1] <=> $b->[1] # sort first numerically by size (smallest first)
or $b->[2] <=> $a->[2] # then numerically descending by modtime age (oldest first)
or $a->[0] cmp $b->[0] # then stringwise by original name
}
works beautifully because Perl has a quirk where 0 or $foo evaluates to $foo. But in Ruby, it's broken because 0 or foo evaluates to 0. So in effect, the Wikibooks implementation totally ignores mtimes and filenames, and only sorts by size. I've dusted off my Wikibooks account so that I can fix this, but I'm wondering: what is the cleanest way of combining the results of multiple <=> spaceship operator comparisons in Ruby?
I'll give a concrete-ish example to clarify the question. Let's assume we have two types of evaluation which may be required as criteria during the sort. The first is relatively cheap:
def size(a)
# get the size of file `a`, and if we're feeling keen,
# memoize the results
...
end
The second is expensive:
def matches(a)
# count the number of occurrences of a string
# in file `a`, which could be a large file, and
# memoize the results
...
end
And we want to sort first by size ascending, then descending by number of matches. We can't use a Schwartzian transform, because that would non-lazily call matches() on every item.
We could define a helper like
def nil_if_equal(result)
result == 0 ? nil : result
end
and then do:
sort {|a, b|
nil_if_equal(size(a) <=> size(b)) or
matches(b) <=> matches(a)
}
If there are n criteria to sort by then you'd need n-1 invocations of nil_if_equal here, since only the last sorting criteria doesn't require it.
So is there a more idiomatic way than this which can avoid the need for nil_if_equal?

No idea how idiomatic it is, but here's a way to use sort_by again. Instead of
for example
['bab', 'foo', 'so', 'bar'].sort_by { |s| [s.size, count_a(s), count_b(s)] }
do this to make count_a(s) and count_b(s) lazy and memoized:
['bab', 'foo', 'so', 'bar'].sort_by { |s| [s.size, lazy{count_a(s)}, lazy{count_b(s)}] }
My lazy makes the block act like a lazy and memoizing version of the value it yields.
Demo output, showing we only count what's necessary (i.e., don't count in 'so' since it has a unique size and don't count 'b' in 'foo' since its 'a'-count is unique among the size-3 strings):
Counting 'a' in 'bab'.
Counting 'a' in 'foo'.
Counting 'a' in 'bar'.
Counting 'b' in 'bab'.
Counting 'b' in 'bar'.
["so", "foo", "bar", "bab"]
Demo code:
def lazy(&block)
def block.value
(#value ||= [self.yield])[0]
end
def block.<=>(other)
value <=> other.value
end
block
end
def count_a(s)
puts "Counting 'a' in '#{s}'."
s.count('a')
end
def count_b(s)
puts "Counting 'b' in '#{s}'."
s.count('b')
end
p ['bab', 'foo', 'so', 'bar'].sort_by { |s| [s.size, lazy{count_a(s)}, lazy{count_b(s)}] }
A different way to make value memoizing: If it ever gets called, it immediately replaces itself with a method just returning the stored value:
def block.value
def self.value; #value end
#value = self.yield
end

Related

Ruby: Understanding .to_enum better

I have been reading this:
https://docs.ruby-lang.org/en/2.4.0/Enumerator.html
I am trying to understand why someone would use .to_enum, I mean how is that different than just an array? I see :scan was passed into it, but what other arguments can you pass into it?
Why not just use .scan in the case below? Any advice on how to understand .to_enum better?
"Hello, world!".scan(/\w+/) #=> ["Hello", "world"]
"Hello, world!".to_enum(:scan, /\w+/).to_a #=> ["Hello", "world"]
"Hello, world!".to_enum(:scan).each(/\w+/).to_a #=> ["Hello", "world"]
Arrays are, necessarily, constructs that are in memory. An array with a a lot of entries takes up a lot of memory.
To put this in context, here's an example, finding all the "palindromic" numbers between 1 and 1,000,000:
# Create a large array of the numbers to search through
numbers = (1..1000000).to_a
# Filter to find palindromes
numbers.select do |i|
is = i.to_s
is == is.reverse
end
Even though there's only 1998 such numbers, the entire array of a million needs to be created, then sifted through, then kept around until garbage collected.
An enumerator doesn't necessarily take up any memory at all, not in a consequential way. This is way more efficient:
# Uses an enumerator instead
numbers = (1..1000000).to_enum
# Filtering code looks identical, but behaves differently
numbers.select do |i|
is = i.to_s
is == is.reverse
end
You can even take this a step further by making a custom Enumerator:
palindromes = Enumerator.new do |y|
1000000.times do |i|
is = (i + 1).to_s
y << i if (is == is.reverse)
end
end
This one doesn't even bother with filtering, it just emits only palindromic numbers.
Enumerators can also do other things like be infinite in length, whereas arrays are necessarily finite. An infinite enumerator can be useful when you want to filter and take the first N matching entries, like in this case:
# Open-ended range, new in Ruby 2.6. Don't call .to_a on this!
numbers = (1..).to_enum
numbers.lazy.select do |i|
is = i.to_s
is == is.reverse
end.take(1000).to_a
Using .lazy here means it does the select, then filters through take with each entry until the take method is happy. If you remove the lazy it will try and evaluate each stage of this to completion, which on an infinite enumerator never happens.

Given a string, how do I compare the characters to see if there are duplicates?

I'm trying to compare characters in a given string to see if there are duplicates, and if there are I was to remove the two characters to reduce the string to as small at possible. eg. ("ttyyzx") would equal to ("zx")
I've tried converting the characters in an array and then using an #each_with_index to iterate over the characters.
arr = ("xxyz").split("")
arr.each_with_index do |idx1, idx2|
if idx1[idx2] == idx1[idx2 + 1]
p idx1[idx2]
p idx1[idx2 + 1]
end
end
At this point I just wan to be able to print the next character in the array within the loop so I know I can move on to the next step, but no matter what code I use it will only print out the first character "x".
To only keep the unique characters (ggorlen's answer is "b"): count all characters, find only those that appear once. We rely on Ruby's Hash producing keys in insertion order.
def keep_unique_chars(str)
str.each_char.
with_object(Hash.new(0)) { |element, counts| counts[element] += 1 }.
select { |_, count| count == 1 }.
keys.
join
end
To remove adjacent dupes only (ggorlen's answer is "aba"): a regular expression replacing adjacent repetitions is probably the go-to method.
def remove_adjacent_dupes(str)
str.gsub(/(.)\1+/, '')
end
Without regular expressions, we can use slice_when to cut the array when the character changes, then drop the groups that are too long. One might think a flatten would be required before join, but join doesn't care:
def remove_adjacent_dupes_without_regexp(str)
str.each_char.
slice_when { |prev, curr| prev != curr }.
select { |group| group.size == 1 }.
join
end
While amadan's and user's solution definitely solve the problem I felt like writing a solution closer to the OP's attempt:
def clean(string)
return string if string.length == 1
array = string.split('')
array.select.with_index do |value, index|
array[index - 1] != value && array[index + 1] != value
end.join
end
Here are a few examples:
puts clean("aaaaabccccdeeeeeefgggg")
#-> bdf
puts clean("m")
#-> m
puts clean("ttyyzx")
#-> zx
puts clean("aab")
#-> b
The method makes use of the fact that the characters are sorted and in case there are duplicates, they are either before or after the character that's being checked by the select method. The method is slower than the solutions posted above, but as OP mentioned he does not yet work with hashes yet I though this might be useful.
If speed is not an issue,
require 'set'
...
Set.new(("xxyz").split("")).to_a.join # => xyz
Making it a Set removes duplicates.
The OP does not want to remove duplicates and keep just a single copy, but remove all characters completely from occurring more than once. So here is a new approach, again compact, but not fast:
"xxyz".split('').sort.join.gsub(/(.)\1+/,'')
The idea is to sort the the letters; hence, identical letters will be joined together. The regexp /(.)\1+/ describes a repetition of a letter.

Ruby Nokogiri parsing omit duplicates

I'm parsing XML files and wanting to omit duplicate values from being added to my Array. As it stands, the XML will looks like this:
<vulnerable-software-list>
<product>cpe:/a:octopus:octopus_deploy:3.0.0</product>
<product>cpe:/a:octopus:octopus_deploy:3.0.1</product>
<product>cpe:/a:octopus:octopus_deploy:3.0.2</product>
<product>cpe:/a:octopus:octopus_deploy:3.0.3</product>
<product>cpe:/a:octopus:octopus_deploy:3.0.4</product>
<product>cpe:/a:octopus:octopus_deploy:3.0.5</product>
<product>cpe:/a:octopus:octopus_deploy:3.0.6</product>
</vulnerable-software-list>
document.xpath("//entry[
number(substring(translate(last-modified-datetime,'-.T:',''), 1, 12)) > #{last_imported_at} and
cvss/base_metrics/access-vector = 'NETWORK'
]").each do |entry|
product = entry.xpath('vulnerable-software-list/product').map { |product| product.content.split(':')[-2] }
effected_versions = entry.xpath('vulnerable-software-list/product').map { |product| product.content.split(':').last }
puts product
end
However, because of the XML input, that's parsing quite a bit of duplicates, so I end up with an array like ['Redhat','Redhat','Redhat','Fedora']
I already have the effected_versions taken care of, since those values don't duplicate.
Is there a method of .map to only add unique values?
If you need to get an array of unique values, then just call uniq method to get the unique values:
product =
entry.xpath('vulnerable-software-list/product').map do |product|
product.content.split(':')[-2]
end.uniq
There are many ways to do this:
input = ['Redhat','Redhat','Redhat','Fedora']
# approach 1
# self explanatory
result = input.uniq
# approach 2
# iterate through vals, and build a hash with the vals as keys
# since hashes cannot have duplicate keys, it provides a 'unique' check
result = input.each_with_object({}) { |val, memo| memo[val] = true }.keys
# approach 3
# Similar to the previous, we iterate through vals and add them to a Set.
# Adding a duplicate value to a set has no effect, and we can convert it to array
result = input.each_with_object.(Set.new) { |val, memo| memo.add(val) }.to_a
If you're not familiar with each_with_object, it's very similar to reduce
Regarding performance, you can find some info if you search for it, for example What is the fastest way to make a uniq array?
From a quick test, I see these performing in increasing time. uniq is 5 times faster than each_with_object, which is 25% slower than the Set.new approach. Probably because sort is implemetned using C. I only tested with only an arbitrary input though, so it might not be true for all cases.

How to use less memory generating Array permutation?

So I need to get all possible permutations of a string.
What I have now is this:
def uniq_permutations string
string.split(//).permutation.map(&:join).uniq
end
Ok, now what is my problem: This method works fine for small strings but I want to be able to use it with strings with something like size of 15 or maybe even 20. And with this method it uses a lot of memory (>1gb) and my question is what could I change not to use that much memory?
Is there a better way to generate permutation? Should I persist them at the filesystem and retrieve when I need them (I hope not because this might make my method slow)?
What can I do?
Update:
I actually don't need to save the result anywhere I just need to lookup for each in a table to see if it exists.
Just to reiterate what Sawa said. You do understand the scope? The number of permutations for any n elements is n!. It's about the most aggressive mathematical progression operation you can get. The results for n between 1-20 are:
[1, 2, 6, 24, 120, 720, 5040, 40320, 362880, 3628800, 39916800, 479001600,
6227020800, 87178291200, 1307674368000, 20922789888000, 355687428096000,
6402373705728000, 121645100408832000, 2432902008176640000]
Where the last number is approximately 2 quintillion, which is 2 billion billion.
That is 2265820000 gigabytes.
You can save the results to disk all day long - unless you own all the Google datacenters in the world you're going to be pretty much out of luck here :)
Your call to map(&:join) is what is creating the array in memory, as map in effect turns an Enumerator into an array. Depending on what you want to do, you could avoid creating the array with something like this:
def each_permutation(string)
string.split(//).permutation do |permutaion|
yield permutation.join
end
end
Then use this method like this:
each_permutation(my_string) do |s|
lookup_string(s) #or whatever you need to do for each string here
end
This doesn’t check for duplicates (no call to uniq), but avoids creating the array. This will still likely take quite a long time for large strings.
However I suspect in your case there is a better way of solving your problem.
I actually don't need to save the result anywhere I just need to lookup for each in a table to see if it exists.
It looks like you’re looking for possible anagrams of a string in an existing word list. If you take any two anagrams and sort the characters in them, the resulting two strings will be the same. Could you perhaps change your data structures so that you have a hash, with keys being the sorted string and the values being a list of words that are anagrams of that string. Then instead of checking all permutations of a new string against a list, you just need to sort the characters in the string, and use that as the key to look up the list of all strings that are permutations of that string.
Perhaps you don't need to generate all elements of the set, but rather only a random or constrained subset. I have written an algorithm to generate the m-th permutation in O(n) time.
First convert the key to a list representation of itself in the factorial number system. Then iteratively pull out the item at each index specified by the new list and of the old.
module Factorial
def factorial num; (2..num).inject(:*) || 1; end
def factorial_floor num
tmp_1 = 0
1.upto(1.0/0.0) do |counter|
break [tmp_1, counter - 1] if (tmp_2 = factorial counter) > num
tmp_1 = tmp_2 #####
end # #
end # #
end # returns [factorial, integer that generates it]
# for the factorial closest to without going over num
class Array; include Factorial
def generate_swap_list key
swap_list = []
key -= (swap_list << (factorial_floor key)).last[0] while key > 0
swap_list
end
def reduce_swap_list swap_list
swap_list = swap_list.map { |x| x[1] }
((length - 1).downto 0).map { |element| swap_list.count element }
end
def keyed_permute key
apply_swaps reduce_swap_list generate_swap_list key
end
def apply_swaps swap_list
swap_list.map { |index| delete_at index }
end
end
Now, if you want to randomly sample some permutations, ruby comes with Array.shuffle!, but this will let you copy and save permutations or to iterate through the permutohedral space. Or maybe there's a way to constrain the permutation space for your purposes.
constrained_generator_thing do |val|
Array.new(sample_size) {array_to_permute.keyed_permute val}
end
Perhaps I am missing the obvious, but why not do
['a','a','b'].permutation.to_a.uniq!

Ruby find in array with offset

I'm looking for a way to do the following in Ruby in a cleaner way:
class Array
def find_index_with_offset(offset, &block)
[offset..-1].find &block
end
end
offset = array.find_index {|element| element.meets_some_criterion?}
the_object_I_want =
array.find_index_with_offset(offset+1) {|element| element.meets_another_criterion?}
So I'm searching a Ruby array for the index of some object and then I do a follow-up search to find the first object that matches some other criterion and has a higher index in the array. Is there a better way to do this?
What do I mean by cleaner: something that doesn't involve explicitly slicing the array. When you do this a couple of times, calculating the slicing indices gets messy fast. I'd like to keep operating on the original array. It's easier to understand and less error-prone.
NB. In my actual code I haven't monkey-patched Array, but I want to draw attention to the fact that I expect I'm duplicating existing functionality of Array/Enumerable
Edits
Fixed location of offset + 1 as per Mladen Jablanović's comment; rewrite error
Added explanation of 'cleaner' as per Mladen Jablanović's comment
Cleaner is here obviously subjective matter. If you aim for short, I don't think you could do better than that. If you want to be able to chain multiple such finds, or you are bothered by slicing, you can do something like this:
module Enumerable
def find_multi *procs
return nil if procs.empty?
find do |e|
if procs.first.call(e)
procs.shift
next true if procs.empty?
end
false
end
end
end
a = (1..10).to_a
p a.find_multi(lambda{|e| e % 5 == 0}, lambda{|e| e % 3 == 0}, lambda{|e| e % 4 == 0})
#=> 8
Edit: And if you're not concerned with the performance you could do something like:
array.drop_while{|element|
!element.meets_some_criterion?
}.drop(1).find{|element|
element.meets_another_criterion?
}

Resources