Trying to understand how array.sort works with custom comparer block - ruby

I am new to ruby and doing a RubyMonk tutorial. One of the problems is the following. Can someone please enlighten me because I am not understanding the suggested solution?
Problem Statement
Create a method named 'sort_string' which accepts a String and rearranges all the words in ascending order, by length. Let's not treat the punctuation marks any different than other characters and assume that we will always have single space to separate the words.
Example: Given a string "Sort words in a sentence", it should return "a in Sort words sentence".
Suggested Solution:
def sort_string(string)
string.split(' ').sort{|x, y| x.length <=> y.length}.join(' ')
end
My questions are;
1) Why are there two block variables being passed through? Should there only be one, because you are going through every element of the sentence one at a time?
2) I looked up the <=> operator and it states,"Combined comparison operator. Returns 0 if first operand equals second, 1 if first operand is greater than the second and -1 if first operand is less than the second." So aren't we essentially sorting by -1, 0, and 1 then, not the words?
Thank you very much in advance for your help!

1) Why are there two block variables being passed through? Should there only be one, because you are going through every element of the sentence one at a time?
Because that's how the sort method works. It compares two elements at a time, and the block tells it how to compare the two elements. There is a single-element method called sort_by which will only require one which could be used in this case:
def sort_string(string)
string.split(' ').sort_by{|x| x.length}.join(' ')
end
Or even shorter:
def sort_string(string)
string.split(' ').sort_by(&:length).join(' ')
end
2) I looked up the <=> operator and it states,"Combined comparison operator. Returns 0 if first operand equals second, 1 if first operand is greater than the second and -1 if first operand is less than the second." So aren't we essentially sorting by -1, 0, and 1 then, not the words?
Again, this is how sorting works. Sort looks at the result and, depending upon the value -1, 0, or 1 will order the original data accordingly. It's not ordering the results of <=> directly. If you've done any C programming and used strcmp, think about how you would use that function. It's based upon the same concept.

For the first question, if you look at the documentation for the sort method its block form takes two variables
http://www.ruby-doc.org/core-2.0.0/Array.html#method-i-sort
For the second question, the spaceship operator does a comparison between the two operands and then returns -1, 0, or 1, and then you're sorting on the results. Yes, you're sorting on -1, 0, and 1, but those values are obtained from the comparison.

There are two block variables because to sort you need two items - you can't compare one item against nothing or itself.
You are sorting by -1, 0 and 1 - through the words.
Both of these questions are related to the sort method - here's an example which might make it clearer:
(1..10).sort { |a, b| b <=> a } #=> [10, 9, 8, 7, 6, 5, 4, 3, 2, 1]
for each number 1 -10, sort looks at 'a' and 'b' - the block arguments. then in the block, the code says to order b higher than a - and this is what it does

The sort function by default sorts the items in some order.
However, if it is passed a block, it uses that block to compare the elements of the array, so that you can define a different, custom order of the elements.
That is, the block has to compare the elements. A minimalistic working version of such comparison is to compare two elements, just to know which one is "greater-or-equal".
This is why the custom block takes two parameters: those are the elements to compare. You don't actually know which one. The sort will perform some sorting algorithm and depending on the internals, it will pick some pairs of elements, compare them using your block, and then, well, it will use that knowledge to reorder the elements in order.
As you provide a block that 'compares', it'd be not very efficient to just return BOOL that says "greater or not". A better and often a bit faster way is to determine if the elements are equal, less, or greater. At once.
arr.sort {|item1, item2}
if item1 < item2 return :lesser
if item1 == item2 return :equal
if item1 < item2 return :greater
}
This is just pseudocode.
With numbers, it is very easy: just subtract them. If you get less-than-zero, you know that the first was lesser. If you get more-than-zero, the first was bigger. If you got zero, they were equal. So, over the time it was 'standarized' way of describing the three-way comparison result to some sorting algorithms.
arr.sort {|item1, item2}
if item1 < item2 return -1
if item1 == item2 return 0
if item1 > item2 return 1
}
or just
arr.sort {|item1, item2}
return item1 - item2
}
Not all types can be subtracted though. Ruby went somewhat further and defined "comparison operator". Instead of just separate </>/==/<=/>=, it provides you with <=> that returns numeric values. -1 meaning that left was lesser and so on.
arr.sort {|item1, item2}
return item1 <=> item2
}
Now, if you provide <=> operator to MyClass, you will be able to sort them easily, even if plain item1-item2 cannot work on non-numeric 'MyClass'..

Related

What is the purpose of passing a block to enum_for in Ruby?

According to the official Ruby docs, there are two ways of calling enum_for:
enum_for(method = :each, *args) → enum
enum_for(method = :each, *args){|*args| block} → enum
I feel like I have a pretty good understanding of Ruby's enumerators in general. And so I understand what calling enum_for without a block does. But I am confused as to what the purpose of calling it with a block is.
The description in the docs, "If a block is given, it will be used to calculate the size of the enumerator without the need to iterate it", isn't very helpful.
In the example in the documentation, they've set up a method, repeat, which will take an enumerable and yield it's values n times. So for the array:
array = [1, 2, 3, 4, 5]
calling repeat(3) will yield:
1
1
1
2
2
2
...
So, I have array and call enum = array.repeat(3), and now I want to know how many elements are in the enum. I could go through and set up a counter:
counter = 0; array.repeat(3) { |element| counter += 1 }
and know that there are 15 elements in enum, or being a not-computer and having prior knowledge about what repeat is doing, I could just say array.size * 3 and get 15.
But that's all overkill, because Enumerator has a size method, so it'd be nice if I could just say enum.size. However, size just returns nil, because it can't calculate the size lazily, so I'd have to use count which is just going to iterate through and count the elements like our naive solution. I can tell size how to lazily calculate the size, though, by passing the block to enum_for (clipped from the docs)
to_enum(__method__, n) do # __method__ is :repeat here
sz = size # Call size and multiply by n...
sz * n if sz # but return nil if size itself is nil
end
and now it behaves like our second method. For a use case like this, it may seem like only a marginal gain, but if I'm repeating a 1,000 element array 1 million times? Or what if my enum is hitting an external service, or doing any number of expensive computations? this small optimization could save quite a few cycles and avoid unnecessary work when all you wanted was a total.

What does the Ruby sort function do, exactly?

Let me preface this by saying I'm a newbie to Ruby (pretty obvious). I'm learning Ruby on Codecademy and I'm confused by the sort function. To use as an example:
list = [3,2,1,4]
list.sort { |a,b| b <=> a }
I know that this will return the array in descending order - [4, 3, 2, 1]. What I don't understand is why, exactly. I know that when the sort function is called, the numbers from the array are passed into the function and compared, which then returns either -1, 0, or 1 - but then what? For instance, I'm guessing this is what would be compared first:
[3 <=> 2] = 1
But what does it do with the 1 that is returned? And what would the array look like after it gets the 1?
I'm confused because I don't understand how reversing the comparison (a <=> b vs. b <=> a) changes the direction in which the array is sorted. Unless I'm mistaken, doesn't "1 <=> 2" essentially return "1 comes before 2", whereas "2 <=> 1" returns "2 comes after 1"? Which is more or less the same thing, yet the results are obviously different.
The "spaceship" operator, <=> doesn't return something so English as "a comes before b". It returns what sort needs to know: where two elements are in relation to each other. Specifically, it returns the -1, 0, or 1 value you mentioned.
In a <=> b, if a is less than b (via whatever comparison method is used for the class of which a is an instance), the return is -1. If they're equal, return is 0; and if a is greater than b, the return is 1.
When you do b <=> a, the returned value is based on b rather than a, so if a is smaller, you'll get 1, whereas you got -1 when doing a <=> b.
So while the English meaning is the same, the devil is in the details: that -1, 0, or 1 return value. That value tells Ruby precisely how two elements fit into a sorted array.
The magic with those three numbers is in the quicksort algorithm used by Ruby. It's out of scope to try and explain precisely how that algorithm works, but you can basically look at it as a simple comparison on many values. For each item in an array, <=> is called with another item in the array in order to determine where the two items fall relative to each other. Once enough comparisons have been made, the positions of all those individual items is known and the sorting is done.
As a simple (and not really technically accurate, but close enough) example, consider the array [3, 2, 7, 1]. You can grab a value to compare others to in order to start the sorting. We'll pick 3. Running a comparison of 3 with all other numbers gives us:
3 <=> 2 == 1: 3 is greater than 2, so 2 must be to the left of 3. Our array might look like this now: [2, 3, 7, 1]
3 <=> 7 == -1: 3 is less than 7, so 7 must be the the right of 3. Our array continues to look as it did before, as the 7 was already on the right.
3 <=> 1 == 1: 3 is greater than 1, so the 1 must be on the left of 3. Our array looks like this now: [2, 1, 3, 7]
We know the 7 must be correct since it's the only element on the "greater than 3" side. So we just need to figure out the sort order for everything before the 3: 1 and 2. Running a similar comparison as above, we obviously swap the 1 and 2 to get [1, 2, 3, 7].
I hope this helps!
The comparison gets two arguments and returns -1 if the first argument is less than the second argument, 0 if the two arguments are equal, and 1 if the second argument is greater than the first argument. When you swap the two, it inverts the result. <=> doesn’t care about where its operands came from, so although the change doesn’t add any extra information about the relationship between a and b, it does invert the result of <=>, and that inverts the sorting order.
(1 <=> 2) == -1
(2 <=> 1) == 1
As the sorting function, you don’t get 1 <=> 2 or 2 <=> 1; you get -1 or 1. From whichever number, you decide which argument you passed to the comparison should come later in the result.
Unless I'm mistaken, doesn't "1 <=> 2" essentially return "1 comes before 2", whereas "2 <=> 1" returns "2 comes after 1"? Which is more or less the same thing, yet the results are obviously different.
No and yes. The question that is asked of the block is: "does the left element come before or after the right?" And by swapping left and right, you swap the order.
So, the answer is: you aren't reversing the comparison per se, but you are reversing the sort method's idea of which is left and which is right.
The return value of the block is interpreted by sort like this:
0: order doesn't matter
1: the elements are already in the right order
-1: the elements are in the wrong order
By swapping left and right, you swap whether the block tells sort that the elements are in the right or wrong order.
Note that Quicksort is completely irrelevant here. What matters is the contract of the comparator block. Whether that block is then used by Quicksort, Shellsort, Insertion Sort, Bubblesort, Bogosort, Timsort or whatever other comparison-based sort doesn't really matter.

How to loop through loop in Ruby

I am trying to loop the numbers 1 to 1000 in such a way that I have all possible pairs, e.g., 1 and 1, 1 and 2, 1 and 3, ..., but also 2 and 1, 2 and 2, 2 and 3, et cetera, and so on.
In this case I have a condition (amicable_pair) that returns true if two numbers are an amicable pair. I want to check all numbers from 1 to n against each other and add all amicable pairs to a total total. The first value will be added to the total if it is part of an amicable pair (not the second value of the pair, since we'll find that later in the loop). To do this I wrote the following "Java-like" code:
def add_amicable_pairs(n)
amicable_values = []
for i in 1..n
for j in 1..n
if (amicable_pair?(i,j))
amicable_values.push(i)
puts "added #{i} from amicable pair #{i}, #{j}"
end
end
end
return amicable_values.inject(:+)
end
Two issues with this: (1) it is really slow. (2) In Ruby you should not use for-loops.
This is why I am wondering how this can be accomplished in a faster and more Ruby-like way. Any help would be greatly appreciated.
Your code has O(n^2) runtime, so if n gets moderately large then it will naturally be slow. Brute-force algorithms are always slow if the search space is large. To avoid this, is there some way you can directly find the "amicable pairs" rather than looping through all possible combinations and checking one by one?
As far as how to write the loops in a more elegant way, I would probably rewrite your code as:
(1..n).to_a.product((1..n).to_a).select { |a,b| amicable_pair?(a,b) }.reduce(0, &:+)
(1..1000).to_a.repeated_permutation(2).select{|pair| amicable_pair?(*pair)}
.map(&:first).inject(:+)

Calculate Median in An Array - Can someone tell me what is going on in this line of code?

This is a solution for calculating the median value in an array. I get the first three lines, duh ;), but the third line is where the magic is happening. Can someone explain how the 'sorted' variable is using and why it's next to brackets, and why the other variable 'len' is enclosed in those parentheses and then brackets? It's almost like sorted is all of a sudden being used as an array? Thanks!
def median(array)
sorted = array.sort
len = sorted.length
return ((sorted[(len - 1) / 2] + sorted[len / 2]) / 2.0).to_f
end
puts median([3,2,3,8,91])
puts median([2,8,3,11,-5])
puts median([4,3,8,11])
Consider this:
[1,2,2,3,4] and [1,2,3,4]. Both arrays are sorted, but have odd and even numbers of elements respectively. So, that piece of code is taking into account these 2 cases.
sorted is indeed an array. You sort [2,3,1,4] and you get back [1,2,3,4]. Then you calculate the middle index (len - 1) / 2 and len / 2 for even / odd number of elements, and find the average of them.
Yes, array.sort is returning an array and it is assigned to sorted. You can then access it via array indices.
If you have an odd number of elements, say 5 elements as in the example, the indices come out to be:
(len-1)/2=(5-1)/2=2
len/2=5/2=2 --- (remember this is integer division, so the decimal gets truncated)
So you take the value at index 2 and add them, and then divide by 2, which is the same as the value at index 2.
If you have an even number of elements, say 4,
(len-1)/2=(4-1)/2=1 --- (remember this is integer division, so the decimal gets truncated)
len/2=4/2=2
So in this case, you are effectively averaging the two middle elements 1 and 2, which is the definition of median for when you have an even number of elements.
It's almost like sorted is all of a sudden being used as an array?
Yes, it is. On line 2 it's being initialized as being an array with the same elements as the input, but in ascending order (default sort is ascending). On line 3 you have len which is initialized with the length of the sorted array, so yeah, sorted is being used as an array since then, because that's what it is.

sorting algorithm where pairwise-comparison can return more information than -1, 0, +1

Most sort algorithms rely on a pairwise-comparison the determines whether A < B, A = B or A > B.
I'm looking for algorithms (and for bonus points, code in Python) that take advantage of a pairwise-comparison function that can distinguish a lot less from a little less or a lot more from a little more. So perhaps instead of returning {-1, 0, 1} the comparison function returns {-2, -1, 0, 1, 2} or {-5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5} or even a real number on the interval (-1, 1).
For some applications (such as near sorting or approximate sorting) this would enable a reasonable sort to be determined with less comparisons.
The extra information can indeed be used to minimize the total number of comparisons. Calls to the super_comparison function can be used to make deductions equivalent to a great number of calls to a regular comparsion function. For example, a much-less-than b and c little-less-than b implies a < c < b.
The deductions cans be organized into bins or partitions which can each be sorted separately. Effectively, this is equivalent to QuickSort with n-way partition. Here's an implementation in Python:
from collections import defaultdict
from random import choice
def quicksort(seq, compare):
'Stable in-place sort using a 3-or-more-way comparison function'
# Make an n-way partition on a random pivot value
segments = defaultdict(list)
pivot = choice(seq)
for x in seq:
ranking = 0 if x is pivot else compare(x, pivot)
segments[ranking].append(x)
seq.clear()
# Recursively sort each segment and store it in the sequence
for ranking, segment in sorted(segments.items()):
if ranking and len(segment) > 1:
quicksort(segment, compare)
seq += segment
if __name__ == '__main__':
from random import randrange
from math import log10
def super_compare(a, b):
'Compare with extra logarithmic near/far information'
c = -1 if a < b else 1 if a > b else 0
return c * (int(log10(max(abs(a - b), 1.0))) + 1)
n = 10000
data = [randrange(4*n) for i in range(n)]
goal = sorted(data)
quicksort(data, super_compare)
print(data == goal)
By instrumenting this code with the trace module, it is possible to measure the performance gain. In the above code, a regular three-way compare uses 133,000 comparisons while a super comparison function reduces the number of calls to 85,000.
The code also makes it easy to experiment with a variety comparison functions. This will show that naïve n-way comparison functions do very little to help the sort. For example, if the comparison function returns +/-2 for differences greater than four and +/-1 for differences four or less, there is only a modest 5% reduction in the number of comparisons. The root cause is that the course grained partitions used in the beginning only have a handful of "near matches" and everything else falls in "far matches".
An improvement to the super comparison is to covers logarithmic ranges (i.e. +/-1 if within ten, +/-2 if within a hundred, +/- if within a thousand.
An ideal comparison function would be adaptive. For any given sequence size, the comparison function should strive to subdivide the sequence into partitions of roughly equal size. Information theory tells us that this will maximize the number of bits of information per comparison.
The adaptive approach makes good intuitive sense as well. People should first be partitioned into love vs like before making more refined distinctions such as love-a-lot vs love-a-little. Further partitioning passes should each make finer and finer distinctions.
You can use a modified quick sort. Let me explain on an example when you comparison function returns [-2, -1, 0, 1, 2]. Say, you have an array A to sort.
Create 5 empty arrays - Aminus2, Aminus1, A0, Aplus1, Aplus2.
Pick an arbitrary element of A, X.
For each element of the array, compare it with X.
Depending on the result, place the element in one of the Aminus2, Aminus1, A0, Aplus1, Aplus2 arrays.
Apply the same sort recursively to Aminus2, Aminus1, Aplus1, Aplus2 (note: you don't need to sort A0, as all he elements there are equal X).
Concatenate the arrays to get the final result: A = Aminus2 + Aminus1 + A0 + Aplus1 + Aplus2.
It seems like using raindog's modified quicksort would let you stream out results sooner and perhaps page into them faster.
Maybe those features are already available from a carefully-controlled qsort operation? I haven't thought much about it.
This also sounds kind of like radix sort except instead of looking at each digit (or other kind of bucket rule), you're making up buckets from the rich comparisons. I have a hard time thinking of a case where rich comparisons are available but digits (or something like them) aren't.
I can't think of any situation in which this would be really useful. Even if I could, I suspect the added CPU cycles needed to sort fuzzy values would be more than those "extra comparisons" you allude to. But I'll still offer a suggestion.
Consider this possibility (all strings use the 27 characters a-z and _):
11111111112
12345678901234567890
1/ now_is_the_time
2/ now_is_never
3/ now_we_have_to_go
4/ aaa
5/ ___
Obviously strings 1 and 2 are more similar that 1 and 3 and much more similar than 1 and 4.
One approach is to scale the difference value for each identical character position and use the first different character to set the last position.
Putting aside signs for the moment, comparing string 1 with 2, the differ in position 8 by 'n' - 't'. That's a difference of 6. In order to turn that into a single digit 1-9, we use the formula:
digit = ceiling(9 * abs(diff) / 27)
since the maximum difference is 26. The minimum difference of 1 becomes the digit 1. The maximum difference of 26 becomes the digit 9. Our difference of 6 becomes 3.
And because the difference is in position 8, out comparison function will return 3x10-8 (actually it will return the negative of that since string 1 comes after string 2.
Using a similar process for strings 1 and 4, the comparison function returns -5x10-1. The highest possible return (strings 4 and 5) has a difference in position 1 of '-' - 'a' (26) which generates the digit 9 and hence gives us 9x10-1.
Take these suggestions and use them as you see fit. I'd be interested in knowing how your fuzzy comparison code ends up working out.
Considering you are looking to order a number of items based on human comparison you might want to approach this problem like a sports tournament. You might allow each human vote to increase the score of the winner by 3 and decrease the looser by 3, +2 and -2, +1 and -1 or just 0 0 for a draw.
Then you just do a regular sort based on the scores.
Another alternative would be a single or double elimination tournament structure.
You can use two comparisons, to achieve this. Multiply the more important comparison by 2, and add them together.
Here is a example of what I mean in Perl.
It compares two array references by the first element, then by the second element.
use strict;
use warnings;
use 5.010;
my #array = (
[a => 2],
[b => 1],
[a => 1],
[c => 0]
);
say "$_->[0] => $_->[1]" for sort {
($a->[0] cmp $b->[0]) * 2 +
($a->[1] <=> $b->[1]);
} #array;
a => 1
a => 2
b => 1
c => 0
You could extend this to any number of comparisons very easily.
Perhaps there's a good reason to do this but I don't think it beats the alternatives for any given situation and certainly isn't good for general cases. The reason? Unless you know something about the domain of the input data and about the distribution of values you can't really improve over, say, quicksort. And if you do know those things, there are often ways that would be much more effective.
Anti-example: suppose your comparison returns a value of "huge difference" for numbers differing by more than 1000, and that the input is {0, 10000, 20000, 30000, ...}
Anti-example: same as above but with input {0, 10000, 10001, 10002, 20000, 20001, ...}
But, you say, I know my inputs don't look like that! Well, in that case tell us what your inputs really look like, in detail. Then someone might be able to really help.
For instance, once I needed to sort historical data. The data was kept sorted. When new data were added it was appended, then the list was run again. I did not have the information of where the new data was appended. I designed a hybrid sort for this situation that handily beat qsort and others by picking a sort that was quick on already sorted data and tweaking it to be fast (essentially switching to qsort) when it encountered unsorted data.
The only way you're going to improve over the general purpose sorts is to know your data. And if you want answers you're going to have to communicate that here very well.

Resources