Get index of array element faster than O(n) - ruby

Given I have a HUGE array, and a value from it. I want to get index of the value in array. Is there any other way, rather then call Array#index to get it? The problem comes from the need of keeping really huge array and calling Array#index enormous amount of times.
After a couple of tries I found that caching indexes inside elements by storing structs with (value, index) fields instead of the value itself gives a huge step in performance (20x times win).
Still I wonder if there's a more convenient way of finding index of en element without caching (or there's a good caching technique that will boost up the performance).

Why not use index or rindex?
array = %w( a b c d e)
# get FIRST index of element searched
puts array.index('a')
# get LAST index of element searched
puts array.rindex('a')

Convert the array into a hash. Then look for the key.
array = ['a', 'b', 'c']
hash = Hash[] # => {"a"=>0, "b"=>1, "c"=>2}
hash['b'] # => 1

Other answers don't take into account the possibility of an entry listed multiple times in an array. This will return a hash where each key is a unique object in the array and each value is an array of indices that corresponds to where the object lives:
a = [1, 2, 3, 1, 2, 3, 4]
=> [1, 2, 3, 1, 2, 3, 4]
indices = a.each_with_index.inject( { }) do |hash, (obj, i)|
hash[obj] += [i]
=> { 1 => [0, 3], 2 => [1, 4], 3 => [2, 5], 4 => [6] }
This allows for a quick search for duplicate entries: { |k, v| v.size > 1 }
=> { 1 => [0, 3], 2 => [1, 4], 3 => [2, 5] }

Is there a good reason not to use a hash? Lookups are O(1) vs. O(n) for the array.

If your array has a natural order use binary search.
Use binary search.
Binary search has O(log n) access time.
Here are the steps on how to use binary search,
What is the ordering of you array? For example, is it sorted by name?
Use bsearch to find elements or indices
Code example
# assume array is sorted by name!
array.bsearch { |each| "Jamie" <=> } # returns element
(0..array.size).bsearch { |n| "Jamie" <=> array[n].name } # returns index

If it's a sorted array you could use a Binary search algorithm (O(log n)). For example, extending the Array-class with this functionality:
class Array
def b_search(e, l = 0, u = length - 1)
return if lower_index > upper_index
midpoint_index = (lower_index + upper_index) / 2
return midpoint_index if self[midpoint_index] == value
if value < self[midpoint_index]
b_search(value, lower_index, upper_index - 1)
b_search(value, lower_index + 1, upper_index)

Taking a combination of #sawa's answer and the comment listed there you could implement a "quick" index and rindex on the array class.
class Array
def quick_index el
hash = Hash[]
def quick_rindex el
hash = Hash[]
array.length - 1 - hash[el]

Still I wonder if there's a more convenient way of finding index of en element without caching (or there's a good caching technique that will boost up the performance).
You can use binary search (if your array is ordered and the values you store in the array are comparable in some way). For that to work you need to be able to tell the binary search whether it should be looking "to the left" or "to the right" of the current element. But I believe there is nothing wrong with storing the index at insertion time and then using it if you are getting the element from the same array.


Find indices of array elements that fulfill a condition

I have an array, and I need an array of subscripts of the original array's elements that satisfy a certain condition.
map doesn't do because it yields an array of the same size. select doesn't do because it yields references to the individual array elements, not their indices. I came up with the following solution: {|elem,i| cond(elem) ? i : nil}.compact
If the array is large and only a few elements fulfill the conditions, another possibility would be
my_array.each_with_index {|elem,i| index_array << i if cond(elem)}
Both work, but isn't there a simpler way?
Nope, there is nothing inbuilt or much simpler that what you already got.
my_array.each_with_index.with_object([]) do |(elem, idx), indices|
indices << idx if cond(elem)
Another possible alternative: {|elem, _| cond(elem) }.map(&:last)
You can use Array#each_index with select
arr = [1, 2, 3, 4] {|i| arr[i].odd? }
# => [0, 2]

Merge sort algorithm using recursion

I'm doing The Odin Project. The practice problem is: create a merge sort algorithm using recursion. The following is modified from someone's solution:
def merge_sort(arry)
# kick out the odds or kick out of the recursive splitting?
# I wasn't able to get the recombination to work within the same method.
return arry if arry.length == 1
arry1 = merge_sort(arry[0...arry.length/2])
arry2 = merge_sort(arry[arry.length/2..-1])
f_arry = []
index1 = 0 # placekeeper for iterating through arry1
index2 = 0 # placekeeper for iterating through arry2
# stops when f_arry is as long as combined subarrays
while f_arry.length < (arry1.length + arry2.length)
if index1 == arry1.length
# pushes remainder of arry2 to f_arry
# not sure why it needs to be flatten(ed)!
(f_arry << arry2[index2..-1]).flatten!
elsif index2 == arry2.length
(f_arry << arry1[index1..-1]).flatten!
elsif arry1[index1] <= arry2[index2]
f_arry << arry1[index1]
index1 += 1
f_arry << arry2 [index2]
index2 += 1
return f_arry
Is the first line return arry if arry.length == 1 kicking it out of the recursive splitting of the array(s) and then bypassing the recursive splitting part of the method to go back to the recombination section? It seems like it should then just keep resplitting it once it gets back to that section as it recurses through.
Why must it be flatten-ed?
The easiest way to understand the first line is to understand that the only contract that merge_sort is bound to is to "return a sorted array" - if the array has only one element (arry.length == 1) it is already sorted - so nothing needs to be done! Simply return the array itself.
In recursion, this is known as a "Stop condition". If you don't provide a stop condition - the recursion will never end (since it will always call itself - and never return)!
The result you need to flatten your result, is because you are pushing an array as an element in you resulting array:
arr = [1]
arr << [2, 3]
# => [1, [2, 3]]
If you try to flatten the resulting array only at the end of the iteration, and not as you are adding the elements, you'll have a problem, since its length will be skewed:
arr = [1, [2, 3]]
# => 2
Although arr contains three numbers it has only two elements - and that will break your solution.
You want all the elements in your array to be numbers, not arrays. flatten! makes sure that all elements in your array are atoms, and if they are not, it adds the child array's elements to itself instead of the child array:
# => [1, 2, 3]
Another you option you might want to consider (and will be more efficient) is to use concat instead:
arr = [1]
arr.concat([2, 3])
# => [1, 2, 3]
This method add all the elements in the array passed as parameter to the array it is called on.

Efficient way of removing similar arrays in an array of arrays

I am trying to analyze some documents and find similarities in them. After analysis, I have an array, the elements of which are arrays of data from documents considered similar. But sometimes I have two almost similar elements, and naturally I want to leave the biggest of them. For simplification:
data = [[1,2,3,4,5,6], [7,8,9,10], [1,2,3,5,6]...]
How do I efficiently process the data that I get:
data = [[1,2,3,4,5,6], [7,8,9,10]...]
I suppose I could intersect every array, and if the intersected array matches one of the original arrays - I ignore it. Here is a quick code I wrote:
data = [[1,2,3,4,5,6], [7,8,9,10], [1,2,3,5,6], [7,9,10]]
cleaned = []
data.each_index do |i|
similar = false
data.each_index do |j|
if i == j
elsif data[i]&data[j] == data[i]
similar = true
unless similar
cleaned << data[i]
puts cleaned.inspect
Is this an efficient way to go? Also, the current behaviour only allows to leave out arrays that are a few elements short, and I might want to merge similar arrays if they occur:
[[1,2,3,4,5], [1,3,4,5,6]] => [[1,2,3,4,5,6]]
You can delete any element in the list if it is fully contained in another element:
data.delete_if do |arr|
data.any? { |a2| !a2.equal?(arr) && arr - a2 == [] }
# => [[1, 2, 3, 4, 5, 6], [7, 8, 9, 10]]
This is a bit more efficient than your suggestion since once you decide that an element should be removed, you don't check against it in the next iterations.

Can you search an array by keyword?

How do you take an array like ["foo",1,2,3] and turn it into something that can quickly be searched by keyword "foo"?
I'm trying to take a csv file, and sort/filter it based on a condition. For example, given the following csv and criteria:
#criteria = ["foobar", "foo"]
the output should be the following (order is important):
I'm using a nested loop to check every item in #criteria against every index[0] of the csv.
require 'csv'
#criteria = ["foobar", "foo"]
#newcsv = []
csv ="./foo.csv", { headers: true, return_headers: false })
csv = csv.to_a.transpose
#criteria.each do |n|
csv.each do |i|
if i[0] == n
#newcsv = #newcsv.transpose"./transpose.csv", "wb") do |lines|
#newcsv.each { |line| lines << line }
It works on small matrices, but I'm sure it won't scale. I'm wondering if a hash might give me better performance. How can I only get the rows in #criteria without using a nested loop?
So this answer was posted by another user and later deleted because he or she "hated it", but I think it at least adds some useful information to the original poster, so I'm reposting it here.
Note that I'm not sure if this code has asymptotic performance that's faster than O(n^2) for an n * n matrix, but the original author disagreed with me. Here, at least, is my reasoning:
If you have an n * n matrix, and you have n - 1 criteria, then wouldn't creating indices take in the worst-case n-1 + n-2 + .. + 2 + 1 = O(n^2) steps, depending on how the criteria and columns of the matrix are sorted?
And then you still end up needing to collect n(n - 1) cells, even if it is by constant-time array index access.
That was my reasoning at least. Maybe I am wrong. If I am, please explain how so, and what the correct asymptotic runtime complexity of the code below is!
Answer from Original Author
Scanning an array for an element is inefficient, but once you have an index, looking up for an element at that index is fast.
Given the header line header = ["foo", "bar", "foobar"] and #criteria = ["foobar", "foo"], you can convert them into indices:
indices ={|column| header.index(column)}
# => [2, 0]
Then, using indices, you can map the rows:
[1, 2, 3],
[4, 5, 6],
[7, 8, 9],
.map{|row| row.values_at(*indices)}
which gives:
[3, 1],
[6, 4],
[9, 7],
This way, majority of the computational complexity lies in creating indices, which is done only once and the time spent on it is ignorable, and all the rest is element look up by index, and the complexity is small, unlike what a user comments.
Here is some example code using the above methods:
require 'csv'
#criteria = ['foobar', 'foo']
table ='./foo.csv', headers: true)
indices = { |column| table.headers.index(column) } { |row| row.values_at(*indices) }

Find the largest value for an array of hashes with common keys?

I have two arrays, each containing any number of hashes with identical keys but differing values:
ArrayA = [{value: "abcd", value_length: 4, type: 0},{value: "abcdefgh", value_length: 8, type: 1}]
ArrayB = [{value: "ab", value_length: 2, type: 0},{value: "abc", value_length: 3, type: 1}]
Despite having any number, the number of hashes will always be equal.
How could I find the largest :value_length for every hash whose value is of a certain type?
For instance, the largest :value_length for a hash with a :type of 0 would be 4. The largest :value_length for a hash with a :type of 1 would be 8.
I just can't get my head around this problem.
A simple way:
all = ArrayA + ArrayB # Add them together if you want to search both arrays.{|x| x[:type] == 0}
.max_by{|x| x[:value_length]}
And if you wanna reuse it just create a function:
def find_max_of_my_array(arr,type){|x| x[:type] == type}
.max_by{|x| x[:value_length]}
p find_max_of_my_array(ArrayA, 0) # => {:value=>"abcd", :value_length=>4, :type=>0}
I'm not totally sure I know what the output you want is, but try this. I assume the arrays are ordered so that ArrayA[x][:type] == ArrayB[x][:type] and that you are looking for the max between (ArrayA[x], ArrayB[x]) not the whole array. If that is not the case, then the other solutions that concat the two array first will work great.
filtered_by_type ={|x| x[0][:type] == type } {|a| a.max_by {|x| x[:value_length] } }
Here's how I approached it: You're looking for the maximum of something, so the Array#max method will probably be useful. You want the actual value itself, not the containing hash, so that gives us some flexibility. Getting comfortable with the functional programming style helps here. In my mind, I can see how select, map, and max fit together. Here's my solution which, as specified, returns the number itself, the maximum value:
def largest_value_length(type, hashes)
# Taking it slowly
right_type_hashes ={|h| h[:type] == type}
value_lengths ={|h| h[:value_length]}
maximum = value_lengths.max
# Or, in one line{|h| h[:type] == type}.map{|h| h[:value_length]}.max
puts largest_value_length(1, ArrayA + ArrayB)
=> 8
You can also sort after filtering by type. That way you can get smallest, second largest etc.
all = ArrayA + ArrayB
all = { |element| element[:type] == 1 }
.sort_by { |k| k[:value_length] }.reverse
puts all[0][:value_length]
puts all[all.length-1][:value_length]
