Most efficient way to compile unique values in a massive text file?

Most efficient way to compile unique values in a massive text file? - ruby

I have a set of large text files that in total contain about 3 million rows.
What I want to do is pluck a value from a given column from each row and add it to an array in memory. If the value already exists in the array, then ignore it.
I'm assuming the fastest way is NOT:
Read a value
if exists (using array's native index or what-have-you method), then push it to the array
Should I be inserting the value in alphabetical order to speed up the match/search?
OR should I keep multiple arrays...for example, one for each letter of the alphabet?

Use Set:
Set implements a collection of unordered values with no duplicates. This is a hybrid of Array's intuitive inter-operation facilities and Hash's fast lookup.
Example usage:
require 'set'
set = Set.new
set << 1 << 2 << 3 # => #<Set: {1, 2, 3}>
set << 2 # => #<Set: {1, 2, 3}>

You could add the values as keys to a hash map, that would take care of removing duplicates automatically. You could even count the number of times each value occurs this way (with the hash value).

Related

Maintaining Ruby Set by object ID

I'm developing an algorithm in Ruby with the following properties:
It works on two objects of type Set, where each element is an Array, where all elements are of type String
Each Array involved has the same number of elements
No two arrays happen to be have the same content (when comparing with ==)
The algorithm involves many operations of moving an array from one Set to the other (or back), storing references to certain Arrays, and testing whether or not that reference is part of the Array
There is no duplication of the Arrays; all Arrays keep their object ID during all the time.
A native implementation would do something like this (to give you the idea); in practice, the arrays here have longer strings and more elements:
# Set up all Arrays involved
master=[
%w(a b c d),
%w(a b c x),
%w(u v w y),
# .... and so on
]
# Create initial sets.
x=Set.new
y=Set.new
# ....
x.add(master[0])
x.add(master[2])
y.add(master[1])
# ....
# Operating on the sets.
i=1
# ...
arr=master[i]
# Move element arr from y to x, if it is in y
if(y.member?(arr)
y.delete(arr)
x.add(arr)
end
# Do something with the sets
x.each { |arr| puts arr.pretty_print }
This would indeed work, simply because the arrays are all different in content. However, testing for membership means that y.member?(arr) tests that we don't have already an object with the same array content like arrin our Set, while it would be sufficient to verify to test that we don't have already an element with the same object_id in our Set, so I'm worried about performance. From my understanding, finding the the object id of an object is cheap, and since it is just a number, maintaining a set of numbers is more performant than maintaining a set of arrays of strings.
Therefore I could try to define my two sets as sets of object_id, and membership test would be faster. However when iterating over a Set, using the object_id to find the array itself is expensive (I would have to search ObjectSpace).
Another possibility would be to not maintain the set of arrays, but the set of indexes into my master array. My code would then be, for example,
x.add(0) # instead of x.add(master[0])
and iterating over a Set would be, i.e.
x.each { |i| puts master[i].pretty_print }
I wonder whether there is a better way - for instance that we can somehow "teach" Set.new to use object identity for maintaining its members, instead of equality.

I think you’re looking for Set#compare_by_identity, which makes the set use the object’s identity (i.e. object ID) of its contents.
x = Set.new
x.compare_by_identity

How can I return hash pairs of keys that sum up to less than a maximum value?

Given this hash:
numsHash = {5=>10, 3=>9, 4=>7, 2=>5, 20=>4}
How can I return the key-value pair of this hash if and when the sum of its keys would be under or equal to a maximum value such as 10?
The expected result would be something like:
newHash = { 5=>10, 3=>9, 2=>5 }
because the sum of these keys equals 10.
I've been obsessing with this for hours now and can't find anything that leads up to a solution.

Summary
In the first section, I provide some context and a well-commented working example of how to solve the defined knapsack problem in a matter of microseconds using a little brute force and some Ruby core classes.
In the second section, I refactor and expand on the code to demonstrate the conversion of the knapsack solution into output similar to what you want, although (as explained and demonstrated in the answer below) the correct output when there are multiple results must be a collection of Hash objects rather than a single Hash unless there are additional selection criteria not included in your original post.
Please note that this answer uses syntax and classes from Ruby 3.0, and was specifically tested against Ruby 3.0.3. While it should work on Ruby 2.7.3+ without changes, and with most currently-supported Ruby 2.x versions with some minor refactoring, your mileage may vary.
Solving the Knapsack Problem with Ruby Core Methods
This seems to be a variant of the knapsack problem, where you're trying to optimize filling a container of a given size. This is actually a complex problem that is NP-complete, so a real-world application of this type will have many different solutions and possible algorithmic approaches.
I do not claim that the following solution is optimal or suitable for general purpose solutions to this class of problem. However, it works very quickly given the provided input data from your original post.
Its suitability is primarily based on the fact that you have a fairly small number of Hash keys, and the built-in Ruby 3.0.3 core methods of Hash#permutation and Enumerable#sum are fast enough to solve this particular problem in anywhere from 44-189 microseconds on my particular machine. That seems more than sufficiently fast for the problem as currently defined, but your mileage and real objectives may vary.
# This is the size of your knapsack.
MAX_VALUE = 10
# It's unclear why you need a Hash or what you plan to do with the values of the
# Hash, but that's irrelevant to the problem. For now, just grab the keys.
#
# NB: You have to use hash rockets or the parser complains about using an
# Integer as a Symbol using the colon notation and raises SyntaxError.
nums_hash = {5 => 10, 3 => 9, 4 => 7, 2 => 5, 20 => 4}
keys = nums_hash.keys
# Any individual element above MAX_VALUE won't fit in the knapsack anyway, so
# discard it before permutation.
keys.reject! { _1 > MAX_VALUE }
# Brute force it by evaluating all possible permutations of your array, dropping
# elements from the end of each sub-array until all remaining elements fit.
keys.permutation.map do |permuted_array|
loop { permuted_array.sum > MAX_VALUE ? permuted_array.pop : break }
permuted_array
end
Returning an Array of Matching Hashes
The code above just returns the list of keys that will fit into your knapsack, but per your original post you then want to return a Hash of matching key/value pairs. The problem here is that you actually have more than one set of Hash objects that will fit the criteria, so your collection should actually be an Array rather than a single Hash. Returning only a single Hash would basically return the original Hash minus any keys that exceed your MAX_VALUE, and that's unlikely to be what's intended.
Instead, now that you have a list of keys that fit into your knapsack, you can iterate through your original Hash and use Hash#select to return an Array of unique Hash objects with the appropriate key/value pairs. One way to do this is to use Enumerable#reduce to call Hash#merge on each Hash element in the subarrays to convert the final result to an Array of Hash objects. Next, you should call Enumerable#unique to remove any Hash that is equivalent except for its internal ordering.
For example, consider this redesigned code:
MAX_VALUE = 10
def possible_knapsack_contents hash
hash.keys.reject! { _1 > MAX_VALUE }.permutation.map do |a|
loop { a.sum > MAX_VALUE ? a.pop : break }; a
end.sort
end
def matching_elements_from hash
possible_knapsack_contents(hash).map do |subarray|
subarray.map { |i| hash.select { |k, _| k == i } }.
reduce({}) { _1.merge _2 }
end.uniq
end
hash = {5 => 10, 3 => 9, 4 => 7, 2 => 5, 20 => 4}
matching_elements_from hash
Given the defined input, this would yield 24 hashes if you didn't address the uniqueness issue. However, by calling #uniq on the final Array of Hash objects, this will correctly yield the 7 unique hashes that fit your defined criteria if not necessarily the single Hash you seem to expect:
[{2=>5, 3=>9, 4=>7},
{2=>5, 3=>9, 5=>10},
{2=>5, 4=>7},
{2=>5, 5=>10},
{3=>9, 4=>7},
{3=>9, 5=>10},
{4=>7, 5=>10}]

Passing array of integers to loop, modify the array, and store results in new array. Project Euler #8 in Ruby

I'm working through problem 8 on project Euler and have looked through a bunch of resources. Here is the problem:
"#8 - Find the greatest product of five consecutive digits in the 1000-digit number."
I split the 1000-digt number into an array of strings and converted that to an array of integers.
number = "73167176531330624919225119674426574742355349194934
96983520312774506326239578318016984801869478851843
85861560789112949495459501737958331952853208805511
12540698747158523863050715693290963295227443043557
66896648950445244523161731856403098711121722383113
62229893423380308135336276614282806444486645238749
30358907296290491560440772390713810515859307960866
70172427121883998797908792274921901699720888093776
65727333001053367881220235421809751254540594752243
52584907711670556013604839586446706324415722155397
53697817977846174064955149290862569321978468622482
83972241375657056057490261407972968652414535100474
82166370484403199890008895243450658541227588666881
16427171479924442928230863465674813919123162824586
17866458359124566529476545682848912883142607690042
24219022671055626321111109370544217506941658960408
07198403850962455444362981230987879927244284909188
84580156166097919133875499200524063689912560717606
05886116467109405077541002256983155200055935729725
71636269561882670428252483600823257530420752963450"
digits = number.split('').reject!{|i| (i=="\n")}
integer_digits = digits.map {|i| i.to_i}
From here, I want to take the first five values, multiple them, and take the resulting value and add it to a new array named "products". I'm trying to remove the first value of the integer_digit array with the .shift method, start the loop over with the second value of the array, and storing the next product of values [1..5] in the integer_digits array...and so on...
getproduct=1
products=[]
loop do
products << integer_digits[0..4].map {|x| (getproduct*=x) }.max
integer_digits.shift
break if integer_digits.length < 5
end
puts products.max
Once the loop went through all the digits, I hoped that I could display the greatest value using the .max method. The code I have returns an empty array...
My question: How do I keep adding the resulting value of the loop to the product array until there are less than five integer_digit values left? And will the .max method work once this is done?

This line:
products << integer_digits[0..4].map {|x| (getproduct*=x) }.max
makes very little sense. What you need is:
products << integer_digits.first(5).inject(:*)
However you shouldn't store all the results, you only need the biggest one:
max = 0
while integer_digits.length >= 5
product = integer_digits.first(5).inject(:*)
max = product if product > max
integer_digits.shift
end
puts max #=> 40824
UPDATE:
The reason why you are getting an empty string is most likely caused by running the loop twice without regenerating integer_digits array (which has 4 elements after the loop)
Also as suggested by #MarkThomas, you can use each_cons method:
integer_digits.each_cons(5).inject(0) {|max, ary| [max, ary.inject(:*)].max }
This has this advantage that it will not modify integer_digits, so you can run it mutliple times over the same set of digits.

How to custom sort with ruby

I've got an array of objects which I pull from the database. But I can sort them only in ascending or descending order from database, however I need them in custom order.
Let's say I have an array of objects from db :
arr = [obj1,obj2,obj3]
where obj1 has id 1, obj2 has id 2 and obj3 has id 3
but my sort order would be 3,1,2 or I'd have some array of ids which would dictate the order i.e [3,1,2]
So the order of custom sorting would be :
arr = [obj3,obj1,obj2]
I've tried :
arr.sort_by{|a,b| [3,1,2]}
I've been reading some tutorials and links about sorting and it's mostly simple sorting. So how would one achieve the custom sorting described above?

You're close. [3,1,2] specifies an ordering, but it doesn't tell the block how to relate it to your objects. You want something like:
arr.sort_by {|obj| [3,1,2].index(obj.id) }
So the comparison will order your objects sequentially by the position of their id in the array.
Or, to use the more explicit sort (which you seem to have sort_by slightly confused with):
arr.sort do |a,b|
ordering = [3,1,2]
ordering.index(a.id) <=> ordering.index(b.id)
end

This is like #Chuck's answer, but with O(n log n) performance.
# the fixed ordering
ordering = [3, 1, 2]
# a map from the object to its position in the ordering
ordering_index = Hash[ordering.map(&:id).each_with_index.to_a]
# a fast version of the block
arr.sort_by{|obj| ordering_index[obj.id]}

Can't sort table with associative indexes

Why I can't use table.sort to sort tables with associative indexes?

In general, Lua tables are pure associative arrays. There is no "natural" order other than the as a side effect of the particular hash table implementation used in the Lua core. This makes sense because values of any Lua data type (other than nil) can be used as both keys and values; but only strings and numbers have any kind of sensible ordering, and then only between values of like type.
For example, what should the sorted order of this table be:
unsortable = {
answer=42,
true="Beauty",
[function() return 17 end] = function() return 42 end,
[math.pi] = "pi",
[ {} ] = {},
12, 11, 10, 9, 8
}
It has one string key, one boolean key, one function key, one non-integral key, one table key, and five integer keys. Should the function sort ahead of the string? How do you compare the string to a number? Where should the table sort? And what about userdata and thread values which don't happen to appear in this table?
By convention, values indexed by sequential integers beginning with 1 are commonly used as lists. Several functions and common idioms follow this convention, and table.sort is one example. Functions that operate over lists usually ignore any values stored at keys that are not part of the list. Again, table.sort is an example: it sorts only those elements that are stored at keys that are part of the list.
Another example is the # operator. For the above table, #unsortable is 5 because unsortable[5] ~= nil and unsortable[6] == nil. Notice that the value stored at the numeric index math.pi is not counted even though pi is between 3 and 4 because it is not an integer. Furthermore, none of the other non-integer keys are counted either. This means that a simple for loop can iterate over the entire list:
for i in 1,#unsortable do
print(i,unsortable[i])
end
Although that is often written as
for i,v in ipairs(unsortable) do
print(i,v)
end
In short, Lua tables are unordered collections of values, each indexed by a key; but there is a special convention for sequential integer keys beginning at 1.
Edit: For the special case of non-integral keys with a suitable partial ordering, there is a work-around involving a separate index table. The described content of tables keyed by string values is a suitable example for this trick.
First, collect the keys in a new table, in the form of a list. That is, make a table indexed by consecutive integers beginning at 1 with keys as values and sort that. Then, use that index to iterate over the original table in the desired order.
For example, here is foreachinorder(), which uses this technique to iterate over all values of a table, calling a function for each key/value pair, in an order determined by a comparison function.
function foreachinorder(t, f, cmp)
-- first extract a list of the keys from t
local keys = {}
for k,_ in pairs(t) do
keys[#keys+1] = k
end
-- sort the keys according to the function cmp. If cmp
-- is omitted, table.sort() defaults to the < operator
table.sort(keys,cmp)
-- finally, loop over the keys in sorted order, and operate
-- on elements of t
for _,k in ipairs(keys) do
f(k,t[k])
end
end
It constructs an index, sorts it with table.sort(), then loops over each element in the sorted index and calls the function f for each one. The function f is passed the key and value. The sort order is determined by an optional comparison function which is passed to table.sort. It is called with two elements to compare (the keys to the table t in this case) and must return true if the first is less than the second. If omitted, table.sort uses the built-in < operator.
For example, given the following table:
t1 = {
a = 1,
b = 2,
c = 3,
}
then foreachinorder(t1,print) prints:
a 1
b 2
c 3
and foreachinorder(t1,print,function(a,b) return a>b end) prints:
c 3
b 2
a 1

You can only sort tables with consecutive integer keys starting at 1, i.e., lists. If you have another table of key-value pairs, you can make a list of pairs and sort that:
function sortpairs(t, lt)
local u = { }
for k, v in pairs(t) do table.insert(u, { key = k, value = v }) end
table.sort(u, lt)
return u
end
Of course this is useful only if you provide a custom ordering (lt) which expects as arguments key/value pairs.
This issue is discussed at greater length in a related question about sorting Lua tables.

Because they don't have any order in the first place. It's like trying to sort a garbage bag full of bananas.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Most efficient way to compile unique values in a massive text file? - ruby

Use Set: Set implements a collection of unordered values with no duplicates. This is a hybrid of Array's intuitive inter-operation facilities and Hash's fast lookup. Example usage: require 'set' set = Set.new set << 1 << 2 << 3 # => #<Set: {1, 2, 3}> set << 2 # => #<Set: {1, 2, 3}>

You could add the values as keys to a hash map, that would take care of removing duplicates automatically. You could even count the number of times each value occurs this way (with the hash value).

Related

Maintaining Ruby Set by object ID

How can I return hash pairs of keys that sum up to less than a maximum value?

Passing array of integers to loop, modify the array, and store results in new array. Project Euler #8 in Ruby

How to custom sort with ruby

Can't sort table with associative indexes

Categories

Resources