So far, I have this code that reads a file and sorts it using Ruby. But this doesn't sort the numbers correctly and I think it will be inefficient, given that the file can be as big as 200GB and contains a number on each line. Can you suggest what else to do?
File.open("topN.txt", "w") do |file|
File.readlines("N.txt").sort.reverse.each do |line|
file.write(line.chomp<<"\n")
end
End
After everyone help over here this is how my code is looking so far...
begin
puts "What is the file name?"
file = gets.chomp
puts "Whats is the N number?"
myN = Integer(gets.chomp)
rescue ArgumentError
puts "That's not a number, try again"
retry
end
topN = File.open(file).each_line.max(myN){|a,b| a.to_i <=> b.to_i}
puts topN
Sorting 200GB of data in memory will not be very performant. I would write a little helper class which only remembers the N biggest elements added so far.
class SortedList
attr_reader :list
def initialize(size)
#list = []
#size = size
end
def add(element)
return if #min && #min > element
list.push(element)
reorganize_list
end
private
def reorganize_list
#list = list.sort.reverse.first(#size)
#min = list.last
end
end
Initialize an instance with the require N and the just add the values parsed from each line to this instance.
sorted_list = SortedList.new(n)
File.readlines("N.txt").each do |line|
sorted_list.add(line.to_i)
end
puts sorted_list.list
Suppose
str = File.read(in_filename)
#=> "117\n106\n143\n147\n63\n118\n146\n93\n"
You could convert that string to an enumerator that enumerates lines, use Enumerable#sort_by to sort those lines in descending order, join the resulting lines (that end in newlines) to form a string that can be written to file:
str.each_line.sort_by { |line| -line.to_i }.join
#=> "147\n146\n143\n118\n117\n106\n93\n63\n"
Another way is to convert the string to array of integers, sort the array using Array#sort, reverse the resulting array and then join the elements of the array back into a string that can be written to file:
str.each_line.map(&:to_i).sort.reverse.join("\n") << "\n"
#=> "147\n146\n143\n118\n117\n106\n93\n63\n"
Let's do a quick benchmark.
require 'benchmark/ips'
(str = 1_000_000.times.map { rand(10_000) }.join("\n") << "\n").size
Benchmark.ips do |x|
x.report("sort_by") { str.each_line.sort_by { |line| -line.to_i }.join }
x.report("sort") { str.each_line.map(&:to_i).sort.reverse.join("\n") << "\n" }
x.compare!
end
Comparison:
sort: 0.4 i/s
sort_by: 0.3 i/s - 1.30x slower
The mighty sort wins again!
You left this comment on your question:
"Write a program, topN, that given a number N and an arbitrarily large file that contains individual numbers on each line (e.g. 200Gb file), will output the largest N numbers, highest first."
That problem seems to me as somewhat different than the one described in the question, and also constitutes a more interesting problem. I have addressed that problem in this answer.
Code
def topN(fname, n, m=n)
raise ArgumentError, "m cannot be smaller than n" if m < n
f = File.open(fname)
best = Array.new(n)
n.times do |i|
break best.replace(best[0,i]) if f.eof?
best[i] = f.readline.to_i
end
best.sort!.reverse!
return best if f.eof?
new_best = Array.new(n)
cand = Array.new(m)
until f.eof?
rd(f, cand)
merge_arrays(best, new_best, cand)
end
f.close
best
end
def rd(f, cand)
cand.size.times { |i| cand[i] = (f.eof? ? -Float::INFINITY : f.readline.to_i) }
cand.sort!.reverse!
end
def merge_arrays(best, new_best, cand)
cand_largest = cand.first
best_idx = best.bsearch_index { |n| cand_largest > n }
return if best_idx.nil?
bi = best_idx
cand_idx = 0
nbr_to_compare = best.size-best_idx
nbr_to_compare.times do |i|
if cand[cand_idx] > best[bi]
new_best[i] = cand[cand_idx]
cand_idx += 1
else
new_best[i] = best[bi]
bi += 1
end
end
best[best_idx..-1] = new_best[0, nbr_to_compare]
end
Examples
Let's create a file with 10 million representations of integers, one per line.
require 'time'
FName = 'test'
(s = 10_000_000.times.with_object('') { |_,s| s << rand(100_000_000).to_s << "\n" }).size
s[0,27]
#=> "86752031\n84524374\n29347072\n"
File.write(FName, s)
#=> 88_888_701
Next, create a simple method to invoke topN with different arguments and to also show execution times.
def try_one(n, m=n)
t = Time.now
a = topN(FName, n, m)
puts "#{(Time.new-t).round(2)} seconds"
puts "top 5: #{a.first(5)}"
puts "bot 5: #{a[n-5..n-1]}"
end
In testing I found that setting m less than n was never desirable in terms of computational time. Requiring that m >= n allowed a small simplification to the code and a small efficiency improvement. I therefore made m >= n a requirement.
try_one 100, 100
9.44 seconds
top 5: [99999993, 99999993, 99999991, 99999971, 99999964]
bot 5: [99999136, 99999127, 99999125, 99999109, 99999078]
try_one 100, 1000
9.53 seconds
top 5: [99999993, 99999993, 99999991, 99999971, 99999964]
bot 5: [99999136, 99999127, 99999125, 99999109, 99999078]
try_one 100, 10_000
9.95 seconds
top 5: [99999993, 99999993, 99999991, 99999971, 99999964]
bot 5: [99999136, 99999127, 99999125, 99999109, 99999078]
Here I've tested for the case of producing the 100 largest values with different number of lines of the file to read at a time m. As seen, the method is insensitive to this latter value. As expected, the largest 5 values and the smallest 5 values (of the 100 returned) are the same in all cases.
try_one 1_000
9.31 seconds
top 5: [99999993, 99999993, 99999991, 99999971, 99999964]
bot 5: [99990425, 99990423, 99990415, 99990406, 99990399]
try_one 1000, 10_000
9.24 seconds
The time required to return the 1,000 largest values is, in fact, slightly less than the times for returning the largest 100. I expect that's not reproducible. The top 5 are of course the same as when returning the largest 100 values. I therefore will not display that line below. The smallest 5 values of the 1000 returned are of course smaller than when the largest 100 values are returned.
try_one 10_000
12.15 seconds
bot 5: [99898951, 99898950, 99898946, 99898932, 99898922]
try_one 100_000
13.2 seconds
bot 5: [98995266, 98995259, 98995258, 98995254, 98995252]
try_one 1_000_000
14.34 seconds
bot 5: [89999305, 89999302, 89999301, 89999301, 89999287]
Explanation
Notice that reuse three arrays, best, cand and new_best. Specifically, I replace the contents of these arrays many times rather than continually creating new (potentially very large) arrays, leaving orphaned arrays to be garbage-collected. A little testing showed this approach improved performance.
We can create a small example and then step through the calculations.
fname = 'temp'
File.write(fname, 20.times.map { rand(100) }.join("\n") << "\n")
#=> 58
This file contains representations of integers in the following array.
arr = File.read(fname).lines.map(&:to_i)
#=> [9, 66, 80, 64, 67, 67, 89, 10, 62, 94, 41, 16, 0, 22, 68, 72, 41, 64, 87, 24]
Sorted, this is:
arr.sort_by! { |n| -n }
#=> [94, 89, 87, 80, 72, 68, 67, 67, 66, 64, 64, 62, 41, 41, 24, 22, 16, 10, 9, 0]
Let's assume we want the 5 largest values.
arr[0,5]
#=> [94, 89, 87, 80, 72]
First, set the two parameters: n, the number of largest values to return, and m, the number of lines to read from the file at a time.
n = 5
m = 5
The calculation follow.
m < n
#=> false, so do not raise ArgumentError
f = File.open(fname)
#=> #<File:temp>
best = Array.new(n)
#=> [nil, nil, nil, nil, nil]
n.times { |i| f.eof? ? (return best.replace(best[0,i])) : best[i] = f.readline.to_i }
best
#=> [9, 66, 80, 64, 67]
best.sort!.reverse!
#=> [80, 67, 66, 64, 9]
f.eof?
#=> false, so do not return
new_best = Array.new(n)
#=> [nil, nil, nil, nil, nil]
cand = Array.new(m)
#=> [nil, nil, nil, nil, nil]
puts "best=#{best}".rjust(52)
until f.eof?
rd(f, cand)
merge_arrays(best, new_best, cand)
puts "cand=#{cand}, best=#{best}"
end
f.close
best
#=> [94, 89, 87, 80, 72]
The following is displayed.
best=[80, 67, 66, 64, 9]
cand=[94, 89, 67, 62, 10], best=[94, 89, 80, 67, 67]
cand=[68, 41, 22, 16, 0], best=[94, 89, 80, 68, 67]
cand=[87, 72, 64, 41, 24], best=[94, 89, 87, 80, 72]
Enumerable.max takes an argument which specifies how many elements will be returned, and a block which specifies how elements are compared:
N = 5
p File.open("test.txt").each_line.max(N){|a,b| a.to_i <=> b.to_i}
This does not read the entire file in memory; the file is read line by line.
I know I can write it this way successfully:
def test_find_first_multiple_of_3
numbers = [2, 8, 9, 27, 24, 5]
found = nil
numbers.each do |number|
if number % 3 == 0
found = number
break
end
end
assert_equal 9, found
end
Is there anyway to do within the block? What am I missing? Or is just not possible?
numbers.each { |n| n % 3 == 0 ? (found = n then break) : nil }
def test_find_first_multiple_of_3
numbers = [2, 8, 9, 27, 24, 5]
found = nil
numbers.each { |n| n % 3 == 0 ? (found = n then break) : nil }
assert_equal 9, found
end
As pointed by other answers, there are other ruby ways to accomplish your algorithm goal, like using the .find method:
found = numbers.find { |n| (n % 3).zero? }
This way, you don't need to break your loop.
But, specifically answering your question, there are some ways to break the loop in the same line, if you want so:
use ; (multiple statements separator):
numbers.each { |n| n % 3 == 0 ? (found = n; break) : nil }
or put your assigment after break, that works too:
numbers.each { |n| n % 3 == 0 ? (break found = n) : nil }
I just used your code in the example, but, again, that's not a good pratice, because, as well pointed by #the Tin Man, "hurts readability and maintenance".
Also, as pointed by #akuhn, you don't need to use ternary here. You can simply use:
numbers.each { |n| break found = n if n % 3 == 0 }
** EDITED to include suggestions from #the Tin Man, #akuhn and #Eric Duminil, in order to warn OP that there are other alternatives to run his task, that doesn't need to break loop. The original answer was written just to answer OP's question specifically (one line break loop), without the code structure concern.
With common Ruby idioms your can write:
def test_find_first_multiple_of_3
numbers = [2, 8, 9, 27, 24, 5]
found = numbers.find { |n| (n % 3).zero? }
assert_equal 9, found
end
Yes, both break and next take an argument.
For your example though, best use find
founds = numbers.find { |n| n % 3 == 0 }
Generally in Ruby there is rarely a reason to break out of a loop.
You can typically use find or any of the other functions provided by the Enumerable module, like take_while and drop_while…
You can use the enumerable method find to find the first item that matches. Usually you will want to use enumerable methods like cycle, detect, each, reject, and others to make the code more compact while remaining understandable:
def test_find_first_multiple_of_3
numbers = [2, 8, 9, 27, 24, 5]
found = numbers.find { |number| number % 3 == 0 }
assert_equal 9, found
end
As an exercise to better understand Ruby Fibers and Enumerators, I wrote a small program to generate the first 10 prime-number palindromes. (I didn't implement a sieve of Eratosthenes or use Ruby's Prime module; I wanted to focus on generators and filters.)
My first version used a Fiber to generate primes, and an Enumerator to filter for palindromes:
module PrimeSeries
extend self
def generator
Fiber.new do
Fiber.yield(2)
3.step(by: 2) { |n| Fiber.yield(n) if prime?(n) }
end
end
def prime?(n)
return false if n < 2
return false if n > 2 && n.even?
sqrt = Math.sqrt(n).ceil
(3..sqrt).step(2).none? { |i| n % i == 0 }
end
end
module Palindrome
extend self
def filter(generator)
Enumerator.new do |yielder|
loop do
n = generator.resume
yielder.yield(n) if palindrome?(n)
end
end
end
def palindrome?(n)
str = n.to_s
str == str.reverse
end
end
primes = PrimeSeries.generator
filter = Palindrome.filter(primes)
p filter.take(10)
# => [2, 3, 5, 7, 11, 101, 131, 151, 181, 191]
What I really wanted, though, was to ignore trivial one-digit palindromes. In order to call it like this,
filter.select { |n| n.to_s.size > 1 }.take(10)
I had to change my generator to an Enumerator in order to take advantage of Enumerator::Lazy.
module PrimeSeries
# ...
def generator
Enumerator.new do |yielder|
yielder.yield(2)
3.step(by: 2) { |n| yielder.yield(n) if prime?(n) }
end
end
# ...
end
module Palindrome
# changed "generator.resume" to "generator.next"
# otherwise the same as previous version
end
primes = PrimeSeries.generator
filter = Palindrome.filter(primes).lazy
p filter.select { |n| n.to_s.size > 1 }.take(10).force
# => [11, 101, 131, 151, 181, 191, 313, 353, 373, 383]
At first, I didn't have .force at the end of my call, and I just got a lazy enumerator back (the result of chaining lazy enumerators for primes, filter, select, and take).
While I think I understand the need for it, I can't find any documentation either for the force method (seems like it's an alias for to_a?) or the cases in which you do or don't have to use it. I'm curious whether there are cases where you don't need to call force in order to evaluate your final results.
I'm writing a method - prime_numbers - that, when passed a number n, returns an n number of primes. It should not rely on Ruby's Prime class. It should behave like so:
prime_numbers 3
=> [2, 3, 5]
prime_numbers 5
=> [2, 3, 5, 7, 11]
My first attempt at this method is as follows:
def prime_numbers(n)
primes = []
i = 2
while primes.length < n do
divisors = (2..9).to_a.select { |x| x != i }
primes << i if divisors.all? { |x| i % x != 0 }
i += 1
end
primes
end
Edit: As pointed out, the current method is at fault by being limited to take into account divisors only up to 9. As a result, any perfect square composed of two equal primes greater than 9 is treated as a prime itself.
If anyone has input or tips they can share on better ways to approach this, it would be greatly appreciated.
Note that if the number is composite it must have a divisor less than or equal to $\sqrt{n}$. So you really only have to check up to $sqrt{n}$ to find a divisor.
Got a good idea for your implementation:
#primes = []
def prime_numbers(n)
i = 2
while #primes.size < n do
#primes << i if is_prime?(i)
i += 1
end
#primes
end
def is_prime?(n)
#primes.each { |prime| return false if n % prime == 0 }
true
end
This is based on the idea that non-prime numbers have prime factors :)
In Ruby 1.9 there is a Prime class you can use to generate prime numbers, or to test if a number is prime:
require 'prime'
Prime.take(10) #=> [2, 3, 5, 7, 11, 13, 17, 19, 23, 29]
Prime.take_while {|p| p < 10 } #=> [2, 3, 5, 7]
Prime.prime?(19) #=> true
Prime implements the each method and includes the Enumerable module, so you can do all sorts of fun stuff like filtering, mapping, and so on.
I need a loop in this pattern. I need an infinite loop producing number that start with 1.
1,10,11,12..19,100,101,102..199,1000,1001.......
def numbers_that_start_with_1
return enum_for(:numbers_that_start_with_1) unless block_given?
infty = 1.0 / 0.0
(0..infty).each do |i|
(0 .. 10**i - 1).each do |j|
yield(10**i + j)
end
end
end
numbers_that_start_with_1.first(20)
#=> [1, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 100, 101, 102, 103, 104, 105, 106, 107, 108]
INFINITY = 1.0 / 0.0
0.upto(INFINITY) do |i|
((10**i)...(2*10**i)).each{|e| puts e }
end
Of course, I wouldn't run this code.
i = 1
loop do
for j in 0...i
puts i+j
end
i *= 10
end
Enumerators are good for things like this. Though, I was lazy and just decided to see if the string representation starts with 1, and iterate through 1 at a time. This means it'll be slow, and it'll have huge pauses while it jumps from things like 1,999,999 to 10,000,000.
#!/usr/bin/env ruby
start_with_1 = Enumerator.new do|y|
number = 1
loop do
while number.to_s[0] != '1'
number += 1
end
y.yield number
number += 1
end
end
start_with_1.each do|n|
puts n
end
Not better, just different...
def nbrs_starting_with_one(nbr=1)
(nbr...2*nbr).each {|i| puts i}
nbrs_starting_with_one(10*nbr)
end