Looking for direction on how to chunk and compare two large text files using ruby. Any help is appreciated. Something like 100 lines at a time.
tried:
file(file1).foreach.each_slice(100) do |lines|
pp lines
end
getting confused how to include the second file to this loop.
CHUNK_SIZE = 256 # bytes
def same? path1, path2
return false unless [path1, path2].map { |f| File.size f }.reduce &:==
f1, f2 = [path1, path2].map { |f| File.new f }
loop do
s1, s2 = [f1, f2].map { |f| f.read(CHUNK_SIZE) }
break false if s1 != s2
break true if s1.nil? || s1.length < CHUNK_SIZE
end
ensure
[f1, f2].each &:close
end
UPD: credits for fixed typo and file size comparison goes to #tadman.
Just "Process two files at the same time in Ruby" and compare by chunks, like this:
f1 = File.open('file1.txt', 'r')
f2 = File.open('file2.txt', 'r')
f1.each_slice(10).zip(f2.each_slice(10)).each do |line1, line2|
return false unless line1 == line2
end
return true
Or, as suggested by #meagar (in this case line by line):
f1.each_line.zip(f2.each_line).all? { |a,b| a == b }
This will return true if files identical.
Just compare those files line by line:
def same_file?(path1, path2)
file1 = File.open(path1)
file2 = File.open(path2)
return true if File.absolute_path(path1) == File.absolute_path(path2)
return false unless file1.size == file2.size
enum1 = file1.each
enum2 = file2.each
loop do
# It's a mystery that the loop really ends
# when any of the 2 files has nothing to read
return false unless enum1.next == enum2.next
end
return true
ensure
file1.close
file2.close
end
I did my homework and found in the Kernel#loop documentation:
StopIteration raised in the block breaks the loop. In this case, loop returns the "result" value stored in the exception.
And, in the Enumerator#next documentation:
When the position reached at the end, StopIteration is raised.
So the mystery is no longer a mystery for me.
Here's another one, the approach is similar to mudasobwa's answer:
def same?(file_1, file_2)
return true if File.identical?(file_1, file_2)
return false unless File.size(file_1) == File.size(file_2)
buf_size = 2 ** 15 # 32 K
buf_1 = ''
buf_2 = ''
File.open(file_1) do |f1|
File.open(file_2) do |f2|
while f1.read(buf_size, buf_1) && f2.read(buf_size, buf_2)
return false unless buf_1 == buf_2
end
end
end
true
end
In the first two lines perform quick checks for identical files (e.g. hard and soft links) and same size using File.identical? and File.size.
File.open opens each file in read-only mode. The while loop then keeps calling read to read 32K chunks from each file into the buffers buf_1 and buf_2 until EOF. If the buffers differ, false is returned. Otherwise, i.e. without encountering any differences, true is returned.
To determine if two files have the exact same content, without comparing the actual content of the same chunk of each file, you can use a checksum function that turns the data into a hash string in a deterministic way. And while you have to read the contents to checksum it, you can get checksums for each slice, and end up with an array of checksums for each file.
You can then compare the collection of checksums. If the two files have the exact same content, the two collections will be equal.
require 'digest/md5'
hashes1 = File.foreach('./path_to_file').each_slice(100).map do |slice|
Digest::MD5.hexdigest(slice)
end
hashes2 = File.read('./path_to_duplicate').each_slice(100).map do |slice|
Digest::MD5.hexdigest(slice)
end
hashes1.join == hashes2.join
#=> true, meaning the two files contain the same content
Benchmark time
(Matt's answer is not included because I couldn't get it working)
Results 1 KB file size (N = 10000)
user system total real
aetherus 0.510000 0.300000 0.810000 ( 0.823201)
meagar 0.350000 0.160000 0.510000 ( 0.512755)
mudasobwa 0.290000 0.200000 0.490000 ( 0.500831)
stefan 0.150000 0.160000 0.310000 ( 0.312743)
yevgeniy_anfilofyev 0.320000 0.170000 0.490000 ( 0.497157)
Results 1 MB file size (N = 100)
user system total real
aetherus 1.540000 0.110000 1.650000 ( 1.667937)
meagar 1.170000 0.130000 1.300000 ( 1.310278)
mudasobwa 1.470000 0.830000 2.300000 ( 2.313481)
stefan 0.010000 0.030000 0.040000 ( 0.045577)
yevgeniy_anfilofyev 0.570000 0.100000 0.670000 ( 0.677226)
Results 1 GB file size (N = 1)
user system total real
aetherus 15.570000 0.920000 16.490000 ( 16.525826)
meagar 24.170000 1.910000 26.080000 ( 26.190057)
mudasobwa 16.260000 8.160000 24.420000 ( 24.471977)
stefan 0.120000 0.330000 0.450000 ( 0.443074)
yevgeniy_anfilofyev 12.940000 1.310000 14.250000 ( 14.295736)
Notes
mudasobwa's code runs significantly faster with larger CHUNK_SIZE
with identical chunk sizes, stefan's code seems to be ~2x faster than mudasobwa's code
"fastest" chunk size is somewhere between 16 K and 512 K
I couldn't use fruity because the 1 GB test would have taken too long
Code
def aetherus_same?(f1, f2)
enum1 = f1.each
enum2 = f2.each
loop do
return false unless enum1.next == enum2.next
end
return true
end
def meagar_same?(f1, f2)
f1.each_line.zip(f2.each_line).all? { |a,b| a == b }
end
CHUNK_SIZE = 256 # bytes
def mudasobwa_same?(f1, f2)
loop do
s1, s2 = [f1, f2].map { |f| f.read(CHUNK_SIZE) }
break false if s1 != s2
break true if s1.nil? || s1.length < CHUNK_SIZE
end
end
def stefan_same?(f1, f2)
buf_size = 2 ** 15 # 32 K
buf_1 = ''
buf_2 = ''
while f1.read(buf_size, buf_1) && f2.read(buf_size, buf_2)
return false unless buf_1 == buf_2
end
true
end
def yevgeniy_anfilofyev_same?(f1, f2)
f1.each_slice(10).zip(f2.each_slice(10)).each do |line1, line2|
return false unless line1 == line2
end
return true
end
FILE1 = ARGV[0]
FILE2 = ARGV[1]
N = ARGV[2].to_i
def with_files
File.open(FILE1) { |f1| File.open(FILE2) { |f2| yield f1, f2 } }
end
require 'benchmark'
Benchmark.bm(19) do |x|
x.report('aetherus') { N.times { with_files { |f1, f2| aetherus_same?(f1, f2) } } }
x.report('meagar') { N.times { with_files { |f1, f2| meagar_same?(f1, f2) } } }
x.report('mudasobwa') { N.times { with_files { |f1, f2| mudasobwa_same?(f1, f2) } } }
x.report('stefan') { N.times { with_files { |f1, f2| stefan_same?(f1, f2) } } }
x.report('yevgeniy_anfilofyev') { N.times { with_files { |f1, f2| yevgeniy_anfilofyev_same?(f1, f2) } } }
end
Related
My goal is to find the word with greatest number of repeated letters in a given string. For example, "aabcc ddeeteefef iijjfff" would return "ddeeteefef" because "e" is repeated five times in this word and that is more than all other repeating characters.
So far this is what I got, but it has many problems and is not complete:
def LetterCountI(str)
s = str.split(" ")
i = 0
result = []
t = s[i].scan(/((.)\2+)/).map(&:max)
u = t.max { |a, b| a.length <=> b.length }
return u.split(//).count
end
The code I have only finds consecutive patterns; if the pattern is interrupted (such as with "aabaaa", it counts a three times instead of five).
str.scan(/\w+/).max_by{ |w| w.chars.group_by(&:to_s).values.map(&:size).max }
scan(/\w+/) — create an array of all sequences of 'word' characters
max_by{ … } — find the word that gives the largest value inside this block
chars — split the string into characters
group_by(&:to_s) — create a hash mapping each character to an array of all the occurrences
values — just get all the arrays of the occurrences
map(&:size) — convert each array to the number of characters in that array
max — find the largest characters and use this as the result for max_by to examine
Edit: Written less compactly:
str.scan(/\w+/).max_by do |word|
word.chars
.group_by{ |char| char }
.map{ |char,array| array.size }
.max
end
Written less functionally and with less Ruby-isms (to make it look more like "other" languages):
words_by_most_repeated = []
str.split(" ").each do |word|
count_by_char = {} # hash mapping character to count of occurrences
word.chars.each do |char|
count_by_char[ char ] = 0 unless count_by_char[ char ]
count_by_char[ char ] += 1
end
maximum_count = 0
count_by_char.each do |char,count|
if count > maximum_count then
maximum_count = count
end
end
words_by_most_repeated[ maximum_count ] = word
end
most_repeated = words_by_most_repeated.last
I'd do as below :
s = "aabcc ddeeteefef iijjfff"
# intermediate calculation that's happening in the final code
s.split(" ").map { |w| w.chars.max_by { |e| w.count(e) } }
# => ["a", "e", "f"] # getting the max count character from each word
s.split(" ").map { |w| w.count(w.chars.max_by { |e| w.count(e) }) }
# => [2, 5, 3] # getting the max count character's count from each word
# final code
s.split(" ").max_by { |w| w.count(w.chars.max_by { |e| w.count(e) }) }
# => "ddeeteefef"
update
each_with_object gives better result than group_by method.
require 'benchmark'
s = "aabcc ddeeteefef iijjfff"
def phrogz(s)
s.scan(/\w+/).max_by{ |word| word.chars.group_by(&:to_s).values.map(&:size).max }
end
def arup_v1(s)
max_string = s.split.max_by do |w|
h = w.chars.each_with_object(Hash.new(0)) do |e,hsh|
hsh[e] += 1
end
h.values.max
end
end
def arup_v2(s)
s.split.max_by { |w| w.count(w.chars.max_by { |e| w.count(e) }) }
end
n = 100_000
Benchmark.bm do |x|
x.report("Phrogz:") { n.times {|i| phrogz s } }
x.report("arup_v2:"){ n.times {|i| arup_v2 s } }
x.report("arup_v1:"){ n.times {|i| arup_v1 s } }
end
output
user system total real
Phrogz: 1.981000 0.000000 1.981000 ( 1.979198)
arup_v2: 0.874000 0.000000 0.874000 ( 0.878088)
arup_v1: 1.684000 0.000000 1.684000 ( 1.685168)
Similar to sawa's answer:
"aabcc ddeeteefef iijjfff".split.max_by{|w| w.length - w.chars.uniq.length}
=> "ddeeteefef"
In Ruby 2.x, this works as-is because String#chars returns an array. In earlier versions of ruby, String#chars yields an enumerator so you need to add .to_a before applying uniq. I did my testing in Ruby 2.0, and overlooked this until it was pointed out by Stephens.
I believe this is valid, since the question was "greatest number of repeated letters in a given string" rather than greatest number of repeats for a single letter in a given string.
"aabcc ddeeteefef iijjfff"
.split.max_by{|w| w.chars.sort.chunk{|e| e}.map{|e| e.last.length}.max}
# => "ddeeteefef"
I have two strings, a and b, in Ruby.
a="scar"
b="cars"
What is the easiest way in Ruby to find whether a and b contain the same characters?
UPDATE
I am building an Anagram game ,so scar is an anagram of cars.So i want a way to compare a and b and come to conclusion that its an anagram
So c="carcass" should not be a match
You could do like this:
a = 'scar'
b = 'cars'
a.chars.sort == b.chars.sort
# => true
a = 'cars'
b = 'carcass'
a.chars.sort == b.chars.sort
# => false
Just for testing arrays vs string vs delete comparation. Assuming we compare strings with equal length.
In the real anagram search you need to sort first word a once. And then compare it to bunch of b's.
a="scar"
b="cars"
require 'benchmark'
n = 1000000
Benchmark.bm do |x|
x.report('string') { a = a.chars.sort.join; n.times do ; a == b.chars.sort.join ; end }
x.report('arrays') { a = a.chars.sort; n.times do ; a == b.chars.sort ; end }
end
The result:
user system total real
string 6.030000 0.010000 6.040000 ( 6.061088)
arrays 6.420000 0.010000 6.430000 ( 6.473158)
But, if you sort a each time (for delete we don't need to sort any word):
x.report('string') { n.times do ; a.chars.sort.join == b.chars.sort.join ; end }
x.report('arrays') { n.times do ; a.chars.sort == b.chars.sort ; end }
x.report('delete') { n.times do ; a.delete(b).empty? ; end }
The result is:
user system total real
string 11.800000 0.020000 11.820000 ( 11.989071)
arrays 11.210000 0.020000 11.230000 ( 11.263627)
delete 1.680000 0.000000 1.680000 ( 1.673979)
What is the easiest way in Ruby to find whether a and b contain the same characters?
As per the definitions of Anagram the below written code should work :
a="scar"
b="cars"
a.size == b.size && a.delete(b).empty?
require 'set'
Set.new(a.chars) == Set.new(b.chars)
updated to take into account comment from sawa
What is the most efficient way to check if two hashes h1 and h2 have the same set of keys disregarding the order? Could it be made faster or more concise with close efficiency than the answer that I post?
Alright, let's break all rules of savoir vivre and portability. MRI's C API comes into play.
/* Name this file superhash.c. An appropriate Makefile is attached below. */
#include <ruby/ruby.h>
static int key_is_in_other(VALUE key, VALUE val, VALUE data) {
struct st_table *other = ((struct st_table**) data)[0];
if (st_lookup(other, key, 0)) {
return ST_CONTINUE;
} else {
int *failed = ((int**) data)[1];
*failed = 1;
return ST_STOP;
}
}
static VALUE hash_size(VALUE hash) {
if (!RHASH(hash)->ntbl)
return INT2FIX(0);
return INT2FIX(RHASH(hash)->ntbl->num_entries);
}
static VALUE same_keys(VALUE self, VALUE other) {
if (CLASS_OF(other) != rb_cHash)
rb_raise(rb_eArgError, "argument needs to be a hash");
if (hash_size(self) != hash_size(other))
return Qfalse;
if (!RHASH(other)->ntbl && !RHASH(other)->ntbl)
return Qtrue;
int failed = 0;
void *data[2] = { RHASH(other)->ntbl, &failed };
rb_hash_foreach(self, key_is_in_other, (VALUE) data);
return failed ? Qfalse : Qtrue;
}
void Init_superhash(void) {
rb_define_method(rb_cHash, "same_keys?", same_keys, 1);
}
Here's a Makefile.
CFLAGS=-std=c99 -O2 -Wall -fPIC $(shell pkg-config ruby-1.9 --cflags)
LDFLAGS=-Wl,-O1,--as-needed $(shell pkg-config ruby-1.9 --libs)
superhash.so: superhash.o
$(LINK.c) -shared $^ -o $#
An artificial, synthetic and simplistic benchmark shows what follows.
require 'superhash'
require 'benchmark'
n = 100_000
h1 = h2 = {a:5, b:8, c:1, d:9}
Benchmark.bm do |b|
# freemasonjson's state of the art.
b.report { n.times { h1.size == h2.size and h1.keys.all? { |key| !!h2[key] }}}
# This solution
b.report { n.times { h1.same_keys? h2} }
end
# user system total real
# 0.310000 0.000000 0.310000 ( 0.312249)
# 0.050000 0.000000 0.050000 ( 0.051807)
Combining freemasonjson's and sawa's ideas:
h1.size == h2.size and (h1.keys - h2.keys).empty?
Try:
# Check that both hash have the same number of entries first before anything
if h1.size == h2.size
# breaks from iteration and returns 'false' as soon as there is a mismatched key
# otherwise returns true
h1.keys.all?{ |key| !!h2[key] }
end
Enumerable#all?
worse case scenario, you'd only be iterating through the keys once.
Just for the sake of having at least a benchmark on this question...
require 'securerandom'
require 'benchmark'
a = {}
b = {}
# Use uuid to get a unique random key
(0..1_000).each do |i|
key = SecureRandom.uuid
a[key] = i
b[key] = i
end
Benchmark.bmbm do |x|
x.report("#-") do
1_000.times do
(a.keys - b.keys).empty? and (a.keys - b.keys).empty?
end
end
x.report("#&") do
1_000.times do
computed = a.keys & b.keys
computed.size == a.size
end
end
x.report("#all?") do
1_000.times do
a.keys.all?{ |key| !!b[key] }
end
end
x.report("#sort") do
1_000.times do
a_sorted = a.keys.sort
b_sorted = b.keys.sort
a == b
end
end
end
Results are:
Rehearsal -----------------------------------------
#- 1.000000 0.000000 1.000000 ( 1.001348)
#& 0.560000 0.000000 0.560000 ( 0.563523)
#all? 0.240000 0.000000 0.240000 ( 0.239058)
#sort 0.850000 0.010000 0.860000 ( 0.854839)
-------------------------------- total: 2.660000sec
user system total real
#- 0.980000 0.000000 0.980000 ( 0.976698)
#& 0.560000 0.000000 0.560000 ( 0.559592)
#all? 0.250000 0.000000 0.250000 ( 0.251128)
#sort 0.860000 0.000000 0.860000 ( 0.862857)
I have to agree with #akuhn that this would be a better benchmark if we had more information on the dataset you are using. But that being said, I believe this question really needed some hard fact.
It depends on your data.
There is no general case really. For example, generally retrieving the entire keyset at once is faster than checking inclusion of each key seperately. However, if in your dataset, the keysets differ more often than not, then a slower solution which fails faster might be faster. For example:
h1.size == h2.size and h1.keys.all?{|k|h2.include?(k)}
Another factor to consider is the size of your hashes. If they are big a solution with higher setup cost, like calling Set.new, might pay off, if however they are small, it won't:
h1.size == h2.size and Set.new(h1.keys) == Set.new(h2.keys)
And if you happen to compare the same immutable hashes over and over again, it would definitely pay off to cache the results.
Eventually only a benchmark will tell, but, to write a benchmark, we'd need to know more about your use case. For sure, testing a solution with synthetic data (as for example, randomly generated keys) will not be representative.
This is my try:
(h1.keys - h2.keys).empty? and (h2.keys - h1.keys).empty?
Here is my solution:
class Hash
# doesn't check recursively
def same_keys?(compare)
if compare.class == Hash
if self.size == compare.size
self.keys.all? {|s| compare.key?(s)}
else
return false
end
else
nil
end
end
end
a = c = { a: nil, b: "whatever1", c: 1.14, d: true }
b = { a: "foo", b: "whatever2", c: 2.14, "d": false }
d = { a: "bar", b: "whatever3", c: 3.14, }
puts a.same_keys?(b) # => true
puts a.same_keys?(c) # => true
puts a.same_keys?(d) # => false
puts a.same_keys?(false).inspect # => nil
puts a.same_keys?("jack").inspect # => nil
puts a.same_keys?({}).inspect # => false
I believe I have a good answer to this issue, but I wanted to make sure ruby-philes didn't have a much better way to do this.
Basically, given an input string, I would like to convert the string to an integer, where appropriate, or a float, where appropriate. Otherwise, just return the string.
I'll post my answer below, but I'd like to know if there is a better way out there.
Ex:
to_f_or_i_or_s("0523.49") #=> 523.49
to_f_or_i_or_s("0000029") #=> 29
to_f_or_i_or_s("kittens") #=> "kittens"
I would avoid using regex whenever possible in Ruby. It's notoriously slow.
def to_f_or_i_or_s(v)
((float = Float(v)) && (float % 1.0 == 0) ? float.to_i : float) rescue v
end
# Proof of Ruby's slow regex
def regex_float_detection(input)
input.match('\.')
end
def math_float_detection(input)
input % 1.0 == 0
end
n = 100_000
Benchmark.bm(30) do |x|
x.report("Regex") { n.times { regex_float_detection("1.1") } }
x.report("Math") { n.times { math_float_detection(1.1) } }
end
# user system total real
# Regex 0.180000 0.000000 0.180000 ( 0.181268)
# Math 0.050000 0.000000 0.050000 ( 0.048692)
A more comprehensive benchmark:
def wattsinabox(input)
input.match('\.').nil? ? Integer(input) : Float(input) rescue input.to_s
end
def jaredonline(input)
((float = Float(input)) && (float % 1.0 == 0) ? float.to_i : float) rescue input
end
def muistooshort(input)
case(input)
when /\A\s*[+-]?\d+\.\d+\z/
input.to_f
when /\A\s*[+-]?\d+(\.\d+)?[eE]\d+\z/
input.to_f
when /\A\s*[+-]?\d+\z/
input.to_i
else
input
end
end
n = 1_000_000
Benchmark.bm(30) do |x|
x.report("wattsinabox") { n.times { wattsinabox("1.1") } }
x.report("jaredonline") { n.times { jaredonline("1.1") } }
x.report("muistooshort") { n.times { muistooshort("1.1") } }
end
# user system total real
# wattsinabox 3.600000 0.020000 3.620000 ( 3.647055)
# jaredonline 1.400000 0.000000 1.400000 ( 1.413660)
# muistooshort 2.790000 0.010000 2.800000 ( 2.803939)
def to_f_or_i_or_s(v)
v.match('\.').nil? ? Integer(v) : Float(v) rescue v.to_s
end
A pile of regexes might be a good idea if you want to handle numbers in scientific notation (which String#to_f does):
def to_f_or_i_or_s(v)
case(v)
when /\A\s*[+-]?\d+\.\d+\z/
v.to_f
when /\A\s*[+-]?\d+(\.\d+)?[eE]\d+\z/
v.to_f
when /\A\s*[+-]?\d+\z/
v.to_i
else
v
end
end
You could mash both to_f cases into one regex if you wanted.
This will, of course, fail when fed '3,14159' in a locale that uses a comma as a decimal separator.
Depends on security requirements.
def to_f_or_i_or_s s
eval(s) rescue s
end
I used this method
def to_f_or_i_or_s(value)
return value if value[/[a-zA-Z]/]
i = value.to_i
f = value.to_f
i == f ? i : f
end
CSV has converters which do this.
require "csv"
strings = ["0523.49", "29","kittens"]
strings.each{|s|p s.parse_csv(converters: :numeric).first}
#523.49
#29
#"kittens"
However for some reason it converts "00029" to a float.
What is the fastest way to read from STDIN a number of 1000000 characters (integers), and split it into an array of one character integers (not strings) ?
123456 > [1,2,3,4,5,6]
The quickest method I have found so far is as follows :-
gets.unpack("c*").map { |c| c-48}
Here are some results from benchmarking most of the provided solutions. These tests were run with a 100,000 digit file but with 10 reps for each test.
user system total real
each_char_full_array: 1.780000 0.010000 1.790000 ( 1.788893)
each_char_empty_array: 1.560000 0.010000 1.570000 ( 1.572162)
map_byte: 0.760000 0.010000 0.770000 ( 0.773848)
gets_scan 2.220000 0.030000 2.250000 ( 2.250076)
unpack: 0.510000 0.020000 0.530000 ( 0.529376)
And here is the code that produced them
#!/usr/bin/env ruby
require "benchmark"
MAX_ITERATIONS = 100000
FILE_NAME = "1_million_digits"
def build_test_file
File.open(FILE_NAME, "w") do |f|
MAX_ITERATIONS.times {|x| f.syswrite rand(10)}
end
end
def each_char_empty_array
STDIN.reopen(FILE_NAME)
a = []
STDIN.each_char do |c|
a << c.to_i
end
a
end
def each_char_full_array
STDIN.reopen(FILE_NAME)
a = Array.new(MAX_ITERATIONS)
idx = 0
STDIN.each_char do |c|
a[idx] = c.to_i
idx += 1
end
a
end
def map_byte()
STDIN.reopen(FILE_NAME)
a = STDIN.bytes.map { |c| c-48 }
a[-1] == -38 && a.pop
a
end
def gets_scan
STDIN.reopen(FILE_NAME)
gets.scan(/\d/).map(&:to_i)
end
def unpack
STDIN.reopen(FILE_NAME)
gets.unpack("c*").map { |c| c-48}
end
reps = 10
build_test_file
Benchmark.bm(10) do |x|
x.report("each_char_full_array: ") { reps.times {|y| each_char_full_array}}
x.report("each_char_empty_array:") { reps.times {|y| each_char_empty_array}}
x.report("map_byte: ") { reps.times {|y| map_byte}}
x.report("gets_scan ") { reps.times {|y| gets_scan}}
x.report("unpack: ") { reps.times {|y| unpack}}
end
This should be reasonably fast:
a = []
STDIN.each_char do |c|
a << c.to_i
end
although some rough benchmarking shows this hackish version is considerably faster:
a = STDIN.bytes.map { |c| c-48 }
scan(/\d/).map(&:to_i)
This will split any string into an array of integers, ignoring any non-numeric characters. If you want to grab user input from STDIN add gets:
gets.scan(/\d/).map(&:to_i)