I'm writing a small script for checking for repeated files inside a folder. I did with array, and i was successful. The problem is that i want to store the folder location also, so i can see where the duplicated files are.
My first thought was using a Hash. But since you will have a lot of files in the same folder, I can't do: hash[folder] = file. The reverse is also impossivel, because if I have repeated files, they will be overwritten (hash[file] = folder)
So what is the best approach to do that?
My code:
class FilesList
attr_accessor :elements
def initialize(path)
#elements = Hash.new
#path = path
printDirectory(#path)
end
def printDirectory(folderPath)
entries = Dir.entries(folderPath) - [".", "..", "repeat.rb"]
entries.each do |single|
if File.directory?("#{folderPath}/#{single}")
printDirectory("#{folderPath}/#{single}")
else
#elements[single] = folderPath
end
end
end
def printArray
puts #elements
end
def each()
#elements.each do |x, y|
yield x y
end
end
def checkRepeated
if #elements.length == #elements.keys.uniq.length
puts "No repeated Files"
else
counts = Hash.new(0)
#elements.each do |key,val|
counts[val] += 1
end
repeateds = counts.reject{|val,count|count==1}.keys
puts repeateds
end
end
end
array = FilesList.new(Dir.pwd)
array.printArray
You may store arrays (or sets) of file names (or folder paths) as hash values
For example, in your code you may change #elements[single] = folderPath to:
#elements[single] ||= []
#elements[single] << folderPath
And then later, your val's will be arrays of folders where file was met.
Similar to the above, but don't use the file as the key.
#elements = Hash.new([])
entries.each do |single|
if File.directory?("#{folderPath}/#{single}")
printDirectory("#{folderPath}/#{single}")
else
#elements[folderPath] << single
end
end
Then you'll get something that looks like this:
{ '/path1' => ['awesome_file.rb', 'beautiful.js'],
'/path2' => ['beautiful.js', 'coffee.rb'] }
And then, if I understand you correctly, you can find duplicate files like so:
files = #elements.values.flatten
repeateds = files.select{ |file| files.count(file) > 1 }
This will return an array: ["beautiful.js", "beautiful.js"], which you can call .uniq on to just get the one ["beautiful.js", or do a count if you like, or map the results to another hash that tells you how often it's repeated, etc.
Related
I want to take a file, read the file into my program and split it into characters, split the resulting character array into a multidimensional array of 5,000 characters each, then write each separate array into a file found in the same location.
I have taken a file, read it, and created the multidimensional array. Now I want to write each separate single dimension array into separate files.
The file is obtained via user input. Then I created a chain helper method that stores the file to an array in the first mixin, this is then passed to another method that breaks it down into a multidimensional array, which finally hands it off to the end of the chain which currently is setup to make a new directory for which I will put these files.
require 'Benchmark/ips'
file = "C:\\test.php"
class String
def file_to_array
file = self
return_file = File.open(file) do |line|
line.each_char.to_a
end
return return_file
end
def file_write
file_to_write = self
if Dir.exist?("I:\\file_to_array")
File.open("I:/file_to_array/tmp.txt", "w") { |file| file.write(file_to_write) }
read_file = File.read("I:/file_to_array/tmp.txt")
else
Dir.mkdir("I:\\file_to_array")
end
end
end
class Array
def file_divider
file_to_divide = self
file_to_separate = []
count = 0
while count != file_to_divide.length
separator = count % 5000
if separator == 0
start = count - 5000
stop = count
file_to_separate << file_to_divide[start..stop]
end
count = count + 1
end
return file_to_separate
end
def file_write
file_to_write = self
if Dir.exist?("I:\\file_to_array")
File.open("I:/file_to_array/tmp.txt", "w") { |file| file.write(file_to_write) }
else
Dir.mkdir("I:\\file_to_array")
end
end
end
Benchmark.ips do |result|
result.report { file.file_to_array.file_divider.file_write }
end
Test.php
<?php
echo "hello world"
?>
This untested code is where I'd start to split text into chunks and save it:
str = "I want to take a file"
str_array = str.scan(/.{1,10}/) # => ["I want to ", "take a fil", "e"]
str_array.each.with_index(1) do |str_chunk, i|
File.write("output#{i}", str_chunk)
end
This doesn't honor word-boundaries.
Reading a separate input file is easy; You can use read if you KNOW the input will never exceed the available memory and you don't care about performance.
Thinking about it further, if you want to read a text file and break its contents into smaller files, then read it in chunks:
input = File.open('input.txt', 'r')
i = 1
until input.eof? do
chunk = input.read(10)
File.write("output#{i}", chunk)
i += 1
end
input.close
Or even better because it automatically closes the input:
File.open('input.txt', 'r') do |input|
i = 1
until input.eof? do
chunk = File.read(10)
File.write("output#{i}", chunk)
i += 1
end
end
Those are not tested but it look about right.
Use standard File API and Serialisation.
File.write('path/to/yourfile.txt', Marshal.dump([1, 2, 3]))
folder_to_analyze = ARGV.first
folder_path = File.join(Dir.pwd, folder_to_analyze)
unless File.directory?(folder_path)
puts "Error: #{folder_path} no es un folder valido."
exit
end
def get_csv_file_paths(path)
files = []
Dir.glob(path + '/**/*.csv').each do |f|
files << f
end
return files
end
def get_xlsx_file_path(path)
files = []
Dir.glob(path + '/**/*.xls').each do |f|
files << f
end
return files
end
files_to_process = []
files_to_process << get_csv_file_paths(folder_path)
files_to_process << get_xlsx_file_path(folder_path)
puts files_to_process[1].length # Not what I want, I want:
# puts files_to_process.length
I'm trying to make a simple script in Ruby that allows me to call it from the command line, like ruby counter.rb mailing_list1 and it goes to the folder and counts all .csv and .xls files.
I intend to operate on each file, getting a row count, etc.
Currently the files_to_process array is actually an array of array - I don't want that. I want to have a single array of both .csv and .xls files.
Since I don't know how to yield from the Dir.glob call, I added them to an array and returned that.
How can I accomplish this using a single array?
Just stick the file extensions together into one group:
Dir[path + "/**/*.{csv,xls}"]
Well, yielding is simple. Just yield.
def get_csv_file_paths(path)
Dir.glob(path + '/**/*.csv').each do |f|
yield f
end
end
def get_xlsx_file_path(path)
Dir.glob(path + '/**/*.xls').each do |f|
yield f
end
end
files_to_process = []
get_csv_file_paths(folder_path) {|f| files_to_process << f }
get_xlsx_file_path(folder_path) {|f| files_to_process << f }
puts files_to_process.length
Every method in ruby can be passed a block. And yield keyword sends data to that block. If the block may or may not be provided, yield is usually used with block_given?.
yield f if block_given?
Update
The code can be further simplified by passing your block directly to glob.each:
def get_csv_file_paths(path, &block)
Dir.glob(path + '/**/*.txt').each(&block)
end
def get_xlsx_file_path(path, &block)
Dir.glob(path + '/**/*.xls').each(&block)
end
Although this block/proc conversion is a little bit of advanced topic.
def get_folder_paths(root_path)
Dir.glob('**/*.csv') + Dir.glob('**/*.xls')
end
folder_path = File.join(Dir.pwd, ARGV.first || '')
raise "#{folder_path} is not a valid folder" unless File.directory?(folder_path)
puts get_folder_paths(folder_path).length
The get_folder_paths method returns an array of CSV and XLS files. Building an array of file names may not be what you really want, especially if there are a lot of them. An approach using the Enumerator returned by Dir.glob would be more appropriate in that case if you did not need the file count first.
I am processing documents in ruby.
I have a document I am extracting specific strings from using regexp and then adding them to another file. When added to the destination file they must be made unique so if that string already exists in the destination file I'am adding a simple suffix e.g. <word>_1. Eventually I want to be referencing the strings by name so random number generation or string from the date is no good.
At present I am storing each word added in an array and then everytime I add a word I check the string doesn't exist in an array which is fine if there is only 1 duplicate however there might be 2 or more so I need to check for the initial string then loop incrementing the suffix until it doesn't exist, (I have simplified my code so there may be bugs)
def add_word(word)
if #added_words include? word
suffix = 1
suffixed_word = word
while added_words include? suffixed_word
suffixed_word = word + "_" + suffix.to_s
suffix += 1
end
word = suffixed_word
end
#added_words << word
end
It looks messy, is there a better algorithm or ruby way of doing this?
Make #added_words a Set (don't forget to require 'set'). This makes for faster lookup as sets are implemented with hashes, while still using include? to check for set membership. It's also easy to extract the highest used suffix:
>> s << 'foo'
#=> #<Set: {"foo"}>
>> s << 'foo_1'
#=> #<Set: {"foo", "foo_1"}>
>> word = 'foo'
#=> "foo"
>> s.max_by { |w| w =~ /#{word}_?(\d+)?/ ; $1 || '' }
#=> "foo_1"
>> s << 'foo_12' #=>
#<Set: {"foo", "foo_1", "foo_12"}>
>> s.max_by { |w| w =~ /#{word}_?(\d+)?/ ; $1 || '' }
#=> "foo_12"
Now to get the next value you can insert, you could just do the following (imagine you already had 12 foos, so the next should be a foo_13):
>> s << s.max_by { |w| w =~ /#{word}_?(\d+)?/ ; $1 || '' }.next
#=> #<Set: {"foo", "foo_1", "foo_12", "foo_13"}
Sorry if the examples are a bit confused, I had anesthesia earlier today. It should be enough to give you an idea of how sets could potentially help you though (most of it would work with array too, but sets have faster lookup).
Change #added_words to a Hash with a default of zero. Then you can do:
#added_words = Hash.new(0)
def add_word( word)
#added_words[word] += 1
end
# put it to work:
list = %w(test foo bar test bar bar)
names = list.map do |w|
"#{w}_#{add_word(w)}"
end
p #added_words
#=> {"test"=>2, "foo"=>1, "bar"=>3}
p names
#=>["test_1", "foo_1", "bar_1", "test_2", "bar_2", "bar_3"]
In that case, I'd probably use a set or hash:
#in your class:
require 'set'
require 'forwardable'
extend Forwardable #I'm just including this to keep your previous api
#elsewhere you're setting up your instance_var, it's probably [] at the moment
def initialize
#added_words = Set.new
end
#then instead of `def add_word(word); #added_words.add(word); end`:
def_delegator :added_words, :add_word, :add
#or just change whatever loop to use ##added_words.add('word') rather than self#add_word('word')
##added_words.add('word') does nothing if 'word' already exists in the set.
If you've got some attributes that you're grouping via these sections, then a hash might be better:
#elsewhere you're setting up your instance_var, it's probably [] at the moment
def initialize
#added_words = {}
end
def add_word(word, attrs={})
#added_words[word] ||= []
#added_words[word].push(attrs)
end
Doing it the "wrong way", but in slightly nicer code:
def add_word(word)
if #added_words.include? word
suffixed_word = 1.upto(1.0/0.0) do |suffix|
candidate = [word, suffix].join("_")
break candidate unless #added_words.include?(candidate)
end
word = suffixed_word
end
#added_words << word
end
The title really really doesn't explain things. My situation is that I would like to read a file and put the contents into a hash. Now, I want to make it clever, I want to create a loop that opens every file in a directory and put it into a hash. Problem is I don't know how to assign a name relative to the file name. eg:
hash={}
Dir.glob(path + "*") do |datafile|
file = File.open(datafile)
file.each do |line|
key, value = line.chomp("\t")
# Problem here is that I wish to have a different
# hash name for every file I loop through
hash[key]=value
end
file.close
end
Is this possible?
Why don't you use a hash whose keys are the file names (in your case "datafile") and whose value are hashes in which you insert your data?
hash = Hash.new { |h, key| h[key] = Hash.new }
Dir.glob(path + '*') do |datafile|
next unless File.stat(datafile).file?
File.open(datafile) do |file|
file.each do |line|
key, value = line.split("\t")
puts key, value
# Different hash name for every file is now hash[datafile]
hash[datafile][key]=value
end
end
end
You want to dynamically create variables with the names of the files you process?
try this:
Dir.glob(path + "*") do |fileName|
File.open(fileName) {
# the variable `hash` and a variable named fileName will be
# pointing to the same object...
hash = eval("#{fileName} = Hash.new")
file.each do |line|
key, value = line.chomp("\t")
hash[key]=value
end
}
end
Of course you would have to make sure you rubify the filename first. A variable named "bla.txt" wouldn't be valid in ruby, neither would "path/to/bla.csv"
If you want to create a dynamic variable, you can also use #instance_variable_set (assuming that instance variables are also OK.
Dir.glob(path + "*") do |datafile|
file = File.open(datafile)
hash = {}
file.each do |line|
key, value = line.chomp("\t")
hash[key] = value
end
instance_variable_set("#file_#{File.basename(datafile)}", hash)
end
This only works when the filename is a valid Ruby variable name. Otherwise you would need some transformation.
Can't you just do the following?
filehash = {} # after the File.open line
...
# instead of hash[key] = value, next two lines
hash[datafile] = filehash
filehash[key] = value
You may want to use something like this:
hash[file] = {}
hash[file][key] = value
Two hashes is enough now.
fileHash -> lineHash -> content.
I have a configuration class in Ruby that used to have keys like "core.username" and "core.servers", which was stored in a YAML file just like that.
Now I'm trying to change it to be nested, but without having to change all the places that refer to keys in the old way. I've managed it with the reader-method:
def [](key)
namespace, *rest = key.split(".")
target = #config[namespace]
rest.each do |k|
return nil unless target[k]
target = target[k]
end
target
end
But when I tried the same with the writer-class, that works, but isn't set in the #config-hash. #config is set with just a call to YAML.load_file
I managed to get it working with eval, but that is not something I would like to keep for long.
def []=(key, value)
namespace, *rest = key.split(".")
target = "#config[\"#{namespace}\"]"
rest.each do |key|
target += "[\"#{key}\"]"
end
eval "#{target} = value"
self[key]
end
Is there any decent way to achieve this, preferably without changing plugins and code throughout?
def []=(key, value)
subkeys = key.split(".")
lastkey = subkeys.pop
subhash = subkeys.inject(#config) do |hash, k|
hash[k]
end
subhash[lastkey] = value
end
Edit: Fixed the split.
PS: You can also replace the inject with an each-loop like in the [] method if you prefer. The important thing is that you do not call [] with the last key, but instead []= to set the value.
I used recursion:
def change(hash)
if hash.is_an? Hash
hash.inject({}) do |acc, kv|
hash[change(kv.first)] = change(kv.last)
hash
end
else
hash.to_s.split('.').trim # Do your fancy stuff here
end
end