Dir.glob to get all csv and xls files in folder - ruby

folder_to_analyze = ARGV.first
folder_path = File.join(Dir.pwd, folder_to_analyze)
unless File.directory?(folder_path)
puts "Error: #{folder_path} no es un folder valido."
exit
end
def get_csv_file_paths(path)
files = []
Dir.glob(path + '/**/*.csv').each do |f|
files << f
end
return files
end
def get_xlsx_file_path(path)
files = []
Dir.glob(path + '/**/*.xls').each do |f|
files << f
end
return files
end
files_to_process = []
files_to_process << get_csv_file_paths(folder_path)
files_to_process << get_xlsx_file_path(folder_path)
puts files_to_process[1].length # Not what I want, I want:
# puts files_to_process.length
I'm trying to make a simple script in Ruby that allows me to call it from the command line, like ruby counter.rb mailing_list1 and it goes to the folder and counts all .csv and .xls files.
I intend to operate on each file, getting a row count, etc.
Currently the files_to_process array is actually an array of array - I don't want that. I want to have a single array of both .csv and .xls files.
Since I don't know how to yield from the Dir.glob call, I added them to an array and returned that.
How can I accomplish this using a single array?

Just stick the file extensions together into one group:
Dir[path + "/**/*.{csv,xls}"]

Well, yielding is simple. Just yield.
def get_csv_file_paths(path)
Dir.glob(path + '/**/*.csv').each do |f|
yield f
end
end
def get_xlsx_file_path(path)
Dir.glob(path + '/**/*.xls').each do |f|
yield f
end
end
files_to_process = []
get_csv_file_paths(folder_path) {|f| files_to_process << f }
get_xlsx_file_path(folder_path) {|f| files_to_process << f }
puts files_to_process.length
Every method in ruby can be passed a block. And yield keyword sends data to that block. If the block may or may not be provided, yield is usually used with block_given?.
yield f if block_given?
Update
The code can be further simplified by passing your block directly to glob.each:
def get_csv_file_paths(path, &block)
Dir.glob(path + '/**/*.txt').each(&block)
end
def get_xlsx_file_path(path, &block)
Dir.glob(path + '/**/*.xls').each(&block)
end
Although this block/proc conversion is a little bit of advanced topic.

def get_folder_paths(root_path)
Dir.glob('**/*.csv') + Dir.glob('**/*.xls')
end
folder_path = File.join(Dir.pwd, ARGV.first || '')
raise "#{folder_path} is not a valid folder" unless File.directory?(folder_path)
puts get_folder_paths(folder_path).length
The get_folder_paths method returns an array of CSV and XLS files. Building an array of file names may not be what you really want, especially if there are a lot of them. An approach using the Enumerator returned by Dir.glob would be more appropriate in that case if you did not need the file count first.

Related

How to write a multidimensional array into separate files and then read from them in order in Ruby

I want to take a file, read the file into my program and split it into characters, split the resulting character array into a multidimensional array of 5,000 characters each, then write each separate array into a file found in the same location.
I have taken a file, read it, and created the multidimensional array. Now I want to write each separate single dimension array into separate files.
The file is obtained via user input. Then I created a chain helper method that stores the file to an array in the first mixin, this is then passed to another method that breaks it down into a multidimensional array, which finally hands it off to the end of the chain which currently is setup to make a new directory for which I will put these files.
require 'Benchmark/ips'
file = "C:\\test.php"
class String
def file_to_array
file = self
return_file = File.open(file) do |line|
line.each_char.to_a
end
return return_file
end
def file_write
file_to_write = self
if Dir.exist?("I:\\file_to_array")
File.open("I:/file_to_array/tmp.txt", "w") { |file| file.write(file_to_write) }
read_file = File.read("I:/file_to_array/tmp.txt")
else
Dir.mkdir("I:\\file_to_array")
end
end
end
class Array
def file_divider
file_to_divide = self
file_to_separate = []
count = 0
while count != file_to_divide.length
separator = count % 5000
if separator == 0
start = count - 5000
stop = count
file_to_separate << file_to_divide[start..stop]
end
count = count + 1
end
return file_to_separate
end
def file_write
file_to_write = self
if Dir.exist?("I:\\file_to_array")
File.open("I:/file_to_array/tmp.txt", "w") { |file| file.write(file_to_write) }
else
Dir.mkdir("I:\\file_to_array")
end
end
end
Benchmark.ips do |result|
result.report { file.file_to_array.file_divider.file_write }
end
Test.php
<?php
echo "hello world"
?>
This untested code is where I'd start to split text into chunks and save it:
str = "I want to take a file"
str_array = str.scan(/.{1,10}/) # => ["I want to ", "take a fil", "e"]
str_array.each.with_index(1) do |str_chunk, i|
File.write("output#{i}", str_chunk)
end
This doesn't honor word-boundaries.
Reading a separate input file is easy; You can use read if you KNOW the input will never exceed the available memory and you don't care about performance.
Thinking about it further, if you want to read a text file and break its contents into smaller files, then read it in chunks:
input = File.open('input.txt', 'r')
i = 1
until input.eof? do
chunk = input.read(10)
File.write("output#{i}", chunk)
i += 1
end
input.close
Or even better because it automatically closes the input:
File.open('input.txt', 'r') do |input|
i = 1
until input.eof? do
chunk = File.read(10)
File.write("output#{i}", chunk)
i += 1
end
end
Those are not tested but it look about right.
Use standard File API and Serialisation.
File.write('path/to/yourfile.txt', Marshal.dump([1, 2, 3]))

Hash with duplicate key values

I'm writing a small script for checking for repeated files inside a folder. I did with array, and i was successful. The problem is that i want to store the folder location also, so i can see where the duplicated files are.
My first thought was using a Hash. But since you will have a lot of files in the same folder, I can't do: hash[folder] = file. The reverse is also impossivel, because if I have repeated files, they will be overwritten (hash[file] = folder)
So what is the best approach to do that?
My code:
class FilesList
attr_accessor :elements
def initialize(path)
#elements = Hash.new
#path = path
printDirectory(#path)
end
def printDirectory(folderPath)
entries = Dir.entries(folderPath) - [".", "..", "repeat.rb"]
entries.each do |single|
if File.directory?("#{folderPath}/#{single}")
printDirectory("#{folderPath}/#{single}")
else
#elements[single] = folderPath
end
end
end
def printArray
puts #elements
end
def each()
#elements.each do |x, y|
yield x y
end
end
def checkRepeated
if #elements.length == #elements.keys.uniq.length
puts "No repeated Files"
else
counts = Hash.new(0)
#elements.each do |key,val|
counts[val] += 1
end
repeateds = counts.reject{|val,count|count==1}.keys
puts repeateds
end
end
end
array = FilesList.new(Dir.pwd)
array.printArray
You may store arrays (or sets) of file names (or folder paths) as hash values
For example, in your code you may change #elements[single] = folderPath to:
#elements[single] ||= []
#elements[single] << folderPath
And then later, your val's will be arrays of folders where file was met.
Similar to the above, but don't use the file as the key.
#elements = Hash.new([])
entries.each do |single|
if File.directory?("#{folderPath}/#{single}")
printDirectory("#{folderPath}/#{single}")
else
#elements[folderPath] << single
end
end
Then you'll get something that looks like this:
{ '/path1' => ['awesome_file.rb', 'beautiful.js'],
'/path2' => ['beautiful.js', 'coffee.rb'] }
And then, if I understand you correctly, you can find duplicate files like so:
files = #elements.values.flatten
repeateds = files.select{ |file| files.count(file) > 1 }
This will return an array: ["beautiful.js", "beautiful.js"], which you can call .uniq on to just get the one ["beautiful.js", or do a count if you like, or map the results to another hash that tells you how often it's repeated, etc.

How do I detect end of file in Ruby?

I wrote the following script to read a CSV file:
f = File.open("aFile.csv")
text = f.read
text.each_line do |line|
if (f.eof?)
puts "End of file reached"
else
line_num +=1
if(line_num < 6) then
puts "____SKIPPED LINE____"
next
end
end
arr = line.split(",")
puts "line number = #{line_num}"
end
This code runs fine if I take out the line:
if (f.eof?)
puts "End of file reached"
With this line in I get an exception.
I was wondering how I can detect the end of file in the code above.
Try this short example:
f = File.open(__FILE__)
text = f.read
p f.eof? # -> true
p text.class #-> String
With f.read you read the whole file into text and reach EOF.
(Remark: __FILE__ is the script file itself. You may use you csv-file).
In your code you use text.each_line. This executes each_line for the string text. It has no effect on f.
You could use File#each_line without using a variable text. The test for EOF is not necessary. each_line loops on each line and detects EOF on its own.
f = File.open(__FILE__)
line_num = 0
f.each_line do |line|
line_num +=1
if (line_num < 6)
puts "____SKIPPED LINE____"
next
end
arr = line.split(",")
puts "line number = #{line_num}"
end
f.close
You should close the file after reading it. To use blocks for this is more Ruby-like:
line_num = 0
File.open(__FILE__) do | f|
f.each_line do |line|
line_num +=1
if (line_num < 6)
puts "____SKIPPED LINE____"
next
end
arr = line.split(",")
puts "line number = #{line_num}"
end
end
One general remark: There is a CSV library in Ruby. Normally it is better to use that.
https://www.ruby-forum.com/topic/218093#946117 talks about this.
content = File.read("file.txt")
content = File.readlines("file.txt")
The above 'slurps' the entire file into memory.
File.foreach("file.txt") {|line| content << line}
You can also use IO#each_line. These last two options do not read the entire file into memory. The use of the block makes this automatically close your IO object as well. There are other ways as well, IO and File classes are pretty feature rich!
I refer to IO objects, as File is a subclass of IO. I tend to use IO when I don't really need the added methods from File class for the object.
In this way you don't need to deal with EOF, Ruby will for you.
Sometimes the best handling is not to, when you really don't need to.
Of course, Ruby has a method for this.
Without testing this, it seems you should perform a rescue rather than checking.
http://www.ruby-doc.org/core-2.0/EOFError.html
file = File.open("aFile.csv")
begin
loop do
some_line = file.readline
# some stuff
end
rescue EOFError
# You've reached the end. Handle it.
end

Ruby : How can I detect/intelligently guess the delimiter used in a CSV file?

I need to be able to figure out which delimiter is being used in a csv file (comma, space or semicolon) in my Ruby project. I know, there is a Sniffer class in Python in the csv module that can be used to guess a given file's delimiter. Is there anything similar to this in Ruby ? Any kind of help or idea is greatly appreciated.
Looks like the py implementation just checks a few dialects: excel or excel_tab. So, a simple implementation of something that just checks for "," or "\t" is:
COMMON_DELIMITERS = ['","',"\"\t\""].freeze
def sniff(path)
first_line = File.open(path).first
return unless first_line
snif = {}
COMMON_DELIMITERS.each do |delim|
snif[delim] = first_line.count(delim)
end
snif = snif.sort { |a,b| b[1]<=>a[1] }
snif[0][0] if snif.size > 0
end
Note: that would return the full delimiter it finds, e.g. ",", so to get , you could change the snif[0][0] to snif[0][0][1].
Also, I'm using count(delim) because it is a little faster, but if you added a delimiter that is composed of two (or more) characters of the same type like --, then it would could each occurrence twice (or more) when weighing the type, so in that case, it may be better to use scan(delim).length.
Here is Gary S. Weaver answer as we are using it in production. Good solution that works well.
class ColSepSniffer
NoColumnSeparatorFound = Class.new(StandardError)
EmptyFile = Class.new(StandardError)
COMMON_DELIMITERS = [
'","',
'"|"',
'";"'
].freeze
def initialize(path:)
#path = path
end
def self.find(path)
new(path: path).find
end
def find
fail EmptyFile unless first
if valid?
delimiters[0][0][1]
else
fail NoColumnSeparatorFound
end
end
private
def valid?
!delimiters.collect(&:last).reduce(:+).zero?
end
# delimiters #=> [["\"|\"", 54], ["\",\"", 0], ["\";\"", 0]]
# delimiters[0] #=> ["\";\"", 54]
# delimiters[0][0] #=> "\",\""
# delimiters[0][0][1] #=> ";"
def delimiters
#delimiters ||= COMMON_DELIMITERS.inject({}, &count).sort(&most_found)
end
def most_found
->(a, b) { b[1] <=> a[1] }
end
def count
->(hash, delimiter) { hash[delimiter] = first.count(delimiter); hash }
end
def first
#first ||= file.first
end
def file
#file ||= File.open(#path)
end
end
Spec
require "spec_helper"
describe ColSepSniffer do
describe ".find" do
subject(:find) { described_class.find(path) }
let(:path) { "./spec/fixtures/google/products.csv" }
context "when , delimiter" do
it "returns separator" do
expect(find).to eq(',')
end
end
context "when ; delimiter" do
let(:path) { "./spec/fixtures/google/products_with_semi_colon_seperator.csv" }
it "returns separator" do
expect(find).to eq(';')
end
end
context "when | delimiter" do
let(:path) { "./spec/fixtures/google/products_with_bar_seperator.csv" }
it "returns separator" do
expect(find).to eq('|')
end
end
context "when empty file" do
it "raises error" do
expect(File).to receive(:open) { [] }
expect { find }.to raise_error(described_class::EmptyFile)
end
end
context "when no column separator is found" do
it "raises error" do
expect(File).to receive(:open) { [''] }
expect { find }.to raise_error(described_class::NoColumnSeparatorFound)
end
end
end
end
I'm not aware of any sniffer implementation in the CSV library included in Ruby 1.9. It will try to auto-discover the row separator, but the column separator is assumed to be a comma by default.
One idea would be to try parsing a sample number of rows (5% of total maybe?) using each of the possible separators. Whichever separator results in the same number of columns most consistently is probably the correct separator.
You may want to check out the smarter_csv gem with the col_sep: :auto option.
SmarterCSV.process(filename, {col_sep: :auto}) do |chunk|
end

recursive file list in Ruby

I'm new to Ruby (being a Java dev) and trying to implement a method (oh, sorry, a function) that would retrieve and yield all files in the subdirectories recursively.
I've implemented it as:
def file_list_recurse(dir)
Dir.foreach(dir) do |f|
next if f == '.' or f == '..'
f = dir + '/' + f
if File.directory? f
file_list_recurse(File.absolute_path f) { |x| yield x }
else
file = File.new(f)
yield file
end
end
end
My questions are:
Does File.new really OPEN a file? In Java new File("xxx") doesn't... If I need to yield some structure that I could query file info (ctime, size etc) from what would it be in Ruby?
{ |x| yield x } looks a little strange to me, is this OK to do yields from recursive functions like that, or is there some way to avoid it?
Is there any way to avoid checking for '.' and '..' on each iteration?
Is there a better way to implement this?
Thanks
PS:
the sample usage of my method is something like this:
curr_file = nil
file_list_recurse('.') do |file|
curr_file = file if curr_file == nil or curr_file.ctime > file.ctime
end
puts curr_file.to_path + ' ' + curr_file.ctime.to_s
(that would get you the oldest file from the tree)
==========
So, thanks to #buruzaemon I found out the great Dir.glob function which saved me a couple of lines of code.
Also, thanks to #Casper I found out the File.stat method, which made my function run two times faster than with File.new
In the end my code is looking something like this:
i=0
curr_file = nil
Dir.glob('**/*', File::FNM_DOTMATCH) do |f|
file = File.stat(f)
next unless file.file?
i += 1
curr_file = [f, file] if curr_file == nil or curr_file[1].ctime > file.ctime
end
puts curr_file[0] + ' ' + curr_file[1].ctime.to_s
puts "total files #{i}"
=====
By default Dir.glob ignores file names starting with a dot (considered to be 'hidden' in *nix), so it's very important to add the second argument File::FNM_DOTMATCH
How about this?
puts Dir['**/*.*']
According to the docs File.new does open the file. You might want to use File.stat instead, which gathers file-related stats into a queryable object. But note that the stats are gathered at point of creation. Not when you call the query methods like ctime.
Example:
Dir['**/*'].select { |f| File.file?(f) }.map { |f| File.stat(f) }
this thing tells me to consider accepting an answer, I hope it wouldn't mind me answering it myself:
i=0
curr_file = nil
Dir.glob('**/*', File::FNM_DOTMATCH) do |f|
file = File.stat(f)
next unless file.file?
i += 1
curr_file = [f, file] if curr_file == nil or curr_file[1].ctime > file.ctime
end
puts curr_file[0] + ' ' + curr_file[1].ctime.to_s
puts "total files #{i}"
You could use the built-in Find module's find method.
If you are on Windows see my answer here under for a mutch faster (~26 times) way than standard Ruby Dir. If you use mtime it's still going to be waaayyy faster.
If you use another OS you could use the same technique, I'm curious if the gain would be that big but I'm almost certain.
How to find the file path of file that is not the current file in ruby

Resources