recursive file list in Ruby - ruby

I'm new to Ruby (being a Java dev) and trying to implement a method (oh, sorry, a function) that would retrieve and yield all files in the subdirectories recursively.
I've implemented it as:
def file_list_recurse(dir)
Dir.foreach(dir) do |f|
next if f == '.' or f == '..'
f = dir + '/' + f
if File.directory? f
file_list_recurse(File.absolute_path f) { |x| yield x }
else
file = File.new(f)
yield file
end
end
end
My questions are:
Does File.new really OPEN a file? In Java new File("xxx") doesn't... If I need to yield some structure that I could query file info (ctime, size etc) from what would it be in Ruby?
{ |x| yield x } looks a little strange to me, is this OK to do yields from recursive functions like that, or is there some way to avoid it?
Is there any way to avoid checking for '.' and '..' on each iteration?
Is there a better way to implement this?
Thanks
PS:
the sample usage of my method is something like this:
curr_file = nil
file_list_recurse('.') do |file|
curr_file = file if curr_file == nil or curr_file.ctime > file.ctime
end
puts curr_file.to_path + ' ' + curr_file.ctime.to_s
(that would get you the oldest file from the tree)
==========
So, thanks to #buruzaemon I found out the great Dir.glob function which saved me a couple of lines of code.
Also, thanks to #Casper I found out the File.stat method, which made my function run two times faster than with File.new
In the end my code is looking something like this:
i=0
curr_file = nil
Dir.glob('**/*', File::FNM_DOTMATCH) do |f|
file = File.stat(f)
next unless file.file?
i += 1
curr_file = [f, file] if curr_file == nil or curr_file[1].ctime > file.ctime
end
puts curr_file[0] + ' ' + curr_file[1].ctime.to_s
puts "total files #{i}"
=====
By default Dir.glob ignores file names starting with a dot (considered to be 'hidden' in *nix), so it's very important to add the second argument File::FNM_DOTMATCH

How about this?
puts Dir['**/*.*']

According to the docs File.new does open the file. You might want to use File.stat instead, which gathers file-related stats into a queryable object. But note that the stats are gathered at point of creation. Not when you call the query methods like ctime.
Example:
Dir['**/*'].select { |f| File.file?(f) }.map { |f| File.stat(f) }

this thing tells me to consider accepting an answer, I hope it wouldn't mind me answering it myself:
i=0
curr_file = nil
Dir.glob('**/*', File::FNM_DOTMATCH) do |f|
file = File.stat(f)
next unless file.file?
i += 1
curr_file = [f, file] if curr_file == nil or curr_file[1].ctime > file.ctime
end
puts curr_file[0] + ' ' + curr_file[1].ctime.to_s
puts "total files #{i}"

You could use the built-in Find module's find method.

If you are on Windows see my answer here under for a mutch faster (~26 times) way than standard Ruby Dir. If you use mtime it's still going to be waaayyy faster.
If you use another OS you could use the same technique, I'm curious if the gain would be that big but I'm almost certain.
How to find the file path of file that is not the current file in ruby

Related

Puts arrays in file using ruby

This is a part of my file:
project(':facebook-android-sdk-3-6-0').projectDir = new File('facebook-android-sdk-3-6-0/facebook-android-sdk-3.6.0/facebook')
project(':Forecast-master').projectDir = new File('forecast-master/Forecast-master/Forecast')
project(':headerListView').projectDir = new File('headerlistview/headerListView')
project(':library-sliding-menu').projectDir = new File('library-sliding-menu/library-sliding-menu')
I need to extract the names of the libs. This is my ruby function:
def GetArray
out_file = File.new("./out.txt", "w")
File.foreach("./file.txt") do |line|
l=line.scan(/project\(\'\:(.*)\'\).projectDir/)
File.open(out_file, "w") do |f|
l.each do |ch|
f.write("#{ch}\n")
end
end
puts "#{l} "
end
end
My function returns this:
[]
[["CoverFlowLibrary"]]
[["Android-RSS-Reader-Library-master"]]
[["library"]]
[["facebook-android-sdk-3-6-0"]]
[["Forecast-master"]]
My problem is that I find nothing in out_file. How can I write to a file? Otherwise, I only need to get the name of the libs in the file.
Meditate on this:
"project(':facebook-android-sdk-3-6-0').projectDir'".scan(/project\(\'\:(.*)\'\).projectDir/)
# => [["facebook-android-sdk-3-6-0"]]
When scan sees the capturing (...), it will create a sub-array. That's not what you want. The knee-jerk reaction is to flatten the resulting array of arrays but that's really just a band-aid on the code because you chose the wrong method.
Instead consider this:
"project(':facebook-android-sdk-3-6-0').projectDir'"[/':([^']+)'/, 1]
# => "facebook-android-sdk-3-6-0"
This is using String's [] method to apply a regular expression with a capture and return that captured text. No sub-arrays are created.
scan is powerful and definitely has its place, but not for this sort of "find one thing" parsing.
Regarding your code, I'd do something like this untested code:
def get_array
File.new('./out.txt', 'w') do |out_file|
File.foreach('./file.txt') do |line|
l = line[/':([^']+)'/, 1]
out_file.puts l
puts l
end
end
end
Methods in Ruby are NOT camelCase, they're snake_case. Constants, like classes, start with a capital letter and are CamelCase. Don't go all Java on us, especially if you want to write code for a living. So GetArray should be get_array. Also, don't start methods with "get_", and don't call it array; Use to_a to be idiomatic.
When building a regular expression start simple and do your best to keep it simple. It's a maintainability thing and helps to reduce insanity. /':([^']+)'/ is a lot easier to read and understand, and accomplishes the same as your much-too-complex pattern. Regular expression engines are greedy and lazy and want to do as little work as possible, which is sometimes totally evil, but once you understand what they're doing it's possible to write very small/succinct patterns to accomplish big things.
Breaking it down, it basically says "find the first ': then start capturing text until the next ', which is what you're looking for. project( can be ignored as can ).projectDir.
And actually,
/':([^']+)'/
could really be written
/:([^']+)'/
but I felt generous and looked for the leading ' too.
The problem is that you're opening the file twice: once in:
out_file = File.new("./out.txt", "w")
and then once for each line:
File.open(out_file, "w") do |f| ...
Try this instead:
def GetArray
File.open("./out.txt", "w") do |f|
File.foreach("./file.txt") do |line|
l=line.scan(/project\(\'\:(.*)\'\).projectDir/)
l.each do |ch|
f.write("#{ch}\n")
end # l.each
end # File.foreach
end # File.open
end # def GetArray

Functionally find mapping of first value that passes a test

In Ruby, I have an array of simple values (possible encodings):
encodings = %w[ utf-8 iso-8859-1 macroman ]
I want to keep reading a file from disk until the results are valid. I could do this:
good = encodings.find{ |enc| IO.read(file, "r:#{enc}").valid_encoding? }
contents = IO.read(file, "r:#{good}")
...but of course this is dumb, since it reads the file twice for the good encoding. I could program it in gross procedural style like so:
contents = nil
encodings.each do |enc|
if (s=IO.read(file, "r:#{enc}")).valid_encoding?
contents = s
break
end
end
But I want a functional solution. I could do it functionally like so:
contents = encodings.map{|e| IO.read(f, "r:#{e}")}.find{|s| s.valid_encoding? }
…but of course that keeps reading files for every encoding, even if the first was valid.
Is there a simple pattern that is functional, but does not keep reading the file after a the first success is found?
If you sprinkle a lazy in there, map will only consume those elements of the array that are used by find - i.e. once find stops, map stops as well. So this will do what you want:
possible_reads = encodings.lazy.map {|e| IO.read(f, "r:#{e}")}
contents = possible_reads.find {|s| s.valid_encoding? }
Hopping on sepp2k's answer: If you can't use 2.0, lazy enums can be easily implemented in 1.9:
class Enumerator
def lazy_find
self.class.new do |yielder|
self.each do |element|
if yield(element)
yielder.yield(element)
break
end
end
end
end
end
a = (1..100).to_enum
p a.lazy_find { |i| i.even? }.first
# => 2
You want to use the break statement:
contents = encodings.each do |e|
s = IO.read( f, "r:#{e}" )
s.valid_encoding? and break s
end
The best I can come up with is with our good friend inject:
contents = encodings.inject(nil) do |s,enc|
s || (c=File.open(f,"r:#{enc}").valid_encoding? && c
end
This is still sub-optimal because it continues to loop through encodings after finding a match, though it doesn't do anything with them, so it's a minor ugliness. Most of the ugliness comes from...well, the code itself. :/

Dir.glob to get all csv and xls files in folder

folder_to_analyze = ARGV.first
folder_path = File.join(Dir.pwd, folder_to_analyze)
unless File.directory?(folder_path)
puts "Error: #{folder_path} no es un folder valido."
exit
end
def get_csv_file_paths(path)
files = []
Dir.glob(path + '/**/*.csv').each do |f|
files << f
end
return files
end
def get_xlsx_file_path(path)
files = []
Dir.glob(path + '/**/*.xls').each do |f|
files << f
end
return files
end
files_to_process = []
files_to_process << get_csv_file_paths(folder_path)
files_to_process << get_xlsx_file_path(folder_path)
puts files_to_process[1].length # Not what I want, I want:
# puts files_to_process.length
I'm trying to make a simple script in Ruby that allows me to call it from the command line, like ruby counter.rb mailing_list1 and it goes to the folder and counts all .csv and .xls files.
I intend to operate on each file, getting a row count, etc.
Currently the files_to_process array is actually an array of array - I don't want that. I want to have a single array of both .csv and .xls files.
Since I don't know how to yield from the Dir.glob call, I added them to an array and returned that.
How can I accomplish this using a single array?
Just stick the file extensions together into one group:
Dir[path + "/**/*.{csv,xls}"]
Well, yielding is simple. Just yield.
def get_csv_file_paths(path)
Dir.glob(path + '/**/*.csv').each do |f|
yield f
end
end
def get_xlsx_file_path(path)
Dir.glob(path + '/**/*.xls').each do |f|
yield f
end
end
files_to_process = []
get_csv_file_paths(folder_path) {|f| files_to_process << f }
get_xlsx_file_path(folder_path) {|f| files_to_process << f }
puts files_to_process.length
Every method in ruby can be passed a block. And yield keyword sends data to that block. If the block may or may not be provided, yield is usually used with block_given?.
yield f if block_given?
Update
The code can be further simplified by passing your block directly to glob.each:
def get_csv_file_paths(path, &block)
Dir.glob(path + '/**/*.txt').each(&block)
end
def get_xlsx_file_path(path, &block)
Dir.glob(path + '/**/*.xls').each(&block)
end
Although this block/proc conversion is a little bit of advanced topic.
def get_folder_paths(root_path)
Dir.glob('**/*.csv') + Dir.glob('**/*.xls')
end
folder_path = File.join(Dir.pwd, ARGV.first || '')
raise "#{folder_path} is not a valid folder" unless File.directory?(folder_path)
puts get_folder_paths(folder_path).length
The get_folder_paths method returns an array of CSV and XLS files. Building an array of file names may not be what you really want, especially if there are a lot of them. An approach using the Enumerator returned by Dir.glob would be more appropriate in that case if you did not need the file count first.

A twist on directory walking in Ruby

I'd like to do the following:
Given a directory tree:
Root
|_dirA
|_dirB
|_file1
|_file2
|_dirC
|_dirD
|_dirE
|_file3
|_file4
|_dirF
|_dirG
|_file5
|_file6
|_file7
... I'd like to walk the directory tree and build an array that contains the path to the first file in each directory that has at least one file. The overall structure may be quite large with many more files than directories, so I'd like to capture just the path to the first file without iterating through all the files in a given directory. One file is enough. For the above tree, the result should look like an array that contains only:
root/dirB/file1
root/dirC/dirD/dirE/file3
root/dirF/dirG/file5
I've played with the Dir and Find options in ruby, but my approach feels too brute-force-ish.
Is there an efficient way to code this functionality? It feels like I am missing some ruby trick here.
Many thanks!
Here's my approach:
root="/home/subtest/tsttree/"
Dir.chdir(root)
dir_list=Dir.glob("**/*/") #this invokes recursion
result=Array.new
dir_list.each do |d|
Dir.chdir(root + d)
Dir.open(Dir.pwd).each do |filename|
next if File.directory? filename #some directories may contain only other directories so exclude them
result.push(d + filename)
break
end
end
puts result
Works, but seems messy.
require 'pathname'
# My answer to stackoverflow question posted here:
# http://stackoverflow.com/questions/12684736/a-twist-on-directory-walking-in-ruby
class ShallowFinder
def initialize(root)
#matches = {}
#root = Pathname(root)
end
def matches
while match = next_file
#matches[match.parent.to_s] = match
end
#matches.values
end
private
def next_file
#root.find do |entry|
Find.prune if previously_matched?(entry)
return entry if entry.file?
end
nil
end
def previously_matched?(entry)
return unless entry.directory?
#matches.key?(entry.to_s)
end
end
puts ShallowFinder.new('Root').matches
Outputs:
Root/B/file1
Root/C/D/E/file3
Root/F/G/file5

Increment part of a string in Ruby

I have a method in a Ruby script that is attempting to rename files before they are saved. It looks like this:
def increment (path)
if path[-3,2] == "_#"
print " Incremented file with that name already exists, renaming\n"
count = path[-1].chr.to_i + 1
return path.chop! << count.to_s
else
print " A file with that name already exists, renaming\n"
return path << "_#1"
end
end
Say you have 3 files with the same name being saved to a directory, we'll say the file is called example.mp3. The idea is that the first will be saved as example.mp3 (since it won't be caught by if File.exists?("#{file_path}.mp3") elsewhere in the script), the second will be saved as example_#1.mp3 (since it is caught by the else part of the above method) and the third as example_#2.mp3 (since it is caught by the if part of the above method).
The problem I have is twofold.
1) if path[-3,2] == "_#" won't work for files with an integer of more than one digit (example_#11.mp3 for example) since the character placement will be wrong (you'd need it to be path[-4,2] but then that doesn't cope with 3 digit numbers etc).
2) I'm never reaching problem 1) since the method doesn't reliably catch file names. At the moment it will rename the first to example_#1.mp3 but the second gets renamed to the same thing (causing it to overwrite the previously saved file).
This is possibly too vague for Stack Overflow but I can't find anything that addresses the issue of incrementing a certain part of a string.
Thanks in advance!
Edit/update:
Wayne's method below seems to work on it's own but not when included as part of the whole script - it can increment a file once (from example.mp3 to example_#1.mp3) but doesn't cope with taking example_#1.mp3 and incrementing it to example_#2.mp3. To provide a little more context - currently when the script finds a file to save it is passing the name to Wayne's method like this:
file_name = increment(image_name)
File.open("images/#{file_name}.jpeg", 'w') do |output|
open(image_url) do |input|
output << input.read
end
end
I've edited Wayne's script a little so now it looks like this:
def increment (name)
name = name.gsub(/\s{2,}|(http:\/\/)|(www.)/i, '')
if File.exists?("images/#{name}.jpeg")
_, filename, count, extension = *name.match(/(\A.*?)(?:_#(\d+))?(\.[^.]*)?\Z/)
count = (count || '0').to_i + 1
"#{name}_##{count}#{extension}"
else
return name
end
end
Where am I going wrong? Again, thanks in advance.
A regular expression will git 'er done:
#!/usr/bin/ruby1.8
def increment(path)
_, filename, count, extension = *path.match(/(\A.*?)(?:_#(\d+))?(\.[^.]*)?\Z/)
count = (count || '0').to_i + 1
"#{filename}_##{count}#{extension}"
end
p increment('example') # => "example_#1"
p increment('example.') # => "example_#1."
p increment('example.mp3') # => "example_#1.mp3"
p increment('example_#1.mp3') # => "example_#2.mp3"
p increment('example_#2.mp3') # => "example_#3.mp3"
This probably doesn't matter for the code you're writing, but if you ever may have multiple threads or processes using this algorithm on the same files, there's a race condition when checking for existence before saving: Two writers can both find the same filename unused and write to it. If that matters to you, then open the file in a mode that fails if it exists, rescuing the exception. When the exception occurs, pick a different name. Roughly:
loop do
begin
File.open(filename, File::CREAT | File::EXCL | File::WRONLY) do |file|
file.puts "Your content goes here"
end
break
rescue Errno::EEXIST
filename = increment(filename)
redo
end
end
Here's a variation that doesn't accept a file name with an existing count:
def non_colliding_filename( filename )
if File.exists?(filename)
base,ext = /\A(.+?)(\.[^.]+)?\Z/.match( filename ).to_a[1..-1]
i = 1
i += 1 while File.exists?( filename="#{base}_##{i}#{ext}" )
end
filename
end
Proof:
%w[ foo bar.mp3 jim.bob.mp3 ].each do |desired|
3.times{
file = non_colliding_filename( desired )
p file
File.open( file, 'w' ){ |f| f << "tmp" }
}
end
#=> "foo"
#=> "foo_#1"
#=> "foo_#2"
#=> "bar.mp3"
#=> "bar_#1.mp3"
#=> "bar_#2.mp3"
#=> "jim.bob.mp3"
#=> "jim.bob_#1.mp3"
#=> "jim.bob_#2.mp3"

Resources