Why does Ruby seem to access files in a directory randomly? - ruby

Is this by design?
Here's the code:
class FileRenamer
def RenameFiles(folder_path)
files = Dir.glob(folder_path + "/*")
end
end
puts "Renaming files..."
renamer = FileRenamer.new()
files = renamer.RenameFiles("/home/papuccino1/Desktop/Test")
puts files
puts "Renaming complete."
It seems to be fetching the files is random order, not as they are displayed in Nautilus.
Is this by design? I'm just curious.

The order should be the same every time on a particular OS, however it is different across operating systems.
The behaviour or Dir.glob can not be relied upon to be the same across different OSs. Not sure if this is by design, but rather an artefact of the filesystems.
On Windows and Linux the results are sorted by hierarchy, and then alphabetically; On Mac OS X the results are sorted alphabetically.
You could mitigate the effect by calling sort on your results e.g.:
files = Dir.glob("./*").sort
or if you wanted it case insensitive, perhaps:
files = Dir.glob("./*").sort {|a,b| a.upcase <=> b.upcase}

The answer from Scott is out of date. I ran Dir.glob on Mac OS 10.15.6 Catalina, and the files were not returned in alphabetical order. According to the ruby docs, the ordering is determined by the OS.
https://ruby-doc.org/core-2.5.1/Dir.html
Note that the pattern is not a regexp, it's closer to a shell glob. See File.fnmatch for the meaning of the flags parameter. Case sensitivity depends on your system (File::FNM_CASEFOLD is ignored), as does the order in which the results are returned.

Related

How to check for multiple words inside a folder

I have a words in a text file called words.txt, and I need to check if any of those words are in my Source folder, which also contains sub-folders and files.
I was able to get all of the words into an array using this code:
array_of_words = []
File.readlines('words.txt').map do |word|
array_of_words << word
end
And I also have (kinda) figured out how to search through the whole Source folder including the sub-folders and sub-files for a specific word using:
Dir['Source/**/*'].select{|f| File.file?(f) }.each do |filepath|
puts filepath
puts File.readlines(filepath).any?{ |l| l['api'] }
end
Instead of searching for one word like api, I want to search the Source folder for the whole array of words (if that is possible).
Consider this:
File.readlines('words.txt').map do |word|
array_of_words << word
end
will read the entire file into memory, then convert it into individual elements in an array. You could accomplish the same thing using:
array_of_words = File.readlines('words.txt')
A potential problem is its not scalable. If "words.txt" is larger than the available memory your code will have problems so be careful.
Searching a file for an array of words can be done a number of ways, but I've always found it easiest to use a regular expression. Perl has a great module called Regexp::Assemble that makes it easy to convert a list of words into a very efficient pattern, but Ruby is missing that sort of functionality. See "Is there an efficient way to perform hundreds of text substitutions in Ruby?" for one solution I put together in the past to help with that.
Ruby does have Regexp.union however it's only a partial help.
words = %w(foo bar)
re = Regexp.union(words) # => /foo|bar/
The pattern generated has flags for the expression so you have to be careful with interpolating it into another pattern:
/#{re}/ # => /(?-mix:foo|bar)/
(?-mix: will cause you problems so don't do that. Instead use:
/#{re.source}/ # => /foo|bar/
which will generate the pattern and behave like we expect.
Unfortunately, that's not a complete solution either, because the words could be found as sub-strings in other words:
'foolish'[/#{re.source}/] # => "foo"
The way to work around that is to set word-boundaries around the pattern:
/\b(?:#{re.source})\b/ # => /\b(?:foo|bar)\b/
which then look for whole words:
'foolish'[/\b(?:#{re.source})\b/] # => nil
More information is available in Ruby's Regexp documentation.
Once you have a pattern you want to use then it becomes a simpler matter to search. Ruby has the Find class, which makes it easy to recursively search directories for files. The documentation covers how to use it.
Alternately, you can cobble your own method using the Dir class. Again, it has examples in the documentation to use it, but I usually go with Find.
When reading the files you're scanning I'd recommend using foreach to read the files line-by-line. File.read and File.readlines are not scalable and can make your program behave erratically as Ruby tries to read a big file into memory. Instead, foreach will result in very scalable code that runs more quickly. See "Why is "slurping" a file not a good practice?" for more information.
Using the links above you should be able to put something together quickly that'll run efficiently and be flexible.
This untested code should get you started:
WORD_ARRAY = File.readlines('words.txt').map(&:chomp)
WORD_RE = /\b(?:#{Regexp.union(WORD_ARRAY).source}\b)/
Dir['Source/**/*'].select{|f| File.file?(f) }.each do |filepath|
puts "#{filepath}: #{!!File.read(filepath)[WORD_RE]}"
end
It will output the file it's reading, and "true" or "false" whether there is a hit finding one of the words in the list.
It's not scalable because of readlines and read and could suffer serious slowdown if any of the files are huge. Again, see the caveats in the "slurp" link above.
Recursively searches directory for any of the words contained in words.txt
re = /#{File.readlines('words.txt').map { |word| Regexp.quote(word.strip) }.join('|')}/
Dir['Source/**/*.{cpp,txt,html}'].select{|f| File.file?(f) }.each do |filepath|
puts filepath
puts File.readlines(filepath, "r:ascii").grep(re).any?
end

Getting a huge list of files in ruby

I'm trying do some operations on a directory which contains nearly 20 million files, I tried Dir.glob, Dir.foreach and Dir.entries to no success.
Is there anything similar to Csharp's Directory.EnumerateFiles in ruby which can enumerate a huge list of files?
Dir#read might do the trick.
dir = Dir.new(path)
while entry = dir.read
puts entry
end

How can I find just the PDF files under folders "Pricelist" and "Price List"?

I have a task to find out all the PDF files under several price list folders using JRuby on Windows 7. The folder structure is as follows:
WorkSpace/Data/2015/city1/A/...
WorkSpace/Data/2015/city1/B/...
WorkSpace/Data/2015/city1/Pricelist/...
WorkSpace/Data/2015/city1/...
WorkSpace/Data/2015/city1/Price List/.....
WorkSpace/Data/2015/city2/A/...
WorkSpace/Data/2015/city2/C/...
WorkSpace/Data/2015/city2/Pricelist/...
WorkSpace/Data/2015/city2/D/...
WorkSpace/Data/2015/city2/Price List/.....
WorkSpace/Data/2016/city1/folder1/...
WorkSpace/Data/2016/city1/folder2/...
WorkSpace/Data/2016/city1/Pricelist/...
WorkSpace/Data/2016/city1/folder3/...
WorkSpace/Data/2016/city1/folder4/Price List/...
WorkSpace/Data/2016/city2/folder1/...
WorkSpace/Data/2016/city2/folder2/...
WorkSpace/Data/2016/city2/Pricelist/...
WorkSpace/Data/2016/city2/folder3/...
WorkSpace/Data/2016/city2/folder4/Price List/...
... represents all kinds of files under their corresponding folder.
I only want to find the PDF files under folder Pricelist and Price List. How can I do this?
I read Searching a folder and all of its subfolders for files of a certain type. This is an answer which I think is helpful, but how can I modify the expression /.*\.pdf$/ to achieve my goal?
Use a Recursive Glob
All you need to find your files is Dir#glob and Enumerable#grep. For example:
Dir.glob('WorkSpace/Data/**/*.pdf').grep /Price List|Pricelist/
This will collect all the PDF files using a recursive glob pattern that descends into all subdirectories starting at Workspace/Data (adjust the path to this starting directory as needed), and then returns only the results that match the directories you're grepping for. In this case, we're using a regular expression pattern with alternation to find either of the two directories you're looking for, without regard to how deeply nested the desired directories might be.
There may be more efficient ways to do this, or you may need to tweak the regex if it's too permissive for you, but this certainly solves the problem without needing to know much more than the root of the directory tree you want to search.
You'll probably want to look at the Find module. The code would be something like this:
results = []
directory_list = []
Find.find('Workspace/Data') do |path|
if FileTest.directory?(path)
fn = File.basename(path)
if fn == 'Pricelist' || fn == 'Price List'
directory_list << path
Find.prune
end
end
end
directory_list.each do |starting_path|
Find.find(starting_path) do |path|
if File.extname(path) == '.pdf'
results << path
end
end
end
The first loop scans and finds all the directories that match the directory name condition, skipping scanning below them because that will happen in the second loop. The second loop takes each of the directories found by the first loop and scans them for files ending in the '.pdf' extension, adding each one to the results list.
You can hoist the second loop's body up into the first loop in place of directory_list << path, but the resulting code would be harder to read and wouldn't gain any performance improvement.

Has Directory Content Changed?

How can I check a directory to see if its contents has changed since a given point in time?
I don't need to be informed when it changes, or what has changed. I just need a way to check if it has changed.
Create a file at the point in time you wish to start monitoring, using any method you like, e.g.:
touch time_marker
Then, when you want to check if anything has been added, use "find" like this:
find . -newer time_marker
This will only tell you files that have been modified or added since time_marker was created - it won't tell you if anything has been deleted. If you want to look again at a future point, "touch" time_marker again to create a new reference point.
If you just need to know if names have changed or files have been added/removed, you can try this:
Dir.glob('some_directory/**/*').hash
Just store and compare the hash values. You can obviously go further by getting more information out of a call to ls, for example, or out of File objects that represent each of the files in your directory structure, and hashing that.
Dir.glob('some_directory/**/*').map { |name| [name, File.mtime(name)] }.hash
UM ACTUALLY I'm being dumb and hash is only consistent for any one runtime environment of ruby. Let's use the standard Zlib::crc32 instead, e.g.
Zlib::crc32(Dir.glob('some_directory/**/*').map { |name| [name, File.mtime(name)] }.to_s)
My concern is that this approach will be memory-hungry and slow if you're checking a very large filesystem. Perhaps globbing the entire structure and mapping it isn't the way--if you have a lot of subdirectories you could walk them recursively and calculate a checksum for each, then combine the checksums.
This might be better for larger directories:
Dir.glob('some_directory/**/*').map do |name|
s = [name, File.mtime(name)].to_s
[Zlib::crc32(s), s.length]
end.inject(Zlib::crc32('')) do |combined, x|
Zlib::crc32_combine(combined, x[0], x[1])
end
This would be less prone to collisions:
Dir.glob('some_directory/**/*').map do |name|
[name, File.mtime(name)].to_s
end.inject(Digest::SHA512.new) do |digest, x|
digest.update x
end.to_s
I've amended this to include timestamp and file size.
dir_checksum = Zlib::crc32(Dir.glob(
File.join(dispatch, '/**/*')).map { |path|
path.to_s + "_" + File.mtime(path).to_s + "_" + File.size(path).to_s
}.to_s)

How to compare data in two CSV files

I have two CSV files which have the same structure and ideally should have the same data.
I want to compare the data in them using Ruby and wanted to know if we already have a Ruby function for the same.
If you want to check whether files are identical you can simply use identical? which is an alias for compare_file:
FileUtils.identical?('file1.csv', 'file2.csv')
If you want to see the differences you might want to use diffy:
gem install diffy
puts Diffy::Diff.new('file1.csv', 'file2.csv', :source => 'files')
It produces diff-like output which can be nicely formatted as HTML:
puts Diffy::Diff.new('file1.csv', 'file2.csv', :source => 'files').to_s(:html_simple)
As Summea commented, look at the CSV class.
Then use:
#Will store each line of each file as an array of fields (so an array of arrays).
file1_lines = CSV.read("file1.csv")
file2_lines = CSV.read("file2.csv")
for i in 0..file1_lines.size
if (file1_lines[i] == file2_lines[i]
puts "Same #{file1_lines[i]}"
else
puts "#{file1_lines[i]} != #{file2_lines[i]}"
end
end
Note that using for in Ruby is quite rare. You normally iterate using an each on the collections, but there are two of them here.
Also, pay attention that one of the list may be longer than the other, but this should get you started.

Resources