I'm currently working on a project in ruby, and I hit a wall on how I should proceed. In the project I'm using Dir.glob to search a directory and all of its subdirectories for certain file types and placing them into an arrays. The type of files I'm working with all have the same file name and are differentiated by their extensions. For example,
txt_files = Dir.glob("**/*.txt")
doc_files = Dir.glob("**/*.doc")
rtf_files = Dir.glob("**/*.rtf")
Would return something similar to,
FILECON.txt
ASSORTED.txt
FIRST.txt
FILECON.doc
ASSORTED.doc
FIRST.doc
FILECON.rtf
ASSORTED.rtf
FIRST.rtf
So, the question I have is how I could break down these arrays efficiently (dealing with thousands of files) and placing all files with the same filename into an array. The new array would look like,
FILECON.txt
FILECON.doc
FILECON.rtf
ASSORTED.txt
ASSORTED.doc
ASSORTED.rtf
etc. etc.
I'm not even sure if glob would be the correct way to do this (all the files with the same file name are in the same folders). Any help would be greatly appreciated!
Get all your files into a single array with Dir.glob("**/*.{txt,doc,rtf}")
Don't forget that all the filenames have the directory too, so if you want to sort by the basename, then
files = Dir.glob("**/*.{txt,doc,rtf}").sort_by {|f| File.basename f}
Not sure if this is exactly what you need, but you can try to
# first get all files
all_files = Dir.glob('**/*')
# then you can group them by name
by_name = all_files.group_by{|f| m = f.match(/([^\/]+)\.[^.\/]+$/); m[1] if m}
# and by extension
by_ext = all_files.group_by{|f| m = f.match(/[^\/]+\.([^.\/]+)$/); m[1] if m}
BTW, I don't see any relation of the question with sorting.
Related
I'm using Photoshop script. I get files from folders. My problem is that when I get the files and place them in an array the array contains hidden files that are in the folder for example ".DS_Store". I can get around this by using:
if (folders[i] != "~/Downloads/start/.DS_Store"){}
But I would like to use something better as I sometimes look in lots of folders and don't know the "~/Downloads/start/" part.
I tried to use indexOf but Photoshop script does not allow indexOf. Does anybody know of a way to check if ".DS_Store" is in the string "~/Downloads/start/.DS_Store" that works in Photoshop script?
I see this answer but I don't know how to use it to test: Photoshop script to ignore .ds_store
For anyone else looking for a solution to this problem, rather than explicitly trying to skip hidden files like .DS_Store, you can use the Folder Object's getFiles() method and pass an expression to build an array of file types you actually want to open. A simple way to use this method is as follows:
// this expression will match strings that end with .jpg, .tif, or .psd and ignore the case
var fileTypes = new RegExp(/\.(jpg|tif|psd)$/i);
// declare our path
var myFolder = new Folder("~/Downloads/start/");
// create array of files utilizing the expression to filter file types
var myFiles = myFolder.getFiles(fileTypes);
// loop through all the files in our array and do something
for (i = 0; i < myFiles.length; i++) {
var fileToOpen = myFiles[i];
open(fileToOpen);
// do stuff...
}
For anybody looking I used the Polyfill found here:
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/indexOf
indexOf() was added to the ECMA-262 standard in the 5th edition; as
such it may not be present in all browsers. You can work around this
by utilizing the following code at the beginning of your scripts. This
will allow you to use indexOf() when there is still no native support.
This algorithm matches the one specified in ECMA-262, 5th edition,
assuming TypeError and Math.abs() have their original values.
I have a words in a text file called words.txt, and I need to check if any of those words are in my Source folder, which also contains sub-folders and files.
I was able to get all of the words into an array using this code:
array_of_words = []
File.readlines('words.txt').map do |word|
array_of_words << word
end
And I also have (kinda) figured out how to search through the whole Source folder including the sub-folders and sub-files for a specific word using:
Dir['Source/**/*'].select{|f| File.file?(f) }.each do |filepath|
puts filepath
puts File.readlines(filepath).any?{ |l| l['api'] }
end
Instead of searching for one word like api, I want to search the Source folder for the whole array of words (if that is possible).
Consider this:
File.readlines('words.txt').map do |word|
array_of_words << word
end
will read the entire file into memory, then convert it into individual elements in an array. You could accomplish the same thing using:
array_of_words = File.readlines('words.txt')
A potential problem is its not scalable. If "words.txt" is larger than the available memory your code will have problems so be careful.
Searching a file for an array of words can be done a number of ways, but I've always found it easiest to use a regular expression. Perl has a great module called Regexp::Assemble that makes it easy to convert a list of words into a very efficient pattern, but Ruby is missing that sort of functionality. See "Is there an efficient way to perform hundreds of text substitutions in Ruby?" for one solution I put together in the past to help with that.
Ruby does have Regexp.union however it's only a partial help.
words = %w(foo bar)
re = Regexp.union(words) # => /foo|bar/
The pattern generated has flags for the expression so you have to be careful with interpolating it into another pattern:
/#{re}/ # => /(?-mix:foo|bar)/
(?-mix: will cause you problems so don't do that. Instead use:
/#{re.source}/ # => /foo|bar/
which will generate the pattern and behave like we expect.
Unfortunately, that's not a complete solution either, because the words could be found as sub-strings in other words:
'foolish'[/#{re.source}/] # => "foo"
The way to work around that is to set word-boundaries around the pattern:
/\b(?:#{re.source})\b/ # => /\b(?:foo|bar)\b/
which then look for whole words:
'foolish'[/\b(?:#{re.source})\b/] # => nil
More information is available in Ruby's Regexp documentation.
Once you have a pattern you want to use then it becomes a simpler matter to search. Ruby has the Find class, which makes it easy to recursively search directories for files. The documentation covers how to use it.
Alternately, you can cobble your own method using the Dir class. Again, it has examples in the documentation to use it, but I usually go with Find.
When reading the files you're scanning I'd recommend using foreach to read the files line-by-line. File.read and File.readlines are not scalable and can make your program behave erratically as Ruby tries to read a big file into memory. Instead, foreach will result in very scalable code that runs more quickly. See "Why is "slurping" a file not a good practice?" for more information.
Using the links above you should be able to put something together quickly that'll run efficiently and be flexible.
This untested code should get you started:
WORD_ARRAY = File.readlines('words.txt').map(&:chomp)
WORD_RE = /\b(?:#{Regexp.union(WORD_ARRAY).source}\b)/
Dir['Source/**/*'].select{|f| File.file?(f) }.each do |filepath|
puts "#{filepath}: #{!!File.read(filepath)[WORD_RE]}"
end
It will output the file it's reading, and "true" or "false" whether there is a hit finding one of the words in the list.
It's not scalable because of readlines and read and could suffer serious slowdown if any of the files are huge. Again, see the caveats in the "slurp" link above.
Recursively searches directory for any of the words contained in words.txt
re = /#{File.readlines('words.txt').map { |word| Regexp.quote(word.strip) }.join('|')}/
Dir['Source/**/*.{cpp,txt,html}'].select{|f| File.file?(f) }.each do |filepath|
puts filepath
puts File.readlines(filepath, "r:ascii").grep(re).any?
end
I'm trying do some operations on a directory which contains nearly 20 million files, I tried Dir.glob, Dir.foreach and Dir.entries to no success.
Is there anything similar to Csharp's Directory.EnumerateFiles in ruby which can enumerate a huge list of files?
Dir#read might do the trick.
dir = Dir.new(path)
while entry = dir.read
puts entry
end
I have a task to find out all the PDF files under several price list folders using JRuby on Windows 7. The folder structure is as follows:
WorkSpace/Data/2015/city1/A/...
WorkSpace/Data/2015/city1/B/...
WorkSpace/Data/2015/city1/Pricelist/...
WorkSpace/Data/2015/city1/...
WorkSpace/Data/2015/city1/Price List/.....
WorkSpace/Data/2015/city2/A/...
WorkSpace/Data/2015/city2/C/...
WorkSpace/Data/2015/city2/Pricelist/...
WorkSpace/Data/2015/city2/D/...
WorkSpace/Data/2015/city2/Price List/.....
WorkSpace/Data/2016/city1/folder1/...
WorkSpace/Data/2016/city1/folder2/...
WorkSpace/Data/2016/city1/Pricelist/...
WorkSpace/Data/2016/city1/folder3/...
WorkSpace/Data/2016/city1/folder4/Price List/...
WorkSpace/Data/2016/city2/folder1/...
WorkSpace/Data/2016/city2/folder2/...
WorkSpace/Data/2016/city2/Pricelist/...
WorkSpace/Data/2016/city2/folder3/...
WorkSpace/Data/2016/city2/folder4/Price List/...
... represents all kinds of files under their corresponding folder.
I only want to find the PDF files under folder Pricelist and Price List. How can I do this?
I read Searching a folder and all of its subfolders for files of a certain type. This is an answer which I think is helpful, but how can I modify the expression /.*\.pdf$/ to achieve my goal?
Use a Recursive Glob
All you need to find your files is Dir#glob and Enumerable#grep. For example:
Dir.glob('WorkSpace/Data/**/*.pdf').grep /Price List|Pricelist/
This will collect all the PDF files using a recursive glob pattern that descends into all subdirectories starting at Workspace/Data (adjust the path to this starting directory as needed), and then returns only the results that match the directories you're grepping for. In this case, we're using a regular expression pattern with alternation to find either of the two directories you're looking for, without regard to how deeply nested the desired directories might be.
There may be more efficient ways to do this, or you may need to tweak the regex if it's too permissive for you, but this certainly solves the problem without needing to know much more than the root of the directory tree you want to search.
You'll probably want to look at the Find module. The code would be something like this:
results = []
directory_list = []
Find.find('Workspace/Data') do |path|
if FileTest.directory?(path)
fn = File.basename(path)
if fn == 'Pricelist' || fn == 'Price List'
directory_list << path
Find.prune
end
end
end
directory_list.each do |starting_path|
Find.find(starting_path) do |path|
if File.extname(path) == '.pdf'
results << path
end
end
end
The first loop scans and finds all the directories that match the directory name condition, skipping scanning below them because that will happen in the second loop. The second loop takes each of the directories found by the first loop and scans them for files ending in the '.pdf' extension, adding each one to the results list.
You can hoist the second loop's body up into the first loop in place of directory_list << path, but the resulting code would be harder to read and wouldn't gain any performance improvement.
How can I check a directory to see if its contents has changed since a given point in time?
I don't need to be informed when it changes, or what has changed. I just need a way to check if it has changed.
Create a file at the point in time you wish to start monitoring, using any method you like, e.g.:
touch time_marker
Then, when you want to check if anything has been added, use "find" like this:
find . -newer time_marker
This will only tell you files that have been modified or added since time_marker was created - it won't tell you if anything has been deleted. If you want to look again at a future point, "touch" time_marker again to create a new reference point.
If you just need to know if names have changed or files have been added/removed, you can try this:
Dir.glob('some_directory/**/*').hash
Just store and compare the hash values. You can obviously go further by getting more information out of a call to ls, for example, or out of File objects that represent each of the files in your directory structure, and hashing that.
Dir.glob('some_directory/**/*').map { |name| [name, File.mtime(name)] }.hash
UM ACTUALLY I'm being dumb and hash is only consistent for any one runtime environment of ruby. Let's use the standard Zlib::crc32 instead, e.g.
Zlib::crc32(Dir.glob('some_directory/**/*').map { |name| [name, File.mtime(name)] }.to_s)
My concern is that this approach will be memory-hungry and slow if you're checking a very large filesystem. Perhaps globbing the entire structure and mapping it isn't the way--if you have a lot of subdirectories you could walk them recursively and calculate a checksum for each, then combine the checksums.
This might be better for larger directories:
Dir.glob('some_directory/**/*').map do |name|
s = [name, File.mtime(name)].to_s
[Zlib::crc32(s), s.length]
end.inject(Zlib::crc32('')) do |combined, x|
Zlib::crc32_combine(combined, x[0], x[1])
end
This would be less prone to collisions:
Dir.glob('some_directory/**/*').map do |name|
[name, File.mtime(name)].to_s
end.inject(Digest::SHA512.new) do |digest, x|
digest.update x
end.to_s
I've amended this to include timestamp and file size.
dir_checksum = Zlib::crc32(Dir.glob(
File.join(dispatch, '/**/*')).map { |path|
path.to_s + "_" + File.mtime(path).to_s + "_" + File.size(path).to_s
}.to_s)