How can I check a directory to see if its contents has changed since a given point in time?
I don't need to be informed when it changes, or what has changed. I just need a way to check if it has changed.
Create a file at the point in time you wish to start monitoring, using any method you like, e.g.:
touch time_marker
Then, when you want to check if anything has been added, use "find" like this:
find . -newer time_marker
This will only tell you files that have been modified or added since time_marker was created - it won't tell you if anything has been deleted. If you want to look again at a future point, "touch" time_marker again to create a new reference point.
If you just need to know if names have changed or files have been added/removed, you can try this:
Dir.glob('some_directory/**/*').hash
Just store and compare the hash values. You can obviously go further by getting more information out of a call to ls, for example, or out of File objects that represent each of the files in your directory structure, and hashing that.
Dir.glob('some_directory/**/*').map { |name| [name, File.mtime(name)] }.hash
UM ACTUALLY I'm being dumb and hash is only consistent for any one runtime environment of ruby. Let's use the standard Zlib::crc32 instead, e.g.
Zlib::crc32(Dir.glob('some_directory/**/*').map { |name| [name, File.mtime(name)] }.to_s)
My concern is that this approach will be memory-hungry and slow if you're checking a very large filesystem. Perhaps globbing the entire structure and mapping it isn't the way--if you have a lot of subdirectories you could walk them recursively and calculate a checksum for each, then combine the checksums.
This might be better for larger directories:
Dir.glob('some_directory/**/*').map do |name|
s = [name, File.mtime(name)].to_s
[Zlib::crc32(s), s.length]
end.inject(Zlib::crc32('')) do |combined, x|
Zlib::crc32_combine(combined, x[0], x[1])
end
This would be less prone to collisions:
Dir.glob('some_directory/**/*').map do |name|
[name, File.mtime(name)].to_s
end.inject(Digest::SHA512.new) do |digest, x|
digest.update x
end.to_s
I've amended this to include timestamp and file size.
dir_checksum = Zlib::crc32(Dir.glob(
File.join(dispatch, '/**/*')).map { |path|
path.to_s + "_" + File.mtime(path).to_s + "_" + File.size(path).to_s
}.to_s)
Related
I have a words in a text file called words.txt, and I need to check if any of those words are in my Source folder, which also contains sub-folders and files.
I was able to get all of the words into an array using this code:
array_of_words = []
File.readlines('words.txt').map do |word|
array_of_words << word
end
And I also have (kinda) figured out how to search through the whole Source folder including the sub-folders and sub-files for a specific word using:
Dir['Source/**/*'].select{|f| File.file?(f) }.each do |filepath|
puts filepath
puts File.readlines(filepath).any?{ |l| l['api'] }
end
Instead of searching for one word like api, I want to search the Source folder for the whole array of words (if that is possible).
Consider this:
File.readlines('words.txt').map do |word|
array_of_words << word
end
will read the entire file into memory, then convert it into individual elements in an array. You could accomplish the same thing using:
array_of_words = File.readlines('words.txt')
A potential problem is its not scalable. If "words.txt" is larger than the available memory your code will have problems so be careful.
Searching a file for an array of words can be done a number of ways, but I've always found it easiest to use a regular expression. Perl has a great module called Regexp::Assemble that makes it easy to convert a list of words into a very efficient pattern, but Ruby is missing that sort of functionality. See "Is there an efficient way to perform hundreds of text substitutions in Ruby?" for one solution I put together in the past to help with that.
Ruby does have Regexp.union however it's only a partial help.
words = %w(foo bar)
re = Regexp.union(words) # => /foo|bar/
The pattern generated has flags for the expression so you have to be careful with interpolating it into another pattern:
/#{re}/ # => /(?-mix:foo|bar)/
(?-mix: will cause you problems so don't do that. Instead use:
/#{re.source}/ # => /foo|bar/
which will generate the pattern and behave like we expect.
Unfortunately, that's not a complete solution either, because the words could be found as sub-strings in other words:
'foolish'[/#{re.source}/] # => "foo"
The way to work around that is to set word-boundaries around the pattern:
/\b(?:#{re.source})\b/ # => /\b(?:foo|bar)\b/
which then look for whole words:
'foolish'[/\b(?:#{re.source})\b/] # => nil
More information is available in Ruby's Regexp documentation.
Once you have a pattern you want to use then it becomes a simpler matter to search. Ruby has the Find class, which makes it easy to recursively search directories for files. The documentation covers how to use it.
Alternately, you can cobble your own method using the Dir class. Again, it has examples in the documentation to use it, but I usually go with Find.
When reading the files you're scanning I'd recommend using foreach to read the files line-by-line. File.read and File.readlines are not scalable and can make your program behave erratically as Ruby tries to read a big file into memory. Instead, foreach will result in very scalable code that runs more quickly. See "Why is "slurping" a file not a good practice?" for more information.
Using the links above you should be able to put something together quickly that'll run efficiently and be flexible.
This untested code should get you started:
WORD_ARRAY = File.readlines('words.txt').map(&:chomp)
WORD_RE = /\b(?:#{Regexp.union(WORD_ARRAY).source}\b)/
Dir['Source/**/*'].select{|f| File.file?(f) }.each do |filepath|
puts "#{filepath}: #{!!File.read(filepath)[WORD_RE]}"
end
It will output the file it's reading, and "true" or "false" whether there is a hit finding one of the words in the list.
It's not scalable because of readlines and read and could suffer serious slowdown if any of the files are huge. Again, see the caveats in the "slurp" link above.
Recursively searches directory for any of the words contained in words.txt
re = /#{File.readlines('words.txt').map { |word| Regexp.quote(word.strip) }.join('|')}/
Dir['Source/**/*.{cpp,txt,html}'].select{|f| File.file?(f) }.each do |filepath|
puts filepath
puts File.readlines(filepath, "r:ascii").grep(re).any?
end
When creating a Tempfile in ruby, it takes the basename you pass it, and then it appends a random string to the end.
From the docs: http://ruby-doc.org/stdlib-1.9.3/libdoc/tempfile/rdoc/Tempfile.html
file = Tempfile.new('hello')
file.path # => something like: "/tmp/hello2843-8392-92849382--0"
You can see it starts with hello and then adds 2843-8392-92849382--0. Though this ending will change every time you create an instance.
This makes it difficult (at least for me) to lookup in the directory its saved in.
Question:
Is there any method (like file.fullName) that could be run on the instance to just get the hello2843-8392-92849382--0, in order to look it up in the directory where its saved?
Thoughts:
You could take the path and parse it but that seems excessive.
Basically you're asking for:
File.basename(file.path)
There's rarely a reason to need that exposed as a method, but if you want you could subclass Tempfile to add it in:
class SuperTempfile < Tempfile
def basename
File.basename(path)
end
end
I have a task to find out all the PDF files under several price list folders using JRuby on Windows 7. The folder structure is as follows:
WorkSpace/Data/2015/city1/A/...
WorkSpace/Data/2015/city1/B/...
WorkSpace/Data/2015/city1/Pricelist/...
WorkSpace/Data/2015/city1/...
WorkSpace/Data/2015/city1/Price List/.....
WorkSpace/Data/2015/city2/A/...
WorkSpace/Data/2015/city2/C/...
WorkSpace/Data/2015/city2/Pricelist/...
WorkSpace/Data/2015/city2/D/...
WorkSpace/Data/2015/city2/Price List/.....
WorkSpace/Data/2016/city1/folder1/...
WorkSpace/Data/2016/city1/folder2/...
WorkSpace/Data/2016/city1/Pricelist/...
WorkSpace/Data/2016/city1/folder3/...
WorkSpace/Data/2016/city1/folder4/Price List/...
WorkSpace/Data/2016/city2/folder1/...
WorkSpace/Data/2016/city2/folder2/...
WorkSpace/Data/2016/city2/Pricelist/...
WorkSpace/Data/2016/city2/folder3/...
WorkSpace/Data/2016/city2/folder4/Price List/...
... represents all kinds of files under their corresponding folder.
I only want to find the PDF files under folder Pricelist and Price List. How can I do this?
I read Searching a folder and all of its subfolders for files of a certain type. This is an answer which I think is helpful, but how can I modify the expression /.*\.pdf$/ to achieve my goal?
Use a Recursive Glob
All you need to find your files is Dir#glob and Enumerable#grep. For example:
Dir.glob('WorkSpace/Data/**/*.pdf').grep /Price List|Pricelist/
This will collect all the PDF files using a recursive glob pattern that descends into all subdirectories starting at Workspace/Data (adjust the path to this starting directory as needed), and then returns only the results that match the directories you're grepping for. In this case, we're using a regular expression pattern with alternation to find either of the two directories you're looking for, without regard to how deeply nested the desired directories might be.
There may be more efficient ways to do this, or you may need to tweak the regex if it's too permissive for you, but this certainly solves the problem without needing to know much more than the root of the directory tree you want to search.
You'll probably want to look at the Find module. The code would be something like this:
results = []
directory_list = []
Find.find('Workspace/Data') do |path|
if FileTest.directory?(path)
fn = File.basename(path)
if fn == 'Pricelist' || fn == 'Price List'
directory_list << path
Find.prune
end
end
end
directory_list.each do |starting_path|
Find.find(starting_path) do |path|
if File.extname(path) == '.pdf'
results << path
end
end
end
The first loop scans and finds all the directories that match the directory name condition, skipping scanning below them because that will happen in the second loop. The second loop takes each of the directories found by the first loop and scans them for files ending in the '.pdf' extension, adding each one to the results list.
You can hoist the second loop's body up into the first loop in place of directory_list << path, but the resulting code would be harder to read and wouldn't gain any performance improvement.
I am following Wicked cool ruby scripts book.
here,
there are two files, file_output = file_list.txt and oldfile_output = file_list.old. These two files contain list of all files the program went through and going to go through.
Now, the file is renamed as old file if a 'file_list.txt' file exists .
then, I am not able to understand the code.
Apparently every line of the file is read and the line is stored in oldfile hash.
Can some one explain from 4 the line?
And also, why is gets used here? why cant a .each method be used to read through every line?
if File.exists?(file_output)
File.rename(file_output, oldfile_output)
File.open(oldfile_output, 'rb') do |infile|
while (temp = infile.gets)
line = /(.+)\s{5,5}(\w{32,32})/.match(temp)
puts "#{line[1]} ---> #{line[2]}"
oldfile_hash[line[1]] = line[2]
end
end
end
Judging from the redundant use of quantifiers ({5,5} and {32,32}) in the regex (which would be better written as {5}, {32}), it looks like the person who wrote that code is not a professional Ruby programmer. So you can assume that the choice taken in the code is not necessarily the best.
As you pointed out, the code could have used each instead of while with gets. The latter approach is sort of an old-school Ruby way of doing it. There is nothing wrong in using it. Until the end of file is reached, gets will return a string, and when it does reach the end of file, gets will return nil, so the while loop works as the same when you use each; in each iteration, it reads the next line.
It looks like each line is supposed to represent a key-value pair. The regex assumes that the key is not an empty string, and that the key and the value are separated by exactly five spaces, and the the value consists of exactly thirty-two letters. Each key-value pair is printed (perhaps for monitoring the progress), and is stored in oldfile_hash, which is most likely a hash.
So the point of using .gets is to tell when the file is finished being read. Essentially, it's tied to the
while (condition)
....
end
block. So gets serves as a little method that will keep giving ruby the next line of the file until there is no more lines to give.
I'm currently working on a project in ruby, and I hit a wall on how I should proceed. In the project I'm using Dir.glob to search a directory and all of its subdirectories for certain file types and placing them into an arrays. The type of files I'm working with all have the same file name and are differentiated by their extensions. For example,
txt_files = Dir.glob("**/*.txt")
doc_files = Dir.glob("**/*.doc")
rtf_files = Dir.glob("**/*.rtf")
Would return something similar to,
FILECON.txt
ASSORTED.txt
FIRST.txt
FILECON.doc
ASSORTED.doc
FIRST.doc
FILECON.rtf
ASSORTED.rtf
FIRST.rtf
So, the question I have is how I could break down these arrays efficiently (dealing with thousands of files) and placing all files with the same filename into an array. The new array would look like,
FILECON.txt
FILECON.doc
FILECON.rtf
ASSORTED.txt
ASSORTED.doc
ASSORTED.rtf
etc. etc.
I'm not even sure if glob would be the correct way to do this (all the files with the same file name are in the same folders). Any help would be greatly appreciated!
Get all your files into a single array with Dir.glob("**/*.{txt,doc,rtf}")
Don't forget that all the filenames have the directory too, so if you want to sort by the basename, then
files = Dir.glob("**/*.{txt,doc,rtf}").sort_by {|f| File.basename f}
Not sure if this is exactly what you need, but you can try to
# first get all files
all_files = Dir.glob('**/*')
# then you can group them by name
by_name = all_files.group_by{|f| m = f.match(/([^\/]+)\.[^.\/]+$/); m[1] if m}
# and by extension
by_ext = all_files.group_by{|f| m = f.match(/[^\/]+\.([^.\/]+)$/); m[1] if m}
BTW, I don't see any relation of the question with sorting.