Most efficient way of matching file names with Ruby regex - ruby

The method Dir.glob is used for achieving file names that match a certain pattern, but its argument has a Unix-like syntax (e.g., using *, ** as wild cards in a particular way, etc.). Instead, I want to use Ruby (Onigmo) regex for the matching pattern to do the same thing (using its wildcards, quantifiers, anchors, escaped characters, etc). What is the best way to do this?
One simple way that comes to mind is to use Dir.glob to get the list of all existing files in all directories, and filtering them using the regex, but that does not look efficient. Or, is it? Is there a better way?

You could try the Find module in Ruby's standard library.
require 'find'
Find.find(path).grep(/regex/)
The find method returns every path that exists within the path you provide as an argument recursively, pretty much like what you mentioned with Dir.glob. You can then use the built-in grep method to filter the results with a regex.
This may not be the most efficient method though, since Dir.glob is written in C while the Find module is written in Ruby. I did a test on my home directory and it took Find a little longer to get the result than Dir.glob, but you can also use the Find module's prune method in order to not descend into particular folders, which could help make things more efficient using Find.

Related

Multiple Pattern Match Algorithm

I have lot of logs and every record contains a url. And I have about 2000+ url patterns to filter the log. Some patterns are regular pattern with capturable group. I want to get url and the matched pattern and, if possible, the captured groupes. Is there a java lib can help me. Or any Algorithm which can solve my problem. Or anyting else which related to my problem. Thanks a lot.
Take a look at java regular expressions library (link).
You can construct a single large pattern by concatenating your original patterns with | between them (use () to specify that you don't want just 1 character).
The regular expression can be compiled into an efficient matching finite automata, that you can run over your data. Just make sure you compile it once and reuse it for every record.
It will handle extracting groups, but you need to handle the groups in a generic way (since any group can be matched). If it makes it easier consider using named groups to make handling simpler.

FindFirstFile Multiple file types

Is it possible to use Windows API function FindFirstFile to search for multiple file types, e.g *.txt and *.doc at the same time?
I tried to separate patterns with '\0' but it does not work - it searches only the first pattern (I guess, that's because it thinks that '\0' is the end of string).
Of course, I can call FindFirstFile with *.* pattern and then check my patterns or call it for every pattern, but I don't like this idea - I will use it only if there no other solutions.
This is not supported. Run it twice with different wildcards. Or use *.* and filter the result. This is definitely the better choice, wildcards are ambiguous anyway due to support for legacy MS-DOS 8.3 filenames. A wildcard like *.doc will find both .doc and .docx files for example. A filename like longfilename.docx also creates an entry named LONGFI~1.DOC
The MSDN docs mention nothing about FindFirstFile allowing multiple search patterns, hence it doesn't exist.
In this case your best bet is to scan using an open selection (like C:\\some directory\* or *) and then filter based on WIN32_FIND_DATA's cFileName member, using strrchr (or the appropriate Unicode variant) to find the extension. It should run pretty fast for the small set of characters that make up a file extension.
If you know the that all the extensions are say 3 characters, you should be able to mask it off as *.??? to speed things up.

systematically changing filenames in a directory w/ Ruby

I'd like to grab all the files in a particular directory, and then apply a gsub(/abc/,'z') to all the filenames and essentially resave the files under the new filenames, how do I do that?
I've been looking at File but I don't seem to have any of the parameters that it requires, aka the filename, etc.
M
File.rename(from, to) along with Dir.entries (or Dir.foreach)?
Dave's answer is right on. Here's an example:
Dir.glob("*.rb").each do |fname|
File.rename(fname, fname.gsub(/\.rb/,".rbb"))
end
Dir.glob allows you to select files based on some given criteria, but like Dave says, you could also use Dir.entriesor Dir.foreach

How do you filter Ruby Find.find() results?

Find.find("d") {|path| puts path}
I want to exclude certain type of files, say *.gif and directories.
PS: I can always add code inside my block to check for the file name and directory type, but I want find itself to filter files for me.
I don't think you can tell find to do that.You could try using Dir#[], which accepts file globs. If you are looking for particular types of files, or files that can be filtered with the file glob pattern language, it may be a better fit.
eg
Dir["dir/**/*.{xml,png,css,html}"]
would find all the xml, png, css, and html files under the directory d.
Check out the docs for more info.
You can't make find do it, but Find may help: in the block, you need to check whether the current path is one of those you'd like to exclude or not; if so, then call Find#prune. This seems to be the standard idiom when using Find.
If you decide to use Dir#[] instead, you may call reject on its result, passing a block to exclude certain types of files. However, note that, as far as I understand, Dir#[] reads all the contents of your d directory before you can filter, while Find#prune guarantees not to read the contents of pruned subdirectories if you call it within the block passed to Find#find.

ruby - get a file from directory without listing all contents

I'm using the split linux command to split huge xml files into node-sized ones. The problem is now I have directory with hundreds of thousands of files.
I want a way to get a file from the directory (to pass to another process for import into our database) without needing to list everything in it. Is this how Dir.foreach already works? Any other ideas?
You can use Dir.glob to find the files you need. More details here, but basically, you pass it a pattern like Dir.glob 'dir/*.rb' and get back filenames matching that pattern. I assume it's done in a reasonably good way, but it will depend on your platform and implementation.
As to Dir.foreach, this should be efficient too - the concern would be if it has to process the entire directory for every pass around the loop. But that would be awful implementation, and is not the case.

Resources