I am extracting files from a zip archive in Ruby using RubyZip, and I need to label files based on characteristics of their filenames:
Example:
I have the following hash:
labels = {
:data_file=>/.\.dat/i,
:metadata=>/.\.xml/i,
:text_location=>/.\.txt/i
}
So, I have the file name of each file in the zip, let's say an example is
filename = 382582941917841df.xml
Assume that each file will match only one Regex in the labels hash, and if not it doesn't matter, just choose the first match. (In this case the regular expressions are all for detecting extensions, but it could be to detect any filename mask like DSC****.jpg for example.
I am doing this now:
label_match =~ labels.find {|key,value| filename =~ value}
---> label_match = [:metadata, /.\.xml/]
label_sym = label_match.nil? ? nil: label_match.first
So this works fine, however doesn't seem very Ruby-like. Is there something I am missing to clean this up nicely?
A case when does this effortlessly:
filename = "382582941917841df.xml"
category = case filename
when /.\.dat/i ; :data_file
when /.\.xml/i ; :metadata
when /.\.txt/i ; :text_location
end
p category # => :metadata ; nil if nothing matched
I think you're doing it backwards and the hard way. Ruby makes it easy to get the extension of a file, which then makes it easy to map it to something.
Starting with something like:
FILENAMES = %w[ foo.bar foo.baz 382582941917841df.xml DSC****.jpg]
FILETYPES = {
'.bar' => 'bar',
'.baz' => 'baz',
'.xml' => 'metadata',
'.dat' => 'data',
'.jpg' => 'image'
}
FILENAMES.each do |fn|
puts "#{ fn } is a #{ FILETYPES[File.extname(fn)] } file"
end
# >> foo.bar is a bar file
# >> foo.baz is a baz file
# >> 382582941917841df.xml is a metadata file
# >> DSC****.jpg is a image file
File.extname is built into Ruby. The File class contains many similar methods useful for finding out things about files known by the OS and/or tearing apart file paths and file names so it's a really good thing to become very familiar with.
It's also important to understand that an improperly written regexp, such as /.\.dat/i can be the source of a lot of pain. Consider these:
'foo.xml.dat'[/.\.dat/] # => "l.dat"
'foo.database.20010101.csv'[/.\.dat/] # => "o.dat"
Are the files really "data" files?
Why is the character in front of the delimiting . important or necessary?
Do you really want to slow your code using unanchored regexp patterns when a method, such as extname will be faster and less maintenance?
Those are things to consider when writing code.
Rather than using nil to indicate the label when there is no match, consider using another symbol like :unknown.
Then you can do:
labels = {
:data_file=>/.\.dat/i,
:metadata=>/.\.xml/i,
:text_location=>/.\.txt/i,
:unknown=>/.*/
}
label = labels.find {|key,value| filename =~ value}.first
Related
How do I print/display just the part of a regular expression that is between the slashes?
irb> re = /\Ahello\z/
irb> puts "re is /#{re}/"
The result is:
re is /(?-mix:\Ahello\z)/
Whereas I want:
re is /\Ahello\z/
...but not by doing this craziness:
puts "re is /#{re.to_s.gsub( /.*:(.*)\)/, '\1' )}/"
If you want to see the original pattern between the delimiters, use source:
IP_PATTERN = /(?:\d{1,3}\.){3}\d{1,3}/
IP_PATTERN # => /(?:\d{1,3}\.){3}\d{1,3}/
IP_PATTERN.inspect # => "/(?:\\d{1,3}\\.){3}\\d{1,3}/"
IP_PATTERN.to_s # => "(?-mix:(?:\\d{1,3}\\.){3}\\d{1,3})"
Here's what source shows:
IP_PATTERN.source # => "(?:\\d{1,3}\\.){3}\\d{1,3}"
From the documentation:
Returns the original string of the pattern.
/ab+c/ix.source #=> "ab+c"
Note that escape sequences are retained as is.
/\x20\+/.source #=> "\\x20\\+"
NOTE:
It's common to build a complex pattern from small patterns, and it's tempting to use interpolation to insert the simple ones, but that doesn't work as most people think it will. Consider this:
foo = /foo/
bar = /bar/imx
foo_bar = /#{ foo }_#{ bar }/
foo_bar # => /(?-mix:foo)_(?mix:bar)/
Notice that foo_bar has the pattern flags for each of the sub-patterns. Those can REALLY mess you up when trying to match things if you're not aware of their existence. Inside the (?-...) block the pattern can have totally different settings for i, m or x in relation to the outer pattern. Debugging that can make you nuts, worse than trying to debug a complex pattern normally would. How do I know this? I'm a veteran of that particular war.
This is why source is important. It injects the exact pattern, without the flags:
foo_bar = /#{ foo.source}_#{ bar.source}/
foo_bar # => /foo_bar/
Use .inspect instead of .to_s:
> puts "re is #{re.inspect}"
re is /\Ahello\z/
How can I get the filename without the extensions? For example, input of "/dir1/dir2/test.html.erb" should return "test".
In actual code I will passing in __FILE__ instead of "/dir1/dir2/test.html.erb".
Read documentation:
basename(file_name [, suffix] ) → base_name
Returns the last component of the filename given in file_name, which
can be formed using both File::SEPARATOR and File::ALT_SEPARATOR as
the separator when File::ALT_SEPARATOR is not nil. If suffix is given
and present at the end of file_name, it is removed.
=> File.basename('public/500.html', '.html')
=> "500"
in you case:
=> File.basename("test.html.erb", ".html.erb")
=> "test"
How about this
File.basename(f, File.extname(f))
returns the file name without the extension.. works for filenames with multiple '.' in it.
In case you don't know the extension you can combine File.basename with File.extname:
filepath = "dir/dir/filename.extension"
File.basename(filepath, File.extname(filepath)) #=> "filename"
Pathname provides a convenient object-oriented interface for dealing with file names.
One method lets you replace the existing extension with a new one, and that method accepts the empty string as an argument:
>> Pathname('foo.bar').sub_ext ''
=> #<Pathname:foo>
>> Pathname('foo.bar.baz').sub_ext ''
=> #<Pathname:foo.bar>
>> Pathname('foo').sub_ext ''
=> #<Pathname:foo>
This is a convenient way to get the filename stripped of its extension, if there is one.
But if you want to get rid of all extensions, you can use a regex:
>> "foo.bar.baz".sub(/(?<=.)\..*/, '')
=> "foo"
Note that this only works on bare filenames, not paths like foo.bar/pepe.baz. For that, you might as well use a function:
def without_extensions(path)
p = Pathname(path)
p.parent / p.basename.sub(
/
(?<=.) # look-behind: ensure some character, e.g., for ‘.foo’
\. # literal ‘.’
.* # extensions
/x, '')
end
Split by dot and the first part is what you want.
filename = 'test.html.erb'
result = filename.split('.')[0]
Considering the premise, the most appropriate answer for this case (and similar cases with other extensions) would be something such as this:
__FILE__.split('.')[0...-1].join('.')
Which will only remove the extension (not the other parts of the name: myfile.html.erb here becomes myfile.html, rather than just myfile.
Thanks to #xdazz and #Monk_Code for their ideas. In case others are looking, the final code I'm using is:
File.basename(__FILE__, ".*").split('.')[0]
This generically allows you to remove the full path in the front and the extensions in the back of the file, giving only the name of the file without any dots or slashes.
name = "filename.100.jpg"
puts "#{name.split('.')[-1]}"
Yet understanding it's not a multiplatform solution, it'd work for unixes:
def without_extensions(path)
lastSlash = path.rindex('/')
if lastSlash.nil?
theFile = path
else
theFile = path[lastSlash+1..-1]
end
# not an easy thing to define
# what an extension is
theFile[0...theFile.index('.')]
end
puts without_extensions("test.html.erb")
puts without_extensions("/test.html.erb")
puts without_extensions("a.b/test.html.erb")
puts without_extensions("/a.b/test.html.erb")
puts without_extensions("c.d/a.b/test.html.erb")
I'd like to switch the extension of a file. For example:
test_dir/test_file.jpg to .txt should give test_dir/test_file.txt.
I also want the solution to work on a file with two extensions.
test_dir/test_file.ext1.jpg to .txt should should give test_dir/test_file.ext1.txt
Similarly, on a file with no extension it should just add the extension.
test_dir/test_file to .txt should give test_dir/test_file.txt
I feel like this should be simple, but I haven't found a simple solution. Here is what I have right now. I think it is really ugly, but it does seem to work.
def switch_ext(f, new_ext)
File.join(File.dirname(f), File.basename(f, File.extname(f))) + new_ext
end
Do you have any more elegant ways to do this? I've looked on the internet, but I'm guessing that I'm missing something obvious. Are there any gotcha's to be aware of? I prefer a solution that doesn't use a regular expression.
Your example method isn't that ugly. Please do continue to use file naming semantic aware methods over string regexp. You could try the Pathname stdlib which might make it a little cleaner:
require 'pathname'
def switch_ext(f, new_ext)
p = Pathname.new f
p.dirname + "#{ p.basename('.*') }#{ new_ext }"
end
>> puts %w{ test_dir/test_file.jpg test_dir/test_file.ext1.jpg testfile .vimrc }.
| map{|f| switch_ext f, '.txt' }
test_dir/test_file.txt
test_dir/test_file.ext1.txt
testfile.txt
.vimrc.txt
def switch_ext(f, new_ext)
(n = f.rindex('.')) == 0 ? nil : (f[0..n] + new_ext)
end
It will find the most right occurrence of '.' if it is not the first character.
Regular expressions were invented for this sort of task.
def switch_ext f, new_ext
f.sub(/((?<!\A)\.[^.]+)?\Z/, new_ext)
end
puts switch_ext 'test_dir/test_file.jpg', '.txt'
puts switch_ext 'test_dir/test_file.ext1.jpg', '.txt'
puts switch_ext 'testfile', '.txt'
puts switch_ext '.vimrc', '.txt'
Output:
test_dir/test_file.txt
test_dir/test_file.ext1.txt
testfile.txt
.vimrc.txt
def switch_ext(filename, new_ext)
filename.chomp( File.extname(filename)) + new_ext
end
I just found this answer to my own question here at the bottom of this long discussion.
http://www.ruby-forum.com/topic/179524
I personally think it is the best one I've seen. I definitely want to avoid a regular expression, because they are hard for me to remember and therefore error prone.
For dotfiles this function just adds the extension onto the file. This behaviour seems sensible to me.
switch_ext('.vimrc', '.txt') # => ".vimrc.txt"
Please continue to post better answers if there are any, and post comments to let me know if you see any deficiencies in this answer. I'll leave the question open for now.
You can use regular expressions, or you can use things like the built-in filename manipulation tools in File:
%w[
test_dir/test_file.jpg
test_dir/test_file.ext1.jpg
test_dir/test_file
].each do |fn|
puts File.join(
File.dirname(fn),
File.basename(fn, File.extname(fn)) + '.txt'
)
end
Which outputs:
test_dir/test_file.txt
test_dir/test_file.ext1.txt
test_dir/test_file.txt
I personally use the File methods. They're aware of different OS's needs for filename separators so porting to another OS is a no brainer. In your use-case it's not a big deal. Mix in path manipulations and it becomes more important.
def switch_ext f, new_ext
"#{f.sub(/\.[^.]+\z/, "")}.#{new_ext}"
end
def switch_ext(filename, ext)
begin
filename[/\.\w+$/] = ext
rescue
filename << ext
end
filename
end
Usage
>> switch_ext('test_dir/test_file.jpeg', '.txt')
=> "test_dir/test_file.txt"
>> switch_ext('test_dir/no_ext_file', '.txt')
=> "test_dir/no_ext_file.txt"
Hope this help.
Since Ruby 1.9.1 the easiest answer is Pathname::sub_ext(replacement) which strips off the extension and replaces it with the given replacement (which can be an empty string ''):
Pathname.new('test_dir/test_file.jpg').sub_ext('.txt')
=> #<Pathname:test_dir/test_file.txt>
Pathname.new('test_dir/test_file.ext1.jpg').sub_ext('.txt')
=> #<Pathname:test_dir/test_file.ext1.txt>
Pathname.new('test_dir/test_file').sub_ext('.txt')
=> #<Pathname:test_dir/test_file.txt>
Pathname.new('test_dir/test_file.txt').sub_ext('')
=> #<Pathname:test_dir/test_file>
One thing to watch out for is that you need to have a leading . in the replacement:
Pathname.new('test_dir/test_file.jpg').sub_ext('txt')
=> #<Pathname:test_dir/test_filetxt>
Is there a way to open a file case-insensitively in Ruby under Linux? For example, given the string foo.txt, can I open the file FOO.txt?
One possible way would be reading all the filenames in the directory and manually search the list for the required file, but I'm looking for a more direct method.
One approach would be to write a little method to build a case insensitive glob for a given filename:
def ci_glob(filename)
glob = ''
filename.each_char do |c|
glob += c.downcase != c.upcase ? "[#{c.downcase}#{c.upcase}]" : c
end
glob
end
irb(main):024:0> ci_glob('foo.txt')
=> "[fF][oO][oO].[tT][xX][tT]"
and then you can do:
filename = Dir.glob(ci_glob('foo.txt')).first
Alternatively, you can write the directory search you suggested quite concisely. e.g.
filename = Dir.glob('*').find { |f| f.downcase == 'foo.txt' }
Prior to Ruby 3.1 it was possible to use the FNM_CASEFOLD option to make glob case insensitive e.g.
filename = Dir.glob('foo.txt', File::FNM_CASEFOLD).first
if filename
# use filename here
else
# no matching file
end
The documentation suggested FNM_CASEFOLD couldn't be used with glob but it did actually work in older Ruby versions. However, as mentioned by lildude in the comments, the behaviour has now been brought inline with the documentation and so this approach shouldn't be used.
You can use Dir.glob with the FNM_CASEFOLD flag to get a list of all filenames that match the given name except for case. You can then just use first on the resulting array to get any result back or use min_by to get the one that matches the case of the orignial most closely.
def find_file(f)
Dir.glob(f, File::FNM_CASEFOLD).min_by do |f2|
f.chars.zip(f2.chars).count {|c1,c2| c1 != c2}
end
end
system "touch foo.bar"
system "touch Foo.Bar"
Dir.glob("FOO.BAR", File::FNM_CASEFOLD) #=> ["foo.bar", "Foo.Bar"]
find_file("FOO.BAR") #=> ["Foo.Bar"]
Is there any way to create the regex /func:\[sync\] displayPTS/ from string func:[sync] displayPTS?
The story behind this question is that I have serval string pattens to search against in a text file and I don't want to write the same thing again and again.
File.open($f).readlines.reject {|l| not l =~ /"#{string1}"/}
File.open($f).readlines.reject {|l| not l =~ /"#{string2}"/}
Instead , I want to have a function to do the job:
def filter string
#build the reg pattern from string
File.open($f).readlines.reject {|l| not l =~ pattern}
end
filter string1
filter string2
s = "func:[sync] displayPTS"
# => "func:[sync] displayPTS"
r = Regexp.new(s)
# => /func:[sync] displayPTS/
r = Regexp.new(Regexp.escape(s))
# => /func:\[sync\]\ displayPTS/
I like Bob's answer, but just to save the time on your keyboard:
string = 'func:\[sync] displayPTS'
/#{string}/
If the strings are just strings, you can combine them into one regular expression, like so:
targets = [
"string1",
"string2",
].collect do |s|
Regexp.escape(s)
end.join('|')
targets = Regexp.new(targets)
And then:
lines = File.readlines('/tmp/bar').reject do |line|
line !~ target
end
s !~ regexp is equivalent to not s =~ regexp, but easier to read.
Avoid using File.open without closing the file. The file will remain open until the discarded file object is garbage collected, which could be long enough that your program will run out of file handles. If you need to do more than just read the lines, then:
File.open(path) do |file|
# do stuff with file
end
Ruby will close the file at the end of the block.
You might also consider whether using find_all and a positive match would be easier to read than reject and a negative match. The fewer negatives the reader's mind has to go through, the clearer the code:
lines = File.readlines('/tmp/bar').find_all do |line|
line =~ target
end
How about using %r{}:
my_regex = "func:[sync] displayPTS"
File.open($f).readlines.reject { |l| not l =~ %r{#{my_regex}} }