Check if string is a glob pattern - ruby

On the input I have string that can be plain path string (e.g. /home/user/1.txt) or glob pattern (e.g. /home/user/*.txt).
Next I want to get array of matches if string is glob pattern and in case when string is just plain path I want to get array with single element - this path.
So somehow I should check if string contains unescaped glob symbols and if it does then call Pathname.glob() to get matches otherwise just return array with this string.
How can I check if string is a glob pattern?
UPDATE
I had this question while implementing homebrew cask glob pattern support for zap stanza.
And the solution that I used is to made a little refactoring to avoid need to check if string is a glob pattern.

Next I want to get array of matches if string is glob pattern and in case when string is just plain path I want to get array with single element - this path.
They're both valid glob patterns. One contains a wildcard, one does not. Run them both through Pathname.glob() and you'll always get an array back. Bonus, it'll check if it matches anything.
$ irb
2.3.3 :001 > require "pathname"
=> true
2.3.3 :002 > Pathname.glob("test.data")
=> [#<Pathname:test.data>]
2.3.3 :003 > Pathname.glob("test.*")
=> [#<Pathname:test.asm>, #<Pathname:test.c>, #<Pathname:test.cpp>, #<Pathname:test.csv>, #<Pathname:test.data>, #<Pathname:test.dSYM>, #<Pathname:test.html>, #<Pathname:test.out>, #<Pathname:test.php>, #<Pathname:test.pl>, #<Pathname:test.py>, #<Pathname:test.rb>, #<Pathname:test.s>, #<Pathname:test.sh>]
2.3.3 :004 > Pathname.glob("doesnotexist")
=> []
This is a great way to normalize and validate your data early, so the rest of the program doesn't have to.
If you really want to figure out if something is a literal path or a glob, you could try scanning for any special glob characters, but that rapidly gets complicated and error prone. It requires knowing how glob works in detail and remembering to check for quoting and escaping. foo* has a glob pattern. foo\* does not. foo[123] does. foo\[123] does not. And I'm not sure what foo[123\] is doing, I think it counts as a non-terminated set.
In general, you want to avoid writing code that has to reproduce the inner workings of another piece of code. If there was a Pathname.has_glob_chars you could use that, but there isn't such a thing.
Pathname.glob uses File.fnmatch to do the globbing and you can use that without touching the filesystem. You might be able to come up with something using that, but I can't make it work. I thought maybe only a literal path will match itself, but foo* defeats that.
Instead, check if it exists.
Pathname.new(path).exist?
If it exists, it was a real path to a real file. If it didn't exist, it might have been a real path, or it might be a glob. That's probably good enough.
You can also check by looking to see if Pathname.glob(path) returned a single element that matches the original path. Note that when matching paths it's important to normalize both sides with cleanpath.
paths = Pathname.glob(path)
if paths.size == 1 && paths[0].cleanpath == Pathname.new(path).cleanpath
puts "#{path} is a literal path"
elsif paths.size == 0
puts "#{path} matched nothing"
else
puts "#{path} was a glob"
end

Related

How to list files given path with poorly escaped Windows separator

I'm attempting to do this:
Dir["c:\temp\*.*"]
but that is failing. I understand why, but I seem to lack the Ruby prowess to work around it.
I am given the path in a variable and otherwise have no control over it. Nor do I know the contents ahead of time.
Is there a way to make Dir function with double quoted strings that are poorly escaped? Alternatively, how does one take a variable with the apparent contents
"c:\temp\*.*"
and convert it into
'c:/temp/*.*'
This problem at the core seems to be how to potentially escape a string that should have been escaped but now is not.
The end result is I am not able to use the given string to do this as conceptually simple as puts() or Dir[].
If given 'c:\temp\*.*' then I have no problem. I can fix that:
foo = 'c:\temp\*.*'.gsub('\\', '/')
If given "c:\\\\temp\\\\*.*" then I have no problem. I can fix that:
foo = "c:\\temp\\*.*".gsub("\\", "/")
However, I am passed neither of those, but rather "c:\\temp\\*.*". This string contains a TAB and a second undefined escape. It is this that I can't fix in a general way.
Even if I knew the contents ahead of time I am stumped on how to properly escape and transform this. I should add that I am not a ruby programmer at the moment so maybe there is some simple method to deal with this that I am not aware of.
I tried a bunch of stuff like:
"c:\temp\*.*".gsub("\t", "/t")
which gets me part of the way, but since the actual contents of the string are not known to me ahead of time this is a little wonky. Further, if the escape character is not valid as in \\* then I am also in a jam. So this also fails:
"c:\temp\*.*".gsub("\t", "/t").gsub("\*", "/*")
Is there a way to make Dir function with double quoted strings that are poorly escaped?
No.
Garbage in, garbage out. There is no Rumpelstiltskin routine that returns gold when given trash.
Ruby auto-converts forward-slashes in filenames/paths to reverse-slashes when running on Windows. Simply make it a habit of using forward, *nix-style, slashes and you'll be fine.
From the IO documentation:
Ruby will convert pathnames between different operating system conventions if possible. For instance, on a Windows system the filename "/gumby/ruby/test.rb" will be opened as "\gumby\ruby\test.rb". When specifying a Windows-style filename in a Ruby string, remember to escape the backslashes:
"c:\\gumby\\ruby\\test.rb"
I don't have "c:\temp" I have "c:\temp" as input
In a properly defined Windows path you should see:
'c:' + '\temp' + '\*.*' # => "c:\\temp\\*.*"
Note that the single-quotes are treating "\t" as an escaped-escape + "t". Your source for the variable is creating the string improperly by using double-quotes:
'c:' + "\temp" + "\*.*" # => "c:\temp*.*"
If you have "\t", you have a TAB character. It's possible to change it to an escaped-T using:
"c:\temp" # => "c:\temp"
"c:\temp"[2] # => "\t"
"c:\temp"[2].ord # => 9
'\t' # => "\\t"
"c:\temp".sub("\t", '\t') # => "c:\\temp"
The next problem is what to do when you have a String containing "*" to convert it to "\*". There's no way to search for "\*" because that's the same as "*" as seen above:
"\*.*" # => "*.*"
But, since "*.*" is a fairly specific "anything" wildcard, maybe simply searching for and replacing that pattern would work:
"c:\temp\*.*".gsub('*.*', '\\*.*') # => "c:\temp\\*.*"
or:
"c:\temp\*.*".gsub('*.*', '/*.*') # => "c:\temp/*.*"
Back to dealing with "\t" and putting it all together... I'd start with:
"c:\temp\*.*".gsub("\t", '\t').gsub('*.*', '/*.*') # => "c:\\temp/*.*"
"c:\temp\*.*".gsub("\t", '/t').gsub('*.*', '/*.*') # => "c:/temp/*.*"
You'll have to figure out what to do if you have something like:
c:/dir/file*.*
where they mean they want all files starting with file. Since you're seeing ambiguous inputs it seems the input routine needs to be more rigorous to not allow reversed-slashes.

Regex match anything except ending string

I'm trying to make a regex that matches anything except an exact ending string, in this case, the extension '.exe'.
Examples for a file named:
'foo' (no extension) I want to get 'foo'
'foo.bar' I want to get 'foo.bar'
'foo.exe.bar' I want to get 'foo.exe.bar'
'foo.exe1' I want to get 'foo.exe1'
'foo.bar.exe' I want to get 'foo.bar'
'foo.exe' I want to get 'foo'
So far I created the regex /.*\.(?!exe$)[^.]*/
but it doesn't work for cases 1 and 6.
You can use a positive lookahead.
^.+?(?=\.exe$|$)
^ start of string
.+? non greedily match one or more characters...
(?=\.exe$|$) until literal .exe occurs at end. If not, match end.
See demo at Rubular.com
Wouldn't a simple replacement work?
string.sub(/\.exe\z/, "")
Do you mean regex matching or capturing?
There may be a regex only answer, but it currently eludes me. Based on your test data and what you want to match, doing something like the following would cover both what you want to match and capture:
name = 'foo.bar.exe'
match = /(.*).exe$/.match(name)
if match == nil
# then this filename matches your conditions
print name
else
# otherwise match[1] is the capture - filename without .exe extension
print match[1]
end
string pattern = #" (?x) (.* (?= \.exe$ )) | ((?=.*\.exe).*)";
First match is a positive look-ahead that checks if your string
ends with .exe. The condition is not included in the match.
Second match is a positive look-ahead with the condition included in the
match. It only checks if you have something followed by .exe.
(?x) is means that white spaces inside the pattern string are ignored.
Or don't use (?x) and just delete all white spaces.
It works for all the 6 scenarios provided.

What does this variable assignment do?

I'm having to code a subversion hook script, and I found a few examples online, mostly python and perl. I found one or two shell scripts (bash) as well. I am confused by a line and am sorry this is so basic a question.
FILTER=".(sh|SH|exe|EXE|bat|BAT)$"
The script later uses this to perform a test, such as (assume EXT=ex):
if [[ "$FILTER" == *"$EXT"* ]]; then blah
My problem is the above test is true. However, I'm not asking you to assist in writing the script, just explaining the initial assignment of FILTER. I don't understand that line.
Editing in a closer example FILTER line. Of course the script, as written does not work, because 'ex' returns true, and not just 'exe'. My problem here is only, however, that I don't understant the layout of the variable assignment itself.
Why is there a period at the beginning? ".(sh..."
Why is there a dollar sign at the end? "...BAT)$"
Why are there pipes between each pattern? "sh|SH|exe"
You probably looking for something as next:
FILTER="\.(sh|SH|exe|EXE|bat|BAT)$"
for EXT
do
if [[ "$EXT" =~ $FILTER ]];
then
echo $EXT extension disallowed
else
echo $EXT is allowed
fi
done
save it to myscript.sh and run it as
myscript.sh bash ba.sh
and will get
bash is allowed
ba.sh extension disallowed
If you don't escape the "dot", e.g. with the FILTER=".(sh|SH|exe|EXE|bat|BAT)$" you will get
bash extension disallowed
ba.sh extension disallowed
What is (of course) wrong.
For the questions:
Why is there a period at the beginning? ".(sh..."
Because you want match .sh (as extension) and not for example bash (without the dot). And therefore the . must be escaped, like \. because the . in regex mean "any character.
Why is there a dollar sign at the end? "...BAT)$"
The $ mean = end of string. You want match file.sh and not file.sh.jpg. The .sh should be at the end of string.
Why are there pipes between each pattern? "sh|SH|exe"
In the rexex, the (...|...|...) construction delimites the "alternatives". As you sure quessed.
You really need read some "regex tutorial" - it is more complicated - and can't be explained in one answer.
Ps: NEVER use UPPERCASE variable names, they can collide with environment variables.
This just assigns a string to FILTER; the contents of that string have no special meaning. When you try to match it against the pattern *ex*, the result is true assuming that the value of $FILTER consists the string ex surrounded by anything on either side. This is true; ex is a substring of exe.
FILTER=".(sh|SH|exe|EXE|bat|BAT)$"
^^
|
+---- here is the "ex" from the pattern.
As I can this is similar to regular expression pattern:
In regular expressions the string start with can be show with ^, similarly in this case . represent seems doing that.
In the bracket you have exact string, which represents what the exact file extensions would be matched, they are 'Or' by using the '|'.
And at the end the expression should only pick the string will '$' or end point and not more than.
I would say that way original author might have looked at it and implemented it.

Linux: shell builtin string matching

I am trying to become more familiar with using the builtin string matching stuff available in shells in linux. I came across this guys posting, and he showed an example
a="abc|def"
echo ${a#*|} # will yield "def"
echo ${a%|*} # will yield "abc"
I tried it out and it does what its advertised to do, but I don't understand what the $,{},#,*,| are doing, I tried looking for some reference online or in the manuals but I couldn't find anything. Can anyone explain to me what's going on here?
This article in the Linux Journal says that the # operator deletes the shortest possible match on the left, while the % operator deletes the shortest possible match on the right.
So ${a#*|} returns everything after the |, and ${a%|*} returns everything before the |.
If you had a situation that called for greedy matching, you'd use ## or %%.
Take a look at this.
${string%substring}
Deletes shortest match of $substring
from back of $string.
${string#substring}
Deletes shortest match of $substring
from front of $string.
EDIT:
I don't understand what the $,{},#,*,|
are doing
I recommend reading this
Typically, ${somename} will substitute the contents of a defined parameter:
mystring="1234567"
echo ${mystring} # produces '1234567'
The % and # symbols are allowing you to add commands that modify the default behavior.
The asterisk '*' is a wildcard; while the pipe '|' is simply a matching character. Let me do the same thing using the matching character of '4'.
mystring="1234567"
echo ${mystring#*4} # produces '567'
Those features and other similarly useful ones are documented in the Shell Parameter Expansion section of the Bash Reference Manual. Here's another really good reference.

Regular expression to match only the first file in a RAR file set

To see what file to invoke the unrar command on, one needs to determine which file is the first in the file set.
Here are some sample file names, of which - naturally - only the first group should be matched:
yes.rar
yes.part1.rar
yes.part01.rar
yes.part001.rar
no.part2.rar
no.part02.rar
no.part002.rar
no.part011.rar
One (limited) way to do it with PCRE compatible regexps is this:
.*(?:(?<!part\d\d\d|part\d\d|\d)\.rar|\.part0*1\.rar)
This did not work in Ruby when I tested it at Rejax however.
How would you write one Ruby compatible regular expression to match only the first file in a set of RAR files?
Don't rely on the names of the files to determine which one is first. You're going to end up finding an edge case where you get the wrong file.
RAR's headers will tell you which file is the first on in the volume, assuming they were created in a somewhat-recent version of RAR.
HEAD_FLAGS Bit flags:
2 bytes
0x0100 - First volume (set only by RAR 3.0 and later)
So open up each file and examine the RAR headers, looking specifically for the flag that indicates which file is the first volume. This will never fail, as long as the archive isn't corrupt. I have done my own tests with spanning RAR archives and their headers are correct according to the link above.
This is a much, much safer way of determining which file is first in a set like this.
The short answer is that it's not possible to construct a single regex to satisfy your problem. Ruby 1.8 does not have lookaround assertions (the (?<! stuff in your example regex) which is why your regex doesn't work. This leaves you with two options.
1) Use more than one regex to do it.
def is_first_rar(filename)
if ((filename =~ /part(\d+)\.rar$/) == nil)
return (filename =~ /\.rar$/) != nil
else
return $1.to_i == 1
end
end
2) Use the regex engine for ruby 1.9, Oniguruma. It supports lookaround assertions, and you can install it as a gem for ruby 1.8. After that, you can do something like this:
def is_first_rar(filename)
reg = Oniguruma::ORegexp.new('.*(?:(?<!part\d\d\d|part\d\d|\d)\.rar|\.part0*1\.rar)')
match = reg.match(filename)
return match != nil
end
Personally I wouldn't use (extended) regular expressions in this case (or at least not just one to do it all). What's wrong with coding this in, for example, a few ifs?
I am no regex expert but here is my attempt
^(yes|no)\.(rar|part0*1\.rar)$
Replace "yes|no" with the actual file name. I matched it against your examples to see if it would only match the first set hence the "yes|no" in the regex.
UPDATE: fixed as per the comment. Not sure why the user would not know the filename so i did not fix that part...

Resources