Find filenames with regexp group capture - ruby

I'm way to find matching files by regexp and also supports groups in the regexp. Like:
match_files('/home/(*)/**/(*).txt')
would return something like:
[ ['/home/bob/docs/abc.txt', 'bob', 'abc'], ['/home/sue/archive/docs/def.txt', 'sue', 'def'] ]
Guard does something like this. I'm not looking to match this specific regex; rather to match any arbitrary regex input that might be provided.
Dir.glob() normally returns a flat array and doesn't support groups. I'm trying to locate a library or some technique that would support this kind of thing, for a DSL.

I'm trying to locate a library or trick that would support this kind of thing, for a DSL.
So your question seem to be off topic, because you are asking to recommend or find a tool or library to solve your problem.
Also, your question should include valid code examples:
['/home/bob/docs/abc.txt', 'bob', 'readme']
I guess it's supposed to mean
['/home/bob/docs/abc.txt', 'bob', 'abc']
Anyways... I think the question is quite interesting, but I don't think that you can't solve it with the standard library.
Dir.glob:
Returns true if path matches against pattern. The pattern is not a
regular expression; instead it follows rules similar to shell filename
globbing. It may contain the following metacharacters...
The only reasonable thing to do is to allow special characters, parse the string, extract the matches, create a glob and then apply matching to the filenames.

How about this.
regex = %r{/home/([^/]+)/.*/([^/]+).txt}
`find .`.split.grep(regex).map { |l| l.match(regex) }.map(&:to_a)
Could certainly be improved.

Related

ParseGlob: What is the pattern to parse all templates recursively within a directory?

Template.ParseGlob("*.html") //fetches all html files from current directory.
Template.ParseGlob("**/*.html") //Seems to only fetch at one level depth
Im not looking for a "Walk" solution. Just want to know if this is possible. I don't quite understand what "pattern" this expects. if i can get an explanation about the pattern used by ParseGlob that would be great too.
The code text/template/helper.go mentions
// The pattern is processed by filepath.Glob and must match at least one file.
filepath.Glob() says that "the syntax of patterns is the same as in Match"
Match returns true if name matches the shell file name pattern.
The implementation of Match() doesn't seem to treat '**' differently, and only consider '*' as matching any sequence of non-Separator characters.
That would mean '**' is equivalent to '*', which in turn would explain why the match works at one level depth only.
So, since the ParseGlob can't load templates recursively we have to use path/filepath.Walk function. But this way gives more opportunities.
https://gist.github.com/logrusorgru/abd846adb521a6fb39c7405f32fec0cf

Tokenize (lex? parse?) a regular expression

Using Ruby I'd like to take a Regexp object (or a String representing a valid regex; your choice) and tokenize it so that I may manipulate certain parts.
Specifically, I'd like to take a regex/string like this:
regex = /var (\w+) = '([^']+)';/
parts = ["foo","bar"]
and create a replacement string that replaces each capture with a literal from the array:
"var foo = 'bar';"
A naïve regex-based approach to parsing the regex, such as:
i = -1
result = regex.source.gsub(/\([^)]+\)/){ parts[i+=1] }
…would fail for things like nested capture groups, or non-capturing groups, or a regex that had a parenthesis inside a character class. Hence my desire to properly break the regex into semantically-valid pieces.
Is there an existing Regex parser available for Ruby? Is there a (horror of horrors) known regex that cleanly matches regexes? Is there a gem I've not found?
The motivation for this question is a desire to find a clean and simple answer to this question.
I have a JavaScript project on GitHub called: Dynamic (?:Regex Highlighting)++ with Javascript! you may want to look at. It parses PCRE compatible regular expressions written in both free-spacing and non-free-spacing modes. Since the regexes are written in the less-feature-rich JavaScript syntax, these regexes could be easily converted to Ruby.
Note that regular expressions may contain arbitrarily nested parentheses structures and JavaScript has no recursive regex features, so the code must parse the tree of nested parens from the-inside-out. Its a bit tricky but works quite well. Be sure to try it out on the highlighter demo page, where you can input and dynamically highlight any regex. The JavaScript regular expressions used to parse regular expressions are documented here.

Ruby regex: extract a list of urls from a string

I have a string of images' URLs and I need to convert it into an array.
http://rubular.com/r/E2a5v2hYnJ
How do I do this?
URI.extract(your_string)
That's all you need if you already have it in a string. I can't remember, but you may have to put require 'uri' in there first. Gotta love that standard library!
Here's the link to the docs URI#extract
Scan returns an array
myarray = mystring.scan(/regex/)
See here on regular-expressions.info
The best answer will depend very much on exactly what input string you expect.
If your test string is accurate then I would not use a regex, do this instead (as suggested by Marnen Laibow-Koser):
mystring.split('?v=3')
If you really don't have constant fluff between your useful strings then regex might be better. Your regex is greedy. This will get you part way:
mystring.scan(/https?:\/\/[\w.-\/]*?\.(jpe?g|gif|png)/)
Note the '?' after the '*' in the part capturing the server and path pieces of the URL, this makes the regex non-greedy.
The problem with this is that if your server name or path contains any of .jpg, .jpeg, .gif or .png then the result will be wrong in that instance.
Figuring out what is best needs more information about your input string. You might for example find it better to pattern match the fluff between your desired URLs.
Use String#split (see the docs for details).
Part of the problem is in rubular you are using https instead of http.. this gets you closer to what you want if the other answers don't work for you:
http://rubular.com/r/cIjmjxIfz5

Ruby RegEx issue

I'm having a problem getting my RegEx to work with my Ruby script.
Here is what I'm trying to match:
http://my.test.website.com/{GUID}/{GUID}/
Here is the RegEx that I've tested and should be matching the string as shown above:
/([-a-zA-Z0-9#:%_\+.~#?&\/\/=]{2,256}\.[a-z]{2,4}\b(\/[-a-zA-Z0-9#:%_\+.~#?&\/\/=]*)([\/\/[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}\/\/[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}\/\/])*?\/)/
3 capturing groups:
group 1: ([-a-zA-Z0-9#:%_\+.~#?&\/\/=]{2,256}\.[a-z]{2,4}\b(\/[-a-zA-Z0-9#:%_\+.~#?&\/\/=]*)([\/\/[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}\/\/[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}\/\/])*?\/)
group 2: (\/[-a-zA-Z0-9#:%_\+.~#?&\/\/=]*)
group 3: ([\/\/[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}\/\/[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}\/\/])
Ruby is giving me an error when trying to validate a match against this regex:
empty range in char class: (My RegEx goes here) (SyntaxError)
I appreciate any thoughts or suggestions on this.
You could simplify things a bit by using URI to deal parsing the URL, \h in the regex, and scan to pull out the GUIDs:
uri = URI.parse(your_url)
path = uri.path
guids = path.scan(/\h{8}-\h{4}-\h{4}-\h{4}-\h{12}/)
If you need any of the non-path components of the URL the you can easily pull them out of uri.
You might need to tighten things up a bit depending on your data or it might be sufficient to check that guids has two elements.
You have several errors in your RegEx. I am very sleepy now, so I'll just give you a hint instead of a solution:
...[\/\/[0-9a-fA-F]....
the first [ does not belong there. Also, having \/\/ inside [] is unnecessary - you only need each character once inside []. Also,
...[-a-zA-Z0-9#:%_\+.~#?&\/\/=]{2,256}...
is greedy, and includes a period - indeed, includes all chars (AFAICS) that can come after it, effectively swallowing the whole string (when you get rid of other bugs). Consider {2,256}? instead.

Regexp in ruby - can I use parenthesis without grouping?

I have a regexp of the form:
/(something complex and boring)?(something complex and interesting)/
I'm interested in the contents of the second parenthesis; the first ones are there only to ensure a correct match (since the boring part might or might not be present but if it is, I'll match it by accident with the regexp for the interesting part).
So I can access the second match using $2. However, for uniformity with other regexps I'm using I want that somehow $1 will contain the contents of the second parethesis. Is it possible?
Use a non-capturing group:
r = /(?:ab)?(cd)/
This is a non-ruby regexp feature. Use /(?:something complex and boring)?(something complex and interesting)/ (note the ?:) to achieve this.
By the way, in Ruby 1.9, you can do /(something complex and boring)?(?<interesting>something complex and interesting)/ and access the group with $~[:interesting] ;)
Yup, use the ?: syntax:
/(?:something complex and boring)?(something complex and interesting)/
I'm not a ruby developer however I know other regex flavors. So I bet you can use a non capturing group
/(?:something complex and boring)?(something complex and interesting)/
There is only one capturing group, hence $1
HTH
Not really, no. But you can use a named group for uniformity, like this:
/(?<group1>something complex and boring)?(?<group2>something complex and interesting)/
You can change the names (the text in the angle brackets) for the uniformity that you want to achieve. You can then access the groups like this:
string.match(/(?<group1>something complex and boring)?(?<group2>something complex and interesting)/) do |m|
# Do something with the match, m['group'] can be used to access the group
end

Resources