Negating strings in Ruby regular expressions - ruby

I'm looking for a way to extract LinkedIn profile pages from lists of URLs using Ruby. Currently I am looping over the URLs and matching them against this regex:
/^http:\/\/.+\.linkedin.com\/(pub|in)/
However, the URLs of LinkedIn profile directory pages are as follows:
http://www.linkedin.com/pub/dir
, so I'm looking to avoid any links that have the pub/dir path in them. I know it's possible to negate character classes in Ruby regexs, such as [^abc] matching any character that isn't abc. Is there a way to do the same with strings? I.e. matching any sequence of characters besides "dir"?

You can use a negative lookahead. Something like
(pub(?!\/dir)|in)

Related

Find filenames with regexp group capture

I'm way to find matching files by regexp and also supports groups in the regexp. Like:
match_files('/home/(*)/**/(*).txt')
would return something like:
[ ['/home/bob/docs/abc.txt', 'bob', 'abc'], ['/home/sue/archive/docs/def.txt', 'sue', 'def'] ]
Guard does something like this. I'm not looking to match this specific regex; rather to match any arbitrary regex input that might be provided.
Dir.glob() normally returns a flat array and doesn't support groups. I'm trying to locate a library or some technique that would support this kind of thing, for a DSL.
I'm trying to locate a library or trick that would support this kind of thing, for a DSL.
So your question seem to be off topic, because you are asking to recommend or find a tool or library to solve your problem.
Also, your question should include valid code examples:
['/home/bob/docs/abc.txt', 'bob', 'readme']
I guess it's supposed to mean
['/home/bob/docs/abc.txt', 'bob', 'abc']
Anyways... I think the question is quite interesting, but I don't think that you can't solve it with the standard library.
Dir.glob:
Returns true if path matches against pattern. The pattern is not a
regular expression; instead it follows rules similar to shell filename
globbing. It may contain the following metacharacters...
The only reasonable thing to do is to allow special characters, parse the string, extract the matches, create a glob and then apply matching to the filenames.
How about this.
regex = %r{/home/([^/]+)/.*/([^/]+).txt}
`find .`.split.grep(regex).map { |l| l.match(regex) }.map(&:to_a)
Could certainly be improved.

How to discover a date or a number near a word - only with regex within regex

I am still learning the intrinsics of regex, and am wondering if it is possible with a single regex to find a number that is at a provided distance from a word.
Consider the following text
DateClient
15-01-20130060 15-01-20140010 15-01-20150020
I want that my regex matches just 15-01-2013.
I know I can have the full DateClient 15-01-2013 with DateClient\W+\d{2}-\d{2}-\d{4}, and then apply a regex afterwards, but i'm trying to build a configurable agnostic system, that gives power to the user, and so I would like to have a single regex expression that just matches 15-01-2013.
Is this even feasible?
Any suggestions?
You can use a capturing group :
DateClient\W+(\d{2}-\d{2}-\d{4})
Example in javascript (you didn't specify a language) :
var str = "DateClient\n15-01-20130060 15-01-20140010 15-01-20150020";
var date = str.match(/DateClient\W+(\d{2}-\d{2}-\d{4})/)[1];
EDIT (following the addition of the Ruby tag) :
In Ruby you can use
(?<=DateClient\W)(\d{2}-\d{2}-\d{4})
Demonstration
Check out lookbehind for matching only the date. However, lookbehind support of your environment can be limited.
Or you could just use a capturing group, which you will be able to extract from the match result.

Ruby Regular Expressions: Matching if substring doesn't exist

I'm having an issue trying to capture a group on a string:
"type=gist\nYou need to gist this though\nbecause its awesome\nright now\n</code></p>\n\n<script src=\"https://gist.github.com/3931634.js\"> </script>\n\n\n<p><code>Not code</code></p>\n"
My regex currently looks like this:
/<code>([\s\S]*)<\/code>/
My goal is to get everything in between the code brackets. Unfortunately, it's matching up to the 2nd closing code bracket Is there a way to match everything inside the code brackets up until the first occurrence of ending code bracket?
All repetition quantifiers in regular expressions are greedy by default (matching as many characters as possible). Make the * ungreedy, like this:
/<code>([\s\S]*?)<\/code>/
But please consider using a DOM parser instead. Regex is just not the right tool to parse HTML.
And I just learned that for going through multiple parts, the
String.scan( /<code>(.*?)<\/code>/ ){
puts $1
}
is a very nice way of going through all occurences of code - but yes, getting a proper parser is better...

Tokenize (lex? parse?) a regular expression

Using Ruby I'd like to take a Regexp object (or a String representing a valid regex; your choice) and tokenize it so that I may manipulate certain parts.
Specifically, I'd like to take a regex/string like this:
regex = /var (\w+) = '([^']+)';/
parts = ["foo","bar"]
and create a replacement string that replaces each capture with a literal from the array:
"var foo = 'bar';"
A naïve regex-based approach to parsing the regex, such as:
i = -1
result = regex.source.gsub(/\([^)]+\)/){ parts[i+=1] }
…would fail for things like nested capture groups, or non-capturing groups, or a regex that had a parenthesis inside a character class. Hence my desire to properly break the regex into semantically-valid pieces.
Is there an existing Regex parser available for Ruby? Is there a (horror of horrors) known regex that cleanly matches regexes? Is there a gem I've not found?
The motivation for this question is a desire to find a clean and simple answer to this question.
I have a JavaScript project on GitHub called: Dynamic (?:Regex Highlighting)++ with Javascript! you may want to look at. It parses PCRE compatible regular expressions written in both free-spacing and non-free-spacing modes. Since the regexes are written in the less-feature-rich JavaScript syntax, these regexes could be easily converted to Ruby.
Note that regular expressions may contain arbitrarily nested parentheses structures and JavaScript has no recursive regex features, so the code must parse the tree of nested parens from the-inside-out. Its a bit tricky but works quite well. Be sure to try it out on the highlighter demo page, where you can input and dynamically highlight any regex. The JavaScript regular expressions used to parse regular expressions are documented here.

What is the best way to match id's against a regular expression in Hpricot?

Using apricot, it is pretty easy to see how I can extract all elements with a given id or class using a CSS Selector. Is it possible to extract elements from a document based on whether some attribute of those elements matches against some regular expression?
If you mean do something like:
doc.search("//div[#id=/regex/]")
then I don't think it can be done. The alternative is to find all elements and then iterate through the results deleting those that don't match a regex.
result = doc.search("//div")
result.delete_if (|x| x.to_s !~ /regex/)
There are lots of alternative approaches. This thread has two other suggestions: Hpricot and Regular Expression.
Note, depending on exactly what it is you are trying to match you may be able to use the "Supported, but different" syntaxes available on the Hpricot Wiki, e.g:
E[#foo$=“bar”]
Matches an E element whose “foo”
attribute value ends exactly with the
string “bar”

Resources