extracting link from text [duplicate] - ruby

This question already has answers here:
How to extract URLs from text
(6 answers)
Closed 8 years ago.
I am tring to extract a link from a phrase and it could be any where last, first or middle so I am usig this regex
link=text.scan(/(^| )(http.*)($| )/)
but the problem is when the link is in the middle it gets the whole phrase until the end.
What should I do ?

It's because .* next to http is greedy. I suggest you to use lookarounds.
link=text.scan(/(?<!\S)(http\S+)(?!\S)/)
OR
link=text.scan(/(?<!\S)(http\S+)/)
Example:
> "http://bar.com foo http://bar.com bar http://bar.com".scan(/(?<!\S)http\S+(?!\S)/)
=> ["http://bar.com", "http://bar.com", "http://bar.com"]
DEMO
(?<!\S) Negative lookbehind which asserts that the match won't be preceeded by a non-space character.
http\S+ Matches the substring http plus the following one or more non-space characters.

Do all the links you are trying to match follow some simple pattern? We'd need to see more context to confidently provide a good solution to your problem.
For example, the regex:
link=text.scan(/http.*\.com/)
...might be good enough for the job (this assumes all links end in ".com"), but I can't say for sure without more information.
Or again, for example, perhaps you could use something like:
link=text.scan(/http[a-z./:]*) - this assumes all links contain only lower case letters, ".", "/" and ":".

Related

How to get first/shorter match with regex in ruby? [duplicate]

This question already has answers here:
Regular expression to stop at first match
(9 answers)
Closed 2 years ago.
I have this gigantic ugly string:
J0000000: Transaction A0001401 started on 8/22/2008 9:49:29 AM
J0000010: Project name: E:\foo.pf
J0000011: Job name: MBiek Direct Mail Test
J0000020: Document 1 - Completed successfully
I'm trying to extract pieces from it using regex. In this case, I want to grab everything after Project Name up to the part where it says J0000011: (the 11 is going to be a different number every time).
Here's the regex I've been playing with:
Project name:\s+(.*)\s+J[0-9]{7}:
The problem is that it doesn't stop until it hits the J0000020: at the end.
How do I make the regex stop at the first occurrence of J[0-9]{7}?
Make .* non-greedy by adding '?' after it:
Project name:\s+(.*?)\s+J[0-9]{7}:
Using non-greedy quantifiers here is probably the best solution, also because it is more efficient than the greedy alternative: Greedy matches generally go as far as they can (here, until the end of the text!) and then trace back character after character to try and match the part coming afterwards.
However, consider using a negative character class instead:
Project name:\s+(\S*)\s+J[0-9]{7}:
\S means “everything except a whitespace and this is exactly what you want.
Well, ".*" is a greedy selector. You make it non-greedy by using ".*?" When using the latter construct, the regex engine will, at every step it matches text into the "." attempt to match whatever make come after the ".*?". This means that if for instance nothing comes after the ".*?", then it matches nothing.
Here's what I used. s contains your original string. This code is .NET specific, but most flavors of regex will have something similar.
string m = Regex.Match(s, #"Project name: (?<name>.*?) J\d+").Groups["name"].Value;
I would also recommend you experiment with regular expressions using "Expresso" - it's a utility a great (and free) utility for regex editing and testing.
One of its upsides is that its UI exposes a lot of regex functionality that people unexprienced with regex might not be familiar with, in a way that it would be easy for them to learn these new concepts.
For example, when building your regex using the UI, and choosing "*", you have the ability to check the checkbox "As few as possible" and see the resulting regex, as well as test its behavior, even if you were unfamiliar with non-greedy expressions before.
Available for download at their site:
http://www.ultrapico.com/Expresso.htm
Express download:
http://www.ultrapico.com/ExpressoDownload.htm
(Project name:\s+[A-Z]:(?:\\w+)+.[a-zA-Z]+\s+J[0-9]{7})(?=:)
This will work for you.
Adding (?:\\w+)+.[a-zA-Z]+ will be more restrictive instead of .*

Ruby Group regular expression to match only first occurence [duplicate]

This question already has answers here:
Regular expression to stop at first match
(9 answers)
Closed 2 years ago.
I have this gigantic ugly string:
J0000000: Transaction A0001401 started on 8/22/2008 9:49:29 AM
J0000010: Project name: E:\foo.pf
J0000011: Job name: MBiek Direct Mail Test
J0000020: Document 1 - Completed successfully
I'm trying to extract pieces from it using regex. In this case, I want to grab everything after Project Name up to the part where it says J0000011: (the 11 is going to be a different number every time).
Here's the regex I've been playing with:
Project name:\s+(.*)\s+J[0-9]{7}:
The problem is that it doesn't stop until it hits the J0000020: at the end.
How do I make the regex stop at the first occurrence of J[0-9]{7}?
Make .* non-greedy by adding '?' after it:
Project name:\s+(.*?)\s+J[0-9]{7}:
Using non-greedy quantifiers here is probably the best solution, also because it is more efficient than the greedy alternative: Greedy matches generally go as far as they can (here, until the end of the text!) and then trace back character after character to try and match the part coming afterwards.
However, consider using a negative character class instead:
Project name:\s+(\S*)\s+J[0-9]{7}:
\S means “everything except a whitespace and this is exactly what you want.
Well, ".*" is a greedy selector. You make it non-greedy by using ".*?" When using the latter construct, the regex engine will, at every step it matches text into the "." attempt to match whatever make come after the ".*?". This means that if for instance nothing comes after the ".*?", then it matches nothing.
Here's what I used. s contains your original string. This code is .NET specific, but most flavors of regex will have something similar.
string m = Regex.Match(s, #"Project name: (?<name>.*?) J\d+").Groups["name"].Value;
I would also recommend you experiment with regular expressions using "Expresso" - it's a utility a great (and free) utility for regex editing and testing.
One of its upsides is that its UI exposes a lot of regex functionality that people unexprienced with regex might not be familiar with, in a way that it would be easy for them to learn these new concepts.
For example, when building your regex using the UI, and choosing "*", you have the ability to check the checkbox "As few as possible" and see the resulting regex, as well as test its behavior, even if you were unfamiliar with non-greedy expressions before.
Available for download at their site:
http://www.ultrapico.com/Expresso.htm
Express download:
http://www.ultrapico.com/ExpressoDownload.htm
(Project name:\s+[A-Z]:(?:\\w+)+.[a-zA-Z]+\s+J[0-9]{7})(?=:)
This will work for you.
Adding (?:\\w+)+.[a-zA-Z]+ will be more restrictive instead of .*

Ruby regex include all alphabets, numbers and / [duplicate]

This question already has answers here:
How to escape all characters with special meaning in Regex
(2 answers)
Closed 7 years ago.
I know this might be asked time and again. But I'm really stuck with this. I've got it to work for including numbers and alphabets but I have no idea on how to include "/" also.
This is what I have,
name.gsub!(/[^0-9A-Za-z]/, '')
So if name is "Cool Stuff *(#/" it returns "CoolStuff". I'd just like it to return "CoolStuff/".
The / is a special character that must be 'escaped' (meaning to take the / literally, and not for a switch or special meaning). So you have:
name.gsub!(/[^0-9A-Za-z]/, '')
But also realize you could shorten your RegEx statement by making it case insensitive by adding an 'i' after the ending slash and therefore allowing you to drop either the [A-Z] or the [a-z] part. So you could have instead:
name.gsub!(/[^0-9A-Z\/]/i, '')
Hope this helps!

ruby regex: match URL recurring pattern

I want to be able to match all the following cases below using Ruby 1.8.7.
/pages/multiedit/16801,16809,16817,16825,16833
/pages/multiedit/16801,16809,16817
/pages/multiedit/16801
/pages/multiedit/1,3,5,7,8,9,10,46
I currently have:
\/pages\/multiedit\/\d*
This matches upto the first set of numbers. So for example:
"/pages/multiedit/16801,16809,16817,16825,16833"[/\/pages\/multiedit\/\d*/]
# => "/pages/multiedit/16801"
See http://rubular.com/r/ruFPx5yIAF for example.
Thanks for the help, regex gods.
\/pages\/multiedit\/\d+(?:,\d+)*
Example: http://rubular.com/r/0nhpgki6Gy
Edit: Updated to not capture anything... Although the performance hit would be negligible. (Thanks Tin Man)
The currently accepted answer of
\/pages\/multiedit\/[\d,]+
may not be a good idea because that will also match the following strings
.../pages/multiedit/,,,
.../pages/multiedit/,1,
My answer requires there be at least one digit before the first comma, and at least one digit between commas, and it must end with a digit.
I'd use:
/\/pages\/multiedit\/[\d,]+/
Here's a demonstration of the pattern at http://rubular.com/r/h7VLZS1W1q
[\d,]+ means "find one or more numbers or commas"
The reason \d* doesn't work is it means "find zero or more numbers". As soon as the pattern search runs into a comma it stops. You have to tell the engine that it's OK to find numbers and commas.

Ruby Regular expression to match a url [duplicate]

This question already has answers here:
Closed 12 years ago.
Possible Duplicates:
Regex to match URL
regex to remove the webpage part of a url in ruby
I am in search of a regular expression for parsing all the urls in a file.
i tried many of the regular expression i got after googling but it fails in one or the other case . my idea is to write one which checks the presense of http or https at the begening and it will match everything untill it sees a blank space .
any ideas ?
NOTE : i dont need to parse the url but erase all the urls from a file or atleast make it unreadable .
The standard URI library provides URI.regexp which is the regular expression for url string.
require 'uri'
string.scan(URI.regexp)
http://ruby-doc.org/stdlib/libdoc/uri/rdoc/index.html
You can try this:
/https?:\/\/[\S]+/
The \S means any non-whitespace character.
(Rubular)

Resources