Parsing URL string in Ruby - ruby

I have a pretty simple string I want to parse in ruby and trying to find the most elegant solution. The string is of format
/xyz/mov/exdaf/daeed.mov?arg1=blabla&arg2=3bla3bla
What I would like to have is :
string1: /xyz/mov/exdaf/daeed.mov
string2: arg1=blabla&arg2=3bla3bla
so basically tokenise on ?
but can't find a good example.
Any help would be appreciated.

I think the best solution would be to use the URI module. (You can do things like URI.parse('your_uri_string').query to get the part to the right of the ?.) See http://www.ruby-doc.org/stdlib/libdoc/uri/rdoc/
Example:
002:0> require 'uri' # or even 'net/http'
true
003:0> URI
URI
004:0> URI.parse('/xyz/mov/exdaf/daeed.mov?arg1=bla&arg2=asdf')
#<URI::Generic:0xb7c0a190 URL:/xyz/mov/exdaf/daeed.mov?arg1=bla&arg2=asdf>
005:0> URI.parse('/xyz/mov/exdaf/daeed.mov?arg1=bla&arg2=asdf').query
"arg1=bla&arg2=asdf"
006:0> URI.parse('/xyz/mov/exdaf/daeed.mov?arg1=bla&arg2=asdf').path
"/xyz/mov/exdaf/daeed.mov"
Otherwise, you can capture on a regex: /^(.*?)\?(.*?)$/. Then $1 and $2 are what you want. (URI makes more sense in this case though.)

Split the initial string on question marks.
str.split("?")
=> ["/xyz/mov/exdaf/daeed.mov", "arg1=blabla&arg2=3bla3bla"]

This seems to be what youre looking for, strings built-in split function:
"abc?def".split("?") => ["abc", "def"]
Edit: Bah, to slow ;)

Related

How do I best dump a Psych::Nodes::Document back to a YAML string?

This does what I want, but going via to_ruby seems unnecessary:
doc = Psych.parse("foo: 123")
doc.to_ruby.to_yaml
# => "---\nfoo: 123\n"
When I try to do this, I get an error:
DEV 16:49:08 >> Psych.parse("foo: 123").to_yaml
RuntimeError: expected STREAM-START
from /opt/…/lib/ruby/2.5.0/psych/visitors/emitter.rb:42:in `start_mapping'
I get the impression that the input needs to be a stream of some sort, but I don't quite get what incantation I need. Any ideas?
(The problem I'm trying to solve here, by the way (in case you know of a better way) is to fix some YAML that can't be deserialised into Ruby, because it references classes that don't exist. The YAML is quite complex, so I don't want to just search-and-replace in the YAML string. My thinking was that I could use Psych.parse to get a syntax tree, modify that tree, then dump it back into a YAML string.)
Figured out the incantation after finding the higher-level docs at https://ruby-doc.org//stdlib-2.3.0_preview1/libdoc/psych/rdoc/Psych/Nodes.html, though please let me know if there's a better way:
doc = Psych.parse("foo: 123")
stream = Psych::Nodes::Stream.new
stream.children << doc
stream.to_yaml
# => "foo: 123\n"

Regex in Ruby for a URL that is an image

So I'm working on a crawler to get a bunch of images on a page that are saved as links. The relevant code, at the moment, is:
def parse_html(html)
html_doc = Nokogiri::HTML(html)
nodes = html_doc.xpath("//a[#href]")
nodes.inject([]) do |uris, node|
uris << node.attr('href').strip
end.uniq
end
I am current getting a bunch of links, most of which are images, but not all. I want to narrow down the links before downloading with a regex. So far, I haven't been able to come up with a Ruby-Friendly regex for the job. The best I have is:
^https?:\/\/(?:[a-z0-9\-]+\.)+[a-z]{2,6}(?:/[^\/?]+)+\.(?:jpg|gif|png)$.match(nodes)
Admittedly, I got that regex from someone else, and tried to edit it to work and I'm failing. One of the big problems I'm having is the original Regex I took had a few "#"'s in it, which I don't know if that is a character I can escape, or if Ruby is just going to stop reading at that point. Help much appreciated.
I would consider modifying your XPath to include your logic. For example, if you only wanted the a elements that contained an img you can use the following:
"//a[img][#href]"
Or even go further and extract just the URIs directly from the href values:
uris = html_doc.xpath("//a[img]/#href").map(&:value)
As some have said, you may not want to use Regex for this, but if you're determined to:
^http(s?):\/\/.*\.(jpeg|jpg|gif|png)
Is a pretty simple one that will grab anything beginning with http or https and ending with one of the file extensions listed. You should be able to figure out how to extend this one, Rubular.com is good for experimenting with these.
Regexp is a very powerful tool but - compared to simple string comparisons - they are pretty slow.
For your simple example, I would suggest using a simple condition like:
IMAGE_EXTS = %w[gif jpg png]
if IMAGE_EXTS.any? { |ext| uri.end_with?(ext) }
# ...
In the context of your question, you might want to change your method to:
IMAGE_EXTS = %w[gif jpg png]
def parse_html(html)
uris = []
Nokogiri::HTML(html).xpath("//a[#href]").each do |node|
uri = node.attr('href').strip
uris << uri if IMAGE_EXTS.any? { |ext| uri.end_with?(ext) }
end
uris.uniq
end

What would be the best way to take a string of html, chop it up, and put each piece into an array?

I have a general idea of how I can do this, but can't pinpoint how exactly to get it done. I am sure it can be done with a regex of some sort. Wondering if anyone here can point me in the right direction.
If I have a string of html such as this
some_html = '<div><b>This is some BOLD text</b></div>'
I want to to divide it into logical pieces, and then put those pieces into an array so I end with a result like this
html_array = ["<div>", "<b>", "This is some BOLD text", "</b>","</div>" ]
Rather than use regex I'd use the nokogiri gem (a gem for parsing html written by Aaron Patterson - contributor to Rails and Ruby). Here's a sample of how to use it:
html_doc = Nokogiri::HTML("<html><body><h1>Mr. Belvedere Fan Club</h1></body></html>")
You can then call html_doc.children to get a nodeset and work your way from there
html_doc.children # returns a nodeset
Use an HTML parser, for instance, Nokogiri. Using SAX you can add tags/elements to the array as events are triggered.
It's not a good idea to try to regex HTML, unless you're planning to treat only a small determined subset of it.
some_html.split(/(<[^>]*>)/).reject{|x| '' == x}

Ruby regexp: capture the path of url

From any URL I want to extract its path.
For example:
URL: https://stackoverflow.com/questions/ask
Path: questions/ask
It shouldn't be difficult:
url[/(?:\w{2,}\/).+/]
But I think I use a wrong pattern for 'ignore this' ('?:' - doesn't work). What is the right way?
I would suggest you don't do this with a regular expression, and instead use the built in URI lib:
require 'uri'
uri = URI::parse('http://stackoverflow.com/questions/ask')
puts uri.path # results in: /questions/ask
It has a leading slash, but thats easy to deal with =)
You can use regex in this case, which is faster than URI.parse:
s = 'http://stackoverflow.com/questions/ask'
s[s[/.*?\/\/[^\/]*\//].size..-1]
# => "questions/ask" (6,8 times faster)
s[/\/(?!.*\.).*/]
# => "/questions/ask" (9,9 times faster, but with an extra slash)
But if you don't care with the speed, use uri, as ctcherry showed, is more readable.
The approach presented by ctcherry is perfectly correct, but I prefer to use request.fullpath instead of including the URI library in the code. Just call request.fullpath in your views or controllers. But be careful, if you have any GET parameters in your URL it will be catched, in this case a use a split('?').first

Find email addresses in large data stream

STILL NOT RESOLVED :( [Feb 11th]
I have a large text file full of random data and want to pull out all the email addresses from it.
I would like to do this in Ruby, with pseudo code like this:
monster_data_string = "asfsfsdfsdfsf sfda **joe#example.com** sdfdsf"
monster_data_string.match(EMAIL_REGEX)
Does anyone know what Ruby email regular expression I would use to accomplish this?
Please keep in mind that I'm looking for a Ruby answer to this. I have already tried numerous regex found by googling but most of them cause Ruby runtime errors stating that characters like "+" and "" are invalid/unrecognized.*
What I have already tried is:
monster_data_string.match(/^([^#\s]+)#((?:[-a-z0-9]+\.)+[a-z]{2,})$/i)
but I receive Ruby errors stating that "+" is an invalid character
Thanks in advance
Watch this...
f = File.open("content.txt")
content = f.read
r = Regexp.new(/\b[a-zA-Z0-9._%+-]+#[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4}\b/)
emails = content.scan(r).uniq
puts YAML.dump(emails)
If you're getting an error message about + or * being invalid in regexes, you're doing something very wrong. This is a valid regex in Ruby, although it's not the one you want:
/^([^#\s]+)#((?:[-a-z0-9]+\.)+[a-z]{2,})$/i
For one thing, you don't want to anchor the regex to the start and end of lines (^ and $) if you're trying to pluck the addresses from "random" text. But once you've gotten rid of the anchors, your regex will match **joe#example.com in your test string, which I presume you don't want. This regex from Regular-Expressions.info does a better job, but read that page for tips on tweaking it to meet your particular needs.
/\b[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}\b/i
Finally (and you may already know this), you won't want to use the match() method because that will only find the first match. Try scan() instead.
Given that it is not possible to parse every valid email address using a regexp you are left with two choices:
Make a regexp that matches as many valid email addresses as possible and live with the the fact that some valid but rarely used forms of email address might get overlooked.
or
Make a regexp that Matches anything that "might be" an email address and then live with the false positives
I use the second approach to weed out obviously wrong email addresses when validating user sign up email addresses on a web page
Gleaned from Ruby Cookbook which has a very good section on email address validation:
valid = '[^ #]+'
/^#{valid}##{valid}\.#{valid}/
Apparently there is a 6343 character Perl regexp written by Paul Warren that does a very good job and also works in Ruby, but even that is not foolproof (I think it might also have some performance implications).
What kind of runtime error messages are you gettting? Is it regarding the regexps as invalid, or is it breaking due to the target string being too large?
To try and help you get there (though not very elegantly, I admit):
I think the start and end anchors (^ and $) aren't helping. You may also want to filter the asterisks?:
irb(main):001:0> mds = "asfsfsdfsdfsf sfda **joe#example.com** sdfdsf"
=> "asfsfsdfsdfsf sfda **joe#example.com** sdfdsf"
irb(main):003:0> mds.match(/^([^#\s]+)#((?:[-a-z0-9]+\.)+[a-z]{2,})$/i)
=> nil
irb(main):004:0> mds.match(/([^#\s]+)#((?:[-a-z0-9]+\.)+[a-z]{2,})/i)
=> #<MatchData "**joe#example.com" 1:"**joe" 2:"example.com">
irb(main):005:0> mds.match(/([^#\s*]+)#((?:[-a-z0-9]+\.)+[a-z]{2,})/i)
=> #<MatchData "joe#example.com" 1:"joe" 2:"example.com">
Even better,
require 'yaml'
content = "asfsfsdfsdfsf sfda **joe#example.com.au** sdfdsf cool_me#example.com.fr"
r = Regexp.new(/\b([a-zA-Z0-9._%+-]+)#([a-zA-Z0-9.-]+?)(\.[a-zA-Z.]*)\b/)
emails = content.scan(r).uniq
puts YAML.dump(emails)
will give you
---
- - joe
- example
- .com.au
- - cool_me
- example
- .com.au

Resources