Regex in Ruby for a URL that is an image - ruby

So I'm working on a crawler to get a bunch of images on a page that are saved as links. The relevant code, at the moment, is:
def parse_html(html)
html_doc = Nokogiri::HTML(html)
nodes = html_doc.xpath("//a[#href]")
nodes.inject([]) do |uris, node|
uris << node.attr('href').strip
end.uniq
end
I am current getting a bunch of links, most of which are images, but not all. I want to narrow down the links before downloading with a regex. So far, I haven't been able to come up with a Ruby-Friendly regex for the job. The best I have is:
^https?:\/\/(?:[a-z0-9\-]+\.)+[a-z]{2,6}(?:/[^\/?]+)+\.(?:jpg|gif|png)$.match(nodes)
Admittedly, I got that regex from someone else, and tried to edit it to work and I'm failing. One of the big problems I'm having is the original Regex I took had a few "#"'s in it, which I don't know if that is a character I can escape, or if Ruby is just going to stop reading at that point. Help much appreciated.

I would consider modifying your XPath to include your logic. For example, if you only wanted the a elements that contained an img you can use the following:
"//a[img][#href]"
Or even go further and extract just the URIs directly from the href values:
uris = html_doc.xpath("//a[img]/#href").map(&:value)

As some have said, you may not want to use Regex for this, but if you're determined to:
^http(s?):\/\/.*\.(jpeg|jpg|gif|png)
Is a pretty simple one that will grab anything beginning with http or https and ending with one of the file extensions listed. You should be able to figure out how to extend this one, Rubular.com is good for experimenting with these.

Regexp is a very powerful tool but - compared to simple string comparisons - they are pretty slow.
For your simple example, I would suggest using a simple condition like:
IMAGE_EXTS = %w[gif jpg png]
if IMAGE_EXTS.any? { |ext| uri.end_with?(ext) }
# ...
In the context of your question, you might want to change your method to:
IMAGE_EXTS = %w[gif jpg png]
def parse_html(html)
uris = []
Nokogiri::HTML(html).xpath("//a[#href]").each do |node|
uri = node.attr('href').strip
uris << uri if IMAGE_EXTS.any? { |ext| uri.end_with?(ext) }
end
uris.uniq
end

Related

Delete Duplicate Lines Ruby

I working on a json file, I think. But Regardless, I'm working with a lot of different hashes and fetching different values and etc. This is
{"notification_rule"=>
{"id"=>"0000000",
"contact_method"=>
{"id"=>"000000",
"address"=>"cod.lew#gmail.com",}
{"notification_rule"=>
{"id"=>"000000"
"contact_method"=>
{"id"=>"PO0JGV7",
"address"=>"cod.lew#gmail.com",}
Essential, this is the type of hash I'm currently working with. With my code:
I wanted to stop duplicates of the same thing in the text file. Because whenever I run this code it brings both the address of both these hashes. And I understand why, because its looping over again, but I thought this code that I added would help resolve that issue:
Final UPDATE
if jdoc["notification_rule"]["contact_method"]["address"].to_s.include?(".com")
numbers.print "Employee Name: "
numbers.puts jdoc["notification_rule"]["contact_method"]["address"].gsub(/#target.com/, '').gsub(/\w+/, &:capitalize)
file_names = ['Employee_Information.txt']
file_names.each do |file_name|
text = File.read(file_name)
lines = text.split("\n")
new_contents = lines.uniq.join("\n")
File.open(file_name, "w") { |file| file.puts new_contents }
end
else
nil
end
This code looks really confused and lacking a specific purpose. Generally Ruby that's this tangled up is on the wrong track, as with Ruby there's usually a simple way of expressing something simple, and testing for duplicated addresses is one of those things that shouldn't be hard.
One of the biggest sources of confusion is the responsibility of a chunk of code. In that example you're not only trying to import data, loop over documents, clean up email addresses, and test for duplicates, but somehow facilitate printing out the results. That's a lot of things going on all at once, and they all have to work perfectly for that chunk of code to be fully operational. There's no way of getting it partially working, and no way of knowing if you're even on the right track.
Always try and break down complex problems into a few simple stages, then chain those stages together as necessary.
Here's how you can define a method to clean up your email addresses:
def address_scrub(address)
address.gsub(/\#target.com/, '').gsub(/\w+/, &:capitalize)
end
Where that can be adjusted as necessary, and presumably tested to ensure it's working correctly, which you can now do indepenedently of the other code.
As for the rest, it looks like this:
require 'set'
# Read in duplicated addresses from a file, clean up with chomp, using a Set
# for fast lookups.
duplicates = Set.new(
File.open("Employee_Information.txt", "r").readlines.map(&:chomp)
)
# Extract addresses from jdoc document array
filtered = jdocs.map do |jdoc|
# Convert to jdoc/address pair
[ jdoc, address_scrub(jdoc["notification_rule"]["contact_method"]["address"]) ]
end.reject do |jdoc, address|
# Remove any that are already in the duplicates list
duplicates.include?(address)
end.map do |jdoc, _|
# Return only the document
jdoc
end
Where that processes jdocs, an array of jdoc structures, and removes duplicates in a series of simple steps.
With the chaining approach you can see what's happening before you add on the next "link", so you can work incrementally towards a solution, adjusting as you go. Any mistakes are fairly easy to catch because you're able to, at any time, inspect the intermediate products of those stages.

writing a short script to process markdown links and handling multiple scans

I'd like to process just links written in markdown. I've looked at redcarpet which I'd be ok with using but I really want to support just links and it doesn't look like you can use it that way. So I think I'm going to write a little method using regex but....
assuming I have something like this:
str="here is my thing [hope](http://www.github.com) and after [hxxx](http://www.some.com)"
tmp=str.scan(/\[.*\]\(.*\)/)
or if there is some way I could just gsub in place [hope](http://www.github.com) -> <a href='http://www.github.com'>hope</a>
How would I get an array of the matched phrases? I was thinking once I get an array, I could just do a replace on the original string. Are there better / easier ways of achieving the same result?
I would actually stick with redcarpet. It includes a StripDown render class that will eliminate any markdown markup (essentially, rendering markdown as plain text). You can subclass it to reactivate the link method:
require 'redcarpet'
require 'redcarpet/render_strip'
module Redcarpet
module Render
class LinksOnly < StripDown
def link(link, title, content)
%{#{content}}
end
end
end
end
str="here is my thing [hope](http://www.github.com) and after [hxxx](http://www.some.com)"
md = Redcarpet::Markdown.new(Redcarpet::Render::LinksOnly)
puts md.render(str)
# => here is my thing hope and ...
This has the added benefits of being able to easily implement a few additional tags (say, if you decide you want paragraph tags to be inserted for line breaks).
You could just do a replace.
Match this:
\[([^[]\n]+)\]\(([^()[]\s"'<>]+)\)
Replace with:
\1
In Ruby it should be something like:
str.gsub(/\[([^[]\n]+)\]\(([^()[]\s"'<>]+)\)/, '\1')

DRY search every page of a site with nokogiri

I want to search every page of a site. My thought is to find all links on a page that stay within the domain, visit them, and repeat. I'll have to implement measures to not repeat efforts as well.
So it starts very easily:
page = 'http://example.com'
nf = Nokogiri::HTML(open(page))
links = nf.xpath '//a' #find all links on current page
main_links = links.map{|l| l['href'] if l['href'] =~ /^\//}.compact.uniq
"main_links" is now an array of links from the active page that start with "/" (which should be links on the current domain only).
From here I can feed and read those links into similar code above, but I don't know the best way to ensure I don't repeat myself. I'm thinking I start collecting all the visited links as I visit them:
main_links.each do |ml|
visited_links = [] #new array of what is visted
np = Nokogiri::HTML(open(page + ml)) #load the first main_link
visted_links.push(ml) #push the page we're on
np_links = np.xpath('//a').map{|l| l['href'] if l['href'] =~ /^\//}.compact.uniq #grab all links on this page pointing to the current domain
main_links.push(np_links).compact.uniq #remove duplicates after pushing?
end
I'm still working out this last bit... but does this seem like the proper approach?
Thanks.
Others have advised you not to write your own web crawler. I agree with this if performance and robustness are your goals. However, it can be a great learning exercise. You wrote this:
"[…] but I don't know the best way to ensure I don't repeat myself"
Recursion is the key here. Something like the following code:
require 'set'
require 'uri'
require 'nokogiri'
require 'open-uri'
def crawl_site( starting_at, &each_page )
files = %w[png jpeg jpg gif svg txt js css zip gz]
starting_uri = URI.parse(starting_at)
seen_pages = Set.new # Keep track of what we've seen
crawl_page = ->(page_uri) do # A re-usable mini-function
unless seen_pages.include?(page_uri)
seen_pages << page_uri # Record that we've seen this
begin
doc = Nokogiri.HTML(open(page_uri)) # Get the page
each_page.call(doc,page_uri) # Yield page and URI to the block
# Find all the links on the page
hrefs = doc.css('a[href]').map{ |a| a['href'] }
# Make these URIs, throwing out problem ones like mailto:
uris = hrefs.map{ |href| URI.join( page_uri, href ) rescue nil }.compact
# Pare it down to only those pages that are on the same site
uris.select!{ |uri| uri.host == starting_uri.host }
# Throw out links to files (this could be more efficient with regex)
uris.reject!{ |uri| files.any?{ |ext| uri.path.end_with?(".#{ext}") } }
# Remove #foo fragments so that sub-page links aren't differentiated
uris.each{ |uri| uri.fragment = nil }
# Recursively crawl the child URIs
uris.each{ |uri| crawl_page.call(uri) }
rescue OpenURI::HTTPError # Guard against 404s
warn "Skipping invalid link #{page_uri}"
end
end
end
crawl_page.call( starting_uri ) # Kick it all off!
end
crawl_site('http://phrogz.net/') do |page,uri|
# page here is a Nokogiri HTML document
# uri is a URI instance with the address of the page
puts uri
end
In short:
Keep track of what pages you've seen using a Set. Do this not by href value, but by the full canonical URI.
Use URI.join to turn possibly-relative paths into the correct URI with respect to the current page.
Use recursion to keep crawling every link on every page, but bailing out if you've already seen the page.
You are missing some things.
A local reference can start with /, but it can also start with ., .. or even no special character, meaning the link is within the current directory.
JavaScript can also be used as a link, so you'll need to search throughout your document and find tags being used as buttons, then parse out the URL.
This:
links = nf.xpath '//a' #find all links on current page
main_links = links.map{|l| l['href'] if l['href'] =~ /^\//}.compact.uniq
can be better written:
links.search('a[href^="/"]').map{ |a| a['href'] }.uniq
In general, don't do this:
....map{|l| l['href'] if l['href'] =~ /^\//}.compact.uniq
because it is very awkward. The conditional in the map results in nil entries in the resulting array, so don't do that. Use select or reject to reduce the set of links that meet your criteria, and then use map to transform them. In your use here, pre-filtering using ^= in the CSS makes it even easier.
Don't store the links in memory. You'll lose all progress if you crash or stop your code. Instead, at a minimum, use something like a SQLite database on disk as a data-store. Create a "href" field that is unique to avoid repeatedly hitting the same page.
Use Ruby's built-in URI class, or the Addressable gem, to parse and manipulate URLs. They save you work, and will do things the right way when you start encoding/decoding queries and trying to normalize the parameters to check for uniqueness, extracting and manipulating paths, etc.
Many sites use session IDs in the URL query to identify the visitor. That ID can make every link different if you start, then stop, then start again, or if you're not returning the cookies received from the site, so you have to return cookies, and figure out which query parameters are significant, and which are going to throw off your code. Keep the first and throw away the second when you store the links for later parsing.
Use a HTTP client like Typhoeus with Hydra to retrieve multiple pages in parallel, and store them in your database, with a separate process that parses them and feeds the URLs to parse back into the database. This can make a huge difference in your overall processing time.
Honor the site's robots.txt file, and throttle your requests to avoid beating up their server. Nobody likes bandwidth hogs and consuming a significant amount of a site's bandwidth or CPU time without permission is a good way to get noticed then banned. Your site will go to zero throughput at that point.
It's a more complicated problem than you seem to realize. Using a library along with Nokogiri is probably the way to go. Unless you're using windows (like me) you might want to look into Anemone.

Get full path in Sinatra route including everything after question mark

I have the following path:
http://192.168.56.10:4567/browse/foo/bar?x=100&y=200
I want absolutely everything that comes after "http://192.168.56.10:4567/browse/" in a string.
Using a splat doesn't work (only catches "foo/bar"):
get '/browse/*' do
Neither does the regular expression (also only catches "foo/bar"):
get %r{/browse/(.*)} do
The x and y params are all accessible in the params hash, but doing a .map on the ones I want seems unreasonable and un-ruby-like (also, this is just an example.. my params are actually very dynamic and numerous). Is there a better way to do this?
More info: my path looks this way because it is communicating with an API and I use the route to determine the API call I will make. I need the string to look this way.
If you are willing to ignore hash tag in path param this should work(BTW browser would ignore anything after hash in URL)
updated answer
get "/browse/*" do
p "#{request.path}?#{request.query_string}".split("browse/")[1]
end
Or even simpler
request.fullpath.split("browse/")[1]
get "/browse/*" do
a = "#{params[:splat]}?#{request.env['rack.request.query_string']}"
"Got #{a}"
end

What would be the best way to take a string of html, chop it up, and put each piece into an array?

I have a general idea of how I can do this, but can't pinpoint how exactly to get it done. I am sure it can be done with a regex of some sort. Wondering if anyone here can point me in the right direction.
If I have a string of html such as this
some_html = '<div><b>This is some BOLD text</b></div>'
I want to to divide it into logical pieces, and then put those pieces into an array so I end with a result like this
html_array = ["<div>", "<b>", "This is some BOLD text", "</b>","</div>" ]
Rather than use regex I'd use the nokogiri gem (a gem for parsing html written by Aaron Patterson - contributor to Rails and Ruby). Here's a sample of how to use it:
html_doc = Nokogiri::HTML("<html><body><h1>Mr. Belvedere Fan Club</h1></body></html>")
You can then call html_doc.children to get a nodeset and work your way from there
html_doc.children # returns a nodeset
Use an HTML parser, for instance, Nokogiri. Using SAX you can add tags/elements to the array as events are triggered.
It's not a good idea to try to regex HTML, unless you're planning to treat only a small determined subset of it.
some_html.split(/(<[^>]*>)/).reject{|x| '' == x}

Resources