Parsing website with Hpricot - ruby

I'm trying to parse images from reddit using Ruby and the Hpricot gem.
Using Chrome I got the XPath to the div holding the link to the image then I use doc.search to find it but the results come up empty.
doc.search('//*[#id="siteTable"]/div[1]/a').each do |r|
puts r.attributes['href']
end
Any ideas?

Related

Trying to scrape an image using Nokogiri but it returns a link that I was not expecting

I'm doing a scraping exercise and trying to scrape the poster from a website using Nokogiri.
This is the link that I want to get:
https://a.ltrbxd.com/resized/film-poster/5/8/6/7/2/3/586723-glass-onion-a-knives-out-mystery-0-460-0-690-crop.jpg?v=ce7ed2a83f
But instead I got this:
https://s.ltrbxd.com/static/img/empty-poster-500.825678f0.png
Why?
This is what I tried:
url = "https://letterboxd.com/film/glass-onion-a-knives-out-mystery/"
serialized_html = URI.open(url).read
html = Nokogiri::HTML.parse(serialized_html)
title = html.search('.headline-1').text.strip
overview = html.search('.truncate p').text.strip
poster = html.search('.film-poster img').attribute('src').value
{
title: title,
overview: overview,
poster_url: poster,
}
It has nothing to do with your ruby code.
If you run in your terminal something like
curl https://letterboxd.com/film/glass-onion-a-knives-out-mystery/
You can see that the output HTML does not have the images you are looking for. You can see then in your browser because after that initial load some javascript runs and loads more resources.
The ajax call that loads the image you are looking for is https://letterboxd.com/ajax/poster/film/glass-onion-a-knives-out-mystery/std/500x750/?k=0c10a16c
Play with the network inspector of your browser and you will be able to identify the different parts of the website and how each one loads.
Nokogiri does not execute Javascript however the link has to be there or at least there has to be a link to some API that returns the link.
First place where I would search for it would be the data attributes of the image element or its parent however in this case it was hidden in an inline script along with some other interesting data about the movie.
First download the web page using curl or wget and open the file in text editor to see what Nokogiri sees. Search for something you know about the file, I searched for ce7ed2a83f part of the image url and found the JSON.
Then the data can be extracted like this:
require 'nokogiri'
require 'open-uri'
require 'json'
url = "https://letterboxd.com/film/glass-onion-a-knives-out-mystery/"
serialized_html = URI.open(url).read
html = Nokogiri::HTML.parse(serialized_html)
data_str = html.search('script[type="application/ld+json"]').first.to_s.gsub("\n",'').match(/{.*}/).to_s
data = JSON.parse(data_str)
data['image']

How to get the position of text using 'PDF-Reader' gem in Ruby

I am new to Ruby and we are using Ruby Selenium framework for automating the PDF verification testing.
I want to verify the content of PDF, like text and also get the position of the text. Along with that I also need to get the text at a given position.
Something like this maybe
require 'pdf-reader'
require 'open-uri'
reader = PDF::Reader.new(open("SAMPLE_URL")) # my resume pdf
page = reader.pages.first
lines = page.split("\n")
text_match_line_numbers = [0...lines.length].select do |i|
lines[i] .include? "text"
end
Look at their docs here, there are more advanced options for navigating the PDF page.

Watir-webdriver is progressing through script before Nokogiri finishes scraping

There are three forms on the page. All forms default to "today" for their date ranges. Each form is iteratively submitted with a date from a range (1/1/2013 - 1/3/2013, for example) and the resulting table is scraped.
The script then submit the date to the next form in line and again, the table is scraped. However, the scraping is occurring before the dates are submitted.
I tried adding sleep 2 in between scrapes to no avail.
The script is here: https://gist.github.com/hnanon/de4801e460a31d93bbdc
The script appears to assume that Nokogiri and Watir will always be in sync. This is not correct.
When you do:
page = Nokogiri::HTML.parse(browser.html)
Nokogiri gets the browser html at that one specific point in time. If Watir makes a change to the browser (ie changes the html), Nokogiri will not know about it.
Each time you want to parse the html with Nokogiri, you need to create a new Nokogiri object using the browser's latest html.
An example to illustrate:
require 'watir-webdriver'
require 'nokogiri'
b = Watir::Browser.new
b.goto 'www.google.ca'
page = Nokogiri::HTML.parse(b.html)
p page
#=> This will be the Google page
b.goto 'www.yahoo.ca'
p page
#=> This will still be the Google page
page = Nokogiri::HTML.parse(b.html)
p page
#=> This will now be the Yahoo page

click on xpath link with Mechanize

I want to click a link with Mechanize that I select with xpath (nokogiri).
How is that possible?
next_page = page.search "//div[#class='grid-dataset-pager']/span[#class='currentPage']/following-sibling::a[starts-with(#class, 'page')][1]"
next_page.click
The problem is that nokogiri element doesn't have click function.
I can't read the href (URL) and send get request because the link has onclick function defined (no href attribute).
If that's not possible, what are the alternatives?
Use page.at instead of page.search when you're trying to find only one element.
You can make your selector simpler (shorter) by using CSS selector syntax:
next_page = page.at('div.grid-dataset-pager > span.currentPage + a[class^="page"]')
You can construct your own Link instance if you have the Nokogiri element, page, and mechanize object to feed the constructor:
next_link = Mechanize::Page::Link.new( next_page, mech, page )
next_link.click
However, you might not need that, because Mechanize#click lets you supply a string with the text of the anchor/button to click on.
# Assuming this link text is unique on the page, which I suspect it is
mech.click next_page.text
Edit after re-reading the question completely: However, none of this is going to help you, because Mechanize is not a web browser! It does not have a JavaScript engine, and thus won't (can't) execute your onclick for you. For this you'll need to use Ruby to control a real web browser, e.g. using Watir or Selenium or Celerity or the like.
In general you would do:
page.link_with(:node => next_link).click
However like Phrogz says, this won't really do what you want.
Why don't you use a hpricot element instead? Mechanize can click on a hpricot element as long as the link has a 'src' or 'href' attribute. Try something along these lines:
page = agent.get("http://www.example.com")
next_page = agent.click((page/"//your/xpath/a"))
Edit After reading Phrogz answer I also realized that this won't really do it. Mechanize doesn't support Javascript yet. With this in mind you have 3 options.
Use a library that controls a real web browser. See #Phrogz answer.
Use Capybara which is an integration testing library but can also be used as a stand alone crawler. I've done this successfully with HTMLUnit which is a also an integration testing library in Java. Capybara comes with Selenium support by default though it also supports Webkit via an external gem. Capybara interprets Javascript out of the box. This blog post might help.
Grok the page that you intend to crawl and use something like HTTPFox to monitor what the onclick Javascript function does and replicate this in your Mechanize script.
Good luck.

Using a regular expression to find a link with a certain anchor when using mechanize

I'm trying to use the Mechanize link_with(:href => 'anchor here') in order to find pages that have links with a certain string in the href. For example I want it so that I can spit out into a text file all sites that have a link where the anchor contains "index.php?user"
How would I go about this?
Thanks all for your answers, I ended up going with page.link_with(:href => /(.*)?user$/)
urls = ['http://www.google.com/','http://www.foo.com/','http://www.bar.com/']
File.open('output.txt', 'w') do |out|
urls.each do |url|
out << url if agent.get(url).link_with(:href => /index.php\?user/)
end
end
I would suggest you look into XPath selectors:
jQuery Xpath selector to select an element which id contains 'sometext'
An example on how to use XPath with mechanize can be found here:
extract single string from HTML using Ruby/Mechanize (and Nokogiri)

Resources