Trying to scrape an image using Nokogiri but it returns a link that I was not expecting - ruby

I'm doing a scraping exercise and trying to scrape the poster from a website using Nokogiri.
This is the link that I want to get:
https://a.ltrbxd.com/resized/film-poster/5/8/6/7/2/3/586723-glass-onion-a-knives-out-mystery-0-460-0-690-crop.jpg?v=ce7ed2a83f
But instead I got this:
https://s.ltrbxd.com/static/img/empty-poster-500.825678f0.png
Why?
This is what I tried:
url = "https://letterboxd.com/film/glass-onion-a-knives-out-mystery/"
serialized_html = URI.open(url).read
html = Nokogiri::HTML.parse(serialized_html)
title = html.search('.headline-1').text.strip
overview = html.search('.truncate p').text.strip
poster = html.search('.film-poster img').attribute('src').value
{
title: title,
overview: overview,
poster_url: poster,
}

It has nothing to do with your ruby code.
If you run in your terminal something like
curl https://letterboxd.com/film/glass-onion-a-knives-out-mystery/
You can see that the output HTML does not have the images you are looking for. You can see then in your browser because after that initial load some javascript runs and loads more resources.
The ajax call that loads the image you are looking for is https://letterboxd.com/ajax/poster/film/glass-onion-a-knives-out-mystery/std/500x750/?k=0c10a16c
Play with the network inspector of your browser and you will be able to identify the different parts of the website and how each one loads.

Nokogiri does not execute Javascript however the link has to be there or at least there has to be a link to some API that returns the link.
First place where I would search for it would be the data attributes of the image element or its parent however in this case it was hidden in an inline script along with some other interesting data about the movie.
First download the web page using curl or wget and open the file in text editor to see what Nokogiri sees. Search for something you know about the file, I searched for ce7ed2a83f part of the image url and found the JSON.
Then the data can be extracted like this:
require 'nokogiri'
require 'open-uri'
require 'json'
url = "https://letterboxd.com/film/glass-onion-a-knives-out-mystery/"
serialized_html = URI.open(url).read
html = Nokogiri::HTML.parse(serialized_html)
data_str = html.search('script[type="application/ld+json"]').first.to_s.gsub("\n",'').match(/{.*}/).to_s
data = JSON.parse(data_str)
data['image']

Related

How can I click on a specific link with Nokogori or Mechanize?

I know how to find an element using Nokogiri. I know how to click a link using Mechanize. But I can't figure out how to find a specific link and click it. This seems like it should be really easy, but for some reason I can't find a solution.
Let's say I'm just trying to click on the first result on a Google search. I can't just click the first link with Mechanize, because the Google page has a bunch of other links, like Settings. The search result links themselves don't seem to have class names, but they're enveloped in <h3 class="r"></h3>.
I could just use Nokogiri to follow the href value of the link like so:
document = open("https://www.google.com/search?q=stackoverflow")
parsed_content = Nokogiri::HTML(document.read)
href = parsed_content.css('.r').children.first['href']
new_document = open(href)
# href is equal to "/url?sa=t&rct=j&q=&esrc=s&source=web&url=https%3A%2F%2Fstackoverflow.com%2F"
but it's not a direct url, and going to that url gives an error. The data-href value is a direct url, but I can't figure out how to get that value - doing the same thing except with ...first['data-href'] returns nil.
Anyone know how I can just find the first .r element on the page and click the link inside it?
Here's the start to my action:
require 'open-uri'
require 'nokogiri'
require 'mechanize'
document = open("https://www.google.com/search?q=stackoverflow")
parsed_content = Nokogiri::HTML(document.read)
Here's the .r element on the Google search results page:
<h3 class="r">
Stack Overflow
</h3>
You should make sure your question is the correct code in your example - it looks like it is not, because you don't surround the url in quotes and the css selector is .r a not r. You use .r a because you want to access the link inside elements with the r class.
Anyway, you can use the approach detailed here like so:
require 'open-uri'
require 'nokogiri'
require 'uri'
base_url = "https://www.google.com/search?q=stackoverflow"
document = open(base_url)
parsed_content = Nokogiri::HTML(document.read)
href = parsed_content.css('.r').first.children.first['href']
new_url = URI.join base_url, href
new_document = open(new_url)
I tested this and following new_url does redirect to StackOverflow as expected.

How to get the position of text using 'PDF-Reader' gem in Ruby

I am new to Ruby and we are using Ruby Selenium framework for automating the PDF verification testing.
I want to verify the content of PDF, like text and also get the position of the text. Along with that I also need to get the text at a given position.
Something like this maybe
require 'pdf-reader'
require 'open-uri'
reader = PDF::Reader.new(open("SAMPLE_URL")) # my resume pdf
page = reader.pages.first
lines = page.split("\n")
text_match_line_numbers = [0...lines.length].select do |i|
lines[i] .include? "text"
end
Look at their docs here, there are more advanced options for navigating the PDF page.

How do I select a table row with Nokogiri?

I'm having trouble scraping the rows from "List of Nobel laureates" in Nokogiri.
I believe my CSS selector is correct, but it's returning empty.
The original tutorial is "Writing a Web Crawler".
require 'rubygems'
require 'nokogiri'
require 'open-uri'
BASE_WIKIPEDIA_URL = 'http://en.wikipedia.org/'
LIST_URL = "#{BASE_WIKIPEDIA_URL}/wiki/List_of_Nobel_laureates"
page = Nokogiri::HTML(open(LIST_URL))
rows = page.css('div#content.mw-body div#bodyContent div#mw-content-text.mw-content-ltr table.wikitable.sortable.jquery-tablesorter tr')
puts "length : #{rows.size}"
There are two problems:
You have a double slash in the URL you are building, so you're not actually looking at the page you think you're looking at. This is the URL you are using: http://en.wikipedia.org//wiki/List_of_Nobel_laureates, if you follow the link you'll see that it redirects to the Wikipedia homepage.
Your CSS selector is far too specific, and includes some information that won't be present in the raw page source. You should try a more simple selector:
rows = page.css('table.wikitable tr')
Specifically you are including the jquery-tablesorter class in your selector. This class is added by JavaScript, but the tools you're using don't execute the page's JavaScript, so the class won't be present and you can't use it to find table rows.
If you use "view source", instead of the your browser's DOM inspector tool, you will see the raw source code without any JavaScript applied.
I can see that you're expecting to a table with class jquery-tablesorter. That's because you're inspecting the table in your browser and it has that class. The problem is that jquery adds that class after the page loads. But since open-uri doesn't process javascript, that class never gets added to the table that nokogiri sees.
Long story short, you probably want to go with just:
page.css('table.wikitable tr')

Watir-webdriver is progressing through script before Nokogiri finishes scraping

There are three forms on the page. All forms default to "today" for their date ranges. Each form is iteratively submitted with a date from a range (1/1/2013 - 1/3/2013, for example) and the resulting table is scraped.
The script then submit the date to the next form in line and again, the table is scraped. However, the scraping is occurring before the dates are submitted.
I tried adding sleep 2 in between scrapes to no avail.
The script is here: https://gist.github.com/hnanon/de4801e460a31d93bbdc
The script appears to assume that Nokogiri and Watir will always be in sync. This is not correct.
When you do:
page = Nokogiri::HTML.parse(browser.html)
Nokogiri gets the browser html at that one specific point in time. If Watir makes a change to the browser (ie changes the html), Nokogiri will not know about it.
Each time you want to parse the html with Nokogiri, you need to create a new Nokogiri object using the browser's latest html.
An example to illustrate:
require 'watir-webdriver'
require 'nokogiri'
b = Watir::Browser.new
b.goto 'www.google.ca'
page = Nokogiri::HTML.parse(b.html)
p page
#=> This will be the Google page
b.goto 'www.yahoo.ca'
p page
#=> This will still be the Google page
page = Nokogiri::HTML.parse(b.html)
p page
#=> This will now be the Yahoo page

How to implement Watir classes (e.g. PageContainer)?

I'm writing a sample test with Watir where I navigate around a site with the IE class, issue queries, etc..
That works perfectly.
I want to continue by using PageContainer's methods on the last page I landed on.
For instance, using its HTML method on that page.
Now I'm new to Ruby and just started learning it for Watir.
I tried asking this question on OpenQA, but for some reason the Watir section is restricted to normal members.
Thanks for looking at my question.
edit: here is a simple example
require "rubygems"
require "watir"
test_site = "http://wiki.openqa.org/"
browser = Watir::IE.new
browser.goto(test_site)
# now if I want to get the HTML source of this page, I can't use the IE class
# because it doesn't have a method which supports that
# the PageContainer class, does have a method that supports that
# I'll continue what I want to do in pseudo code
Store HTML source in text file
# I know how to write to a file, so that's not a problem;
# retrieving the HTML is the problem.
# more specifically, using another Watir class is the problem.
Close browser
# end
Currently, the best place to get answers to your Watir questions is the Watir-General email list.
For this question, it would be nice to see more code. Is the application under test (AUT) opening a new window/tab that you were having trouble getting to and therefore wanted to try the PageContainer, or is it just navigating to a second page?
If it is the first one, you want to look at #attach, if it is the second, then I would recommend reading the quick start tutorial.
Edit after code added above:
What I think you missed is that Watir::IE includes the Watir::PageContainer module. So you can call browser.html to get the html displayed on the page to which you've navigated.
I agree. It seems to me that browser.html is what you want.

Resources