Scraping iframe data using Nokogiri and Ruby [closed] - ruby

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 7 years ago.
Improve this question
This is my script written to scrape data inside the <iframe> tag using Nokogiri:
require 'nokogiri'
require 'restclient'
doc = Nokogiri::HTML(RestClient.get("http://www.sample_site.com/"))
doc.xpath('//iframe[#width="1001" and #height="973"]').children
I am getting like this:
=> [#<Nokogiri::XML::Text:0x1913970 "\r\nYour browser does not support inline frames\r\n">]
Can anyone tell me why?

An iframe is used to embed another document within the current HTML document. It means the iframe loads his content from an external source that is specified in the src attribute.
So, if you want to do scraping to an iframe content you should send a request to the external source from where it loads his content.
# The iframe (notice the 'src' attribute)
<iframe src="iframe_source_url" height="973" width="1001">
# iframe content
</iframe>
# Code to do the scraping
doc = RestClient.get('iframe_source_url')
parsed_doc = Nokogiri::HTML(doc)
parsed_doc.css('#yourSelectorHere') # or parsed_doc.xpath('...')
Note (about the error)
When you do scraping, the HTTP client you use acts as your browser (yours is restclient). The error says your browser does not support inline frames, in other words, restclient does not support inline-frames and is why it cannot load the content of the frame.

The issue is to be addressed to RestClient, not to Nokogiri.
RestClient does not retrieve the content of iframes. You might want to try to examine the content of RestClient.get("http://www.sample_site.com/"), there will be the string like:
<iframe src="page-1.htm" name="test" height="120" width="600">
You need a Frames Capable browser to view this content.
</iframe>
Nokogiri is fine dealing with this, it returns the content of iframe node which is apparently the only TextNode having the string you yielded as a result.

Related

Give HTTPS Proxies [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
OK, as admins requested it, i refine my question:
i am searching for local Proxie solutions which can tweak HTTPS / HST websites. They should be able to tweak the content and the headers of the site. Do you know such Proxies? I would prefer Python solutions because they are hackable.
Yes, there are solutions which work with Browser plugins, and i have posted an answer containing an example using Yarip, but the problem is: As soon as the Browser developers decide to remove APIs, not that anyone would do that, the Plugin stops working.
Therefore i want to have a solution which works on the protocol level. So, which Proxies can do that, tweak HTTPS / HST websites? I dont care for performance, my internet is slow anyway and im not in a hurry. Please also give a small example how to tweak the content of the website and a small example how to tweak a header, using your solution.
Hopefully my question is clear now.
Ok, here is my solution to this problem. AJAX with bottle.py and the jQuery Docs were helpful.
Use Firefox (Developer Edition), version 52. Later versions of Firefox do not support Yarip, which is my approach to inject Javascript into the document and to tweak the Content-Security-Policy response header, if there is one. I am interested in solutions for later Firefoxes/Chromes/whatever.
Stop Firefox from blocking mixed content (loading http resources from https sites). Which is nonsense on localhost. According to this Discussion and this Wiki Entry, starting with Firefox 55 localhost is finally whitelisted by default. However, as i need Yarip and therefore can not use Firefox 55, i still need to manually disable this policy.
This can be done globally by setting about.config -> security.mixed_content.block_active_content to false which is lazy and veery dangerous, as it affects every website, or by doing it temporarily per-page which is not so lazy but a little bit less dangerous.
Install Python 3
pip install bottle
Create a file server.py with the following contents:
import os, json
from bottle import request, response, route, static_file, debug, run
#route('/inc') # this handles the ajax calls
def inc():
# only needed when you pass parameters to the ajax call.
response.set_header('Access-Control-Allow-Origin', '*')
number = request.params.get('number', 0, type=int)
return json.dumps({'increased_number': number + 1})
#route('/static/<filename:path>') # this serves the two static javascript files
def send_static(filename):
return static_file(filename, root=os.path.dirname(__file__))
debug(True)
run(port=9030, reloader=True)
Run it
Put a copy of jquery.js into the same directory
Create a file logic.js in the same directory with the following contents:
// http://api.jquery.com/jQuery.noConflict/
var my = {};
my.$ = $.noConflict(true);
// http://api.jquery.com/ready/
my.$(function() {
var target = my.$(
'<div id="my-ajax-result" style="position:absolute; padding:1em; background:'
+ 'white; cursor:pointer; border:3px solid blue; z-index:999;">0</div>'
);
my.$('body').prepend(target);
function ajaxcall(){
// http://api.jquery.com/jQuery.getJSON/
my.$.getJSON(
"http://localhost:9030/inc",
{
number : target.text() // parameters
},
function(result) {
target.text(result.increased_number);
}
);
}
// http://api.jquery.com/click/
target.click(function(event) {
ajaxcall();
return false;
});
ajaxcall();
});
Install Yarip
Save the following xml to a file and, from Yarips manage pages dialog import it. This should create a new rule for en.wikipedia.org. Im too lazy now to explain how yarip works, but it is worth learning it. This rule will inject the jquery.js and the logic.js at the end of body and it will tweak the Content-Security-Policy response header, if there is one.
<?xml version="1.0" encoding="UTF-8"?>
<yarip version="0.3.5">
<page id="{56b07a5d-e2df-41f2-9ca8-34b4ecb04af8}" name="wikipedia.org" allowScript="true" created="1496845936929">
<page>
<header>
<response>
<item created="1496845998973">
<regexp flags="i"><![CDATA[.*wikipedia\.org.*]]></regexp>
<name><![CDATA[Content-Security-Policy]]></name>
<script><![CDATA[function (value) {
return "Content-Security-Policy: connect-src *";
}]]></script>
</item>
</response>
</header>
<stream>
<item created="1496845985382">
<regexp flags="i"><![CDATA[.*wikipedia\.org.*]]></regexp>
<stream_regexp flags="gim"><![CDATA[</body>]]></stream_regexp>
<script><![CDATA[function (match, p1, offset, string) {
return '<script type="text/javascript" src="http://localhost:9030/static/jquery.js"></script><script type="text/javascript" src="http://localhost:9030/static/logic.js"></script></body>';
}
]]></script>
</item>
</stream>
</page>
</page>
</yarip>
Make sure that yarip is enabled
Navigate to en.wikipedia.org. You should see a blue rectangle top left, with a number in it. If you click on it, an ajax call to localhost will be made and the contents of the blue rectangle will be replaced with the result result of that call – the number increased by 1. Screenshot:
Play around with this and tweak the web the way you want, including HTTPS sites. Have read/write access to your computer using the Python. Eat this, Firefox Nanny devs.

Web scraping from youtube with nokogiri

I want to scrape all the names of the users who commented below a youtube video.
I'm using ruby and nokogiri.
require 'rubygems'
require 'nokogiri'
require 'open-uri'
url = "https://www.youtube.com/watch?v=tntOCGkgt98"
doc = Nokogiri::HTML(open(url))
doc.css(".comment-thread-renderer > .comment-renderer").each do |comment|
name = comment.css("#comment-section-renderer-items .g-hovercard").text
puts name
end
But it's not working, I'm not getting any output, no error either.
I won't be able to give you a solution, but at least I can give you a couple of hints that may help you to move forward.
The code you have is not working because the comments section is loaded via an ajax call after the page is loaded. If you do a hard reload in your browser, you will see that there is a spinner icon and a Loading... text in the sections comment, waiting for the content to be loaded. When Nokogiri gets the page via the http request, it gets the html content that you see before the comments are loaded. As a matter of fact the place where the contents will be later added looks like:
<div id="watch-discussion" class="branded-page-box yt-card">
<div id="comment-section-renderer"
class="comment-section-renderer vve-check"
data-visibility-tracking="CCsQuy8iEwjr3P3u1uzNAhXIepAKHRV9D8Ao-B0=">
<div class="action-panel-loading">
<p class="yt-spinner ">
<span class="yt-spinner-img yt-sprite" title="Loading icon">
</span>
<span class="yt-spinner-message">Loading...</span>
</p>
</div>
</div>
</div>
That is the reason why you won't find the divs you are looking for, because they aren't part of the html you have.
Looking at the network console in the browser, it seems that the ajax request to get the comments data is being sent to https://www.youtube.com/watch_fragments_ajax?v=tntOCGkgt98&tr=time&distiller=1&ctoken=EhYSC3RudE9DR2tndDk4wAEAyAEA4AEBGAY%253D&frags=comments&spf=load. As you can see the v parameter is the video id, however there are a couple of caveats:
There is a ctoken param, which you can get by scraping the original page contents. It is inside a <script> tag, in the form of
'COMMENTS_TOKEN': "<token>".
However, you still need to send a session_token as a form data in the body of the AJAX request (which is a POST). That I don't know where is coming from :(.
I think that you will be pushing the limits of Nokogiri here, as AFAIK it is not intended to follow ajax requests or handling Javascript. Maybe the ruby Selenium driver is better suited for this.
HTH
I think you need name.css("#comment-section..."
The each statement will iterate over the elements, using the variable name.
You may want to use node instead of name:
doc.css(".comment-thread-renderer > .comment-renderer").each do |node|
name = node.css("#comment-section-renderer-items .g-hovercard").text
puts name
end
I wrote this rails app using nokogiri to see all the tags that a page has before any javascript is run in the browser. The source code is here, so you can adjust it if you need to add more info about the node in the view.
That can easily tell you if the particular tag element that you are looking for is something you can retrieve without having to do some JS eval.
Most web crawlers don't support client-side rendering, which gives you an idea that it's not a trivial task to execute JS when scraping content.
YouTube is a dynamically rendered JavaScript website, though it could be parsed with Nokogiri without using Selenium or another package. Try open the Network tab in dev tools, scroll to the comment section, and see what request being send.
You need to make a post request in order to fetch comments data. You can preview the output in the "Preview" tab.
Preview output:
Which is equivalent to this comment:
Note: Since this comment brings very little value, this answer will be updated with the attached code once there will be an available solution.

How to extract HTML from updated DOM using Capybara Webkit driver?

I have a page that injects some text into the DOM: 'Success!'.
The Javascript code works because I see the expected text in the screenshot, and the spec passes:
page.visit '/'
save_and_open_screenshot
expect( page).to have_content 'Success!'
puts page.html
However, the page.html is not updated. It does not have the injected text.
How do I get the HTML for the updated DOM?
EDIT: I found that the issue is caused by an iframe. The iframe is not added to the page.html, but it is added to the page.
EDIT #2: It turns out that the 'Success!' content is not in the iframe. So maybe the context is switching to the iframe.
Found one workaround which is OK:
html = page.evaluate_script( 'document.documentElement.innerHTML' )
I guess one could use JS or jQuery finder to find the expected <div>.
For the entire page body you can do this:
page.body
For any element in particular
page.find(".my-div").base.inner_html
Check out the full API here: https://github.com/thoughtbot/capybara-webkit/blob/master/lib/capybara/webkit/node.rb

How to fix an "undefined method" when trying to scrape a website with Nokogiri

I want to get some data from the HMs website, using this scraper:
require 'nokogiri'
require 'open-uri'
require 'rmagick'
require 'mechanize'
product = "http://www2.hm.com/es_es/productpage.0250933004.html"
web = Nokogiri::HTML(open(product))
puts web.at_css('.product-item-headline').text
Nokogiri returns NIL for each selector and raises undefined method for nilClass. I don't know if this particular website has something that can avoid scraping.
In the URL DOM, I can see there is a .product-item-headline class, and I can fetch the info in the JavaScript console, but I can't with Nokogiri.
I tried targeting the whole body text, and this is the only thing I get printed.
var callcoremetrix = function(){cmSetClientID(getCoremetricsClientId(), true, "msp.hm.com", "hm.com");};
Maybe some JavaScript is ruining my scrape?
One idea is to use IRB and go step by step:
irb
> require 'open-uri'
> html = open(product).read
Does the HTML contain the class name text?
> html =~ /product-item-headline/
=> 56099
Yes it does, and here's the line:
<h1 class="product-item-headline">
So try Nokogiri:
> require 'nokogiri'
web = Nokogiri::HTML(html)
=> success
Read the HTML text and try increasingly-broad queries related to your issue that take you nearer the top of the HTML, and see if they find results:
web.css("h1") # on line 2217 of the HTML
=> []
web.css(".product-detail-meta") # on line 2215
=> []
web.css(".wrapper") # on line 86
=> []
web.css("body") # on line 84
=> [#<Nokogiri::XML::Element …
This shows you there's a problem in the HTML. The parsing is disrupted between lines 84 and 86.
Let's guess that line 85 may be the issue: it is a <header> tag, and we happen to know that doesn't contain your target, so we can delete it. Save the HTML to a file, then use any text editor to delete the tag and all its contents, then re-parse.
Does it work now?
web.css("h1") # on line 359 of the HTML
=> []
Nope. So we repeat this process, cutting down the HTML.
I also like to cut down the HTML by removing pieces that I know don't contain my target, such as the <head> area, <footer> areas, <script> areas etc.
You may like to use an auto-indenting editor, because it can quickly show you that something is unbalanced with the HTML.
Eventually we find that the HTML has many incorrect tags, such as unclosed section tags.
You can solve this a variety of ways:
The pure way is to fix the unclosed section tags, any way to you want.
The hack way is to narrow the HTML to the area you know you need, which is in the h1 tag.
Here's the hack way:
area = html.match(/<h1 class="product-item-headline\b.*?<\/h1>/m)[0]
web = Nokogiri::HTML(area)
puts web.at_css(".product-item-headline").text.strip
=> "Funda de cojín de jacquard"
Heads up that the hack way isn't truly HTML-savvy, and you can see that it will fail if the HTML page author changes to use a different tag, or uses another class name before the class name you want, etc.
The best long-term solution is to contact the author of the HTML page and show him how to validate the HTML. A good site for this is http://validator.w3.org/ -- when you validate your URL, the site shows 100 errors and 6 warnings, and explains each one and how to solve it.

How to get a mail address from HTML code with Nokogiri

How can I get the mail address from HTML code with Nokogiri? I'm thinking in regex but I don't know if it's the best solution.
Example code:
<html>
<title>Example</title>
<body>
This is an example text.
Mail to me
</body>
</html>
Does a method exist in Nokogiri to get the mail address if it is not between some tags?
You can extract the email addresses using xpath.
The selector //a will select any a tags on the page, and you can specify the href attribute using # syntax, so //a/#href will give you the hrefs of all a tags on the page.
If there are a mix of possible a tags on the page with different urls types (e.g. http:// urls) you can use xpath functions to further narrow down the selected nodes. The selector
//a[starts-with(#href, \"mailto:\")]/#href
will give you the href nodes of all a tags that have a href attribute that starts with "mailto:".
Putting this all together, and adding a little extra code to strip out the "mailto:" from the start of the attribute value:
require 'nokogiri'
selector = "//a[starts-with(#href, \"mailto:\")]/#href"
doc = Nokogiri::HTML.parse File.read 'my_file.html'
nodes = doc.xpath selector
addresses = nodes.collect {|n| n.value[7..-1]}
puts addresses
With a test file that looks like this:
<html>
<title>Example</title>
<body>
This is an example text.
Mail to me
A Web link
<a>An empty anchor.</a>
</body>
</html>
this code outputs the desired example#example.com. addresses is an array of all the email addresses in mailto links in the document.
I'll preface this by saying that I know nothing about Nokogiri. But I just went to their website and looked at the documentation and it looks pretty cool.
If you add an email_field class (or whatever you want to call it) to your email link, you can modify their example code to do what you are looking for.
require 'nokogiri'
require 'open-uri'
# Get a Nokogiri::HTML:Document for the page we’re interested in...
doc = Nokogiri::HTML(open('http://www.yoursite.com/your_page.html'))
# Do funky things with it using Nokogiri::XML::Node methods...
####
# Search for nodes by css
doc.css('.email_field').each do |email|
# assuming you have than one, do something with all your email fields here
end
If I were you, I would just look at their documentation and experiment with some of their examples.
Here's the site: http://nokogiri.org/
CSS selectors can now (finally) find text at the beginning of a parameter:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
blah
blah
EOT
doc.at('a[href^="mailto:"]')
.to_html # => "blah"
Nokogiri tries to track the jQuery extensions. I used to have a link to a change-notice or message from one of the maintainers talking about it but my mileage has varied.
See "CSS Attribute Selectors" for more information.
Try to get the whole html page and use regular expressions.

Resources