Web scraping from youtube with nokogiri - ruby

I want to scrape all the names of the users who commented below a youtube video.
I'm using ruby and nokogiri.
require 'rubygems'
require 'nokogiri'
require 'open-uri'
url = "https://www.youtube.com/watch?v=tntOCGkgt98"
doc = Nokogiri::HTML(open(url))
doc.css(".comment-thread-renderer > .comment-renderer").each do |comment|
name = comment.css("#comment-section-renderer-items .g-hovercard").text
puts name
end
But it's not working, I'm not getting any output, no error either.

I won't be able to give you a solution, but at least I can give you a couple of hints that may help you to move forward.
The code you have is not working because the comments section is loaded via an ajax call after the page is loaded. If you do a hard reload in your browser, you will see that there is a spinner icon and a Loading... text in the sections comment, waiting for the content to be loaded. When Nokogiri gets the page via the http request, it gets the html content that you see before the comments are loaded. As a matter of fact the place where the contents will be later added looks like:
<div id="watch-discussion" class="branded-page-box yt-card">
<div id="comment-section-renderer"
class="comment-section-renderer vve-check"
data-visibility-tracking="CCsQuy8iEwjr3P3u1uzNAhXIepAKHRV9D8Ao-B0=">
<div class="action-panel-loading">
<p class="yt-spinner ">
<span class="yt-spinner-img yt-sprite" title="Loading icon">
</span>
<span class="yt-spinner-message">Loading...</span>
</p>
</div>
</div>
</div>
That is the reason why you won't find the divs you are looking for, because they aren't part of the html you have.
Looking at the network console in the browser, it seems that the ajax request to get the comments data is being sent to https://www.youtube.com/watch_fragments_ajax?v=tntOCGkgt98&tr=time&distiller=1&ctoken=EhYSC3RudE9DR2tndDk4wAEAyAEA4AEBGAY%253D&frags=comments&spf=load. As you can see the v parameter is the video id, however there are a couple of caveats:
There is a ctoken param, which you can get by scraping the original page contents. It is inside a <script> tag, in the form of
'COMMENTS_TOKEN': "<token>".
However, you still need to send a session_token as a form data in the body of the AJAX request (which is a POST). That I don't know where is coming from :(.
I think that you will be pushing the limits of Nokogiri here, as AFAIK it is not intended to follow ajax requests or handling Javascript. Maybe the ruby Selenium driver is better suited for this.
HTH

I think you need name.css("#comment-section..."
The each statement will iterate over the elements, using the variable name.
You may want to use node instead of name:
doc.css(".comment-thread-renderer > .comment-renderer").each do |node|
name = node.css("#comment-section-renderer-items .g-hovercard").text
puts name
end

I wrote this rails app using nokogiri to see all the tags that a page has before any javascript is run in the browser. The source code is here, so you can adjust it if you need to add more info about the node in the view.
That can easily tell you if the particular tag element that you are looking for is something you can retrieve without having to do some JS eval.
Most web crawlers don't support client-side rendering, which gives you an idea that it's not a trivial task to execute JS when scraping content.

YouTube is a dynamically rendered JavaScript website, though it could be parsed with Nokogiri without using Selenium or another package. Try open the Network tab in dev tools, scroll to the comment section, and see what request being send.
You need to make a post request in order to fetch comments data. You can preview the output in the "Preview" tab.
Preview output:
Which is equivalent to this comment:
Note: Since this comment brings very little value, this answer will be updated with the attached code once there will be an available solution.

Related

How to extract HTML from updated DOM using Capybara Webkit driver?

I have a page that injects some text into the DOM: 'Success!'.
The Javascript code works because I see the expected text in the screenshot, and the spec passes:
page.visit '/'
save_and_open_screenshot
expect( page).to have_content 'Success!'
puts page.html
However, the page.html is not updated. It does not have the injected text.
How do I get the HTML for the updated DOM?
EDIT: I found that the issue is caused by an iframe. The iframe is not added to the page.html, but it is added to the page.
EDIT #2: It turns out that the 'Success!' content is not in the iframe. So maybe the context is switching to the iframe.
Found one workaround which is OK:
html = page.evaluate_script( 'document.documentElement.innerHTML' )
I guess one could use JS or jQuery finder to find the expected <div>.
For the entire page body you can do this:
page.body
For any element in particular
page.find(".my-div").base.inner_html
Check out the full API here: https://github.com/thoughtbot/capybara-webkit/blob/master/lib/capybara/webkit/node.rb

Scraping iframe data using Nokogiri and Ruby [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 7 years ago.
Improve this question
This is my script written to scrape data inside the <iframe> tag using Nokogiri:
require 'nokogiri'
require 'restclient'
doc = Nokogiri::HTML(RestClient.get("http://www.sample_site.com/"))
doc.xpath('//iframe[#width="1001" and #height="973"]').children
I am getting like this:
=> [#<Nokogiri::XML::Text:0x1913970 "\r\nYour browser does not support inline frames\r\n">]
Can anyone tell me why?
An iframe is used to embed another document within the current HTML document. It means the iframe loads his content from an external source that is specified in the src attribute.
So, if you want to do scraping to an iframe content you should send a request to the external source from where it loads his content.
# The iframe (notice the 'src' attribute)
<iframe src="iframe_source_url" height="973" width="1001">
# iframe content
</iframe>
# Code to do the scraping
doc = RestClient.get('iframe_source_url')
parsed_doc = Nokogiri::HTML(doc)
parsed_doc.css('#yourSelectorHere') # or parsed_doc.xpath('...')
Note (about the error)
When you do scraping, the HTTP client you use acts as your browser (yours is restclient). The error says your browser does not support inline frames, in other words, restclient does not support inline-frames and is why it cannot load the content of the frame.
The issue is to be addressed to RestClient, not to Nokogiri.
RestClient does not retrieve the content of iframes. You might want to try to examine the content of RestClient.get("http://www.sample_site.com/"), there will be the string like:
<iframe src="page-1.htm" name="test" height="120" width="600">
You need a Frames Capable browser to view this content.
</iframe>
Nokogiri is fine dealing with this, it returns the content of iframe node which is apparently the only TextNode having the string you yielded as a result.

Dynamic display in Grails

I have this problem I'm facing. I have been working on a project using Grails based on the advice from a friend. I'm still a novice in using Grails, so any down to earth explanation would be highly welcomed.
My project is a web application which scans broken or dead links and displays them on a screen. The main application is written in Java, and it displays the output (good links, bad links, pages scanned) continuously on the system console as the scan goes on. I've finished implementing my UI, controllers, views, database using Grails. Now, I will like to display actively in a section of my GSP page say forager.gsp the current link being scanned, the current number of bad links found and the current page being scanned.
The attempts I have tried in implementing this active display include storing the output my application displays on the console in a table in my database. This table has a single row which is constantly updated as the current paged scanned changes, number of good links found changes and number of bad links found changes. As this particular table is being updated constantly, I've written an action in my controller which reads this single line and renders the result to my UI. The problem I'm now facing is that I need a way of constantly updating the result being displayed after an interval of time in my UI. I want the final output to look like
scanning: This page, Bad links: 8, good links: 200
So basically here is my controller action which reads the table from the database
import groovy.sql.Sql
class PHPController {
def index() {}
def dataSource
def ajax = {
def sql = new Sql(dataSource)
def errors = sql.rows("SELECT *from links")
render (view: 'index', template:'test', model:[errors:errors])
}
}
Here is the template I render test.gsp
<table border="0">
<g:each in="${ errors }" var="error">
<tr><td>${ error.address }</td><td>${ error.error}</td><td>${ error.pageLink}</td></tr>
</g:each>
</table>
For now I'm working with a test UI, which means this is not my UI but one I use for testing purposes, say index.gsp
<html>
<body>
<div><p>Pleaseeee, update only the ones below</p></div>
<script type="text/javascript">
function ClickMe(){
setInterval('document.getElementById("auto").click()',5000);
alert("Function works");
}
</script>
<div id="dont't touch">
<g:formRemote url="[controller:'PHP', action:'ajax']" update="ajaxDiv"
asynchronous="true" name="Form" onComplete="ClickMe()" after="ClickMe()">
<div>
<input id="auto" type="button" value="Click" />
</div>
</g:formRemote>
<div id="ajaxDiv">
<g:render template="/PHP/test"/>
</div>
</body>
</html>
The div I'm trying to update is "ajaxDiv". Anyone trying to answer this question can just assume that I dont have an index.gsp and can propose a solution from scratch. This is the first time I'm using Grails in my life so far, and also the first time I'm ever dealing with ajax in any form. The aim is to dynamically fetch data from my database and display the result. Or if someone knows how to directly mirror output from the system console unto the UI, that will also be great.
It sounds like a form would be appropriate for your needs. Check out the Grails documentation on forms. You should be able to render a form with the values you would like without too much trouble. Be sure to pay attention to your mapping and let me know if you have any questions after you have set index.gsp up to render a form for your values.

Ruby Watir - how to select <a onclick="new Ajax.Request

Hi I'm trying to select an edit button and I am having difficulty selecting it.
<td>
<a onclick="new Ajax.Request('/media/remote/edit_source/3', {asynchronous:true, evalScripts:true}); return false;" href="#">
<img title="Edit" src="/media/images/edit.gif?1258500617" alt="Edit">
</a>
I have the number at the end of ('/media/remote/edit_source/3') the which changes and I have stored it in #rep_id variable.
I can't use xpath because the table changes often. Any suggestions? Any help is greatly appreciated. Below is what I have tried and fails. I am fairly new to watir and love it, but occasionally I run into things like this and get stumped.
browser.a(:text, "/media/remote/edit_source/#{#rep_id}").when_present.click
The line:
browser.a(:text, "/media/remote/edit_source/#{#rep_id}").when_present.click
fails because:
The content you are looking for is in the onclick attribute (rather than the text)
The locator is passed a string for the second parameter. This means that it is looking for something that exactly matches that. Given that you are only using part of the text/attribute, you need to use a regexp.
If you are using watir-webdriver, there is support for locating an element by its :onclick attribute. You can use a regexp to partially match the :onclick attribute.
browser.link(:onclick => /#{Regexp.escape("/media/remote/edit_source/#{#rep_id}")}/).when_present.click
If you are also using watir-classic (for IE testing), the above will not work. Instead, you can check the html of the link. Checking the html also works in watir-webdriver, but could be less robust than using :onclick.
browser.link(:html => /#{Regexp.escape("/media/remote/edit_source/#{#rep_id}")}/).when_present.click
From your example, it looks like you are using the URL from the onclick event handler as a :text locator, which I'd expect to fail unless that text does exist.
You could potentially click on the img. Examples:
browser.image(:title, "Edit").click
browser.image(:src, "/media/images/edit.gif?1258500617").click
browser.image(:src, /edit\.gif\?\d{10}/).click # regex the src
Otherwise, you might need to use the fire_event method to trigger the event handler, which looks like this:
browser.link(:id, "foo").fire_event "onclick"
These are the links to the fire_event docs for watir and watir-webdriver for reference.

Groovy htmlunit getByXPath

I'm currently using HtmlUnit to attempt to grab an href out of a page and am having some trouble.
The XPath is:
/html/body/div[2]/div/div/table/tbody/tr/td[2]/div/div[5]/div/div[2]/span/a
On the webpage it looks like:
<a class="t" title="This Brush" href=http://domain.com/this/that">Brush Set</a>
In my code I am doing:
hrefs = page.getByXPath("//html/body/div[2]/div/div/table/tbody/tr/td[2]/div/div[5]/div/div[2]/span/a[#class='t']")
However, this is returning everything in there instead of just the url that I want.
Can someone explain what I must add to get the href? (also it doesn't end with .html)
You are selecting the a. You want to select the a/#href.
hrefs = page.getByXPath("//html/body/div[2]/div/div/table/tbody/tr/td[2]/div/div[5]/div/div[2]/span/a[#class='t']/#href")

Resources