Groovy htmlunit getByXPath - xpath

I'm currently using HtmlUnit to attempt to grab an href out of a page and am having some trouble.
The XPath is:
/html/body/div[2]/div/div/table/tbody/tr/td[2]/div/div[5]/div/div[2]/span/a
On the webpage it looks like:
<a class="t" title="This Brush" href=http://domain.com/this/that">Brush Set</a>
In my code I am doing:
hrefs = page.getByXPath("//html/body/div[2]/div/div/table/tbody/tr/td[2]/div/div[5]/div/div[2]/span/a[#class='t']")
However, this is returning everything in there instead of just the url that I want.
Can someone explain what I must add to get the href? (also it doesn't end with .html)

You are selecting the a. You want to select the a/#href.
hrefs = page.getByXPath("//html/body/div[2]/div/div/table/tbody/tr/td[2]/div/div[5]/div/div[2]/span/a[#class='t']/#href")

Related

Web scraping from youtube with nokogiri

I want to scrape all the names of the users who commented below a youtube video.
I'm using ruby and nokogiri.
require 'rubygems'
require 'nokogiri'
require 'open-uri'
url = "https://www.youtube.com/watch?v=tntOCGkgt98"
doc = Nokogiri::HTML(open(url))
doc.css(".comment-thread-renderer > .comment-renderer").each do |comment|
name = comment.css("#comment-section-renderer-items .g-hovercard").text
puts name
end
But it's not working, I'm not getting any output, no error either.
I won't be able to give you a solution, but at least I can give you a couple of hints that may help you to move forward.
The code you have is not working because the comments section is loaded via an ajax call after the page is loaded. If you do a hard reload in your browser, you will see that there is a spinner icon and a Loading... text in the sections comment, waiting for the content to be loaded. When Nokogiri gets the page via the http request, it gets the html content that you see before the comments are loaded. As a matter of fact the place where the contents will be later added looks like:
<div id="watch-discussion" class="branded-page-box yt-card">
<div id="comment-section-renderer"
class="comment-section-renderer vve-check"
data-visibility-tracking="CCsQuy8iEwjr3P3u1uzNAhXIepAKHRV9D8Ao-B0=">
<div class="action-panel-loading">
<p class="yt-spinner ">
<span class="yt-spinner-img yt-sprite" title="Loading icon">
</span>
<span class="yt-spinner-message">Loading...</span>
</p>
</div>
</div>
</div>
That is the reason why you won't find the divs you are looking for, because they aren't part of the html you have.
Looking at the network console in the browser, it seems that the ajax request to get the comments data is being sent to https://www.youtube.com/watch_fragments_ajax?v=tntOCGkgt98&tr=time&distiller=1&ctoken=EhYSC3RudE9DR2tndDk4wAEAyAEA4AEBGAY%253D&frags=comments&spf=load. As you can see the v parameter is the video id, however there are a couple of caveats:
There is a ctoken param, which you can get by scraping the original page contents. It is inside a <script> tag, in the form of
'COMMENTS_TOKEN': "<token>".
However, you still need to send a session_token as a form data in the body of the AJAX request (which is a POST). That I don't know where is coming from :(.
I think that you will be pushing the limits of Nokogiri here, as AFAIK it is not intended to follow ajax requests or handling Javascript. Maybe the ruby Selenium driver is better suited for this.
HTH
I think you need name.css("#comment-section..."
The each statement will iterate over the elements, using the variable name.
You may want to use node instead of name:
doc.css(".comment-thread-renderer > .comment-renderer").each do |node|
name = node.css("#comment-section-renderer-items .g-hovercard").text
puts name
end
I wrote this rails app using nokogiri to see all the tags that a page has before any javascript is run in the browser. The source code is here, so you can adjust it if you need to add more info about the node in the view.
That can easily tell you if the particular tag element that you are looking for is something you can retrieve without having to do some JS eval.
Most web crawlers don't support client-side rendering, which gives you an idea that it's not a trivial task to execute JS when scraping content.
YouTube is a dynamically rendered JavaScript website, though it could be parsed with Nokogiri without using Selenium or another package. Try open the Network tab in dev tools, scroll to the comment section, and see what request being send.
You need to make a post request in order to fetch comments data. You can preview the output in the "Preview" tab.
Preview output:
Which is equivalent to this comment:
Note: Since this comment brings very little value, this answer will be updated with the attached code once there will be an available solution.

XPATH - Ignore 1 page element, grab the rest

I can't seem to figure this out due to the square brackets issue. I have an html page full of h3 tags with hrefs that I need to grab but one has a class on that I don't want. Example:
I want all H3s hrefs but not this one:
<h3 class="leave_this">Leave me alone!
To grab all the hrefs I am using this:
//h3/a/#href
Tried a few variations but no luck.
REMOVED CONFUSING EXAMPLE, APOLOGIES

Ruby Watir - how to select <a onclick="new Ajax.Request

Hi I'm trying to select an edit button and I am having difficulty selecting it.
<td>
<a onclick="new Ajax.Request('/media/remote/edit_source/3', {asynchronous:true, evalScripts:true}); return false;" href="#">
<img title="Edit" src="/media/images/edit.gif?1258500617" alt="Edit">
</a>
I have the number at the end of ('/media/remote/edit_source/3') the which changes and I have stored it in #rep_id variable.
I can't use xpath because the table changes often. Any suggestions? Any help is greatly appreciated. Below is what I have tried and fails. I am fairly new to watir and love it, but occasionally I run into things like this and get stumped.
browser.a(:text, "/media/remote/edit_source/#{#rep_id}").when_present.click
The line:
browser.a(:text, "/media/remote/edit_source/#{#rep_id}").when_present.click
fails because:
The content you are looking for is in the onclick attribute (rather than the text)
The locator is passed a string for the second parameter. This means that it is looking for something that exactly matches that. Given that you are only using part of the text/attribute, you need to use a regexp.
If you are using watir-webdriver, there is support for locating an element by its :onclick attribute. You can use a regexp to partially match the :onclick attribute.
browser.link(:onclick => /#{Regexp.escape("/media/remote/edit_source/#{#rep_id}")}/).when_present.click
If you are also using watir-classic (for IE testing), the above will not work. Instead, you can check the html of the link. Checking the html also works in watir-webdriver, but could be less robust than using :onclick.
browser.link(:html => /#{Regexp.escape("/media/remote/edit_source/#{#rep_id}")}/).when_present.click
From your example, it looks like you are using the URL from the onclick event handler as a :text locator, which I'd expect to fail unless that text does exist.
You could potentially click on the img. Examples:
browser.image(:title, "Edit").click
browser.image(:src, "/media/images/edit.gif?1258500617").click
browser.image(:src, /edit\.gif\?\d{10}/).click # regex the src
Otherwise, you might need to use the fire_event method to trigger the event handler, which looks like this:
browser.link(:id, "foo").fire_event "onclick"
These are the links to the fire_event docs for watir and watir-webdriver for reference.

How do HtmlAgilityPack extract text from html node whose class attribute appended dynamically

Dear friends,I want to extract text 平均3.6 星 from this code segment excerpted from amazon.cn.
<div class="content"><ul>
<li><b>用户评分:</b>
<span class="crAvgStars" style="white-space:no-wrap;">
<span class="asinReviewsSummary" ref="dp_db_cm_cr_acr_pop_" name="B004GUSIKO">
<a>
<span class="swSprite s_star_3_5 " title="平均3.6 星">
<span>平均3.6 星</span>
</span>
</a>
My question is span class tag value "s_star_3_5 " vary from different customer's rating level and appended dynamically. So I attempt to use doc.DocumentNode.SelectSingleNode(" //span[#class='swSprite']").InnerText or //span[#class='swSprite s_star_3_5 '], but the result is an error or not what my want !
Any suggestions?
First of all, I suggest you saving the value of doc.DocumentNode.OuterHtml to a local .html file and see if the code you're obtaining is that code. The thing is that sometimes you start parsing a website using HtmlAgilityPack, but the very first problem is that you're not getting the valid HTML correctly. Maybe you're getting a 404 error, or a redirection, etc.
I'm suggesting this because I tested //span[#class='swSprite s_star_3_5 '] and worked correctly.
That was the issue in the following questions:
Selecting nodes that have an attribute with spaces using HTMLAgilityPack
XPath Query Problem using HTML Agility Pack
If that doesn't help, post the HTML code and I'll help you ;)
This works for me:
HtmlDocument doc = new HtmlDocument();
doc.Load(myHtml);
HtmlNode node = doc.DocumentNode.SelectSingleNode("//span[starts-with(#class, 'swSprite')]");
Console.WriteLine("Text=" + node.InnerText.Trim());
and outputs
平均3.6 星
Note I use the XPATH starts-with function.

selenium cannot find element with class in IE

I'm using selenium_client with cucumber, webrat + IE
As you'd expect, Firefox works fine. I've tried the following:
selenium.is_visible("css=#flash .flash_notice")
selenium.is_visible("xpath=//*[#id='flash']/*[#class='flash_notice]")
selenium.is_visible("xpath=//*[#id='flash']/*[contains(#class,'flash_notice]')")
both cannot find the element.
I think it must be something to do with IE, looking closer at the html selenium returns from IE...
It looks like this:
<UL id=flash>
<LI className=flash_notice>Deleted</LI>
</UL>
Notice IE returns the class attribute as className, is this confusing selenium? How can I get round this so that I can use the same statement for selenium using IE and Firefox
Just to confuse us even more, this example works, confirming its something to do with checking the class attribute
selenium.is_visible("xpath=//*[#id='flash']/*[. =\'Deleted\']")
It appears that your XPATH expressions are mal-formed.
The first XPATH is missing the single quote ' at the end of flash_notice.
It should be:
selenium.is_visible("xpath=//*[#id='flash']/*[#class='flash_notice']")
The second XPATH has the ' ] and ) out of order, which messes up the expression.
It should be:
selenium.is_visible("xpath=//*[#id='flash']/*[contains(#class,'flash_notice')]")

Resources