Selecting with Xpath in Scrapy

Selecting with Xpath in Scrapy - xpath

I'm using Scapy to scrape some data from a site and I need help using Xpath to select "data" from the following.
<span class="result_item"><span class="text3"><span class="header_text3">**data**</span><br />
**data**<br />
**data**</span> <span class="phone_button_out"><span class="phone_button" style="margin-top: 0"
onclick="pageTracker._trackEvent('USDSearch','Call Now!F');phone_win.open('name','**data**',27101650,0)">
Call Now!<br />
</span></span>
What statements can I use to select the necessary data? I hope this isn't a stupid question. If it is, please point me in the right direction.

There are multiple data elements to get in the posted html. Assuming that <span class="result_item"> is parent of the items, you can try the following:
To get header:
//span[#class='result_item']/span[#class='header_text3']/text()
To get anchor link data:
//span[#class='result_item']/a/text()
Also, to help with xpaths, install Firebug Addon in Firefox, then FirePath addon on Firebug. Pointing to elements will give you autogenerated xpaths (good for beginners. sometime needs xpath tuning)

Related

Extract ratings with XPath

How to extract all the ratings(numbers) with XPath from the following page? Thank you.
Top 50 best films of 2018

If to find only by class of <span> that contains rating, then you will get a lot of other items. So I selected the parent <div> by class and then got the <span> by class. It seems to work fine.
//div[#class="ipl-rating-star small"]/span[#class="ipl-rating-star__rating"]/text()
Useful info:
XPath tutorial
How to use XPath in Chrome DevTools or else

click on a image link with Capybara

I am trying to click on a image link with a Capybara / Rspec test. I am having very little success at the moment.
I am trying to select the link with href "/post/3", (knowing that they are other links before). I have tried many combinations of xpath without success. The only combination working was
page.first(:xpath, //a).click
however when I have changed the file and added more links above my Capybara test is broken.
<div class='row'>
<img id="imagen3" src="/system/posts/images/000/000/003/original/frankie-mannings-102nd-birthday-5160522641047552-hp.gif?1464448829" alt="Frankie mannings 102nd birthday 5160522641047552 hp" />
<p>caption</p>
</div>
How can I select that link, and click it?

ok I got it:
find(:xpath, "//a[contains(#href,'posts/3}')]").click

//a[contains(#href, 'posts/3’)]

HtmlUnit - getTextContent()

I´m working whith HTMLUnit, I need get text content of a HtmlAnchor but only text no more tags html have.
<a class="subjectPrice" href="http://www.terra.es/?ca=28_s&st=a&c=4" title="Opel Zafira Tourer 2.0 Cdti 165 Cv Excellence 5p. -12">
<span class="old_price">32.679€</span>
24.395€
If I execute htmlAnchor.getTextContent() it´s return 32.679€ 24.395€, but I only need 24.395€
Anybody can help me? thanks.

Just use XPath to get the appropriate DomText node. It seems that ./text() taking as a reference the HtmlAnchor should be enough.

discover a certain part of a page with selenium

I have a webpage looks something like this:
<html>
...
<div id="menu">
...
<ul id="listOfItems">
<!--- repeated block start -->
<li id="item" class="itemClass">
...
<span class="spanClass"><span class="title">title</span></span>
...
</li>
<!-- repeated block end-->
<li id="item" class="itemClass">
...
<span class="spanClass"><span class="title">title something</span></span>
...
</li>
<li id="item" class="itemClass">
...
<span class="spanClass"><span class="title">title other thing</span></span>
...
</li>
</ul>
...
</div>
...
</html>
I would like to know what is the xpath of the titles ("title", "title something", "title other thing"). The point is that the order of the <li> elements are not specified. It could be different after every page loading. Is there any method how to discover a certain structure of the page with xpath? I have an notion about how to solve this issue, but before I'm going to write iterations with C# to discover the page I ask you.
Thanks in advance!

First of all, id's should be unique, so your portrayed webpage would not work well when it comes to testing.
I did however test, and got some XPath locators to work for selecting specific titles (although I recommend you fix your webpage instead of actually using this):
//li[#id='item']/span/span
//li[#id='item'][1]/span/span
//li[#id='item'][3]/span/span
If you're after all three titles, you could try Dimitre Novatchev's suggestion:
//span[#class='title']
This should get all titles on the page.
I would like to say one thing however, if you're getting into Selenium, I recommend you download the Selenium IDE extension for Firefox. It's a great tool for beginners. It helps you both to make your Selenium tests by recording your clicks on a website, and it also helps you auto-generate and test your XPath locators and other locators.
And again: I urge you to not make a website with duplicate id elements :-)

Does Selenium support XPath expressions like:
//span[#class='title']
If yes, than use the above XPath expression. It selects every span element in the XML document, whose class attribute has string value of "title".
I recommend to use a tool like the XPath Visualizer to play with different XPath expressions and see the selected nodes highlighted in the source XML document.

xpath locator works in FF3, but won't work in IE7

After switching from firefox testing to internet explorer testing, some elements couldn't be found by selenium anymore.
i tracked down one locator:
xpath=(//a[#class='someclass'])[2]
While it works as it should under firefox, it could not find this element in ie.
What alternatives do i have now? JS DOM? CSS Selector? How would this locator look like?
Update:
I will provide an example to make my point:
<ul>
<li>
<a class='someClass' href="http://www.google.com">BARF</a>
</li>
<li>
<a class='someClass' href="http://www.google.de">BARF2</a>
</li>
</ul>
<div>
<a class='someClass' href="http://www.google.ch">BARF3</a>
</div>
The following xpath won't work:
//a[#class='someclass'][2]
In my understanding this should be the same as:
//a[#class='someclass' and position()=2]
and i don't have any links that are the second child of any node. All i want is, to address one link from the set of links of class 'someClass'.

Without knowing the rest of your HTML source it's difficult to give you alternatives that are guaranteed to work. Hopefully the following suggestions will help point you in the right direction:
//a[#class='someClass'][2]This is like your example, but the parantheses are not needed.
//a[contains(#class, 'someClass')][2] This will work even if the link has other classes.
css=a.someClass:nth-child(2) This will only work if the link is the 2nd child element of it's parent.
Update
Based on your update, try the following: //body/descendant::a[#class='someClass'][2]

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Selecting with Xpath in Scrapy - xpath

Related

Extract ratings with XPath

click on a image link with Capybara

HtmlUnit - getTextContent()

discover a certain part of a page with selenium

xpath locator works in FF3, but won't work in IE7

Categories

Resources