I'm new to scrapy and am experimenting with the shell, attempting to retrieve products from the following URL: https://www.newbalance.co.nz/men/shoes/running/trail/?prefn1=sizeRefinement&prefv1=men_shoes_7
The following is my code - I'm not sure why the final query is coming back empty:
$ scrapy shell
fetch("https://www.newbalance.co.nz/men/shoes/running/trail/?prefn1=sizeRefinement&prefv1=men_shoes_7")
div_product_lists = response.xpath('//div[#id="product-lists"]')
ul_product_list_main = div_product_lists.xpath('//ul[#id="product-list-main"]')
for li_tile in ul_product_list_main.xpath('//li[#class="tile"]'):
... print li_tile.xpath('//div[#class="product"]').extract()
...
[]
[]
If I inspect the page using a property inspector, then I am seeing data for the div (with class product), so am not sure why this is coming back empty. Any help would be appreciated.
The problem here is that xpath doesn't understand that class="product product-tile " means "This element has 2 classes, product and product-tile".
In an xpath selector, the class attribute is just a string, like any other.
Knowing that, you can search for the entire class string:
>>> li_tile.xpath('.//div[#class="product product-tile "]')
[<Selector xpath='.//div[#class="product product-tile "]' data='<div class="product product-tile " tabin'>]
If you want to find all the elements that have the "product" class, the easiest way to do so is using a css selector:
>>> li_tile.css('div.product')
[<Selector xpath="descendant-or-self::div[#class and contains(concat(' ', normalize-space(#class), ' '), ' product ')]" data='<div class="product product-tile " tabin'>]
As you can see by looking at the resulting Selector, this is a bit more complicated to achieve using only xpath.
The data you want to extract is more readily available in other div having class product-top-spacer.
For instance you can get all divs having class="product-top-spacer" by following:
ts = response.xpath('//div[#class="product-top-spacer"]')
and check the item of first extracted div and its price:
ts[0].xpath('descendant::p[#class="product-name"]/a/text()').extract()[0]
>> 'Leadville v3'
ts[0].xpath('descendant::div[#class="product-pricing"]/text()').extract()[0].strip()
>> '$260.00'
All items can be viewed by iterating ts
for t in ts:
itname = t.xpath('descendant::p[#class="product-name"]/a/text()').extract()[0]
itprice = t.xpath('descendant::div[#class="product-pricing"]/text()').extract()[0].strip()
itprice = ' '.join(itprice.split()) # some cleaning
print(itname + ", " + itprice)
Related
<div class="col-sm-3">
<span>Annuitant:</span>
</div>
<div class="col-sm-3">
<span id="annuitant">
RPD
</span>
</div>
Xpath code that i used previously
findXpath=page.find('label', text: workbook.cell(j,k), :match => :prefer_exact).path
splitXpath=(findXpath.split("/")) #splitting xpath
##Xpath manipulation to get the xpath of "RPD"
count1=splitXpath.count
value1=splitXpath.at(count1-3)
value=splitXpath.at(count1-2)
labelNum=value1.match(/(\d+)/)
i=0
elementNum=labelNum[1].to_i+1
for maxnum in 1..splitXpath.count-4
elementXpath=elementXpath + "/" + splitXpath[maxnum]
end
elementXpath=elementXpath + "/div[" + elementNum.to_s + "]" + "/"+ value
elementXpath=elementXpath + "/" + splitXpath.at(count1-1)
finalElementXpath=elementXpath.sub("label","span")# obtained the xpath of RPD
if (workbook.cell(j+1,k) == (find(:xpath, finalElementXpath).native.text)) # verifying the value RPD is present
Can I use parent class and verify whether "Annuitant" is present and also to check whether Annuitant value is "RPD". Please help me to write a code for this in ruby capybara
Use assert_selector to check if the selector has the text you want. See below:
page.assert_selector('#annuitant', :text => 'RPD', :visible => true)
You can scope Capybara's finders/matchers to any element by either calling them on an element or using within(element) ...
In this case you'd want to scope to at least one level higher in your html document so that both elements you are interested in are contained by the element you're scoping too. Also the class 'col-sm-3' would be a bad choice because it is not going to be unique to these elements. Another thing this comes down to is how rigorous does your check need to be, do you actually need to check the structure of the elements or do you just need to verify the text appears next to each other on the page. If the latter something like
element = find('<selector for parent/grandparent of both elements>') # could also just be `page` if the text is unique
expect(element).to have_text('Annuitant: RPD')
if you do actually need to verify the structure things get more complicated and you would need to use XPath
expect(element).to have_selector(:xpath, './/div[./span[text()="Annuitant:"]]/following-sibling::div[1][./span[normalize-space(text())="RPD"]]')
Here is my Xpath query i'm trying to convert to jsoup..
//div[#id='ad-display']/descendant-or-self::img[contains(#class, 'absmiddle')]/#src
I can't find any documentation that talks about descendants in jsoup. I know it talks about child elements, but apparently I'm not good enough to find the correlation between the two.
JSoup uses CSS selectors, selecting descendant in CSS is easy, just put your descendant element after ancestor separated by space.
Select by id is done with '#'. And select by class with '.'
Putting all together:
Document doc = Jsoup.parse("<div id='ad-display'><div><div>" +
"<img class='2'></img>" +
"<img class='absmiddle' src='src1'></img>" +
"<img class='dummy'></img>" +
"<img class='absmiddle' src='src2'></img>" +
"</div></div></div>");
Elements select = doc.select("div#ad-display img.absmiddle");
for (Element elem : select)
System.out.println(elem.attr("src"));
I added a mini-html as an example. Note imgs are inside a div inside a div inside the ancestor div (with id 'ad-display')
The output would be:
src1
src2
As expected.
I hope it will help.
I'd like to use xquery (I believe) to output the text from the title attribute of an html element.
Example:
<div class="rating" title="1.0 stars">...</div>
I can use xpath to select the element, but it tries to output the info between the div tags. I think I need to use xquery to output the "1.0 stars" text from the title attribute.
There's gotta be a way to do this. My Google skills are proving ineffective in coming up with an answer.
Thanks.
XPath: //div[#class='rating']/#title
This will give you the title text for every div with a class of "rating".
Addendum (following from comments below):
If the class has other, additional text in it, in addition to "rating", then you can use something like this:
//div[contains(concat(' ', normalize-space(#class), ' '), ' rating ')]
(Hat tip to How can I match on an attribute that contains a certain string?).
You should use:
let $XML := <p><div class="rating" title="2.0 stars">sdfd</div><div class="rating" title="1.0 stars">sdfd</div></p>
for $title in $XML//#title
return
<p>{data($title)}</p>
to get output:
<p>2.0 stars</p>
<p>1.0 stars</p>
I'm using the HtmlAgilityPack to parse href tags in an html file. The href tags look like this:
<h3 class="product-name">Super Cool Product</h3>
So far I can successfully pull out the url and the title together, and display it in a list. This is the main code I'm using to parse the html:
var linksOnPage = from lnks in document.DocumentNode.SelectNodes("//h3[#class='product-name']//a")
where
lnks.Attributes["href"] != null &&
lnks.InnerText.Trim().Length > 0
select new
{
Url = lnks.Attributes["href"].Value,
Text = lnks.InnerText
};
The code above gives me a result that looks like this:
Super Cool Product - http://www.somewebsite.com/blahblah
I'm trying to figure out how to pull out the name and url separately, and put them into separate strings, instead of pulling them out together and putting them into one string. I'm guessing there is some sort of Xpath notation I can use to do this. I would be extremely thankful if someone could lead me in the right direction
Thanks,
Miles
How do I target the following...
Every href attribute, inside a li, inside a ul, inside the FIRST div with the class of "listing".
There are many div.listing divs on the page, and inside each there are like 1000 ul>li>a[href="http://whatever.com"]
I want to fetch all the http://whatever.com in only the first div.
I know I can use div[1] and I can use div[#class="listing"] but really I need to find out how to combine them, and I also need to know how you fetch the attribute and not the text
//*[contains(concat( " ", #class, " " ), concat( " ", "listing", " " )) and (((count(preceding-sibling::*) + 1) = 1) and parent::*)]//a/#href
I got this by marking up your question in HTML and using SelectorGadget