How do I target this with nokogiri? - ruby

How do I target the following...
Every href attribute, inside a li, inside a ul, inside the FIRST div with the class of "listing".
There are many div.listing divs on the page, and inside each there are like 1000 ul>li>a[href="http://whatever.com"]
I want to fetch all the http://whatever.com in only the first div.
I know I can use div[1] and I can use div[#class="listing"] but really I need to find out how to combine them, and I also need to know how you fetch the attribute and not the text

//*[contains(concat( " ", #class, " " ), concat( " ", "listing", " " )) and (((count(preceding-sibling::*) + 1) = 1) and parent::*)]//a/#href
I got this by marking up your question in HTML and using SelectorGadget

Related

Xpath insert "," after every li

I have a piece of code and XPath to export it. the code is:
<div class="container-fluid">
<ul class="tags expandable">
<li><a class="search__link" href="domain.com">office</a></li>
<li><a class="search__link" href="domain.com">space</a></li>
</ul>
</div>
and the Xpath is :
//ul[contains(concat (" ", normalize-space(#class), " "), " tags expandable ")]
this Xpath export data is like this: "office space"
but I want to insert "," after each li and I want the export like this: "office, space,"
You should use something like this with XPath 1.0:
translate(normalize-space(//ul)," ",",")
returns "office,space".
To add the last coma :
concat(translate(normalize-space(//ul)," ",","),",")
returns "office,space,".
EDIT : With the website link you posted, use this one-liner to get what you want :
translate(normalize-space(//div[#class="container-fluid"]/ul)," ",",")
Output :
People,Space,Women,Friends,Communication,Group,Support,Community,Beautiful,Unity,Gender,Movement,Gathering,Copy,Rights,Feminist,Empower,Supporting,Copy,space,Empowering
It still needs some corrections (remove the undesired "," between Copy and space). If you need something fully automatic and since you have to use XPath 1.0 (no replace function), you can try :
translate(normalize-space(translate(//div[#class="container-fluid"]/ul," ",""))," ",",")
Output (Copy space are now merged) :
People,Space,Women,Friends,Communication,Group,Support,Community,Beautiful,Unity,Gender,Movement,Gathering,Copy,Rights,Feminist,Empower,Supporting,Copyspace,Empowering
Otherwise, just use :
//ul[#class="tags expandable"]//a/text()
And add the comas with the programming language you want.

Unsure why xpath query coming back empty

I'm new to scrapy and am experimenting with the shell, attempting to retrieve products from the following URL: https://www.newbalance.co.nz/men/shoes/running/trail/?prefn1=sizeRefinement&prefv1=men_shoes_7
The following is my code - I'm not sure why the final query is coming back empty:
$ scrapy shell
fetch("https://www.newbalance.co.nz/men/shoes/running/trail/?prefn1=sizeRefinement&prefv1=men_shoes_7")
div_product_lists = response.xpath('//div[#id="product-lists"]')
ul_product_list_main = div_product_lists.xpath('//ul[#id="product-list-main"]')
for li_tile in ul_product_list_main.xpath('//li[#class="tile"]'):
... print li_tile.xpath('//div[#class="product"]').extract()
...
[]
[]
If I inspect the page using a property inspector, then I am seeing data for the div (with class product), so am not sure why this is coming back empty. Any help would be appreciated.
The problem here is that xpath doesn't understand that class="product product-tile " means "This element has 2 classes, product and product-tile".
In an xpath selector, the class attribute is just a string, like any other.
Knowing that, you can search for the entire class string:
>>> li_tile.xpath('.//div[#class="product product-tile "]')
[<Selector xpath='.//div[#class="product product-tile "]' data='<div class="product product-tile " tabin'>]
If you want to find all the elements that have the "product" class, the easiest way to do so is using a css selector:
>>> li_tile.css('div.product')
[<Selector xpath="descendant-or-self::div[#class and contains(concat(' ', normalize-space(#class), ' '), ' product ')]" data='<div class="product product-tile " tabin'>]
As you can see by looking at the resulting Selector, this is a bit more complicated to achieve using only xpath.
The data you want to extract is more readily available in other div having class product-top-spacer.
For instance you can get all divs having class="product-top-spacer" by following:
ts = response.xpath('//div[#class="product-top-spacer"]')
and check the item of first extracted div and its price:
ts[0].xpath('descendant::p[#class="product-name"]/a/text()').extract()[0]
>> 'Leadville v3'
ts[0].xpath('descendant::div[#class="product-pricing"]/text()').extract()[0].strip()
>> '$260.00'
All items can be viewed by iterating ts
for t in ts:
itname = t.xpath('descendant::p[#class="product-name"]/a/text()').extract()[0]
itprice = t.xpath('descendant::div[#class="product-pricing"]/text()').extract()[0].strip()
itprice = ' '.join(itprice.split()) # some cleaning
print(itname + ", " + itprice)

Ruby/Selenium can't receive text

I'm using the "firepath" Firefox extension to test my xpaths.
Running this xpath:
driver.find_elements(:xpath, "//path/a[#class='anytext']").map{|el| el.text}
against this anchor:
<a class="anytext" href="/any/path/" title="Search for skill">text</a>
I received all elements on page as [string1, string2....]
With this xpath:
driver.find_elements(:xpath, "//path/a[#class='anytext and']").map{|el| el.text}
and this anchor:
<a class="anytext andmore" href="/any/path/" title="Search for skill" aria-describedby="tooltip">text</a>
I received array [" ", " ", ....] without text.
I understand that the problem is to do with "aria-describedby" but I dont know what to try next? I tried using different methods but not getting what I need.
There are multiple classes so try contains and and in your xpath instead:
driver.find_elements(:xpath, "//path/a[contains(#class, 'anytext') and contains(#class, 'and')]").map{|el| el.text}
Conversely you could search by css selectors:
driver.find_elements(:css, "path a.anytext.and").map{|el| el.text}
I solved the problem with firebug
This is hidden elements with attribute: 'innerHTML' or 'textContent'
i.e.:
driver.find_elements(:css, "path to elements").map{|el| el.attribute('textContent')}

Perform an Xpath descendant search in jsoup

Here is my Xpath query i'm trying to convert to jsoup..
//div[#id='ad-display']/descendant-or-self::img[contains(#class, 'absmiddle')]/#src
I can't find any documentation that talks about descendants in jsoup. I know it talks about child elements, but apparently I'm not good enough to find the correlation between the two.
JSoup uses CSS selectors, selecting descendant in CSS is easy, just put your descendant element after ancestor separated by space.
Select by id is done with '#'. And select by class with '.'
Putting all together:
Document doc = Jsoup.parse("<div id='ad-display'><div><div>" +
"<img class='2'></img>" +
"<img class='absmiddle' src='src1'></img>" +
"<img class='dummy'></img>" +
"<img class='absmiddle' src='src2'></img>" +
"</div></div></div>");
Elements select = doc.select("div#ad-display img.absmiddle");
for (Element elem : select)
System.out.println(elem.attr("src"));
I added a mini-html as an example. Note imgs are inside a div inside a div inside the ancestor div (with id 'ad-display')
The output would be:
src1
src2
As expected.
I hope it will help.

Grab content of Javascript from head tag

I need to grab specific content from this Javascript tag which is in the section of an html doc:
<script type="text/javascript">
var sbc = "<a href='http://www.test.com/Default.aspx' style='color:#e16a58;'>Home</a> / Men's Bikes";
</script>
Namely the 'Men's Bikes' text. Anyone know how I can do this?
I tried this which gets me all the tags:
/html/head/script[#type='text/javascript']
But not sure how I can narrow down to just that one - there are numerous tags in the .
If you apply this xpath to your input XML:
/html/head/script[#type='text/javascript']/text()
we get
var sbc = " / Men's Bikes";
using substring-after and substring-before can manipulate further the wanted output text. e.g.
/html/head/script[#type='text/javascript']/substring-after(text()[last()], '/ ')
outputs:
Men's Bikes";
ultimately:
/html/head/script[#type='text/javascript']/substring-before(substring-after(text()[last()], '/ '), '";')
outputs
Men's Bikes

Resources