Perform an Xpath descendant search in jsoup - xpath

Here is my Xpath query i'm trying to convert to jsoup..
//div[#id='ad-display']/descendant-or-self::img[contains(#class, 'absmiddle')]/#src
I can't find any documentation that talks about descendants in jsoup. I know it talks about child elements, but apparently I'm not good enough to find the correlation between the two.

JSoup uses CSS selectors, selecting descendant in CSS is easy, just put your descendant element after ancestor separated by space.
Select by id is done with '#'. And select by class with '.'
Putting all together:
Document doc = Jsoup.parse("<div id='ad-display'><div><div>" +
"<img class='2'></img>" +
"<img class='absmiddle' src='src1'></img>" +
"<img class='dummy'></img>" +
"<img class='absmiddle' src='src2'></img>" +
"</div></div></div>");
Elements select = doc.select("div#ad-display img.absmiddle");
for (Element elem : select)
System.out.println(elem.attr("src"));
I added a mini-html as an example. Note imgs are inside a div inside a div inside the ancestor div (with id 'ad-display')
The output would be:
src1
src2
As expected.
I hope it will help.

Related

XPath - Nested path scraping

I'm trying to perform html scrapping of a webpage. I like to fetch the three alternate text (alt - highlighted) from the three "img" elements.
I'm using the following code extract the whole "img" element of slide-1.
from lxml import html
import requests
page = requests.get('sample.html')
tree = html.fromstring(page.content)
text_val = tree.xpath('//a[class="cover-wrapper"][id = "slide-1"]/text()')
print text_val
I'm not getting the alternate text values displayed. But it is an empty list.
HTML Script used:
This is one possible XPath :
//div[#id='slide-1']/a[#class='cover-wrapper']/img/#alt
Explanation :
//div[#id='slide-1'] : This part find the target <div> element by comparing the id attribute value. Notice the use #attribute_name syntax to reference attribute in XPath. Missing the # symbol would change the XPath selector meaning to be referencing a -child- element with the same name, instead of an attribute.
/a[#class='cover-wrapper'] : from each <div> element found by the previous bit of the XPath, find child element <a> that has class attribute value equals 'cover-wrapper'
/img/#alt : then from each of such <a> elements, find child element <img> and return its alt attribute
You might want to change the id filter to be starts-with(#id,'slide-') if you meant to return the all 3 alt attributes in the screenshot.
Try this:
//a[#class="cover-wrapper"]/img/#alt
So, I am first selecting the node having a tag and class as cover-wrapper and then I select the node img and then the attribute alt of img.
To find the whole image element :
//a[#class="cover-wrapper"]
I think you want:
//div[#class="showcase-wrapper"][#id="slide-1"]/a/img/#alt

How to find an element that has a specific text without other inside elements

I have an HTML string like this.
html = '<div>outer<div>inner</div></div>'
I want to get text from only inside of a div element.
doc = Nokogiri::HTML(html)
doc.xpath('//div[contains(.,"inner")]')
But this code gets not only the inner element, but also the outer element because the outer element also contains the text inner.
How can I find an element that contains a specific text without inner HTML tag?
I can easily get the inner element in this case by doc.css('div > div'), but in real case, I am not sure how many div tags exist. And the inner text may include more text, except for inner like:
html = '<div>outer<div>inner text</div></div>'

Using Xpath and HtmlAgilityPack to find all elements with innertext containing a specific word or words

I am trying to build a simple search-engine using HtmlAgilityPack and Xpath with C# (.NET 4).
I want to find every node containing a userdefined searchword, but I can't seem to get the XPath right.
For Example:
<HTML>
<BODY>
<H1>Mr T for president</H1>
<div>We believe the new president should be</div>
<div>the awsome Mr T</div>
<div>
<H2>Mr T replies:</H2>
<p>I pity the fool who doesn't vote</p>
<p>for Mr T</p>
</div>
</BODY>
</HTML>
If the specified searchword is "Mr T" I'd want the following nodes: <H1>, The second <div>, <H2> and the second <p>.
I have tried numerous variants of doc.DocumentNode.SelectNodes("//text()[contains(., "+ searchword +")]"); but I always seem to wind up with every single node in the entire DOM.
Any hints to get me in the right direction would be very appreciated.
Use:
//*[text()[contains(., 'Mr T')]]
This selects all elements in the XML document that have a text-node child which contains the string 'Mr T'.
This can also be written shorter as:
//text()[contains(., 'Mr T')]/..
This selects the parent(s) of any text node that contains the string 'Mr T'.
According to Xpath, if you want to find a specific keyword you need to follow the format ("keyword" is the word you like to search) :
//*[text()[contains(., 'keyword')]]
You have to follow the same format as above in C#, keyword is the string variable you call:
doc.DocumentNode.SelectNodes("//*[text()[contains(., '" + keyword + "')]]");
Use the following:
doc.DocumentNode.SelectNodes("//*[contains(text()[1], " + searchword + ")]")
This selects all elements (*) whose first text child (text()[1]) contains the searchword.
Case-insensitive solution:
var xpathForFindText =
"//*[text()[contains(translate(., 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), '" + lowerFocusKwd + "')]]";
var result=doc.DocumentNode.SelectNodes(xpathForFindText);
Note:
Be careful, because the lowerFocusKwd must not contain the following character, because the xpath will be in bad format:
'

use YQL with substring-before in xpath

I am trying to get a string before '--' within a paragraph in an html page using the xpath and send it to yql
for example i want to get the date from the following article:
<div>
<p>Date --- the body of the article</p>
</div>
I tried this query in yql:
select * from html where url="article url" and xpath="//div/p/text()/[substring-before(.,'--')]"
but it does not work.
how can I get the date of the article which is before the '--'
You can simply use:
substring-before(//div/p,'--')
Use:
substring-before(/div/p/text(), '--')
This XPath expression evaluates to the string immediately preceding '--' in the first text node in the XML document, that is a child of a p that is a child of the div top element.
In case you want to get this value for every such text node, you have to use an expression like:
substring-before((//div/p/text())[$k], '--')
and evaluate this expression $N times, for $k = 1,2, ..., $N
where $N is count(//div/p/text())
Do note: Try to avoid using the // XPath pseudo-operator always when the structure of the XML document is statically known. Using // usually results in big inefficiency (O(N^2)) that are felt especially painful on big XML documents.

Use Nokogiri to get all nodes in an element that contain a specific attribute name

I'd like to use Nokogiri to extract all nodes in an element that contain a specific attribute name.
e.g., I'd like to find the 2 nodes that contain the attribute "blah" in the document below.
#doc = Nokogiri::HTML::DocumentFragment.parse <<-EOHTML
<body>
<h1 blah="afadf">Three's Company</h1>
<div>A love triangle.</div>
<b blah="adfadf">test test test</b>
</body>
EOHTML
I found this suggestion (below) at this website: http://snippets.dzone.com/posts/show/7994, but it doesn't return the 2 nodes in the example above. It returns an empty array.
# get elements with attribute:
elements = #doc.xpath("//*[#*[blah]]")
Thoughts on how to do this?
Thanks!
I found this here
elements = #doc.xpath("//*[#*[blah]]")
This is not a useful XPath expression. It says to give you all elements that have attributes that have child elements named 'blah'. And since attributes can't have child elements, this XPath will never return anything.
The DZone snippet is confusing in that when they say
elements = #doc.xpath("//*[#*[attribute_name]]")
the inner square brackets are not literal... they're there to indicate that you put in the attribute name. Whereas the outer square brackets are literal. :-p
They also have an extra * in there, after the #.
What you want is
elements = #doc.xpath("//*[#blah]")
This will give you all the elements that have an attribute named 'blah'.
You can use CSS selectors:
elements = #doc.css "[blah]"

Resources