How to extract inner text of multiple Paragraph tags which are nested withing an anchor tag - xpath

Here is the code:
<a id='Letter1'>
<p>Dear Sir, </p>
<p>This is with.........</p>
<p>I would be.......</p>
<p>Hoping to hear from you soon</p>
<p>Regards.</p>
</a>
Using Xpath I want to extract the inner text of all the Paragraph tags which are contained inside the anchor tag as a single text entity.
The final result i want is
string letterBody= document.DocumentNode.SelectSingleNode("//XPATH QUERY").innerText;
where letterBody="Dear Sir, This is with...................Regards."

You need to just get the <a> element and you will get all the text nodes which are under <a> as its innertext.
So your xpath would be /a[#id='Letter1'] or just /a.

Related

xPath - Why is this exact text selector not working with the data test id?

I have a block of code like so:
<ul class="open-menu">
<span>
<li data-testid="menu-item" class="menu-item option">
<svg>...</svg>
<div>
<strong>Text Here</strong>
<small>...</small>
</div>
</li>
<li data-testid="menu-item" class="menu-item option">
<svg>...</svg>
<div>
<strong>Text</strong>
<small>...</small>
</div>
</li>
</span>
</ul>
I'm trying to select a menu item based on exact text like so in the dev tools:
$x('.//*[contains(#data-testid, "menu-item") and normalize-space() = "Text"]');
But this doesn't seem to be selecting the element. However, when I do:
$x('.//*[contains(#data-testid, "menu-item")]');
I can see both of the menu items.
UPDATE:
It seems that this works:
$x('.//*[contains(#class, "menu-item") and normalize-space() = "Text"]');
Not sure why using a class in this context works and not a data-testid. How can I get my xpath selector to work with my data-testid?
Why is this exact text selector not working
The fact that both li elements are matched by the XPath expression
if omitting the condition normalize-space() = "Text" is a clue.
normalize-space() returns ... Text Here ... for the first li
in the posted XML and ... Text ... for the second (or some other
content in place of ... from div/svg or div/small) causing
normalize-space() = "Text" to fail.
In an update you say the same condition succeeds. This has nothing to
do with using #class instead of #data-testid; it must be triggered
by some content change.
How can I get my xpath selector to work with my data-testid?
By testing for an exact text match in the li's descendant strong
element,
.//*[#data-testid = "menu-item" and div/strong = "Text"]
which matches the second li. Making the test more robust is usually
in order, e.g.
.//*[contains(#data-testid,"menu-item") and normalize-space(div/strong) = "Text"]
Append /div/small or /descendant::small, for example, to the XPath
expression to extract just the small text.
data-testid="menu-item" is matching both the outer li elements while text content you are looking for is inside the inner strong element.
So, to locate the outer li element based on it's data-testid attribute value and it's inner strong element text value you can use XPath expression like this:
//*[contains(#data-testid, "menu-item") and .//normalize-space() = "Text"]
Or
.//*[contains(#data-testid, "menu-item") and .//*[normalize-space() = "Text"]]
I have tested, both expressions are working correctly

How can I xpath target text() "only" directly under a html tag, instead of the text contained under "other html child tags"

How can I xpath target text() "only" directly under a html tag, instead of the text contained under "other html child tags"
Consider
<li class="one">
<label class="two">
<span class="two-one">Unwanted text</span>
Wanted text only directly under under label (not under span)
</label>
Two options :
If you have multiple lines to fetch :
//label/text()[normalize-space()]
If you have just one line to fetch, use position. For your sample data:
//label/text()[last()]
[last()] could be replaced with [1],[2],[3],... to specify the position of the text you're trying to get.

What is Valid Xpath for link extract by div class name?

What is Valid Xpath for link extract by div class name?
Here is html code:
<div class="poster">
<a href="/title/tt2091935/mediaviewer/rm4278707200?ref_=tt_ov_i"> <img alt="Mr. Right Poster" title="Mr. Right Poster" src="http://ia.media-imdb.com/images/M/MV5BOTcxNjUyOTMwOV5BMl5BanBnXkFtZTgwMzUxMDk4NzE#._V1_UX182_CR0,0,182,268_AL_.jpg" itemprop="image">
</a> </div>
I want to know exact Xpath as if i found href link.
I try with //a/#href[#class='poster'] but it's doesn't work
The <div> contains the <a> so you can use that to navigate:
//div[#class='poster']/a/#href
Remember that the "poster" class is defined on the <div> not on the <a> so that's where you need to apply the predicate.
//div returns all <div> elements
[#class='poster'] is a predicate that filters by class
/a returns all <a> elements that are children of those <div>s
/#href gives us the attribute we want
Depending on the system you're using you might need to wrap the whole expression in text() in order to bring back the attribute data rather than the DOM node.

Xpath get text of nested item not working but css does

I'm making a crawler with Scrapy and wondering why my xpath doesn't work when my CSS selector does? I want to get the number of commits from this html:
<li class="commits">
<a data-pjax="" href="/samthomson/flot/commits/master">
<span class="octicon octicon-history"></span>
<span class="num text-emphasized">
521
</span>
commits
</a>
</li
Xpath:
response.xpath('//li[#class="commits"]//a//span[#class="text-emphasized"]//text()').extract()
CSS:
response.css('li.commits a span.text-emphasized').css('::text').extract()
CSS returns the number (unescaped), but XPath returns nothing. Am I using the // for nested elements correctly?
You're not matching all values in the class attribute of the span tag, so use the contains function to check if only text-emphasized is present:
response.xpath('//li[#class="commits"]//a//span[contains(#class, "text-emphasized")]//text()')[0].strip()
Otherwise also include num:
response.xpath('//li[#class="commits"]//a//span[#class="num text-emphasized"]//text()')[0].strip()
Also, I use [0] to retrieve the first element returned by XPath and strip() to remove all whitespace, resulting in just the number.

Nokogiri grab text with formatting and link tags, <em>,<strong>, <a>, etc

How can I recursively capture all the text with formatting tags using Nokogiri?
<div id="1">
This is text in the TD with <strong> strong </strong> tags
<p>This is a child node. with <b> bold </b> tags</p>
<div id=2>
"another line of text to a link "
<p> This is text inside a div <em>inside<em> another div inside a paragraph tag</p>
</div>
</div>
For example, I would like to capture:
"This is text in the TD with <strong> strong </strong> tags"
"This is a child node. with <b> bold </b> tags"
"another line of text to a link "
"This is text inside a div <em>inside<em> another div inside a paragraph tag"
I can't just use .text() because it strips the formatting tags and I'm not sure how to do it recursively.
ADDED DETAIL: Sanitize looks like an interesting gem, I'm reading it now. However, have some added info that might clarify what I need to do.
I need to traverse each node, get the text, process it and put it back. therefore I would grab the text from , "This is text in the TD with strong tags", modify it to something like, "This is the modified text in the TD with strong tags. Then goto the next tag from div 1 get the text. "This is a child node. with bold tags" modify it "This is a modified child node. with bold tags." and put it back. Goto the next div#2 and grab the text, "another line of text to a link ", modify it, "another line of modified text to a link ", and put it back and goto the next node, Div#2 and grab text from the paragraph tag. "This is modified text inside a div inside another div inside a paragraph tag"
so after everything is processed the new html should be look like this...
<div id="1">
This is modified text in the TD with <strong> strong </strong> tags
<p>This is a modified child node. with <b> bold </b> tags</p>
<div id=2>
"another line of modified text to a link "
<p> This is modified text inside a div <em>inside<em> another div inside a paragraph tag</p>
</div>
</div>
My quasi-code,but I'm really stuck on the two parts, grabbing just the text with formatting (which sanitize helps with), but sanitize grabs all tags. I need to preserve formatting of just the text with formatting, including spaces, etc. However, not grab the unrelated tag children. And two, traversing down all the children related directly with full text tags.
#Quasi-code
doc = Nokogiri.HTML(html)
kids=doc.at('div#1')
text_kids=kids.descendant_elements
text.kids.each do |i|
#grab full text(full sentence and paragraphs) with formating tags
#currently, I have not way to grab just the text with formatting and not the other tags
modified_text=processing_code(i.full_text_w_formating())
i.full_text_w_formating=modified_text
end
def processing_code(string)
#code to process string (not relevant for this example)
return modified_string
end
# Recursive 1
class Nokogiri::XML::Node
def descendant_elements
#This is flawed because it grabs every child and even
#splits it based on any tag.
# I need to traverse down only the text related children.
element_children.map{ |kid|
[kid, kid.descendant_elements]
}.flatten
end
end
I'd use two tactics, Nokogiri to extract the content you want, then a blacklist/whitelist program to strip tags you don't want or keep the ones you want.
require 'nokogiri'
require 'sanitize'
html = '
<div id="1">
This is text in the TD with <strong> strong <strong> tags
<p>This is a child node. with <b> bold </b> tags</p>
<div id=2>
"another line of text to a link "
<p> This is text inside a div <em>inside<em> another div inside a paragraph tag</p>
</div>
</div>
'
doc = Nokogiri.HTML(html)
html_fragment = doc.at('div#1').to_html
will capture the contents of <div id="1"> as an HTML string:
This is text in the TD with <strong> strong <strong> tags
<p>This is a child node. with <b> bold </b> tags</p>
<div id="2">
"another line of text to a link "
<p> This is text inside a div <em>inside<em> another div inside a paragraph tag</em></em></p>
</div>
</strong></strong>
The trailing </strong></strong> is the result of two opening <strong> tags. That might be deliberate, but with no closing tags Nokogiri will do some fixup to make the HTML correct.
Passing html_fragment to the Sanitize gem:
doc = Sanitize.clean(
html_fragment,
:elements => %w[ a b em strong ],
:attributes => {
'a' => %w[ href ],
},
)
The returned text looks like:
This is text in the TD with <strong> strong <strong> tags
This is a child node. with <b> bold </b> tags
"another line of text to a link "
This is text inside a div <em>inside<em> another div inside a paragraph tag</em></em>
</strong></strong>
Again, because the HTML was malformed with no closing </strong> tags, the two trailing closing tags are present.

Resources