How would one, via xpath, select the strong tag after baz text for example?
<p>
<br>foo<strong>this foo</strong>
<br>bar<strong>this bar</strong>
<br>baz<strong>this baz</strong>
<br>qux<strong>this qux</strong></p>
Obviously the following does not work....
//p[text() = 'baz']/following-sibling::select[1]
Try this
//p/text()[. = 'baz']/following-sibling::strong[1]
Demo here - http://www.xpathtester.com/obj/b67bad4d-4d38-4e2d-a3df-b7e5a2e9f286
This solution relies on no whitespace around your text nodes. You will need to switch to using the following if you start using indentation or other whitespace characters
//p/text()[normalize-space(.) = 'baz']/following-sibling::strong[1]
Related
Watir
mytext =browser.element(:xpath => '//*[#id="gold"]/div[1]/h1').text
Html
<h1>
This is the text I want
<span> I do not want this text </span>
</h1>
When I run my Watir code, it selects all the text, including what is in the spans. How do I just get the text "This is the text I want", and no span text?
If you have a more complicated HTML, I find it can be easier to deal with this using Nokogiri as it provides more methods for parsing the HTML:
require 'nokogiri'
h1 = browser.element(:xpath => '//*[#id="gold"]/div[1]/h1')
doc = Nokogiri::HTML.fragment(h1.html)
mytext = doc.at('h1').children.select(&:text?).map(&:text).join.strip
Ideally start by trying to avoid using XPath. One of the most powerful features of Watir is the ability to create complicated locators without XPath syntax.
The issue is that calling text on a node gets all content within that node. You'd need to do something like:
top_level = browser.element(id: 'gold')
h1_text = top_level.h1.text
span_text = top_level.h1.span.text
desired_text = h1_text.chomp(span_text)
This is useful for top level text.
If there is only one h1, you can ommit id
#b.h1.text.remove(#b.h1.children.collect(&:text).join(' '))
Or specify it if there are more
#b.h1(id: 'gold').text.remove(#b.h1.children.collect(&:text).join(' '))
Make it a method and call it from your script with get_top_text(#b.h1) to get it
def get_top_text(el)
el.text.chomp(#b.h1.children.collect(&:text).join(' '))
end
I have this HTML fragment:
<p>Yes. No. Both. Maybe a plane?</p><h2 id="2-is-it-a-plane">2. Is it a plane?</h2><p>Yes. No. Both.</p><h2 id="3-what-is-superman-anyway">3. What is Superman, anyway?</h2><p>Is it a bird? Is it a plane? No, it’s Superman.</p>
I need to replace the word plane with
plane
but only when it's outside of an <a></a> anchor tag, and outside a heading, <h1-h6></h> tag.
This is what I've tried:
require 'Nokogiri'
h = '<p>Yes. No. Both. Maybe a plane?</p><h2 id="2-is-it-a-plane">2. Is it a plane?</h2><p>Yes. No. Both.</p><h2 id="3-what-is-superman-anyway">3. What is Superman, anyway?</h2><p>Is it a bird? Is it a plane? No, it’s Superman.</p>'
doc = Nokogiri::HTML::DocumentFragment(h).parse
# Try 1: This outputs all content, but I need to avoid <a>/<h#>
doc.content
# Try 2: The below line removes headings permanently - I need them to remain
# doc.search(".//h2").remove
# Try 3: This just comes out empty - why?
# doc.xpath('text()')
# doc.xpath('//text()')
# then,
# code to replace `plane` is here ...
# this part is not needed
# then,
doc.to_html
I tried various other variations of xpath to no avail. What am I doing wrong?
After some playing around, it appears you needed to use the XPath selector p/text(). Things then got more complicated because you're trying to replace normal text with a link element.
When I just tried using gsub, Nokogiri was escaping the new link, so I needed to split the text element into multiple sibling elements where I could replace some of the siblings with link elements instead of text nodes.
doc.xpath('p/text()').grep(/plane/) do |node|
node_content, *remaining_texts = node.content.split(/(plane)/)
node.content = node_content
remaining_texts.each do |text|
if text == 'plane'
node = node.add_next_sibling('plane').last
else
node = node.add_next_sibling(text).last
end
end
end
puts doc
# <p>Yes. No. Both. Maybe a plane?</p>
# <h2 id="2-is-it-a-plane">2. Is it a plane?</h2>
# <p>Yes. No. Both.</p>
# <h2 id="3-what-is-superman-anyway">3. What is Superman, anyway?</h2>
# <p>Is it a bird? Is it a plane? No, it’s Superman.</p>
A more general purpose XPath selector for all elements, except headings and links, might be:
*[not(name()='a')][not(name()='h1')][not(name()='h2')][not(name()='h3')][not(name()='h4')][not(name()='h5')][not(name()='h6')]/text()
You may need to tweak this some as I'm not an XML or Nokogiri expert, but it appears to me to be working for the provided example, at least, so it should get you going.
CSS/xpath selector to get the link text excluding the text in .muted.
I have html like this:
<a href="link">
Text
<span class="muted"> –text</span>
</a>
When I do getText(), I get the complete text like, Text-text. Is it possible to exclude the muted subclass text ?
Tried cssSelector = "a:not([span='muted'])" doesn't work.
xpath = "//a/node()[not(name()='span')][1]"
ERROR: The result of the xpath expression "//a/node()[not(name()='span')][1]" is: [objectText]. It should be an element.
AFAIK this cannot be done with CSS selector only. You can try to use JavaScriptExecutor to get required text.
As you didn't mention programming language you use I show you example on Python:
link = driver.find_element_by_css_selector('a[href="link"]')
driver.execute_script('return arguments[0].childNodes[0].nodeValue', link)
This will return just "Text" without " -text"
You cannot do this using Selenium WebDriver's API. You have to handle it in your code as follows:
// Get the entire link text
String linkText = driver.findElement(By.xpath("//a[#href='link']")).getText();
// Get the span text only
String spanText = driver.findElement(By.xpath("//a[#href='link']/span[#class='muted']")).getText();
// Replace the span text from link text and trim any whitespace
linkText.replace(spanText, "").trim();
I've tried everything below, not working for me. I am trying to avoid using "contains"
//p[text()[contains(.,'text1')]][text()[contains(.,'text2')]]
Here is my html p element
<p>
text1
<br></br>
text2
</p>
Here's what I've tried so far:
"//p[normalize-space(text()) = 'text1text2']]"
"//p[normalize-space() = 'text1text2']]")
"//p[text()[normalize-space() ='text1text2']]"
"//p[text()[normalize-space().,'text1text2']]"
#"//p[text() = ""text1\r\ntext2""]"
I had to do this:
"//p[.='text1text2']"
But I would still like to see if I could verify the newline in my xpath somehow.
I have asked a similar question before but this one is slightly different
I have content with this sort of links in:
Professor Steve Jackson
[UPDATE]
And this is how i read it:
content = doc.xpath("/wcm:root/wcm:element[#name='Body']").inner_text
The links has two pairs of double quotes after the href=.
I am trying to strip out the tag and retrieve only the text like so:
Professor Steve Jackson
To do this I'm using the same method which works for this sort of link which has only a single pair of double quotes:
World
This returns World:
content = Nokogiri::XML.fragment(content_with_link)
content.css('a[href^="ssLINK"]')
.each{|a| a.replace("<>#{a.content}</>")}
=>World
When I try To do the same for the link that has two pairs of double quotes it complains:
content = Nokogiri::XML.fragment(content_with_link)
content.css('a[href^=""ssLINK""]')
.each{|a| a.replace("<>#{a.content}</>")}
Error:
/var/lib/gems/1.9.1/gems/nokogiri-1.6.0/lib/nokogiri/css/parser_extras.rb:87:in
`on_error': unexpected 'ssLINK' after '[:prefix_match, "\"\""]' (Nokogiri::CSS::SyntaxError)
Anyone know how I can overcome this issue?
I can suggest you two ways to do it, but it depends on whether : every <a> tag has href's with two "" enclosing them or its just the one with ssLINK
Assume
output = []
input_text = 'Professor Steve Jackson'
1) If a tags has href with "" only with ssLink then just do
Nokogiri::HTML(input_text).css('a[href=""]').each do |nokogiri_obj|
output << nokogiri_obj.text
end
# => output = ["Professor Steve Jackson"]
2) If all the a tags has href with ""then you can try this
nokogiri_a_tag_obj = Nokogiri::HTML(input_text).css('a[href=""]')
nokogiri_a_tag_obj.each do |nokogiri_obj|
output << nokogiri_obj.text if nokogiri_obj.has_attribute?('sslink')
end
# => output = ["Professor Steve Jackson"]
With this second approach if
input_text = 'Professor Steve Jackson Some other TextSecond link'
then also the output will be ["Professor Steve Jackson"]
Your content is not XML, so any attempt to solve the problem using XML tools such as XSLT and XPath is doomed to failure. Use a regex approach, e.g. awk or Perl. However, it's not immediately obvious to me how to match
<a href="" sometext"">
without also matching
<a href="" sometext="">
so we need to know a bit more about this syntax that you are trying to parse.