xpath text() returns "None" when the tag is #href - xpath

I'm trying to extract text contained within HTML tags in order build a python defaultdict.
To accomplish this I need to clean out all xpath and/or HTML data and get just the text, which I can accomplish with /text() , unless it's an href.
How I scrape the items:
for item in response.xpath(
"//*[self::h3 or self::p or self::strong or self::a[#href]]"):
How it looks if I print the above, without extraction attempts:
<Selector xpath='//*[self::h3 or self::p or self::a[#href]]' data='<h3> Some text here ...'>
<Selector xpath='//*[self::h3 or self::p or self::a[#href]]' data='<a href="https://some.url.com...'>
I want to extract "Some text here" and "https://some.url.com"
How I try to extract the text:
item = item.xpath("./text()").get()
print(item):
The result:
Some text here
None
"None" is where I would expect to see: https://some.url.com, after trying various methods suggested online, I cannot get this to work.

Try to use this line to extract either text or #href:
item = item.xpath("./text() | ./#href").get()

Related

Excluding contents of <span> from text using Waitr

Watir
mytext =browser.element(:xpath => '//*[#id="gold"]/div[1]/h1').text
Html
<h1>
This is the text I want
<span> I do not want this text </span>
</h1>
When I run my Watir code, it selects all the text, including what is in the spans. How do I just get the text "This is the text I want", and no span text?
If you have a more complicated HTML, I find it can be easier to deal with this using Nokogiri as it provides more methods for parsing the HTML:
require 'nokogiri'
h1 = browser.element(:xpath => '//*[#id="gold"]/div[1]/h1')
doc = Nokogiri::HTML.fragment(h1.html)
mytext = doc.at('h1').children.select(&:text?).map(&:text).join.strip
Ideally start by trying to avoid using XPath. One of the most powerful features of Watir is the ability to create complicated locators without XPath syntax.
The issue is that calling text on a node gets all content within that node. You'd need to do something like:
top_level = browser.element(id: 'gold')
h1_text = top_level.h1.text
span_text = top_level.h1.span.text
desired_text = h1_text.chomp(span_text)
This is useful for top level text.
If there is only one h1, you can ommit id
#b.h1.text.remove(#b.h1.children.collect(&:text).join(' '))
Or specify it if there are more
#b.h1(id: 'gold').text.remove(#b.h1.children.collect(&:text).join(' '))
Make it a method and call it from your script with get_top_text(#b.h1) to get it
def get_top_text(el)
el.text.chomp(#b.h1.children.collect(&:text).join(' '))
end

css/xpath selector to exclude the child node in the element when using selenium webdriver (java)

CSS/xpath selector to get the link text excluding the text in .muted.
I have html like this:
<a href="link">
Text
<span class="muted"> –text</span>
</a>
When I do getText(), I get the complete text like, Text-text. Is it possible to exclude the muted subclass text ?
Tried cssSelector = "a:not([span='muted'])" doesn't work.
xpath = "//a/node()[not(name()='span')][1]"
ERROR: The result of the xpath expression "//a/node()[not(name()='span')][1]" is: [objectText]. It should be an element.
AFAIK this cannot be done with CSS selector only. You can try to use JavaScriptExecutor to get required text.
As you didn't mention programming language you use I show you example on Python:
link = driver.find_element_by_css_selector('a[href="link"]')
driver.execute_script('return arguments[0].childNodes[0].nodeValue', link)
This will return just "Text" without " -text"
You cannot do this using Selenium WebDriver's API. You have to handle it in your code as follows:
// Get the entire link text
String linkText = driver.findElement(By.xpath("//a[#href='link']")).getText();
// Get the span text only
String spanText = driver.findElement(By.xpath("//a[#href='link']/span[#class='muted']")).getText();
// Replace the span text from link text and trim any whitespace
linkText.replace(spanText, "").trim();

get Correct value using xpath

I'm trying to use xpath to get the raw value of an element. The element is a description and it can contain raw text or xhtml.
So it can be as follows:
<description>asdasdasd <a>Item1</a> asd <a> Price </a></description>
based on the above xml, i just need this:
asdasdasd Item1 asd Price
I've tried //description/text(), //description/descendant::*/text() and some others with no success. Any suggestion?
Just use:
//description
The value of an element is its text
Or if it must be a string and there is just one element:
string(//description)

xpath: Picking tag after text

How would one, via xpath, select the strong tag after baz text for example?
<p>
<br>foo<strong>this foo</strong>
<br>bar<strong>this bar</strong>
<br>baz<strong>this baz</strong>
<br>qux<strong>this qux</strong></p>
Obviously the following does not work....
//p[text() = 'baz']/following-sibling::select[1]
Try this
//p/text()[. = 'baz']/following-sibling::strong[1]
Demo here - http://www.xpathtester.com/obj/b67bad4d-4d38-4e2d-a3df-b7e5a2e9f286
This solution relies on no whitespace around your text nodes. You will need to switch to using the following if you start using indentation or other whitespace characters
//p/text()[normalize-space(.) = 'baz']/following-sibling::strong[1]

Hpricot: How to extract inner text without other html subelements

I'm working on a vim rspec plugin (https://github.com/skwp/vim-rspec) - and I am parsing some html from rspec. It looks like this:
doc = %{
<dl>
<dt id="example_group_1">This is the heading text</dt>
Some puts output here
</dl>
}
I can get the entire inner of the using:
(Hpricot.parse(doc)/:dl).first.inner_html
I can get just the dt by using
(Hpricot.parse(doc)/:dl).first/:dt
But how can I access the "Some puts output here" area? If I use inner_html, there is way too much other junk to parse through. I've looked through hpricot docs but don't see an easy way to get essentially the inner text of an html element, disregarding its html children.
I ended up figuring out a route by myself, by manually parsing the children:
(#context/"dl").each do |dl|
dl.children.each do |child|
if child.is_a?(Hpricot::Elem) && child.name == 'dd'
# do stuff with the element
elsif child.is_a?(Hpricot::Text)
text=child.to_s.strip
puts text unless text.empty?
end
end
Note that this is bad HTML you have there. If you have control over it, you should wrap the content you want in a <dd>.
In XML terms what you are looking for is the TextNode following the <dt> element. In my comment I showed how you can select this node using XPath in Nokogiri.
However, if you must use Hpricot, and cannot select text nodes using it, then you could hack this by getting the inner_html and then stripping out the unwanted:
(Hpricot.parse(doc)/:dl).first.inner_html.sub %r{<dt>.+?</dt>}, ''

Resources