xpath expression to remove whitespace - xpath

I have this HTML:
<tr class="even expanded first>
<td class="score-time status">
<a href="/matches/2012/08/02/europe/uefa-cup/">
16 : 00
</a>
</td>
</tr>
I want to extract the (16 : 00) string without the extra whitespace. Is this possible?

I. Use this single XPath expression:
translate(normalize-space(/tr/td/a), ' ', '')
Explanation:
normalize-space() produces a new string from its argument, in which any leading or trailing white-space (space, tab, NL or CR characters) is deleted and any intermediary white-space is replaced by a single space character.
translate() takes the result produced by normalize-space() and produces a new string in which each of the remaining intermediary spaces is replaced by the empty string.
II. Alternatively:
translate(/tr/td/a, '
&#13', '')

Please try the below xpath expression :
//td[#class='score-time status']/a[normalize-space() = '16 : 00']

You can use XPath's normalize-space() as in //a[normalize-space()="16 : 00"]

I came across this thread when I was having my own issue similar to above.
HTML
<div class="d-flex">
<h4 class="flex-auto min-width-0 pr-2 pb-1 commit-title">
<a href="/nsomar/OAStackView/releases/tag/1.0.1">
1.0.1
</a>
XPath start command
tree.xpath('//div[#class="d-flex"]/h4/a/text()')
However this grabbed random whitespace and gave me the output of:
['\n ', '\n 1.0.1\n ']
Using normalize-space, it removed the first blank space node and left me with just what I wanted
tree.xpath('//div[#class="d-flex"]/h4/a/text()[normalize-space()]')
['\n 1.0.1\n ']
I could then grab the first element of the list, and use strip() to remove any further whitespace
XPath final command
tree.xpath('//div[#class="d-flex"]/h4/a/text()[normalize-space()]')[0].strip()
Which left me with exactly what I required:
1.0.1

you can check if text() nodes are empty.
/path/text()[not(.='')]
it may be useful with axes like following-sibling:: if these are no containers, or with child::.
you can use string() or the regex() function of xpath 2.
NOTE: some comments say that xpath cannot do string manipulation... even if it's not really designed for that you can do basic things: contains(), starts-with(), replace().
if you want to check whitespace nodes it's much harder, as you will generally have a nodelist result set, and most xpath functions, like match or replace, only operate one node.
you can separate node and string manipulation
So you may use xpath to retrieve a container, or a list of text nodes, and then process it with another language. (java, php, python, perl for instance).

Related

Xpath getting text with mixed elements in same div

Here is some sample HTML
<div class="something">
<p> This is a <b> Paragraph </b> with mixed elements
<p> Next paragraph....
</div>
what I tried was
//div[contains('#class','something')/text()
and
//div[contains('#class','something')/*/text()
and
//div[contains('#class','something')/p/text()
all of these seem to skip the 'b' tags and the 'a' tags.
Try " ".join(sel.xpath("//div[contains(#class,'something')]//text()").extract()) where sel is selector in your case may be response.
Use the XPath expression
//div[contains(#class,'something')]//text()
to get a concatenation of the text of all the text() nodes in the chosen div element.
Output:
This is a Paragraph with mixed elements
Next paragraph....
It depends on what and how you want to obtain. Anyway, there are couple of problems with what you tried:
You are missing closing bracket (]) after contains in the XPath expression.
#class should not be enclosed in (single) quotes when used inside contains.
If you want to get all the text of div element as one string, you might use
normalize-space(//div[contains(#class,'something')])

Replacing <a> tags that have two pairs of double quotes

I have asked a similar question before but this one is slightly different
I have content with this sort of links in:
Professor Steve Jackson
[UPDATE]
And this is how i read it:
content = doc.xpath("/wcm:root/wcm:element[#name='Body']").inner_text
The links has two pairs of double quotes after the href=.
I am trying to strip out the tag and retrieve only the text like so:
Professor Steve Jackson
To do this I'm using the same method which works for this sort of link which has only a single pair of double quotes:
World
This returns World:
content = Nokogiri::XML.fragment(content_with_link)
content.css('a[href^="ssLINK"]')
.each{|a| a.replace("<>#{a.content}</>")}
=>World
When I try To do the same for the link that has two pairs of double quotes it complains:
content = Nokogiri::XML.fragment(content_with_link)
content.css('a[href^=""ssLINK""]')
.each{|a| a.replace("<>#{a.content}</>")}
Error:
/var/lib/gems/1.9.1/gems/nokogiri-1.6.0/lib/nokogiri/css/parser_extras.rb:87:in
`on_error': unexpected 'ssLINK' after '[:prefix_match, "\"\""]' (Nokogiri::CSS::SyntaxError)
Anyone know how I can overcome this issue?
I can suggest you two ways to do it, but it depends on whether : every <a> tag has href's with two "" enclosing them or its just the one with ssLINK
Assume
output = []
input_text = 'Professor Steve Jackson'
1) If a tags has href with "" only with ssLink then just do
Nokogiri::HTML(input_text).css('a[href=""]').each do |nokogiri_obj|
output << nokogiri_obj.text
end
# => output = ["Professor Steve Jackson"]
2) If all the a tags has href with ""then you can try this
nokogiri_a_tag_obj = Nokogiri::HTML(input_text).css('a[href=""]')
nokogiri_a_tag_obj.each do |nokogiri_obj|
output << nokogiri_obj.text if nokogiri_obj.has_attribute?('sslink')
end
# => output = ["Professor Steve Jackson"]
With this second approach if
input_text = 'Professor Steve Jackson Some other TextSecond link'
then also the output will be ["Professor Steve Jackson"]
Your content is not XML, so any attempt to solve the problem using XML tools such as XSLT and XPath is doomed to failure. Use a regex approach, e.g. awk or Perl. However, it's not immediately obvious to me how to match
<a href="" sometext"">
without also matching
<a href="" sometext="">
so we need to know a bit more about this syntax that you are trying to parse.

XPath: How to grab multiple strings when doing a string, substring, or another function on text() nodes

I want to use XPath to grab a list of modified strings via the text() function
Example code:
<div>
<p>
Monday 2/4/13
</p>
<p>
Tuesday 2/5/13
</p>
</div>
Now in this example, if I wanted to grab an array of the text between the markups, I'd write an expression such as .//div/p/text(). However, if I wanted to only grab the dates, I could use a substring-after function, but the code substring-after(.//div/p/text(), ' ') only grabs one element. How does I write this expression to grab all the text elements?
In XPath 2.0, you can use the function directly in the text():
//div/p/substring-after(text(), ' ')
In XPath 1.0, that cannot be achieved with only one expression because:
the substring-after() function takes a string as first parameter, not a node-set
a function cannot be specified as a location step (as the 2.0 example above does).
So, in 1.0, your best bet is something like (which you'd have to repeat for each node - notice also it returns just a string):
concat(substring-after(//div/p[1]/text(), ' '),
' ',
substring-after(//div/p[2]/text(), ' '))

XPath expression for selecting all text in a given node, and the text of its chldren

Basically I need to scrape some text that has nested tags.
Something like this:
<div id='theNode'>
This is an <span style="color:red">example</span> <b>bolded</b> text
</div>
And I want an expression that will produce this:
This is an example bolded text
I have been struggling with this for hour or more with no result.
Any help is appreciated
The string-value of an element node is the concatenation of the string-values of all text node descendants of the element node in document order.
You want to call the XPath string() function on the div element.
string(//div[#id='theNode'])
You can also use the normalize-space function to reduce unwanted whitespace that might appear due to newlines and indenting in the source document. This will remove leading and trailing whitespace and replace sequences of whitespace characters with a single space. When you pass a nodeset to normalize-space(), the nodeset will first be converted to it's string-value. If no arguments are passed to normalize-space it will use the context node.
normalize-space(//div[#id='theNode'])
// if theNode was the context node, you could use this instead
normalize-space()
You might want use a more efficient way of selecting the context node than the example XPath I have been using. eg, the following Javascript example can be run against this page in some browsers.
var el = document.getElementById('question');
var result = document.evaluate('normalize-space()', el, null ).stringValue;
The whitespace only text node between the span and b elements might be a problem.
Use:
string(//div[#id='theNode'])
When this expression is evaluated, the result is the string value of the first (and hopefully only) div element in the document.
As the string value of an element is defined in the XPath Specification as the concatenation in document order of all of its text-node descendants, this is exactly the wanted string.
Because this can include a number of all-white-space text nodes, you may want to eliminate contiguous leading and trailing white-space and replace any such intermediate white-space by a single space character:
Use:
normalize-space(string(//div[#id='theNode']))
XSLT - based verification:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:template match="/">
"<xsl:copy-of select="string(//div[#id='theNode'])"/>"
===========
"<xsl:copy-of select="normalize-space(string(//div[#id='theNode']))"/>"
</xsl:template>
</xsl:stylesheet>
when this transformation is applied on the provided XML document:
<div id='theNode'> This is an
<span style="color:red">example</span>
<b>bolded</b> text
</div>
the two XPath expressions are evaluated and the results of these evaluations are copied to the output:
" This is an
example
bolded text
"
===========
"This is an example bolded text"
If you are using scrapy in python, you can use descendant-or-self::*/text(). Full example:
txt = """<div id='theNode'>
This is an <span style="color:red">example</span> <b>bolded</b> text
</div>"""
selector = scrapy.Selector(text=txt, type="html") # Create HTML doc from HTML text
all_txt = selector.xpath('//div/descendant-or-self::*/text()').getall()
final_txt = ''.join( _ for _ in all_txt).strip()
print(final_txt) # 'This is an example bolded text'
How about this :
/div/text()[1] | /div/span/text() | /div/b/text() | /div/text()[2]
Hmmss I am not sure about the last part though. You might have to play with that.
normal code
//div[#id='theNode']
to get all text but if they become split then
//div[#id='theNode']/text()
Not sure but if you provide me the link I will try

Xquery to extract text in html

I am working on extracting text out of html documents and storing in database. I am using webharvest tool for extracting the content. However I kind of stuck at a point. Inside webharvest I use XQuery expression inorder to extract the data. The html document that I am parsing is as follows:
<td><a name="hw">HELLOWORLD</a>Hello world</td>
I need to extract "Hello world" text from the above html script.
I have tried extracting the text in this fashion:
$hw :=data($item//a[#name='hw']/text())
However what I always get is "HELLOWORLD" instead of "Hello world".
Is there a way to extract "Hello World". Please help.
What if I want to do it this way:
<td>
<a name="hw1">HELLOWORLD1</a>Hello world1
<a name="hw2">HELLOWORLD2</a>Hello world2
<a name="hw3">HELLOWORLD3</a>Hello world3
</td>
I would like to extract the text Hello world 2 that is in betweeb hw2 and hw3. I would not want to use text()[3] but is there some way I could extract the text out between /a[#name='hw2'] and /a[#name='hw3'].
Your xpath is selecting the text of the a nodes, not the text of the td nodes:
$item//a[#name='hw']/text()
Change it to this:
$item[a/#name='hw']/text()
Update (following comments and update to question):
This xpath selects the second text node from $item that have an a tag containing a name attribute set to hw:
$item[a/#name='hw']//text()[2]
I would not want to use text()[3] but
is there some way I could extract the
text out between /a[#name='hw2'] and
/a[#name='hw3'].
If there is just one text node between the two <a> elements, then the following would be quite simple:
/a[#name='hw3']/preceding::text()[1]
If there are more than one text nodes between the two elements, then you need to express the intersection of all text nodes following the first element with all text nodes preceding the second element. The formula for intersection of two nodesets (aka Kaysian method of intersection) is:
$ns1[count(.|$ns2) = count($ns2)]
So, just replace in the above expression $ns1 with:
/a[#name='hw2']/following-sibling::text()
and $ns2 with:
/a[#name='hw3']/preceding-sibling::text()
Lastly, if you really have XQuery (or XPath 2), then this is simply:
/a[#name='hw2']/following-sibling::text()
intersect
/a[#name='hw3']/preceding-sibling::text()
This handles your expanded case, while letting you select by attribute value rather than position:
let $item :=
<td>
<a name="hw1">HELLOWORLD1</a>Hello world1
<a name="hw2">HELLOWORLD2</a>Hello world2
<a name="hw3">HELLOWORLD3</a>Hello world3
</td>
return $item//node()[./preceding-sibling::a/#name = "hw2"][1]
This gets the first node that has a preceding-sibling "a" element with a name attribute of "hw2".

Resources