There is a page like the following:
<html>
<head></head>
<body>
<p> 5-8 </p>
<p></br>5-8</br></p>
<p> </br>5-8 </br></p>
</body>
</html>
The goal is to abstract the text in each p, the breaks and whitespaces are not wanted.
How to achieve that?
Thanks in advance! Best Wishes!
--The first Updating
Another post suggested using normalize_space(). I tried that, well, It can remove the spaces. However, only one node is left. How can I get all 30 node text without unwanted spaces? Thanks in advance and Best wishes!
enter image description here
It's not possible to achieve what you want entirely in XPath 1.0, but in XPath 2.0 or later it is possible.
You don't say what XPath interpreter you have available but you mention Chrome's XPath Helper which relies on Chrome's built in XPath interpreter which supports XPath 1.0 (as is the norm for web browsers).
But it's possible you are just using Chrome to examine the data, and have another, more modern XPath interpreter such as e.g. Saxon. If so, an XPath 2.0 solution will work for you, though you won't be able to use it in Chrome, obviously.
I've tidied up your XML example:
<html>
<head></head>
<body>
<p> 5-8 </p>
<p><br/>5-8<br/></p>
<p> <br/>5-8 <br/></p>
</body>
</html>
NB those are non-breaking spaces there.
In XPath 2.0:
for $paragraph in //p
return normalize-space(
translate($paragraph, codepoints-to-string(160), ' ')
)
NB this uses the translate function to convert non-breaking spaces (the char with Unicode codepoint 160) to a space, and then uses normalize-space to trim leading and trailing whitespace (I'm not sure what you would want to do if there were whitespace in the middle of the para, instead of just at the start or end; this will convert any such sequence of whitespace to a single space character). You might think normalize-space would be enough, but in fact a non-breaking-space doesn't fall into normalize-space's category of "white space" so it would not be trimmed.
In XPath 1.0 is not exactly possible to do what you want. You could use an XPath expression that would return each p element to your host language, and then iterate over those p elements, executing a second XPath expression for each one, with that p as the context. Essentially this mean moving the for ... in ... return iterator from XPath into your host language. To select the paragraphs:
//p
... and then for each one:
normalize-space(
translate(., ' ', ' ')
)
NB in that expression, the first string literal is a non-breaking-space character, and the second is a space. XPath 1.0 doesn't have the codepoints-to-string function or I'd have used that, for clarity.
The . which is the first parameter to the translate function represents the context node (the current node). When you execute this XPath expression in your host language you need to pass one of the p elements as the context node. You don't say what host language you're using, but in JavaScript, for instance, you could use the document.evaluate function to execute the first XPath, receiving an iterator of p elements. Then for each element, you'd call its evaluate method to execute the second XPath, and that would ensure that the p element was the context node for the XPath (i.e. the . in the expression).
Related
Can anyone please help me here ?
I want to run two xpath together and store the value, I am not sure if it is possible.
My one xpath is fetching City and second is state
//div[(text()='city')]/following-sibling::div
//div[contains(text(),'state')]/following-sibling::div
As xpath is telling name of city and state is provided in next div of city and state. I want to run both and capture output in string format.
On side note: both xpath is working fine for me.
<div>
<div>City</div>
<div>London</div>
</div>
<--In between some other elements like p, section other divs-->
<div>
<div>state</div>
<div>England</div>
</div>
It sounds like you want to convert the results of the two XPath expressions to strings, and concatenate those strings. The expression below concatenates them (with a single space between) using the XPath concat function.
concat(
//div[(text()='city')]/following-sibling::div,
' ',
//div[contains(text(),'state')]/following-sibling::div
)
One other thing: note that in your example XML the text of the first div is "City" rather than "city". Make sure the strings in your XPath expression match the text exactly because the expression 'City'='city' evaluates to false
For purposes to automatically replace keywords with links based on a list of keyword-link pairs I need to get text that is not already linked, not a script or manually excluded, inside paragraphs (p) and list items (li) –- to be used in Drupal's Alinks module.
I modified the existing xpath selector as follows and would like to get feedback on it, if it is efficient or might be improved:
//*[p or li]//text()[not(ancestor::a) and not(ancestor::script) and not(ancestor::*[#data-alink-ignore])]
The xpath is meant to work with any html5 content, also with self closing tags (not well-formed xml) -- that's the way the module was designed, and it works quite well.
In order to select text node descendant of p or li elements that are not descendant of a or script elements, you can use this XPath 1.0:
//*[self::p|self::li]
//text()[
not(ancestor::a|ancestor::script|ancestor::*[#data-alink-ignore])
]
Your XPath expression is invalid. You are missing a / before text(). So a valid expression would be
//*[p or li]/text()[not(ancestor::a) and not(ancestor::script) and not(ancestor::*[#data-alink-ignore])]
But without an XML source file it is impossible to tell if this expression would match your desired node.
I think the answer is "no" but I'll ask anyway. Can you find the last occurrence of a newline in text node using XPath 1.0?
E.g. Given the following XML I want to find the last newline (immediately after "second") in order to get the text "third".
<element> first
second
third </element>
If I knew the position of the last newline it would be trivial to get the text after it. I don't actually want to return the value, just test against it.
As far as I can tell XPath 1.0 doesn't have any of:
reverse text functions
loops
character axis/node
regex
string split
Any of the above would be enough to solve this problem!
Can you find the last occurrence of a newline in text node using XPath 1.0?
No. XPath generally has not been designed to do string processing.
Of course in XPath 2.0 you can do it by tokenizing the input into sequence and then getting the last element from that sequence. But strictly speaking that does not qualify as text processing, it's sequence handling. In other words, it won't actually give you the position of that last newline character either.
with XPath 1.0 you will have to do this bit of work in the host language.
I have a following element.
driver = Selenium::WebDriver.for :phantomjs
driver.xpath("/html/body/form/table/tbody/tr[14]/td/table/tbody/tr/td/table/tbody/tr[1]/td[1]/font").text
=> "unique\ntext"
But I don't want to rely on unstable table layout, so I decided to use text() function in xpath like:
driver.xpath("//font[text()='unique\ntext']")
=> nil
But as you see, I couldn't find the element by the text() function. The original text is unique<br>text.
How can I match the <br> tag by using XPath?
There is no id or name attributes that I can use.
The text() test selects any text nodes. In this example there are two such nodes, before and after the <br>. It is not the same as the text method or the string value of the parent node.
One way of selecting what you want could be like this:
driver.xpath("//font[ . ='unique\ntext']")
You might need to add extra newlines before or after the text. Note that this relies on Ruby doing the conversion of \n into an actual newline character before passing the query to the XPath processor, so you need to be careful about getting your quotes right. This compares the string-value of the node, which for an element is the concatenation of all the descendent text nodes, which is what you want.
A better solution might be to use the normalize-space() function here (as long as the unique aspect of the text doesn’t depend on the newlines).
Try:
driver.xpath("//font[normalize-space()='unique text']")
Note that all leading and trailing whitespace in the target text has been removed, and any internal whitespace is changed to a single space character.
I need to locate the node within an xml file by its value using XPath.
The problem araises when the node to find contains value with whitespaces inside.
F.e.:
<Root>
<Child>value</Child>
<Child>value with spaces</Child>
</Root>
I can not construct the XPath locating the second Child node.
Simple XPath /Root/Child perfectly works for both children, but /Root[Child=value with spaces] returns an empty collection.
I have already tried masking spaces with %20, & #20;, & nbsp; and using quotes and double quotes.
Still no luck.
Does anybody have an idea?
Depending on your exact situation, there are different XPath expressions that will select the node, whose value contains some whitespace.
First, let us recall that any one of these characters is "whitespace":
-- the Tab
-- newline
-- carriage return
' ' or -- the space
If you know the exact value of the node, say it is "Hello World" with a space, then a most direct XPath expression:
/top/aChild[. = 'Hello World']
will select this node.
The difficulties with specifying a value that contains whitespace, however, come from the fact that we see all whitespace characters just as ... well, whitespace and don't know if a it is a group of spaces or a single tab.
In XPath 2.0 one may use regular expressions and they provide a simple and convenient solution. Thus we can use an XPath 2.0 expression as the one below:
/*/aChild[matches(., "Hello\sWorld")]
to select any child of the top node, whose value is the string "Hello" followed by whitespace followed by the string "World". Note the use of the matches() function and of the "\s" pattern that matches whitespace.
In XPath 1.0 a convenient test if a given string contains any whitespace characters is:
not(string-length(.)= stringlength(translate(., '
','')))
Here we use the translate() function to eliminate any of the four whitespace characters, and compare the length of the resulting string to that of the original string.
So, if in a text editor a node's value is displayed as
"Hello World",
we can safely select this node with the XPath expression:
/*/aChild[translate(., '
','') = 'HelloWorld']
In many cases we can also use the XPath function normalize-space(), which from its string argument produces another string in which the groups of leading and trailing whitespace is cut, and every whitespace within the string is replaced by a single space.
In the above case, we will simply use the following XPath expression:
/*/aChild[normalize-space() = 'Hello World']
Try either this:
/Root/Child[normalize-space(text())=value without spaces]
or
/Root/Child[contains(text(),value without spaces)]
or (since it looks like your test value may be the issue)
/Root/Child[normalize-space(text())=normalize-space(value with spaces)]
Haven't actually executed any of these so the syntax may be wonky.
Locating the Attribute by value containing whitespaces using XPath
I have a input type element with value containing white space.
eg:
<input type="button" value="Import Selected File">
I solved this by using this xpath expression.
//input[contains(#value,'Import') and contains(#value ,'Selected')and contains(#value ,'File')]
Hope this will help you guys.
"x0020" worked for me on a jackrabbit based CQ5/AEM repository in which the property names had spaces. Below would work for a property "Record ID"-
[(jcr:contains(jcr:content/#Record_x0020_ID, 'test'))]
did you try #x20 ?
i've googled this up like on the second link:
try to replace the space using "x0020"
this seems to work for the guy.
All of the above solutions didn't really work for me.
However, there's a much simpler solution.
When you create the XMLDocument, make sure you set PreserveWhiteSpace property to true;
XmlDocument xmldoc = new XmlDocument();
xmldoc.PreserveWhitespace = true;
xmldoc.Load(xmlCollection);