Using a regex to get a Nokogiri node - ruby

I'm parsing an XML file with Nokogiri.
Currently, I'm using the following to get the value I need (the document includes multiple Phase nodes):
xml.xpath("//Phase[#text=' = STER P=P(T) ']")
But now, the uploaded XML file can have a text attribute with a different value. Thus, I'm trying to update my code using a regular expression since the value always contains STER.
After looking at a few questions on SO, I tried
xml.xpath("//Phase[#text~=/STER/]")
However, when I run it, I get
ERROR: Invalid predicate: //Phase[#text~=/STER/] (Nokogiri::XML::XPath::SyntaxError)
What am I missing here?
Alternatively, is there an XPATH function similar to starts-with` that looks for the substring within the entire value and not just at the beginning of it?

There are two problems with your code: first off, there is no =~ operator in XPath. The way to test whether text matches a regex is using the matches function:
//Phase[matches(#text, 'STER')]
Secondly, regex matching is a feature of XPath 2.0, but Nokogiri implements XPath 1.0.
Luckily, you are not actually using any regex features, you are simply checking for a fixed string, which can be done with XPath 1.0 using the contains function:
//Phase[contains(#text, 'STER')]

Related

XDMP-REGEX: (err:FORX0002) - String transformation with Regular expressions

I am working on xquery requirement to identify the xml tag name() from the XML document using the regex. Later , will do the transformation on data.It searches the entire document and If i found match, am doing string :replace using xquery/xpath.
Please find some sample code which am looking for.
let $full-doc := fn:doc($uri)
if(fn:matches($full-doc,"<Hyperlink\b[^\>]*?>([A-Z][a-z]{2} [0-3]?[0-9]
[12][890][0-9]{2})</Hyperlink>"))
then $full-doc
else "regex is not working"
I am getting the following Error.
regex-match :
[1.0-ml] XDMP-REGEX: (err:FORX0002) fn:matches(fn:doc("44215.xml"), "
<Hyperlink\b[^\>]*?>([A-Z][a-z]{2} [0-3]?[0-9] [12][890][0-9]{2}...") -
- Invalid regular expression
Could some one please explain why my regex is not working ?
Looking at your requirement:
I am working on xquery requirement to identify the xml tag name() from the XML document using the regex.
You are going about this entirely the wrong way. XQuery doesn't see the lexical XML, it sees a tree of nodes. To find the name of an element, use an XPath expression to find the element, then use the name() function to get its name.
If you want to find an element whose name matches a regex, use //*[matches(name(), $regex)]
The word boundary code \b is not supported in XQuery (see https://www.w3.org/TR/xpath-functions-31/#regex-syntax).
But I guess you are looking for Hyperlink elements, not for a <Hyperlink> substring, so you should use a path expression:
let $doc := fn:doc($uri)
where $doc//Hyperlink[matches(., '([A-Z][a-z]{2} [0-3]?[0-9] [12][890][0-9]{2})')]
return $doc

I am trying to use XPath function contains() that has a string in 2 parts but it is throwing an invalid xpath error

I am trying to use XPath function contains() that has a string in 2 parts but it is throwing an "invalid xpath expression" error upon evaluation.
Here is what I am trying to achieve:
Normal working xpath:
//*[contains(text(),'some_text')]
Now I want to break it up in 2 parts as some random text is populating in between:
//*[contains(text(),'some'+ +'text')]
What I have done is to use '+' '+' to concatenate string in expression as we do in Java. Please suggest how can i get through this.
You can combine 2 contains() in one predicate expression to check if a text node contains 2 specific substrings :
//*[text()[contains(.,'some') and contains(.,'text')]]
demo
If you need to be more specific by making sure that 'text' comes somewhere after 'some' in the text node, then you can use combination of substring-after() and contains() as shown below :
//*[text()[contains(substring-after(.,'some'),'text')]]
demo
If each target elements always contains one text node, or if only the first text node need to be considered in case multiple text nodes found in an element, then the above XPath can be simplified a bit as follow :
//*[contains(substring-after(text(),'some'),'text')]

How do I use the XPath tokenizer function in Nokogiri?

I am attempting to extract information from the following HTML using Nokogiri and XPath.
<p>Friday, February 1<br><strong>Apple <br> Orange</strong></p>
e.xpath('./text()[following-sibling::br]')
Gives me the date just fine. I want to then grab the text inside the strong node and split on br. There may be many fruits separated by br or there may just be one with no br. I would ideally like to accomplish this in xpath instead of code since I'm essentially defining a bunch of parsers via JSON.
Right now I'm thinking that I should use the tokenizer function and pass the text in the strong tag. I thought that should look like this:
e.xpath('./strong[fn::tokenize(.,"<br>")]')
and have also tried
e.xpath('fn::tokenize(./strong,"<br>")')
but I am getting:
.../gems/nokogiri-1.5.6/lib/nokogiri/xml/node.rb:159:in `evaluate': Invalid expression: ./strong/text()[fn::tokenize(.,"br")] (Nokogiri::XML::XPath::SyntaxError)
I'm modeling my usage after the documentation for the method that the error occurs in (line 139):
node.xpath('.//title[regex(., "\w+")]',...

xpath to check '#' present

I want to write xpath to check node contain '#'
<node1>
<node11>Some text</node11>
<node11>#2o11 PickMe</node12>
</node1>
I want to write xpath like "//node11[contains(,'#\d+')]". Whats correct way to check #
The correct XPath expression is:
//node11[contains(., '#')]
In your XML, the closing tag of the second subnote should be </node11> instead of </node12>.
If you are using xpath 2.0 you should be able to use something like:
"//node11[matches(.,'#\d+')]"
However, if you aren't using 2.0 you won't have regex support directly. If you are using 1.0 then you won't be able to match using \d+. But this will work:
"//node11[contains(.,'#')]"
Or even:
"//node11[starts-with(.,'#')]"
Use:
/*/node11[contains(., '#')]
Note: It is recommended to avoid using the // pseudo-operator because this most often leads to very slow evaluation of the XPath expression.

Trouble using Xpath "starts with" to parse xhtml

I'm trying to parse a webpage to get posts from a forum.
The start of each message starts with the following format
<div id="post_message_somenumber">
and I only want to get the first one
I tried xpath='//div[starts-with(#id, '"post_message_')]' in yql without success
I'm still learning this, anyone have suggestions
I think I have a solution that does not require dealing with namespaces.
Here is one that selects all matching div's:
//div[#id[starts-with(.,"post_message")]]
But you said you wanted just the "first one" (I assume you mean the first "hit" in the whole page?). Here is a slight modification that selects just the first matching result:
(//div[#id[starts-with(.,"post_message")]])[1]
These use the dot to represent the id's value within the starts-with() function. You may have to escape special characters in your language.
It works great for me in PowerShell:
# Load a sample xml document
$xml = [xml]'<root><div id="post_message_somenumber"/><div id="not_post_message"/><div id="post_message_somenumber2"/></root>'
# Run the xpath selection of all matching div's
$xml.selectnodes('//div[#id[starts-with(.,"post_message")]]')
Result:
id
--
post_message_somenumber
post_message_somenumber2
Or, for just the first match:
# Run the xpath selection of the first matching div
$xml.selectnodes('(//div[#id[starts-with(.,"post_message")]])[1]')
Result:
id
--
post_message_somenumber
I tried xpath='//div[starts-with(#id,
'"post_message_')]' in yql without
success I'm still learning this,
anyone have suggestions
If the problem isn't due to the many nested apostrophes and the unclosed double-quote, then the most likely cause (we can only guess without being shown the XML document) is that a default namespace is used.
Specifying names of elements that are in a default namespace is the most FAQ in XPath. If you search for "XPath default namespace" in SO or on the internet, you'll find many sources with the correct solution.
Generally, a special method must be called that binds a prefix (say "x:") to the default namespace. Then, in the XPath expression every element name "someName" must be replaced by "x:someName.
Here is a good answer how to do this in C#.
Read the documentation of your language/xpath-engine how something similar should be done in your specific environment.
#FindBy(xpath = "//div[starts-with(#id,'expiredUserDetails') and contains(text(), 'Details')]")
private WebElementFacade ListOfExpiredUsersDetails;
This one gives a list of all elements on the page that share an ID of expiredUserDetails and also contains the text or the element Details

Resources