Scrapy: Invalid XPath - xpath

hxs.select("//h:h2[re:test(., 'a', 'i')]").extract()
Undefined namespace prefix
xmlXPathEval: evaluation failed
Traceback (most recent call last):
File "<console>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/scrapy/selector/libxml2sel.py", line 44, in select
raise ValueError("Invalid XPath: %s" % xpath)
ValueError: Invalid XPath: //h:h2[re:test(., 'a', 'i')]
I'm new to XPath and Scrapy.
What's wrong with it? (I'm trying to select nodes that contain the word "a").

According to the traceback, you're using an undefined namespace prefix re. I'm not familiar with scrapy but it seems you have to define the namespace prefix somewhere.
BTW, isn't the function you're trying to use called matches?
You could call it like this: //h:h2[matches(., 'a', 'i')]
An alternative would be
//h:h2[contains(lower-case(.),'a')]
Also, what you said (
What's wrong with it? (I'm trying to select nodes that contain the
word "a").
) contradicts the function's semantics. In your snippet, you're actually looking for a string that contains the letter a. Not for a as a word.
If a is the only text in your element, you could also try using:
//h:h2[lower-case(.)='a']
Or if you're looking for a as a word in a longer text, you can combine the use of matches with XPath regular expressions.

Related

freemarker - Retrieve value from sequences

Hoping this issue is easy enough to resolve.
I am trying to retrieve a single value from a sequence using FreeMarker via the advanced form PDF functionality in NetSuite.
Here is a snippet of code:
<#assign getOps>
<#list record.item as assembly>
{item: ${assembly.item}, op: ${assembly.operationsequencenumber}}
</#list>
</#assign>
Number of words: ${getOps?word_list?size}
${getOps}
When I print the above, the following is printed:
I want to be able to capture single values from this sequence, using something similar to ${getOps.item} but an error is fired:
For "." left-hand operand: Expected a hash, but this has evaluated to
a string (wrapper: f.t.SimpleScalar):
==> getOps[2] [in template "template" at line 126, column 3]
---- FTL stack trace ("~" means nesting-related):
Failed at: ${getOps[2].item} [in template "template" at line 126, column 1]
Can you identify the issue here?
Any help is appreciated.
Thanks
You are capturing output into a single string there. So it's unstructured, a flat string, therefore you can't traverse it anymore. If you really need to transform the original list, you need to use ?map (see in the FreeMarker Manual). But Netsuite uses a FreeMarker fork, and I'm not sure if they support ?map.

Using a regex to get a Nokogiri node

I'm parsing an XML file with Nokogiri.
Currently, I'm using the following to get the value I need (the document includes multiple Phase nodes):
xml.xpath("//Phase[#text=' = STER P=P(T) ']")
But now, the uploaded XML file can have a text attribute with a different value. Thus, I'm trying to update my code using a regular expression since the value always contains STER.
After looking at a few questions on SO, I tried
xml.xpath("//Phase[#text~=/STER/]")
However, when I run it, I get
ERROR: Invalid predicate: //Phase[#text~=/STER/] (Nokogiri::XML::XPath::SyntaxError)
What am I missing here?
Alternatively, is there an XPATH function similar to starts-with` that looks for the substring within the entire value and not just at the beginning of it?
There are two problems with your code: first off, there is no =~ operator in XPath. The way to test whether text matches a regex is using the matches function:
//Phase[matches(#text, 'STER')]
Secondly, regex matching is a feature of XPath 2.0, but Nokogiri implements XPath 1.0.
Luckily, you are not actually using any regex features, you are simply checking for a fixed string, which can be done with XPath 1.0 using the contains function:
//Phase[contains(#text, 'STER')]

XDMP-REGEX: (err:FORX0002) - String transformation with Regular expressions

I am working on xquery requirement to identify the xml tag name() from the XML document using the regex. Later , will do the transformation on data.It searches the entire document and If i found match, am doing string :replace using xquery/xpath.
Please find some sample code which am looking for.
let $full-doc := fn:doc($uri)
if(fn:matches($full-doc,"<Hyperlink\b[^\>]*?>([A-Z][a-z]{2} [0-3]?[0-9]
[12][890][0-9]{2})</Hyperlink>"))
then $full-doc
else "regex is not working"
I am getting the following Error.
regex-match :
[1.0-ml] XDMP-REGEX: (err:FORX0002) fn:matches(fn:doc("44215.xml"), "
<Hyperlink\b[^\>]*?>([A-Z][a-z]{2} [0-3]?[0-9] [12][890][0-9]{2}...") -
- Invalid regular expression
Could some one please explain why my regex is not working ?
Looking at your requirement:
I am working on xquery requirement to identify the xml tag name() from the XML document using the regex.
You are going about this entirely the wrong way. XQuery doesn't see the lexical XML, it sees a tree of nodes. To find the name of an element, use an XPath expression to find the element, then use the name() function to get its name.
If you want to find an element whose name matches a regex, use //*[matches(name(), $regex)]
The word boundary code \b is not supported in XQuery (see https://www.w3.org/TR/xpath-functions-31/#regex-syntax).
But I guess you are looking for Hyperlink elements, not for a <Hyperlink> substring, so you should use a path expression:
let $doc := fn:doc($uri)
where $doc//Hyperlink[matches(., '([A-Z][a-z]{2} [0-3]?[0-9] [12][890][0-9]{2})')]
return $doc

I am trying to use XPath function contains() that has a string in 2 parts but it is throwing an invalid xpath error

I am trying to use XPath function contains() that has a string in 2 parts but it is throwing an "invalid xpath expression" error upon evaluation.
Here is what I am trying to achieve:
Normal working xpath:
//*[contains(text(),'some_text')]
Now I want to break it up in 2 parts as some random text is populating in between:
//*[contains(text(),'some'+ +'text')]
What I have done is to use '+' '+' to concatenate string in expression as we do in Java. Please suggest how can i get through this.
You can combine 2 contains() in one predicate expression to check if a text node contains 2 specific substrings :
//*[text()[contains(.,'some') and contains(.,'text')]]
demo
If you need to be more specific by making sure that 'text' comes somewhere after 'some' in the text node, then you can use combination of substring-after() and contains() as shown below :
//*[text()[contains(substring-after(.,'some'),'text')]]
demo
If each target elements always contains one text node, or if only the first text node need to be considered in case multiple text nodes found in an element, then the above XPath can be simplified a bit as follow :
//*[contains(substring-after(text(),'some'),'text')]

Getting "XPath error: Invalid predicate", while trying to use an Xpath that contains greek letters

From this webpage: page I am trying to build a crawler that is going to extract "Μακεδονία > Ν. Ημαθίας > Δ. Δοβρά" from the "Περιοχή:" field.
--> See a screenshot of the item:
See Link no.2 below
In order to do this, I am intended to use XPath to focus on "Περιοχή:" and then use the following-sibling keyword to access and extract the text "Μακεδονία > Ν. Ημαθίας > Δ. Δοβρά", because the td that contains it can be in a different location in other webpages (but always after the tr with the text "Περιοχή:") or even missing.
See Link no.3 below
In scrapy shell I am testing the following:
x = response.xpath(u"//th[#text()=u'Περιοχή:']/text()").extract()
expecting to get x = [u"Περιοχή:"]
but instead I am getting an error:
ValueError: XPath error: Invalid predicate in //th[#text()=u'\u03a0\u03b5\u03c1\u03b9\u03bf\u03c7\u03ae:']/text()
What am I doing wrong?
Thanks in advance.
You are specifying unicode encoding twice, you shouldn't specify that in xpath since it's already an unicode string.
i.e.
# this:
u"//th[#text()=u'Περιοχή:']/text()"
# should be this:
u"//th[text()='Περιοχή:']/text()"
Notice no u before the text and you don't need # before text() either because it's an xpath function not a node attribute.

Resources