Search/Parse XML and exclude certain nodes without removing them? - ruby

The command below allows me to parse the text in all nodes except for nodes 'wp14:sizeRelH' & 'wp14:sizeRelV'
XML.search('//wp14:sizeRelH', '//wp14:sizeRelV').remove.search('//text()')
I would like to do the same thing but I do not want to remove nodes 'wp14:sizeRelH' and 'wp14:sizeRelV' from the XML.
This way I can parse through the XML tree and make changes to the text in each node without affecting nodes 'wp14:sizeRelH' and 'wp14:sizeRelV'
EDIT: It appears if nodes '//wp14:sizeRelH' or '//wp14:sizeRelV' are not in the XML, then my command also returns nothing which is not good :(

Looks like I found the answer. I used //text()[not...] but had to find the ancestors names of the text I didn't want to include:
XML.search('//text()[not(ancestor::wp14:pctHeight or ancestor::wp14:pctWidth or ancestor::wp:posOffset)]')

Related

What does Camel Splitter actually do with XML Document when splitting with xpath?

I have a document with an order and a number of lines. I need to break the order into lines so I have a camel splitter set to xpath with the order line as it's value. This works fine.
However, what I get going forward is an element for the order line, which is what I want, but when converting it I need information from the order element - but if I try to get the parent element via xpath following the split, this doesn't work.
Does Camel create copies of the nodes returned by the xpath expression, or return a list of nodes within the parent document? If the former, can I make it the latter? If the latter, any ideas why a "../*" expression would return nothing?
Thanks!
Screwtape.
Look at the split options that are available when using a Tokenizer:
http://camel.apache.org/splitter.html
You have four different modes (i, w, u, t) and the 'w' one is keeping the ancestor context. In such case, the parent node (=the thing you apparently need) will be repeated in each sub-message
Default:
<m:order><id>123</id><date>2014-02-25</date></m:order>
'w' mode:
<m:orders>
<m:order><id>123</id><date>2014-02-25</date>...</m:order>
</m:orders>

Ruby LibXML skip large nodes

I have an xml file that has a very large text node (>10 MB). While reading the file, is it possible to skip (ignore) this node?
I tried the following:
reader = XML::Reader.io(path)
while reader.read do
next if reader.name.eql?('huge-node')
end
But this still results in the error parser error : xmlSAX2Characters: huge text node
The only other solution I can think of is to first read the file as a string and remove the huge node through a gsub, and then parse the file. However, this method seems very inefficient.
That's probably because by the time you are trying to skip it, it's already read the node. According to the documentation for the #read method:
reader.read -> nil|true|false
Causes the reader to move to the next node in the stream, exposing its properties.
Returns true if a node was successfully read or false if there are no more nodes to read. On errors, an exception is raised.
You would need to skip the node prior to calling the #read method on it. I'm sure there are many ways you could do that but it doesn't look like this library supports XPath expressions, or I would suggest something like that.
EDIT: The question was clarified so that the SAX parser is a required part of the solution. I have removed links that would not be helpful given this constraint.
You don't have to skip the node. The cause is that since version 2.7.3 libxml limits the maximum size of a single text node to 10MB.
This limit can be removed with a new option, XML_PARSE_HUGE.
Bellow an example:
# Reads entire file into a string
$result = file_get_contents("https://www.ncbi.nlm.nih.gov/gene/68943?report=xml&format=text");
# Returns the xml string into an object
$xml = simplexml_load_string($result, 'SimpleXMLElement', LIBXML_COMPACT | LIBXML_PARSEHUGE);

xpath - matching value of child in current node with value of element in parent

Edit: I think I found the answer but I'll leave the open for a bit to see if someone has a correction/improvement.
I'm using xpath in Talend's etl tool. I have xml like this:
<root>
<employee>
<benefits>
<benefit>
<benefitname>CDE</benefitname>
<benefit_start>2/3/2004</benefit_start>
</benefit>
<benefit>
<benefitname>ABC</benefitname>
<benefit_start>1/1/2001</benefit_start>
</benefit>
</benefits>
<dependent>
<benefits>
<benefit>
<benefitname>ABC</benefitname>
</benefit>
</dependent>
When parsing benefits for dependents, I want to get elements present in the employee's
benefit element. So in the example above, I want to get 1/1/2001 for the dependent's
start date. I want 1/1/2001, not 2/3/2004, because the dependent's benefit has benefitname ABC, matching the employee's benefit with the same benefitname.
What xpath, relative to /root/employee/dependent/benefits/benefit, will yield the value of
benefit_start for the benefit under parent employee that has the same benefit name as the
dependent benefit name? (Note I don't know ahead of time what the literal value will be, I can't just look for 'ABC', I have to match whatever value is in the dependent's benefitname element.
I'm trying:
../../../benefits/benefit[benefitname=??what??]/benefit_start
I don't know how to refer to the current node's ancestor in the middle of
the xpath (since I think "." at the point I have ??what?? will refer to
the benefit node of the employee/benefits.
EDIT: I think what I want is "current()/benefitname" where the ??what?? is. Seems to work with saxon, I haven't tried it in the etl tool yet.
Your XML is malformed, and I don't think you've described your siduation very well (the XPath you're trying has a bunch of ../../s at the beginning, but you haven't said what the context node is, whether you're iterating through certain nodes, or what.
Supposing the current context node were an employee element, you could select benefit_starts that match dependent benefits with
benefits/benefit[benefitname = ../../dependent/benefits/benefit/benefitname]
/benefit_start
If the current context node is a benefit element in a dependents section, and you want to get the corresponding benefit_start for just the current benefit element, you can do:
../../../benefits/benefit[benefitname = current()/benefitname]/benefit_start
Which is what I think you've already discovered.

Retrieve an xpath text contains using text()

I've been hacking away at this one for hours and I just can't figure it out. Using XPath to find text values is tricky and this problem has too many moving parts.
I have a webpage with a large table and a section in this table contains a list of users (assignees) that are assigned to a particular unit. There is nearly always multiple users assigned to a unit and I need to make sure a particular user is assigned to any of the units on the table. I've used XPath for nearly all of my selectors and I'm half way there on this one. I just can't seem to figure out how to use contains with text() in this context.
Here's what I have so far:
//td[#id='unit']/span [text()='asdfasdfasdfasdfasdf (Primary); asdfasdfasdfasdfasdf, asdfasdfasdfasdf; 456, 3456'; testuser]
The XPath Query above captures all text in the particular section I am looking at, which is great. However, I only need to know if testuser is in that section.
text() gets you a set of text nodes. I tend to use it more in a context of //span//text() or something.
If you are trying to check if the text inside an element contains something you should use contains on the element rather than the result of text() like this:
span[contains(., 'testuser')]
XPath is pretty good with context. If you know exactly what text a node should have you can do:
span[.='full text in this span']
But if you want to do something like regular expressions (using exslt for example) you'll need to use the string() function:
span[regexp:test(string(.), 'testuser')]

Get node text only if contain an attribute?

XPath problem.
I have these nodes:
[...]
<videos>
<video timestamp="201204271112">myVideo.avi</video>
<video>myVideo.avi</video>
<video timestamp="201204271113">myVideo.avi</video>
<video>myVideo.avi</video>
<video>myVideo.avi</video>
</videos>
<photos>
<photo timestamp="201204271112">myphoto.avi</video>
<photo>myphoto.avi</video>
<photo timestamp="201204271113">aphoto.avi</video>
<photo>myphoto.avi</video>
<photo>myphoto.avi</video>
</photos>
[...]
How can i get only node text that contains timestamp attribute?
I tried
//#timestamp
it returns ALL timestamps attribute only. And the text?
How can make a query that include all two conditions? AND condition.
Something like this:
//#text and //#timestamps
to get only
201204271112 - myVideo.avi
201204271113 - myVideo.avi
201204271113 - aphoto.avi
excluding other ones?
thanks.
How can i get only node text that contains timestamp attribute?
Could you mean //*[#timestamp]/text()? That selects all text nodes whose parents have the timestamp attribute.
The conditions are in XPaths, too (i.e. //video[#timestamp and text()] selects all video nodes that have both timestamp and some text nodes).
What you probably meant is a node-set union used with symbol |. To get both the timestamps and the text nodes, you'll need two queries unioned together: //#timestamp | //*[#timestamp]/text() gets all timestamps and all their text nodes. However, I don't think you can get it nicely aligned (there will be all timestamps first, then all text nodes).
You can try either iterating one by one with some kind of for loop and get both the timestamp and the text node via position, or you can just get all nodes that have a timestamp and dig their text out of them later (which is a preffered way).
The spec is a surprisingly good read on this.
You can match on attributes:
//video[#timestamp]/text()
//video = matches a node with name video anywhere in the treee
[#timestamp] is a predicate, meaning the node has to have this attribute
text() selects all text node children of the current node

Resources