I need to merge two XML files. I saw this question before, but that poster wanted to simply concatenate the two files. I want to merge based on a specific child element, in this case, id.
I have two XML files that have the following structure:
File #1:
<document>
<row>
<id>1</id>
<data_field1>aaaa</data_field1>
<data_field2>bbbb</data_field2>
</row>
</document>
File #2:
<document>
<row>
<id>1</id>
<data_field3>cccc</data_field3>
</row>
</document>
And I want them to be merged into File #3:
<document>
<row>
<id>1</id>
<data_field1>aaaa</data_field1>
<data_field2>bbbb</data_field2>
<data_field3>cccc</data_field3>
</row>
</document>
Where it uses the id element to join each XML entry.
The code below will do this, using XML::Twig
It will work with more than 2 docs, and work even if not all id's are present in both docs. It will load both files in memory though, if you want to be able to work with documents too big to fit in memory, the code will be a bit more complex. The rows will be in the same order as in the first document, then in the second one (for those that only appear in the second one).
Since it is written as a test, you can make the test case more complex, or add more tests, which would probably be a good idea.
#!/usr/bin/perl
use strict;
use warnings;
use Test::More;
use XML::Twig;
# normally you would read the documents from file,
# but it's easier to write a self-contained test
my $d1='
<document>
<row>
<id>1</id>
<data_field1>aaaa</data_field1>
<data_field2>bbbb</data_field2>
</row>
</document>
';
my $d2='
<document>
<row>
<id>1</id>
<data_field3>cccc</data_field3>
</row>
</document>
';
my $merged=
'<document>
<row>
<id>1</id>
<data_field1>aaaa</data_field1>
<data_field2>bbbb</data_field2>
<data_field3>cccc</data_field3>
</row>
</document>
';
$merged=~ s{\n}{}g; # remove \n's,
# if you want the result indented, look at the pretty_print option
is( merged( $d1, $d2), $merged, 'one test to rule them all');
done_testing();
sub merged
{
my #docs= map { XML::Twig->new->parse( $_) } #_;
my $merged= XML::Twig->new->parse( '<document></document>');
my %row_id; # hash id => row_element
foreach my $doc (#docs)
{ foreach my $row ($doc->root->children( 'row'))
{ my $eid= $row->first_child( 'id');
my $id= $eid->text;
# if the row hasn't been created in the merged doc, do it
if( ! $row_id{$id})
{ $row_id{$id}= $merged->root->insert_new_elt( last_child => 'row');
$row_id{$id}->insert_new_elt( last_child => id => $id);
}
# move the data fields to the end of the row
foreach my $data_field ($eid->next_siblings)
{ $data_field->move( last_child => $row_id{$id}); }
}
}
return $merged->sprint;
}
Related
I have an XML document,
<resultsets>
<row>
<first_name>Georgi</first_name>
<last_name>Facello</last_name>
</row>
<row>
<first_name>Bezalel</first_name>
<last_name>Simmel</last_name>
</row>
<row>
<first_name>Bezalel</first_name>
<last_name>Hass</last_name>
</row>
</resultsets>
I want to sort first names and remove duplicated first names to produce this:
<resultsets>
<row>
<first_name>Bezalel</first_name>
<last_name>Simmel</last_name>
</row>
<row>
<first_name>Georgi</first_name>
<last_name>Facello</last_name>
</row>
</resultsets>
Following are the code I wrote:
for $last_name at $count1 in doc("employees.xml")//last_name,
$first_name at $count2 in doc("employees.xml")//first_name
let $f := $first_name
where ( $count1=$count2 )
group by $f
order by $f
return
<row>
{$f}
{$last_name}
</row>
However, this code sort the XML document by first names, but failed to remove the duplicated first name ('Bezalel'), it returns:
<resultsets>
<row>
<first_name>Bezalel</first_name>
<last_name>Simmel</last_name>
</row>
<row>
<first_name>Bezalel</first_name>
<last_name>Hass</last_name>
</row>
<row>
<first_name>Georgi</first_name>
<last_name>Facello</last_name>
</row>
</resultsets>
I know how to solve this using two FLOWR statements. group by behavior is weird, could you please explain why it does not remove the duplicates?
Is there any way we can solve this problem using ONE FLOWR loop and ONLY use $first_name and $last_name two variables? Thanks,
I would simply group the row elements by the first_name child and then output the first item in each group to ensure you don't get duplicates:
<resultssets>
{
for $row in resultsets/row
group by $fname := $row/first_name
order by $fname
return
$row[1]
}
</resultssets>
http://xqueryfiddle.liberty-development.net/jyyiVhf
As to how the group by clause works, see https://www.w3.org/TR/xquery-31/#id-group-by which says:
The group by clause assigns each pre-grouping tuple to a group, and
generates one post-grouping tuple for each group. In the post-grouping
tuple for a group, each grouping key is represented by a variable that
was specified in a GroupingSpec, and every variable that appears in
the pre-grouping tuples that were assigned to that group is
represented by a variable of the same name, bound to a sequence of all
values bound to the variable in any of these pre-grouping tuples.
Is there any way that i can use an expression inside expression in Freemarker?
Example:
XML FIle
<Document>
<Row>
<item_date>01/01/2015</item_date>
</Row>
<Row>
<item_date>02/01/2015</item_date>
</Row>
</Document>
<#list 0..1 as i>
${Document.Row[${i}].item_date}
</#list>
I want to print as below
01/01/2015
02/01/2015
Any idea?
Thanks in Advance
Like this:
${Document.Row[i].item_date}
Note that if you are using an up-to-date version, you get this error message, which explains why:
You can't use "${" here as you are already in
FreeMarker-expression-mode. Thus, instead of ${myExpression}, just
write myExpression. (${...} is only needed where otherwise static text
is expected, i.e, outside FreeMarker tags and ${...}-s.)
I'm trying to store in an array all the unique Xpaths of the low level elements in the XML below, but like I'm doing in array a is being stored all the XML, not only the Xpath themselves. The XML has different levels of Xpath. I mean, some child elements only have 2 ancestors and others more than one.
This is the code I have.
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<?xml version="1.0" encoding="UTF-8"?>
<items>
<item>
<name>Cake</name>
<ppu>0.55</ppu>
<batters>
<batter>Regular</batter>
<batter>Chocolate</batter>
<batter>Blueberry</batter>
<batter>Devil's Food</batter>
</batters>
<topping>None</topping>
<topping>Glazed</topping>
<topping>Sugar</topping>
<topping>Powdered Sugar</topping>
<topping>Chocolate with Sprinkles</topping>
<topping>Chocolate</topping>
<topping>Maple</topping>
</item>
<item>
<name>Raised</name>
<ppu>0.55</ppu>
<batters>
<batter>Regular</batter>
</batters>
<topping>None</topping>
<topping>Glazed</topping>
<topping>Sugar</topping>
<topping>Chocolate</topping>
<topping>Maple</topping>
</item>
</items>
EOT
a = []
a = doc.xpath("//*")
puts a
I'd like to store in array "a" only the unique xpaths as below:
/items/item/name
/items/item/ppu
/items/item/batters/batter
/items/item/topping
Maybe somebody could help me in how to do this.
Thanks for the help.
What you want to select is the "leaf" nodes. You can do it like so:
doc.xpath("//*[not(*)]")
This means "select all elements that don't contain elements".
If you want the XPaths, you'll need to call .path on each node. But the paths provided by Nokogiri have explicit positions (e.g. /items/item[2]/topping[4]), so you'll have to apply a regex to remove them, then remove duplicates with uniq:
doc.xpath("//*[not(*)]").map {|leaf| leaf.path.gsub(/\[.*?\]/, '') }.uniq
Output:
/items/item/name
/items/item/ppu
/items/item/batters/batter
/items/item/topping
how can you ignore nodes which have a certain innertext but you don't know the innertext of the other nodes:
<row>
<column>test</columm>
</row>
<row>
<column>???</columm>
</row>
this is what I tried but didn't work
row/column[not(.='test')]
row/column[.!='test']
row/column[not(text()='test')]
row/column[text()!='test']
row[column[text()!='test']]/column
This will get you the rows where the first <column> is not test.
//row[column[1][. != 'test']]
See http://www.xpathtester.com/obj/1ddc1930-ad7f-424c-9800-85df95fe6af3
(hit "Test!") to run it
I am navigating this office open xml file using XPath 1.0 (extract):
<sheetData ref="A1:XFD108">
<row spans="1:3" r="1">
<c t="s" r="A1">
<is>
<t>FirstCell</t>
</is>
</c>
<c t="s" r="C1">
<is>
<t>SecondCell</t>
</is>
</c>
</row>
<row spans="1:3" r="2">
<c t="s" r="A2">
<is>
<t>ThirdCell</t>
</is>
</c>
<c t="s" r="C2">
<is>
<t>[persons.ID]</t>
</is>
</c>
</row>
</sheetData>
I need to find the cell that says "[persons.ID]", which is a variable. Technically, I need to find the first <row> containing a descendant::t that starts with [ and closes with ]. I currently have:
.//row//t[starts-with(text(), '[') and
substring(text(), string-length(text())) = ']']/ancestor::row
So I filter and then go up again. It works, but I'd like to understand XPath better here - I found no way filter the predicate. Can you point me to a valid equivalent of doing something like .//row[descendant::t[starts-with()...]].
Any help is greatly appreciated.
Technically, I need to find the first
containing a descendant::t that
starts with [ and closes with ].
/sheetData/row[c/is/t[starts-with(.,'[')]
[substring(.,string-length(.))=']']]
[1]
or
/sheetData/row[.//t[starts-with(.,'[') and
substring(.,string-length(.))=']']][1]
or
(//row[.//t[starts-with(.,'[') and
substring(.,string-length(.))=']']])[1]
One option:
.//row[starts-with(descendant::t/text(),'[') and substring(descendant::t/text(), string-length(descendant::t/text())) = ']' ]
This will give you the row, however one significant problem could be if your row has two t elements that would satisfy different conditions, but not both conditions. e.g. one t starts with [, and another ends with ]
Obvsiously, what you have doesn't have this problem
Another option: use translate
.//row[translate(descendant::t/text(),"0123456789","") = "[]"]
That will strip the numeric characters and then it's a simple comparison to the [] characters