I'm trying convert a xml document, organized in divisions and paragraph, with page break and line breaks as milestones into a xml document that wraps pages and lines in page and line elements.
To do this I'm trying to use util:get-fragment-between.
To first get all the lines on a page into a fragment and then turn each line into a fragment.
The first step works, but in the second step, I get the following error org.exist.dom.memtree.ElementImpl cannot be cast to org.exist.dom.persistent.StoredNode which I do not understand.
Below is the xquery file and below that of an excerpt of the xml file I'm trying to convert.
xquery version "3.1";
let $doc := doc($docpath)
(: Build first fragment of containing only lines on page:)
let $begp-node := $doc//tei:pb[#n="15-v"]
let $endp-node := $doc//tei:pb[#n="16-r"]
let $p-fragment := util:get-fragment-between($begp-node, $endp-node, $make-fragment, $display-root-namespace)
let $p-node := util:parse($p-fragment)
(: so far so good, print out of p-node gives me an xml document with just the text on page 15-v :)
(: next step. here I attempt to build a fragment for each line in the newly created page fragment :)
let $lines := $p-node//tei:lb
for $line at $pos in $lines
let $make-fragment1 := true()
let $display-root-namespace1 := true()
let $beginning-node := $line
let $ending-node := $line/following::tei:lb[1]
let $fragment := util:get-fragment-between($beginning-node, $ending-node, $make-fragment1, $display-root-namespace1)
let $node := util:parse($fragment)
return $node
I would expect $node to be a new xml document that just contains the line fragment. But instead I get the error:
org.exist.dom.memtree.ElementImpl cannot be cast to org.exist.dom.persistent.StoredNode
Here is an excerpt of the original document:
<p>
<lb ed="#L"/>dilectio <choice>
<orig>dependant</orig>
<reg>dependant</reg>
</choice> causaliter a cognitione tamen quaelibet obiecti apprehensio vel cognitio
<lb ed="#L"/>cum voluntatis libertate sufficit dilectionem causare <g ref="#slash"/> prima
probatur quia si non sequitur quod dilec
<lb ed="#L"/>tio
<lb ed="#L"/>posset poni seu elici naturaliter a voluntate seclusa omni cognitione consequens
est falsum
<pb ed="#L" n="15-v"/>
<lb ed="#L" n="1"/> quia tunc voluntas posset diligere in infinitum contra <ref>
<name ref="#Augustine">augustinum</name> in libro 8 2 10 <title ref="#deTrinitate">de
trinitate</title>
</ref> patet consequentia quia positis omnibus causis ad productionem <sic>ad productionem</sic>
alicuius effectus re
<lb ed="#L" n="2"/>quisitis
<lb ed="#L" n="3"/>omni alio secluso talis effectus posset naturaliter poni in esse <g
ref="#slash"/>2a pars probatur quia
<lb ed="#L" n="4"/>quia si sola obiecti cognitio etc sequitur quod stante iudicio vel
apprehensione alicuius
<lb ed="#L" n="5"/>obiecti sub ratione <corr>
<del rend="strikethrough">boni</del>
<add place="inLine">mali</add>
</corr> seclusa omnia existentia vel apparentia bonitatis
<lb ed="#L" n="6"/>voluntas posset tale obiectum velle vel diligere consequentia nota sed
consequens est contra <ref>
<name ref="#Aristotle">philosophum</name>
</ref> et <ref>
<name ref="#Averroes">commentatorem</name>
<lb ed="#L" n="7"/>primo <name ref="#Ethics">ethicorum</name>
</ref> quia omnia bonum appetunt
<p xml:id="pgb1q2-d1e3692">
<g ref="#pilcrow"/>primum corollarium
<lb ed="#L" n="8"/>
Any advice is much appreciated.
This algorithm, although 3 times slower than the Java code, works in memory:
(:~ trim the XML from $nodes $start to $end
: The algorithm is
: 1) find all the ancestors of the start node - $startParents
: 2) find all the ancestors of the end node- $endParents
: 3) recursively, starting with the common top we create a new element which is a copy of the element being trimmed by
: 3.1 copying all attributes
: 3.2 there are four cases depending on the node and the start and end edge nodes of the tree
: a) left and right nodes are the same - nothing else to copy
: b) both nodes are in the node's children - trim the start one, copy the intervening children and trim the end one
: c) only the start node is in the node's children - trim this node and copy the following siblings
: d) only the end node is in the node's children - copy the preceding siblings and trim the node
: attributes (currently in the fb namespace since its not a TEI attribute) are added to trimmed nodes
: #param start - the element bounding the start of the subtree
: #param end - the element bounding the end of the subtree
:)
declare function fb:trim-node($start as node() ,$end as node()) {
let $startParents := $start/ancestor-or-self::*
let $endParents := $end/ancestor-or-self::*
let $top := $startParents[1]
return
fb:trim-node($top,subsequence($startParents,2),subsequence($endParents,2))
};
declare function fb:trim-node($node as node(), $start as node()*, $end as node()*) {
if (empty($start) and empty($end))
then $node (: leaf is untrimmed :)
else
let $startNode := $start[1]
let $endNode:= $end[1]
let $children := $node/node()
return
element {QName (namespace-uri($node), name($node))} { (: preserve the namespace :)
$node/#* , (: copy all the attributes :)
if ($startNode is $endNode) (: edge node is common :)
then fb:trim-node($startNode, subsequence($start,2),subsequence($end,2))
else
if ($startNode = $children and $endNode = $children) (: both in same subtree :)
then (fb:trim-node($startNode, subsequence($start,2),()), (: first the trimmed start node :)
(: then the siblings between start and end nodes :)
$startNode/following-sibling::node()
except $endNode/following-sibling::node()
except $endNode,
fb:trim-node($endNode, (), subsequence($end,2)) (: then the trimmed end node :)
)
else if ($startNode = $children) (: start node is in the children :)
then
( fb:trim-node($startNode, subsequence($start,2),()), (: first the trimmed start node :)
$startNode/following-sibling::node() (: then the following siblings :)
)
else if ($endNode = $children) (: end node is in the children :)
then
( $endNode/preceding-sibling::node(), (: the preceding siblings :)
fb:trim-node($endNode, (), subsequence($end,2)) (: then the trimmed end node :)
)
else ()
}
};
There is a comparison of four algorithms including the Java one, using a demo application original by joewiz here: http://kitwallace.co.uk/Book/set/fragment-between/page
Related
I wonder if there is any good way to speed up the plotly performance especially for their speed. Currently I am trying to isolate the data for plots and only leave two inputs to reactively change the plotly content.
What I was confused is that when I tested those poltly plots with fixed data (I did isolate in shiny) the plots running pretty fast when changing those two inputs. However, when I import the same logic in shiny using isolate() and two inputs the polts running extremely slow.
Please help.
Here is my current codes, Thanks
observe({
##### Visualization - Daily Pattern #####
output$TTPlotB<-renderPlotly({
#input$file_load
#data_daily<-
#data_daily2<<-data_daily
date1<-input$dateRange[1]
date2<-input$dateRange[2]
journey1<-input$journey[1]
journey2<-input$journey[2]
plot_ly(isolate({Alloc()})%>%
filter(date_range>= date1 & date_range<=date2 )%>%
filter(Days_in_journey %in% ( journey1:journey2 ))%>%
group_by(date_range,OEM,daily_available)%>%
summarise(ngm=min(daily_avail_ngm), ### use the min based on the selected journey
product=min(daily_avail_product), ### use the min based on the selected journey
lp=min(daily_avail_lp), ### use the min based on the selected journey
cart=min(daily_avail_cart), ### use the min based on the selected journey
email=min(daily_avail_email))%>%
gather(channel,channel_daily_avail,-c(date_range,OEM,daily_available))%>%
select(-c(daily_available))%>%
spread(OEM,channel_daily_avail)%>%
group_by(date_range,channel)%>%
mutate(Daily_tot=sum(HP,Dell,Lenovo,Acer,Asus,Others))%>%
mutate(D_HP_avai=round(100*HP/(input$OEM.HP.Aud.wk*1000/7),0))%>%
mutate(D_Dell_avai=round(100*Dell/(input$OEM.Dell.Aud.wk*1000/7),0))%>%
mutate(D_Leno_avai=round(100*Lenovo/(input$OEM.Lenovo.Aud.wk*1000/7),0))%>%
mutate(D_Acer_avai=round(100*Acer/(input$OEM.Acer.Aud.wk*1000/7),0))%>%
mutate(D_Asus_avai=round(100*Asus/(input$OEM.Asus.Aud.wk*1000/7),0))%>%
mutate(D_Oth_avai=round(100*Others/(input$OEM.Others.Aud.wk*1000/7),0))%>%as.data.table()
,
height = 500, width = 1450,
x=~date_range,
y=~Daily_tot,
type="scatter",
color=~channel,
mode="lines+markers",
hoverinfo = 'text',
text = ~paste('</br> <B> Channel:</B>', channel
,
'</br> <B> Total Available Machines:</B>', formatC(Daily_tot, format="d", big.mark=","),
'</br> <B> Selected Date:</B>',date_range,
'</br> <B> HP: </B>',formatC(HP, format="d", big.mark=","),',',round(100*HP/Daily_tot,0),'%',
' <B> HP Daily:</B>',D_HP_avai,'%' ,
'</br> <B> Dell: </B>',formatC(Dell, format="d", big.mark=","),',',round(100*Dell/Daily_tot,0),'%',
' <B> Dell Daily:</B>',D_Dell_avai ,'%',
'</br> <B> Lenovo: </B>',formatC(Lenovo, format="d", big.mark=","),',',round(100*Lenovo/Daily_tot,0),'%',
' <B> Lenovo Daily:</B>',D_Leno_avai ,'%',
'</br> <B> Acer: </B>',formatC(Acer, format="d", big.mark=","),',',round(100*Acer/Daily_tot,0),'%',
' <B> Acer Daily:</B>', D_Acer_avai,'%',
'</br> <B> Asus: </B>',formatC(Asus, format="d", big.mark=","),',',round(100*Asus/Daily_tot,0),'%',
' <B> Asus Daily:</B>',D_Asus_avai ,'%',
'</br> <B> Others: </B>',formatC(Others, format="d", big.mark=","),',',round(100*Others/Daily_tot,0),'%',
' <B> Other Daily:</B>',D_Oth_avai ,'%'
)
)%>%
layout(xaxis = list(title = 'Dates'),
yaxis = list(title = 'Available Machines')
)
})
})
I am stuck at a place in my xquery. In the response i want to get back the phone numbers for each Type i.e. HOME and CELL , but only the ones which have highest sequence number .If there are two rows with PhoneType="HOME" , I want back the phone which has highest sequence of two i.e in my case Sequence="3".
With my xquery I am able to get back a phoneType="HOME" and the first row with type HOME. i am not able to add a condition to check the Sequence as well. Where and how can i add it. Please suggest. Thanks in advance
Part of my Xquery:
<acc:phones>
{
for $PersonPhonesRow in $PersonMaintenanceResponse/ns2:CMPersonService/ns2:CMPersonDetails/ns2:PersonPhones/ns2:PersonPhonesRow[#PhoneType="HOME"][1]
return
if(fn:data($PersonMaintenanceResponse/ns2:CMPersonService/ns2:CMPersonDetails/ns2:PersonPhones/ns2:PersonPhonesRow/#PhoneType="HOME"[1]))
then
<com:phone>
<com:type>{'HOME'}</com:type>
<com:phoneNumber>{fn:data($PersonMaintenanceResponse/ns2:CMPersonService/ns2:CMPersonDetails/ns2:PersonPhones/ns2:PersonPhonesRow[#PhoneType="HOME"][1]/#PhoneNumber)}</com:phoneNumber>
<com:carrier>{fn:data($PersonMaintenanceResponse/ns2:CMPersonService/ns2:CMPersonDetails/ns2:PersonPhones/ns2:PersonPhonesRow[#PhoneType="HOME"][1]/#Extension)}</com:carrier>
</com:phone>
else
()
}
</acc:phones>
Request:
<CMPerson xmlns="http://splwg.com/CMPerson.xsd">
<CMPersonService>
<CMPersonDetails>
<PersonPhones>
<PersonPhonesHeader PersonID="1234567890" LastSequenceNumber="9"/>
<PersonPhonesRow PersonID="1234567890" Sequence="1" PhoneType="HOME" IntlPrefix="" PhoneNumber="(850) 123-0000" Extension="" Version="12" PhoneAlgorithmParamValue="(999) 999-9999"/>
<PersonPhonesRow PersonID="1234567890" Sequence="2" PhoneType="CELL" IntlPrefix="" PhoneNumber="(850) 000-0000" Extension="" Version="3" PhoneAlgorithmParamValue="(999) 999-9999"/>
<PersonPhonesRow PersonID="1234567890" Sequence="3" PhoneType="HOME" IntlPrefix="" PhoneNumber="(850) 123-1111" Extension="ATT" Version="1" PhoneAlgorithmParamValue="(999) 999-9999"/>
<PersonPhonesRow PersonID="1234567890" Sequence="4" PhoneType="BUSN" IntlPrefix="" PhoneNumber="(904) 111-1111" Extension="" Version="3" PhoneAlgorithmParamValue="(999) 999-9999"/>
</PersonPhones>
</CMPersonDetails>
</CMPersonService>
</CMPerson>
Response required:
<acc:phones>
<com:phone xmlns:com="******">
<com:type>HOME</com:type>
<com:phoneNumber>(850) 123-1111</com:phoneNumber>
<com:carrier>ATT</com:carrier>
</com:phone>
<com:phone xmlns:com="******">
<com:type>CELL</com:type>
<com:phoneNumber>(904) 111-1111</com:phoneNumber>
<com:carrier></com:carrier>
</com:phone>
</acc:phones>
Query to be used:
let $PersonMaintenanceResponse := 'Your request'
let $uniqPhoneType :=distinct-values($PersonMaintenanceResponse/ns2:CMPersonService/ns2:CMPersonDetails/ns2:PersonPhones/ns2:PersonPhonesRow/#PhoneType)
for $each at $i in $uniqPhoneType
return
<acc:phones>
{
let $seq := $PersonMaintenanceResponse/ns2:CMPersonService/ns2:CMPersonDetails/ns2:PersonPhones/ns2:PersonPhonesRow[(#PhoneType = $each)]
let $maxValue := max($seq/#Sequence)
let $maxrow := $seq[#Sequence eq $maxValue]
return
<com:phone xmlns:com="*******">
<com:type>{$each}</com:type>
<com:phoneNumber>{$maxrow/#PhoneNumber}</com:phoneNumber>
<com:carrier>{$maxrow/#Extension}</com:carrier>
</com:phone>
}</acc:phones>
I've to take the right part and clean it after it comparate with the middle part and save if are equal
> #!/usr/bin/env ruby
require 'rubygems'
require 'levenshtein'
require 'csv'
# Extending String class for blank? method
class String
def blank?
self.strip.empty?
end
end
# In
lines = CSV.read('entrada.csv')
lines.each do |line|
id = line[0].upcase.strip
left = line[1].upcase.strip
right = line[2].upcase.strip
eduardo = line[2].upcase.split(' ','de')
line[0] = id
line[1] = left
line[2] = right
line[4] = eduardo[0]+eduardo[1]
distance = Levenshtein.distance left, right
line << 99 if (left.blank? or right.blank?)
line << distance unless (left.blank? or right.blank?)
end
# Out
# counter = 0
CSV.open('salida.csv', 'w') do |csv|
lines.each do |line|
# counter = counter + 1 if line[3] <= 3
csv << line
end
end
# p counter
The middle is the correct the rigth i should correct
Some examples:
Eduardo | Abner | Herrera | Herrera -> Eduardo Herrera
Angel | De | Leon -> Angel De Leon
Maira | Angelina | de | Leon -> Maira De Leon
Marquilla | Gutierrez | Petronilda |De | Leon -> Marquilla Petronilda
First order of business is to come up with some rules. Based on your examples, and Spanish naming customs, here's my stab at the rules.
A name has a forename, paternal surname, and optional maternal surname.
A forename can be multiple words.
A surname can be multiple words linked by a de, y, or e.
So ['Marquilla', 'Gutierrez', 'Petronilda', 'De', 'Leon'] should be { forename: 'Marquilla', paternal_surname: 'Gutierrez', maternal_surname: 'Petronilda de Leon' }
To simplify the process, I'd first join any composite surnames into one field. ['Marquilla', 'Gutierrez', 'Petronilda', 'De', 'Leon'] becomes ['Marquilla', 'Gutierrez', 'Petronilda De Leon']. Watch out for cases like ['Angel', 'De', 'Leon'] in which case the surname is probably De Leon.
Once that's done, figuring out which part is which becomes easier.
name = {}
if parts.length == 1
error?
# The special case of only two parts: forename paternal_surname
elsif parts.length == 2
name = {
forename: parts[0],
paternal_surname: parts[1]
}
# forename paternal_surname maternal_surname
else
# The forename can have multiple parts, so work from the
# end and whatever's left is their forename.
name[:maternal_surname] = parts.pop
name[:paternal_surname] = parts.pop
name[:forename] = parts.join(" ")
end
There's a lot of ambiguity in Spanish naming, so this can only be an educated guess at what their actual name is. You'll probably have to tweak the rules as you learn more about the dataset. For example, I'm pretty sure handling of de is not that simple. For example...
One Leocadia Blanco Álvarez, married to a Pedro Pérez Montilla, may be addressed as Leocadia Blanco de Pérez or as Leocadia Blanco Álvarez de Pérez
In that case ['Marquilla', 'Gutierrez', 'Petronilda', 'De', 'Leon'] becomes ['Marquilla', 'Gutierrez', 'Petronilda', 'De Leon'] which is { forename: 'Marquilla', paternal_surname: 'Gutierrez', maternal_surname: 'Petronilda', married_to: 'Leon' } or 'Marquilla Gutierrez Petronilda who is married to someone whose parental surname is Leon.
Good luck.
I would add more columns to the database, like last_name1, last_name2, last_name3, etc, and make them optional (don't put validations on those attributes). Hope that answers your question!
I'm moving my bookmarks from kippt.com to pinboard.in.
I exported my bookmarks from Kippt and for some reason, they were storing tags (preceded by #) and description within the same field. Pinboard keeps tags and description separated.
This is what a Kippt bookmark looks like after export:
<DT>This is a title
<DD>#tag1 #tag2 This is a description
This is what it should look like before importing into Pinboard:
<DT>This is a title
<DD>This is a description
So basically, I need to replace #tag1 #tag2 by TAGS="tag1,tag2" and move it on the first line within <A>.
I've been reading about moving chunks of data here: sed or awk to move one chunk of text betwen first pattern pair into second pair?
I haven't been to come up with a good recipe so far. Any insight?
Edit:
Here's an actual example of what the input file looks like (3 entries out of 3500):
<DT>Phabricator
<DD>#bug #tracking
<DT>The hidden commands for diagnosing and improving your Netflix streaming quality – Quartz
<DT>Icelandic Farm Holidays | Local experts in Iceland vacations
<DD>#iceland #tour #car #drive #self Self-driving tour of Iceland
This might not be the most beautiful solution, but since it seems to be a one-time-thing it should be sufficient.
import re
dt = re.compile('^<DT>')
dd = re.compile('^<DD>')
with open('bookmarks.xml', 'r') as f:
for line in f:
if re.match(dt, line):
current_dt = line.strip()
elif re.match(dd, line):
current_dd = line
tags = [w for w in line[4:].split(' ') if w.startswith('#')]
current_dt = re.sub('(<A[^>]+)>', '\\1 TAGS="' + ','.join([t[1:] for t in tags]) + '">', current_dt)
for t in tags:
current_dd = current_dd.replace(t + ' ', '')
if current_dd.strip() == '<DD>':
current_dd = ""
else:
print current_dt
print current_dd
current_dt = ""
current_dd = ""
print current_dt
print current_dd
If some parts of the code are not clear, just tell me. You can of course use python to write the lines to a file instead of printing them, or even modify the original file.
Edit: Added if-clause so that empty <DD> lines won't show up in the result.
script.awk
BEGIN{FS="#"}
/^<DT>/{
if(d==1) print "<DT>"s # for printing lines with no tags
s=substr($0,5);tags="" # Copying the line after "<DT>". You'll know why
d=1
}
/^<DD>/{
d=0
m=match(s,/>/) # Find the end of the HREF descritor first match of ">"
for(i=2;i<=NF;i++){sub(/ $/,"",$i);tags=tags","$i} # Concatenate tags
td=match(tags,/ /) # Parse for tag description (marked by a preceding space).
if(td==0){ # No description exists
tags=substr(tags,2)
tagdes=""
}
else{ # Description exists
tagdes=substr(tags,td)
tags=substr(tags,2,td-2)
}
print "<DT>" substr(s,1,m-1) ", TAGS=\"" tags "\"" substr(s,m)
print "<DD>" tagdes
}
awk -f script.awk kippt > pinboard
INPUT
<DT>Phabricator
<DD>#bug #tracking
<DT>The hidden commands for diagnosing and improving your Netflix streaming quality – Quartz
<DT>Icelandic Farm Holidays | Local experts in Iceland vacations
<DD>#iceland #tour #car #drive #self Self-driving tour of Iceland
OUTPUT:
<DT>Phabricator
<DD>
<DT>The hidden commands for diagnosing and improving your Netflix streaming quality – Quartz
<DT>Icelandic Farm Holidays | Local experts in Iceland vacations
<DD> Self-driving tour of Iceland
I'm parsing a simple XML file to create a flat text file from it. The desired outcome is shown below the sample XML. The XML has sort of a header-detail structure (Assembly_Info and Part respectively), with a unique header node followed by any number of detail record nodes, all of which are siblings. After digging into the elements under the header, I can't then find a way back 'up' to then pick up all the sibling detail nodes.
XML file looks like this:
<?xml version="1.0" standalone="yes" ?>
<Wrapper>
<Record>
<Product>
<prodid>4094</prodid>
</Product>
<Assembly>
<Assembly_Info>
<id>DF-7A</id>
<interface>C</interface>
</Assembly_Info>
<Part>
<status>N/A</status>
<dev_name>0000</dev_name>
</Part>
<Part>
<status>Ready</status>
<dev_name>0455</dev_name>
</Part>
<Part>
<status>Ready</status>
<dev_name>045A</dev_name>
</Part>
</Assembly>
<Assembly>
<Assembly_Info>
<id>DF-7A</id>
<interface>C</interface>
</Assembly_Info>
<Part>
<status>N/A</status>
<dev_name>0002</dev_name>
</Part>
<Part>
<status>Ready</status>
<dev_name>0457</dev_name>
</Part>
</Assembly>
</Record>
</Wrapper>
For each Assembly I need to read the values of the two elemenmets in Assembly_Info which I do successfully. But, I then want to read each of the Part records that are associated with the Assembly. The objective is to 'flatten' the file into this:
prodid id interface status dev_name
4094 DF-7A C N/A 0000
4094 DF-7A C Ready 0455
4094 DF-7A C Ready 045A
4094 DF-7A C N/A 0002
4094 DF-7A C Ready 0457
I'm attempting to use findnodes() to do this, as that's about the only tool I thought I understood. My code unfortunately reads all of the Part records from the entire file foreach Assembly--since the only way I've been able to find the Part nodes is to start at the root. I don't know how to change 'where I am', if you will; to tell findnodes to begin at current parent. Code looks like this:
my $parser = XML::LibXML -> new();
my $tree = $parser -> parse_file ('DEMO.XML');
for my $product ($tree->findnodes ('/Wrapper/Record/Product/prodid')) {
$prodid = $product->textContent();
}
foreach my $assembly ($tree->findnodes ('/Wrapper/Record/Assembly')){
$assemblies++;
$parts = 0;
for my $assembly ($tree->findnodes ('/Wrapper/Record/Assembly/Assembly_Info')) {
$id = $assembly->findvalue('id');
$interface = $assembly->findvalue('interface');
}
foreach my $part ($tree->findnodes ('/Wrapper/Record/Assembly/Part')) {
$parts++;
$status = $part->findvalue('status');
$dev_name = $part->findvalue('dev_name');
}
print "Assembly No: ", $assemblies, " Parts: ",$parts, "\n";
}
How do I get just the Part nodes for a given Assembly, after I've gone down to the Assembly_Info depths? There is quite a bit I'm not getting, and I think a problem may be that I'm thinking of this as 'navigating' or moving a cursor, if you will. Examples of XPath path expressions have not helped me.
Instead of always using $tree as the starting point for the findnodes method, you can use any other node, especially also child nodes. Then you could use a relative XPath expression. For example:
for my $record ($tree->findnodes('/Wrapper/Record')) {
for my $assembly ($record->findnodes('./Assembly')) {
for my $part ($assembly->findnodes('./Part')) {
}
}
}