To add values in cumulative format - xpath

I have a xml structure as follows:
<bookstore>
<book>
<name>story</name>
<price>50.00</price>
<author>smith</smith>
</book>
<book>
<name>history</name>
<price>150.00</price>
<author>kelly</smith>
</book>
<book>
<name>epic</name>
<price>300.00</price>
<author>jones</smith>
</book>
</bookstore>
In the above example i want to add the price as follows:
first book price should be returned as it is.
second book price should be added with first book price 50.00+150=200.00
Third book price should be added with first & second book price 50.00+150.00+300.00=500.00
& return the values of price as below
<pricelist>
<price>50.00</price>
<price>200.00</price>
<price>500.00</price>
</pricelist>
can anyone help me on this??
Thanks.

There are two ways to solve your problem. One would be to sum up all preceding sibling books, which is easy to read and code, but has O(n^2) complexity and thus does not scale well for large input, but will be fine for rather small sets (complexity might even be worse, depending on how your XQuery processor resolves the preceding siblings).
for $book in /bookstore/book
return
<price>{ sum(($book/price, $book/preceding-sibling::book/price)) }</price>
As a declarative programming language, XQuery lacks variables (that can be modified, eg. in a loop). As an alternative, write a recursive function, which calculates the sum in O(n).
declare function local:sum($books, $sum) {
if ($books) then
let $price := $sum + $books[1]/price
return
(
<price>{ $price }</price>,
local:sum($books[position() > 1], $price)
)
else
()
};
local:sum(/bookstore/book, 0)

Related

XQuery: look for node with descendants in a certain order

I have an XML file that represents the syntax trees of all the sentences in a book:
<book>
<sentence>
<w class="pronoun" role="subject">
I
</w>
<wg type="verb phrase">
<w class="verb" role="verb">
like
</w>
<wg type="noun phrase" role="object">
<w class="adj">
green
</w>
<w class="noun">
eggs
</w>
</wg>
</wg>
</sentence>
<sentence>
...
</sentence>
...
</book>
This example is fake, but the point is that the actual words (the <w> elements) are nested in unpredictable ways based on syntactic relationships.
What I'm trying to do is find <sentence> nodes with <w> children matching particular criteria in a certain order. For example, I may be looking for a sentence with a w[#class='pronoun'] descendant followed by a w[#class='verb'] descendant.
It's easy to find sentences that just contain both descendants, without caring about ordering:
//sentence[descendant::w[criteria1] and descendant::w[criteria2]]
I did manage to figure out this query that does what I want, which looks for a <w> with a following <w> matching the criteria with the same closest <sentence> ancestor:
for $sentence in //sentence
where $sentence[descendant::w[criteria1 and
following::w[(ancestor::sentence[1] = $sentence) and criteria2]]]
return ...
...but unfortunately it's very slow, and I'm not sure why.
Is there a non-slow way to search for a node that contains descendants matching criteria in a certain order? I'm using XQuery 3.1 with BaseX. If I can't find a reasonable way to do this with XQuery, plan B is to do post-processing with Python.
The following axis is expensive indeed, as it spans all subsequent nodes of a document that are no descendants and no ancestors.
The node comparison operators (<<, >>, is) may help you here. In the code example below, it is checked if there is at least one verb that is followed by a noun:
for $sentence in //sentence
let $words1 := $sentence//w[#class = 'verb']
let $words2 := $sentence//w[#class = 'noun']
where some $w1 in $words1 satisfies
some $w2 in $words2 satisfies $w1 << $w2
return $sentence

Select all nodes until a specific given node/tag

Given the following markup:
<div id="about">
<dl>
<dt>Date</dt>
<dd>1872</dd>
<dt>Names</dt>
<dd>A</dd>
<dd>B</dd>
<dd>C</dd>
<dt>Status</dt>
<dd>on</dd>
<dt>Another Field</dt>
<dd>X</dd>
<dd>Y</dd>
</dl>
</div>
I'm trying to extract all the <dd> nodes following <dt>Names</dt> but only until another <dt> starts. In this case, I'm after the following nodes:
<dd>A</dd>
<dd>B</dd>
<dd>C</dd>
I'm trying the following XPath code, but it's not working as intended.
xpath("//div[#id='about']/dl/dt[contains(text(),'Names')]/following-sibling::dd[not(following-sibling::dt)]/text()")
Any thoughts on how to fix it?
Many thanks.
Update: much simpler solution
There is a prerequisite in your situation, that is that the anchor item always is the first preceding sibling with a certain property. Because of that, here's a much simpler way of writing the below complex expression:
/div/dl/dd[preceding-sibling::dt[1][. = 'Names']]
In other words:
select any dd
that has a first preceding sibling dt (the preceding sibling axis counts backwards)
that itself has a value of "Names"
As can be seen in the following screenshot from oXygen, it selects the nodes you wanted to select (and if you change "Names" to "Status" or "Another Field", it will select only the following ones before the next dt also).
Original complex solution (leaving in for reference)
This is far easier in XPath 2.0, but let's assume you can only use XPath 1.0. The trick is to count the number of preceding siblings from your anchor element (the one with "Names" in it), and disregard any that have the wrong count (i.e., when we cross over <dt>Status</dt>, the number of preceding siblings has increased).
For XPath 1.0, remove the comments between (: and :) (in XPath, whitespace is insignificant, you can make it a multiline XPath for readability, but in 1.0, comments are not possible)
/div/dl/dd
(: any dd having a dt before it with "Names" :)
[preceding-sibling::dt[. = 'Names']]
(: count the preceding siblings up to dt with "Names", add one to include 'self' :)
[count(preceding-sibling::dt[. = 'Names']/preceding-sibling::dt) + 1
=
(: compare with count of all preceding siblings :)
count(preceding-sibling::dt)]
As a one-liner:
/div/dl/dd[preceding-sibling::dt[. = 'Names']][count(preceding-sibling::dt[. = 'Names']/preceding-sibling::dt) + 1 = count(preceding-sibling::dt)]
How about this:
//dd[preceding-sibling::dt[contains(., 'Names')]][following-sibling::dt]

Xquery not returning desired values

I am trying to return a certain set of values however the query is not quite returning what I would like. I would like to return records by the author "Hennie J. Steenhagen" grouped by year. However what it is returning is records grouped by year if it’s of the same year as one of Hennies records. Not only Hennies.
For example, if we have the record <www><author>Hennie*</author><year>1990</year></www> and <www><author>Derpie</author><year>1990></year></www> the query will return both records grouped in the year 1990, I would only like Hennies to be returned.
for $y in /*/*/year where $y/../author ="Hennie J. Steenhagen" return <year-Pub>{$y}{/*/*[year = $y]}</year-Pub>
Your question is quite difficult to understand because your XPath addresses a larger XML node tree than the example XML you have provided. However for the example I will assume that your records are named record. Also your output of your XPath does not make a lot of sense to me, but I will assume that you know what you want!
Given the XML:
<record>
<www>
<author>Hennie J. Steenhagen</author>
<year>1990</year>
</www>
and
<www>
<author>Derpie</author>
<year>1990></year>
</www>
</record>
If you have an XQuery 3.0 processor, you could use the following:
/record/www[author = "Hennie J. Steenhagen"] ! <year-Pub>{year}{.}</year-Pub>
If you only have access to an XQuery 1.0 processor, then you could fall-back to the following:
for $w in /record/www[author = "Hennie J. Steenhagen"]
return
<year-Pub>{$w/year}{$w}</year-Pub>
Both of my examples only use a single predicate which will only filter the data once. Whereas your self-found solution uses both a predicate and a where expression, and so has to filter the data twice.
Fixed it,
for $y in /*/*/year where $y/../author ="Hennie J. Steenhagen" and /*/*[year=$y] return <year-Pub>{$y/../*}</year-Pub>
Thanks for any one whom spend their time looking.

What would you use for `n to n` relations in python?

after fiddling around with dictionaries, I came to the conclusion, that I would need a data structure that would allow me an n to n lookup. One example would be: A course can be visited by several students and each student can visit several courses.
What would be the most pythonic way to achieve this? It wont be more than 500 Students and 100 courses, to stay with the example. So I would like to avoid using a real database software.
Thanks!
Since your working set is small, I don't think it is a problem to just store the student IDs as lists in the Course class. Finding students in a class would be as simple as doing
course.studentIDs
To find courses a student is in, just iterate over the courses and find the ID:
studentIDToGet = "johnsmith001"
studentsCourses = list()
for course in courses:
if studentIDToGet in course.studentIDs:
studentsCourses.append(course.id)
There's other ways you could do it. You could have a dictionary of studentIDs mapped to courseIDs or two dictionaries that - one mapped studentIDs:courseIDs and another courseIDs:studentIDs - when updated, update each other.
The implementation I wrote out the code for would probably be the slowest, which is why I mentioned that your working set is small enough that it would not be a problem. The other implentations I mentioned but did not show the code for would require some more code to make them work that just aren't worth the effort.
It depends completely on what operations you want the structure to be able to carry out quickly.
If you want to be able to quickly look up properties related to both a course and a student, for example how many hours a student has spent on studies for a specific course, or what grade the student has in the course if he has finished it, and if he has finished it etc. a vector containing n*m elements is probably what you need, where n is the number of students and m is the number of courses.
If on the other hand the average number of courses a student has taken is much less than the total number of courses (which it probably is for a real case scenario), and you want to be able to quickly look up all the courses a student has taken, you probably want to use an array consisting of n lists, either linked lists, resizable vectors or similar – depending on if you want to be able to with the lists; maybe that is to quickly remove elements in the middle of the lists, or quickly access an element at a random location. If you both want to be able to quickly remove elements in the middle of the lists and have quick random access to list elements, then maybe some kind of tree structure would be the most suitable for you.
Most tree data structures carry out all basic operations in logarithmic time to the number of elements in the tree. Beware that some tree data structures have an amortized time on these operators that is linear to the number of elements in the tree, even though the average time for a randomly constructed tree would be logarithmic. A typical example of when this happens is if you use a binary search tree and build it up with increasingly large elements. Don't do that; scramble the elements before you use them to build up the tree in that case, or use a divide-and-conquer method and split the list in two parts and one pivot element and create the tree root with the pivot element, then recursively create trees from both the left part of the list and the right part of the list, these also using the divide-and-conquer method, and attach them to the root as the left child and the right child respectively.
I'm sorry, I don't know python so I don't know what data structures that are part of the language and which you have to create yourself.
I assume you want to index both the Students and Courses. Otherwise you can easily make a list of tuples to store all Student,Course combinations: [ (St1, Crs1), (St1, Crs2) .. (St2, Crs1) ... (Sti, Crsi) ... ] and then do a linear lookup everytime you need to. For upto 500 students this ain't bad either.
However if you'd like to have a quick lookup either way, there is no builtin data structure. You can simple use two dictionaries:
courses = { crs1: [ st1, st2, st3 ], crs2: [ st_i, st_j, st_k] ... }
students = { st1: [ crs1, crs2, crs3 ], st2: [ crs_i, crs_j, crs_k] ... }
For a given student s, looking up courses is now students[s]; and for a given course c, looking up students is courses[c].
For something simple like what you want to do, you could create a simple class with data members and methods to maintain them and keep them consistent with each other. For this problem two dictionaries would be needed. One keyed by student name (or id) that keeps track of the courses each is taking, and another that keeps track of which students are in each class.
defaultdicts from the 'collections' module could be used instead of plain dicts to make things more convenient. Here's what I mean:
from collections import defaultdict
class Enrollment(object):
def __init__(self):
self.students = defaultdict(set)
self.courses = defaultdict(set)
def clear(self):
self.students.clear()
self.courses.clear()
def enroll(self, student, course):
if student not in self.courses[course]:
self.students[student].add(course)
self.courses[course].add(student)
def drop(self, course, student):
if student in self.courses[course]:
self.students[student].remove(course)
self.courses[course].remove(student)
# remove student if they are not taking any other courses
if len(self.students[student]) == 0:
del self.students[student]
def display_course_enrollments(self):
print "Class Enrollments:"
for course in self.courses:
print ' course:', course,
print ' ', [student for student in self.courses[course]]
def display_student_enrollments(self):
print "Student Enrollments:"
for student in self.students:
print ' student', student,
print ' ', [course for course in self.students[student]]
if __name__=='__main__':
school = Enrollment()
school.enroll('john smith', 'biology 101')
school.enroll('mary brown', 'biology 101')
school.enroll('bob jones', 'calculus 202')
school.display_course_enrollments()
print
school.display_student_enrollments()
school.drop('biology 101', 'mary brown')
print
print 'After mary brown drops biology 101:'
print
school.display_course_enrollments()
print
school.display_student_enrollments()
Which when run produces the following output:
Class Enrollments:
course: calculus 202 ['bob jones']
course: biology 101 ['mary brown', 'john smith']
Student Enrollments:
student bob jones ['calculus 202']
student mary brown ['biology 101']
student john smith ['biology 101']
After mary brown drops biology 101:
Class Enrollments:
course: calculus 202 ['bob jones']
course: biology 101 ['john smith']
Student Enrollments:
student bob jones ['calculus 202']
student john smith ['biology 101']

Word comparison algorithm

I am doing a CSV Import tool for the project I'm working on.
The client needs to be able to enter the data in excel, export them as CSV and upload them to the database.
For example I have this CSV record:
1, John Doe, ACME Comapny (the typo is on purpose)
Of course, the companies are kept in a separate table and linked with a foreign key, so I need to discover the correct company ID before inserting.
I plan to do this by comparing the company names in the database with the company names in the CSV.
the comparison should return 0 if the strings are exactly the same, and return some value that gets bigger as the strings get more different, but strcmp doesn't cut it here because:
"Acme Company" and "Acme Comapny" should have a very small difference index, but
"Acme Company" and "Cmea Mpnyaco" should have a very big difference index
Or "Acme Company" and "Acme Comp." should also have a small difference index, even though the character count is different.
Also, "Acme Company" and "Company Acme" should return 0.
So if the client makes a type while entering data, i could prompt him to choose the name he most probably wanted to insert.
Is there a known algorithm to do this, or maybe we can invent one :)
?
You might want to check out the Levenshtein Distance algorithm as a starting point. It will rate the "distance" between two words.
This SO thread on implementing a Google-style "Do you mean...?" system may provide some ideas as well.
I don't know what language you're coding in, but if it's PHP, you should consider the following algorithms:
levenshtein(): Returns the minimal number of characters you have to replace, insert or delete to transform one string into another.
soundex(): Returns the four-character soundex key of a word, which should be the same as the key for any similar-sounding word.
metaphone(): Similar to soundex, and possibly more effective for you. It's more accurate than soundex() as it knows the basic rules of English pronunciation. The metaphone generated keys are of variable length.
similar_text(): Similar to levenshtein(), but it can return a percent value instead.
I've had some success with the Levenshtein Distance algorithm, there is also Soundex.
What language are you implementing this in? we may be able to point to specific examples
I have actually implemented a similar system. I used the Levenshtein distance (as other posters already suggested), with some modifications. The problem with unmodified edit distance (applied to whole strings) is that it is sensitive to word reordering, so "Acme Digital Incorporated World Company" will match poorly against "Digital Incorporated World Company Acme" and such reorderings were quite common in my data.
I modified it so that if the edit distance of whole strings was too big, the algorithm fell back to matching words against each other to find a good word-to-word match (quadratic cost, but there was a cutoff if there were too many words, so it worked OK).
I've taken SoundEx, Levenshtein, PHP similarity, and double metaphone and packaged them up in C# in one set of extension methods on String.
Entire blog post here.
There's multiple algorithms to do just that, and most databases even include one by default. It is actually a quite common concern.
If its just about English words, SQL Server for example includes SOUNDEX which can be used to compare on the resulting sound of the word.
http://msdn.microsoft.com/en-us/library/aa259235%28SQL.80%29.aspx
I'm implementing it in PHP, and I am now writing a piece of code that will break up 2 strings in words and compare each of the words from the first string with the words of the second string using levenshtein and accept the lowes possible values. Ill post it when I'm done.
Thanks a lot.
Update: Here's what I've come up with:
function myLevenshtein( $str1, $str2 )
{
// prepare the words
$words1 = explode( " ", preg_replace( "/\s+/", " ", trim($str1) ) );
$words2 = explode( " ", preg_replace( "/\s+/", " ", trim($str2) ) );
$found = array(); // array that keeps the best matched words so we don't check them again
$score = 0; // total score
// In my case, strings that have different amount of words can be good matches too
// For example, Acme Company and International Acme Company Ltd. are the same thing
// I will just add the wordcount differencre to the total score, and weigh it more later if needed
$wordDiff = count( $words1 ) - count( $words2 );
foreach( $words1 as $word1 )
{
$minlevWord = "";
$minlev = 1000;
$return = 0;
foreach( $words2 as $word2 )
{
$return = 1;
if( in_array( $word2, $found ) )
continue;
$lev = levenshtein( $word1, $word2 );
if( $lev < $minlev )
{
$minlev = $lev;
$minlevWord = $word2;
}
}
if( !$return )
break;
$score += $minlev;
array_push( $found, $minlevWord );
}
return $score + $wordDiff;
}

Resources