Help with XPath query (count) - xpath

I want to count all descendant-or-self nodes of a certain node, but just all descendants which are lower than a certain level starting at 0. Do you have any suggestions?
Basically it seems something like:
count(//fstructure/node()) + count(//fstructure/node()/node()) + count(//fstructure/node()/node()/node()) + 1
works for 3 levels and the (element) node "fstructure", even if it's not really nice, but I just needed it for debugging.
best regards,
Johannes

This XPath expression:
count(
ExprForYourNode//*
[not(count(ancestor::* )
>
count(ExprForYourNode/ancestor::*) + 2
)
]
)
select all descendent-or-self elements off the element selected by the expression ExprForYourNode with maximum depth of 2 (zero based)
If you want to select all descendent-or self nodes (elements, text-nodes, comment-nodes and processing-instruction-nodes) use:
count(
ExprForYourNode//node()
[not(count(ancestor::* )
>
count(ExprForYourNode/ancestor::*) + 2
)
]
)
For example with this document:
<t>
<a>
<b>
<c>
<d/>
</c>
</b>
</a>
</t>
this expression:
count(
/*/a//*
[not(count(ancestor::* )
>
count(/*/a/ancestor::*) + 2
)
]
)
produces:
2
This is the number of elements (b and c, but not d) that are descendents of a with relative depth to a 2 or less.
Similarly, the evaluation of this expression:
count(
/*/a//node()
[not(count(ancestor::* )
>
count(/*/a/ancestor::*) + 2
)
]
)
produces:
6
That is the number of elements (as before) plus the number of (white-space-only) text nodes at depth up to 2 relative the element a

Related

Generating correct phrases from PEG grammars

I wrote a PEG parser generator just for fun (I will publish it on NPM some time), and thought it would be easy to add a randomised phrase generator on top of it. The idea is to automatically get correct phrases, given a grammar. So I set the following rules to generate strings from each type of parsers :
Sequence p1 p2 ... pn : Generate a phrase for each subparser and return the concatenation.
Alternative p1 | p2 | ... | pn : Randomly pick a subparser and generate a phrase with it.
Repetition p{n, m} : Pick a number x in [n, m] (or [n, n+2] is m === Infinity) and return a concatenation of x generated phrases from p.
Terminal : Just return the terminal literal.
When I take the following grammar :
S: NP VP
PP: P NP
NP: Det N | Det N PP | 'I'
VP: V NP | VP PP
V: 'shot' | 'killed' | 'wounded'
Det: 'an' | 'my'
N: 'elephant' | 'pajamas' | 'cat' | 'dog'
P: 'in' | 'outside'
It works great. Some examples :
my pajamas killed my elephant
an pajamas wounded my pajamas in my pajamas
an dog in I wounded my cat in I outside my elephant in my elephant in an pajamas outside an cat
I wounded my pajamas in my dog
This grammar has a recursion (PP: P NP > NP: Det N PP). When I take this other recursive grammar, for math expression this time :
expr: term (('+' | '-') term)*
term: fact (('*' | '/') fact)*
fact: '1' | '(' expr ')'
Almost one time in two, I get a "Maximum call stack size exceeded" error (in NodeJS). The other half of the time, I get correct expressions :
( 1 ) * 1 + 1
( ( 1 ) / ( 1 + 1 ) - ( 1 / ( 1 * 1 ) ) / ( 1 / 1 - 1 ) ) * 1
( ( ( 1 ) ) )
1
1 / 1
I guess the recursive production for fact gets called too often, too deep in the call stack and this makes the whole thing just blow off.
How can I make my approach less naive in order to avoid those cases that explode the call stack ? Thank you.
Of course if a grammar describes arbitrarily long inputs, you can easily end up in a very deep recursion. A simple way to avoid this trap is keep a priority queue of partially expanded sentential forms where the key is length. Remove the shortest and replace each non-terminal in each possible way, emitting those that are now all terminals and adding the rest back onto the queue. You might also want to maintain an "already emitted" set to avoid emitting duplicates. If the grammar doesn't have anything like epsilon productions where a sentential form derives a shorter string, then this method produces all strings described by the grammar in non-decreasing length order. That is, once you've seen an output of length N, all strings of length N-1 and shorter have already appeared.
Since OP asked about details, here's an implementation for the expression grammar. It's simplified by rewriting the PEG as a CFG.
import heapq
def run():
g = {
'<expr>': [
['<term>'],
['<term>', '+', '<expr>'],
['<term>', '-', '<expr>'],
],
'<term>': [
['<fact>'],
['<fact>', '*', '<term>'],
['<fact>', '/', '<term>'],
],
'<fact>': [
['1'],
['(', '<expr>', ')']
],
}
gen(g)
def is_terminal(s):
for sym in s:
if sym.startswith('<'):
return False;
return True;
def gen(g, lim = 10000):
q = [(1, ['<expr>'])]
n = 0;
while n < lim:
_, s = heapq.heappop(q)
# print("pop: " + ''.join(s))
a = []
b = s.copy()
while b:
sym = b.pop(0)
if sym.startswith('<'):
for rhs in g[sym]:
s_new = a.copy()
s_new.extend(rhs)
s_new.extend(b)
if is_terminal(s_new):
print(''.join(s_new))
n += 1
else:
# print("push: " + ''.join(s_new))
heapq.heappush(q, (len(s_new), s_new))
break # only generate leftmost derivations
a.append(sym)
run()
Uncomment the extra print()s to see heap activity. Some example output:
1
(1)
1*1
1/1
1+1
1-1
((1))
(1*1)
(1/1)
(1)*1
(1)+1
(1)-1
(1)/1
(1+1)
(1-1)
1*(1)
1*1*1
1*1/1
1+(1)
1+1*1
1+1/1
1+1+1
1+1-1
1-(1)
1-1*1
1-1/1
1-1+1
1-1-1
1/(1)
1/1*1
1/1/1
1*1+1
1*1-1
1/1+1
1/1-1
(((1)))
((1*1))
((1/1))
((1))*1
((1))+1
((1))-1
((1))/1
((1)*1)
((1)+1)
((1)-1)
((1)/1)
((1+1))
((1-1))
(1)*(1)
(1)*1*1
(1)*1/1
(1)+(1)
(1)+1*1

Neo4j: why the performance of allShortestPaths function is so slow?

I am using Neo4j 'neo4j-community-2.3.0-RC1' version.
In my database there are just 1054 nodes.
when i do path query with 'allShotestPaths' function, why it is so slow.
it take about more than 1 second,the following is unit test result :
√ search optimalPath Path (192ms)
√ search optimal Path by Lat Lng (1131ms)
should i optimize the query? the following are querys for 'optimalPath' and 'optimal Path by Lat Lng'
optimalPath query:
MATCH path=allShortestPaths((start:潍坊_STATION )-[rels*..50]->(end:潍坊_STATION {name:"火车站"}))
RETURN NODES(path) AS stations,relationships(path) AS path,length(path) AS stop_count,
length(FILTER(index IN RANGE(1, length(rels)-1) WHERE (rels[index]).bus <> (rels[index - 1]).bus)) AS transfer_count,
length(FILTER( rel IN rels WHERE type(rel)="WALK" )) AS walk_count
order by transfer_count,walk_count,stop_count
optimal Path by Lat Lng query:
MATCH path=allShortestPaths((start:潍坊_STATION {name:"公交总公司"})-[rels*..50]->(end:潍坊_STATION {name:"火车站"}))
WHERE
round(
6378.137 *1000*2*
asin(sqrt(
sin((radians(start.lat)-radians(36.714))/2)^2+cos(radians(start.lat))*cos(radians(36.714))*
sin((radians(start.lng)-radians(119.1268))/2)^2
))
)/1000 < 0.5 // this formula is used to calculate the distance between two GEO coordinate (latitude\longitude)
RETURN NODES(path) AS stations,relationships(path) AS path,length(path) AS stop_count,
length(FILTER(index IN RANGE(1, length(rels)-1) WHERE (rels[index]).bus <> (rels[index - 1]).bus)) AS transfer_count,
length(FILTER( rel IN rels WHERE type(rel)="WALK" )) AS walk_count
order by transfer_count,walk_count,stop_count
you can download the database here:https://www.dropbox.com/s/zamkyh2aaw3voe6/data.rar?dl=0
i will be very grateful ,if anybody can help me. thanks
In general, without knowing more, I would pull the predicates and expressions that can be computed before the paths are all expanded, before the match.
And as your geo-filter is independent of anything else except your parameters and the start-node you can do:
MATCH (start:潍坊_STATION {name:"公交总公司"})
WHERE
// this formula is used to calculate the distance between two GEO coordinate (latitude\longitude)
round(6378.137 *1000*2*
asin(sqrt(sin((radians(start.lat)-radians({lat}))/2)^2
+cos(radians(start.lat))*cos(radians({lat}))*
sin((radians(start.lng)-radians({lng}))/2)^2)))/1000
< 0.5
MATCH (end:潍坊_STATION {name:"火车站"})
MATCH path=allShortestPaths((start)-[rels*..50]->(end))
RETURN NODES(path) AS stations,
relationships(path) AS path,
length(path) AS stop_count,
length(FILTER(index IN RANGE(1, length(rels)-1)
WHERE (rels[index]).bus <> (rels[index - 1]).bus)) AS transfer_count,
length(FILTER( rel IN rels WHERE type(rel)="WALK" )) AS walk_count
ORDER BY transfer_count,walk_count,stop_count;
see this test (but the other query is equally fast):
neo4j-sh (?)$ MATCH (start:潍坊_STATION {name:"公交总公司"})
>
> WHERE
> // this formula is used to calculate the distance between two GEO coordinate (latitude\longitude)
> round(6378.137 *1000*2*
> asin(sqrt(sin((radians(start.lat)-radians({lat}))/2)^2
> +cos(radians(start.lat))*cos(radians({lat}))*
> sin((radians(start.lng)-radians({lng}))/2)^2)))/1000
> < 0.5
>
> MATCH (end:潍坊_STATION {name:"火车站"})
> MATCH path=allShortestPaths((start)-[rels*..50]->(end))
> WITH NODES(path) AS stations,
> relationships(path) AS path,
> length(path) AS stop_count,
> length(FILTER(index IN RANGE(1, length(rels)-1)
> WHERE (rels[index]).bus <> (rels[index - 1]).bus)) AS transfer_count,
> length(FILTER( rel IN rels WHERE type(rel)="WALK" )) AS walk_count
>
> ORDER BY transfer_count,walk_count,stop_count
> RETURN count(*);
+----------+
| count(*) |
+----------+
| 320 |
+----------+
1 row
10 ms

Google codejam APAC Test practice round: Parentheses Order

I spent one day solving this problem and couldn't find a solution to pass the large dataset.
Problem
An n parentheses sequence consists of n "("s and n ")"s.
Now, we have all valid n parentheses sequences. Find the k-th smallest sequence in lexicographical order.
For example, here are all valid 3 parentheses sequences in lexicographical order:
((()))
(()())
(())()
()(())
()()()
Given n and k, write an algorithm to give the k-th smallest sequence in lexicographical order.
For large data set: 1 ≤ n ≤ 100 and 1 ≤ k ≤ 10^18
This problem can be solved by using dynamic programming
Let dp[n][m] = number of valid parentheses that can be created if we have n open brackets and m close brackets.
Base case:
dp[0][a] = 1 (a >=0)
Fill in the matrix using the base case:
dp[n][m] = dp[n - 1][m] + (n < m ? dp[n][m - 1]:0 );
Then, we can slowly build the kth parentheses.
Start with a = n open brackets and b = n close brackets and the current result is empty
while(k is not 0):
If number dp[a][b] >= k:
If (dp[a - 1][b] >= k) is true:
* Append an open bracket '(' to the current result
* Decrease a
Else:
//k is the number of previous smaller lexicographical parentheses
* Adjust value of k: `k -= dp[a -1][b]`,
* Append a close bracket ')'
* Decrease b
Else k is invalid
Notice that open bracket is less than close bracket in lexicographical order, so we always try to add open bracket first.
Let S= any valid sequence of parentheses from n( and n) .
Now any valid sequence S can be written as S=X+Y where
X=valid prefix i.e. if traversing X from left to right , at any point of time, numberof'(' >= numberof')'
Y=valid suffix i.e. if traversing Y from right to left, at any point of time, numberof'(' <= numberof')'
For any S many pairs of X and Y are possible.
For our example: ()(())
`()(())` =`empty_string + ()(())`
= `( + )(())`
= `() + (())`
= `()( + ())`
= `()(( + ))`
= `()(() + )`
= `()(()) + empty_string`
Note that when X=empty_string, then number of valid S from n( and n)= number of valid suffix Y from n( and n)
Now, Algorithm goes like this:
We will start with X= empty_string and recursively grow X until X=S. At any point of time we have two options to grow X, either append '(' or append ')'
Let dp[a][b]= number of valid suffixes using a '(' and b ')' given X
nop=num_open_parenthesis_left ncp=num_closed_parenthesis_left
`calculate(nop,ncp)
{
if dp[nop][ncp] is not known
{
i1=calculate(nop-1,ncp); // Case 1: X= X + "("
i2=((nop<ncp)?calculate(nop,ncp-1):0);
/*Case 2: X=X+ ")" if nop>=ncp, then after exhausting 1 ')' nop>ncp, therefore there can be no valid suffix*/
dp[nop][ncp]=i1+i2;
}
return dp[nop][ncp];
}`
Lets take example,n=3 i.e. 3 ( and 3 )
Now at the very start X=empty_string, therefore
dp[3][3]= number of valid sequence S using 3( and 3 )
= number of valid suffixes Y from 3 ( and 3 )

Optimization of a function which look for combination - out-of-memory trouble + speed

Below is a function that creates all possible combination of splitting the elements of x into n groups (all groups have the same number of elements)
Function:
perm.groups <- function(x,n){
nx <- length(x)
ning <- nx/n
group1 <-
rbind(
matrix(rep(x[1],choose(nx-1,ning-1)),nrow=1),
combn(x[-1],ning-1)
)
ng <- ncol(group1)
if(n > 2){
out <- vector('list',ng)
for(i in seq_len(ng)){
other <- perm.groups(setdiff(x,group1[,i]),n=n-1)
out[[i]] <- lapply(seq_along(other),
function(j) cbind(group1[,i],other[[j]])
)
}
out <- unlist(out,recursive=FALSE)
} else {
other <- lapply(seq_len(ng),function(i)
matrix(setdiff(x,group1[,i]),ncol=1)
)
out <- lapply(seq_len(ng),
function(i) cbind(group1[,i],other[[i]])
)
}
out
}
Pseudo-code (explainations)
nb = number of groups
ning = number of elements in every group
if(nb == 2)
1. take first element, and add it to every possible
combination of ning-1 elements of x[-1]
2. make the difference for each group defined in step 1 and x
to get the related second group
3. combine the groups from step 2 with the related groups from step 1
if(nb > 2)
1. take first element, and add it to every possible
combination of ning-1 elements of x[-1]
2. to define the other groups belonging to the first groups obtained like this,
apply the algorithm on the other elements of x, but for nb-1 groups
3. combine all possible other groups from step 2
with the related first groups from step 1
This function (and pseudo-code) was first created by Joris Meys on this previous post:
Find all possible ways to split a list of elements into a a given number of group of the same size
Is there a way to create a function that returns a given number of randomly taken possible combinations ?
Such a function would take a third argument which is either percentage.possibilities or number.possiblities which fix the number of random different combinations the function returns.
Something like:
new.perm.groups(x=1:12,n=3,number.possiblities=50)
Working on #JackManey suggestion, you can sample one permutation group in an equiprobable fashion using
sample.perm.group <- function(ning, ngrp)
{
if( ngrp==1 ) return(seq_len(ning))
g1 <- 1+sample(ning*ngrp-1, size=ning-1)
g1 <- c(1, g1[order(g1)])
remaining <- seq_len(ning*ngrp)[-g1]
cbind(g1, matrix(remaining[sample.perm.group(ning, ngrp-1)], nrow=ning), deparse.level=0)
}
where ning is the number of elements per group and ngrp is the number of groups.
It returns indices, so if you have an arbitrary vector you can use it as a permutation:
> ind <- sample.perm.group(3,3)
> ind
[,1] [,2] [,3]
[1,] 1 2 5
[2,] 3 7 6
[3,] 4 8 9
> LETTERS[1:9][ind]
[1] "A" "C" "D" "B" "G" "H" "E" "F" "I"
To generate a sample of permutations of size N, you have two options: If you allow repetitions, i.e., a sample with replacement, all you have to do is run the preceding function N times. OTOH, if your sample is to be taken without replacent, then you can use a rejection mechanism:
sample.perm.groups <- function(ning, ngrp, N)
{
result <- list(sample.perm.group(ning, ngrp))
for( i in seq_len(N-1) )
{
repeat
{
y <- sample.perm.group(ning, ngrp)
if( all(vapply(result, function(x)any(x!=y), logical(1))) ) break
}
result[[i+1]] <- y
}
result
}
This is clearly an equiprobable sampling design, and it is unlikely to be inefficient, since the number of possible combinations is usually much larger than N.

How can you compare to what extent two lists are in the same order?

I have two arrays containing the same elements, but in different orders, and I want to know the extent to which their orders differ.
The method I tried, didn't work. it was as follows:
For each list I built a matrix which recorded for each pair of elements whether they were above or below each other in the list. I then calculated a pearson correlation coefficient of these two matrices. This worked extremely badly. Here's a trivial example:
list 1:
1
2
3
4
list 2:
1
3
2
4
The method I described above produced matrices like this (where 1 means the row number is higher than the column, and 0 vice-versa):
list 1:
1 2 3 4
1 1 1 1
2 1 1
3 1
4
list 2:
1 2 3 4
1 1 1 1
2 0 1
3 1
4
Since the only difference is the order of elements 2 and 3, these should be deemed to be very similar. The Pearson Correlation Coefficient for those two matrices is 0, suggesting they are not correlated at all. I guess the problem is that what I'm looking for is not really a correlation coefficient, but some other kind of similarity measure. Edit distance, perhaps?
Can anyone suggest anything better?
Mean square of differences of indices of each element.
List 1: A B C D E
List 2: A D C B E
Indices of each element of List 1 in List 2 (zero based)
A B C D E
0 3 2 1 4
Indices of each element of List 1 in List 1 (zero based)
A B C D E
0 1 2 3 4
Differences:
A B C D E
0 -2 0 2 0
Square of differences:
A B C D E
4 4
Average differentness = 8 / 5.
Just an idea, but is there any mileage in adapting a standard sort algorithm to count the number of swap operations needed to transform list1 into list2?
I think that defining the compare function may be difficult though (perhaps even just as difficult as the original problem!), and this may be inefficient.
edit: thinking about this a bit more, the compare function would essentially be defined by the target list itself. So for example if list 2 is:
1 4 6 5 3
...then the compare function should result in 1 < 4 < 6 < 5 < 3 (and return equality where entries are equal).
Then the swap function just needs to be extended to count the swap operations.
A bit late for the party here, but just for the record, I think Ben almost had it... if you'd looked further into correlation coefficients, I think you'd have found that Spearman's rank correlation coefficient might have been the way to go.
Interestingly, jamesh seems to have derived a similar measure, but not normalized.
See this recent SO answer.
You might consider how many changes it takes to transform one string into another (which I guess it was you were getting at when you mentioned edit distance).
See: http://en.wikipedia.org/wiki/Levenshtein_distance
Although I don't think l-distance takes into account rotation. If you allow rotation as an operation then:
1, 2, 3, 4
and
2, 3, 4, 1
Are pretty similar.
There is a branch-and-bound algorithm that should work for any set of operators you like. It may not be real fast. The pseudocode goes something like this:
bool bounded_recursive_compare_routine(int* a, int* b, int level, int bound){
if (level > bound) return false;
// if at end of a and b, return true
// apply rule 0, like no-change
if (*a == *b){
bounded_recursive_compare_routine(a+1, b+1, level+0, bound);
// if it returns true, return true;
}
// if can apply rule 1, like rotation, to b, try that and recur
bounded_recursive_compare_routine(a+1, b+1, level+cost_of_rotation, bound);
// if it returns true, return true;
...
return false;
}
int get_minimum_cost(int* a, int* b){
int bound;
for (bound=0; ; bound++){
if (bounded_recursive_compare_routine(a, b, 0, bound)) break;
}
return bound;
}
The time it takes is roughly exponential in the answer, because it is dominated by the last bound that works.
Added: This can be extended to find the nearest-matching string stored in a trie. I did that years ago in a spelling-correction algorithm.
I'm not sure exactly what formula it uses under the hood, but difflib.SequenceMatcher.ratio() does exactly this:
ratio(self) method of difflib.SequenceMatcher instance:
Return a measure of the sequences' similarity (float in [0,1]).
Code example:
from difflib import SequenceMatcher
sm = SequenceMatcher(None, '1234', '1324')
print sm.ratio()
>>> 0.75
Another approach that is based on a little bit of mathematics is to count the number of inversions to convert one of the arrays into the other one. An inversion is the exchange of two neighboring array elements. In ruby it is done like this:
# extend class array by new method
class Array
def dist(other)
raise 'can calculate distance only to array with same length' if length != other.length
# initialize count of inversions to 0
count = 0
# loop over all pairs of indices i, j with i<j
length.times do |i|
(i+1).upto(length) do |j|
# increase count if i-th and j-th element have different order
count += 1 if (self[i] <=> self[j]) != (other[i] <=> other[j])
end
end
return count
end
end
l1 = [1, 2, 3, 4]
l2 = [1, 3, 2, 4]
# try an example (prints 1)
puts l1.dist(l2)
The distance between two arrays of length n can be between 0 (they are the same) and n*(n+1)/2 (reversing the first array one gets the second). If you prefer to have distances always between 0 and 1 to be able to compare distances of pairs of arrays of different length, just divide by n*(n+1)/2.
A disadvantage of this algorithms is it running time of n^2. It also assumes that the arrays don't have double entries, but it could be adapted.
A remark about the code line "count += 1 if ...": the count is increased only if either the i-th element of the first list is smaller than its j-th element and the i-th element of the second list is bigger than its j-th element or vice versa (meaning that the i-th element of the first list is bigger than its j-th element and the i-th element of the second list is smaller than its j-th element). In short: (l1[i] < l1[j] and l2[i] > l2[j]) or (l1[i] > l1[j] and l2[i] < l2[j])
If one has two orders one should look at two important ranking correlation coefficients:
Spearman's rank correlation coefficient: https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient
This is almost the same as Jamesh answer but scaled in the range -1 to 1.
It is defined as:
1 - ( 6 * sum_of_squared_distances ) / ( n_samples * (n_samples**2 - 1 )
Kendalls tau: https://nl.wikipedia.org/wiki/Kendalls_tau
When using python one could use:
from scipy import stats
order1 = [ 1, 2, 3, 4]
order2 = [ 1, 3, 2, 4]
print stats.spearmanr(order1, order2)[0]
>> 0.8000
print stats.kendalltau(order1, order2)[0]
>> 0.6667
if anyone is using R language, I've implemented a function that computes the "spearman rank correlation coefficient" using the method described above by #bubake here:
get_spearman_coef <- function(objectA, objectB) {
#getting the spearman rho rank test
spearman_data <- data.frame(listA = objectA, listB = objectB)
spearman_data$rankA <- 1:nrow(spearman_data)
rankB <- c()
for (index_valueA in 1:nrow(spearman_data)) {
for (index_valueB in 1:nrow(spearman_data)) {
if (spearman_data$listA[index_valueA] == spearman_data$listB[index_valueB]) {
rankB <- append(rankB, index_valueB)
}
}
}
spearman_data$rankB <- rankB
spearman_data$distance <-(spearman_data$rankA - spearman_data$rankB)**2
spearman <- 1 - ( (6 * sum(spearman_data$distance)) / (nrow(spearman_data) * ( nrow(spearman_data)**2 -1) ) )
print(paste("spearman's rank correlation coefficient"))
return( spearman)
}
results :
get_spearman_coef(c("a","b","c","d","e"), c("a","b","c","d","e"))
spearman's rank correlation coefficient: 1
get_spearman_coef(c("a","b","c","d","e"), c("b","a","d","c","e"))
spearman's rank correlation coefficient: 0.9

Resources