Is there something wrong with this xpath, or is Marklogic misbehaving? - xpath

I've got a weird problem with what should be a simple xpath not returning data when I'm sure the data is present in Marklogic. I can see the data in question through a more general xpath, but not get it specifically.
xpath that works:
/log:record[log:changes/log:change/#field = "moduleCodes"]
=> a long sequence of log:record elements with "moduleCodes" field changes
xpath that doesn't:
/log:record/log:changes/log:change[#field = "moduleCodes"]
=> empty sequence
(I've ommitted the log namespace definition brevity.)
Trying to figure out what's going on, I started with the first, working, xpath and built on it:
/log:record[log:changes/log:change/#field = "moduleCodes"]/log:changes/log:change
=> sequence of log:change elements including some with #field = "moduleCodes"
/log:record[log:changes/log:change/#field = "moduleCodes"]/log:changes/log:change[#field]
=> sequence of log:change elements including some with #field = "moduleCodes"
/log:record[log:changes/log:change/#field = "moduleCodes"]/log:changes/log:change[#field = "moduleCodes"]
=> empty sequence
Am I misunderstanding something fundamental? I can't see any reason why the xpaths putting the predicate on the log:change would return an empty sequence when everything else works as I expect. This feels like Marklogic getting confused somehow to me, but I want to make sure it's not just me missing a subtlety of xpath before I start talking like that.
I just tried the paths with a different field-name. It works as I expect with (at least some) other values.
I just restarted the ML cluster; no change.
Edit:
All of the xpaths above work fine in Oxygen, so it seems to be just ML that's behaving like this. I tried adding fn:doc() to the front of all the paths, in case that helped, but it made no difference.
Here's an (anonymised) record that I believe should match all the xpaths:
<log:record id="00000001" date="2013-04-14T01:42:02.922+01:00" type="change" xmlns:log="some/namespace/definition">
<log:head>
<some-header-info/>
</log:head>
<log:changes>
<log:change field="dateModified">
<log:old-value>2012-11-06T00:00:00.0000000</log:old-value>
<log:new-value>2013-03-20T00:00:00.0000000</log:new-value>
</log:change>
<log:change field="moduleCodes">
<log:old-value>
<log:moduleCodes>
<log:moduleCodes-value code="AAA"/>
</log:moduleCodes>
</log:old-value>
<log:new-value>
<log:moduleCodes>
<log:moduleCodes-value code="AAA"/>
<log:moduleCodes-value code="BBB"/>
</log:moduleCodes>
</log:new-value>
</log:change>
</log:changes>
</log:record>

As best I can recreate your test with 6.0-2.3, this works for me.
When debugging database queries, one useful technique is to move things in memory. If it still doesn't work, this throws suspicion on the database query. When I try that using 6.0-2.3, the results seem to be correct.
declare namespace log="some/namespace/definition" ;
document {
<log:record id="00000001" date="2013-04-14T01:42:02.922+01:00" type="change" xmlns:log="some/namespace/definition">
<log:head>
<some-header-info/>
</log:head>
<log:changes>
<log:change field="dateModified">
<log:old-value>2012-11-06T00:00:00.0000000</log:old-value>
<log:new-value>2013-03-20T00:00:00.0000000</log:new-value>
</log:change>
<log:change field="moduleCodes">
<log:old-value>
<log:moduleCodes>
<log:moduleCodes-value code="AAA"/>
</log:moduleCodes>
</log:old-value>
<log:new-value>
<log:moduleCodes>
<log:moduleCodes-value code="AAA"/>
<log:moduleCodes-value code="BBB"/>
</log:moduleCodes>
</log:new-value>
</log:change>
</log:changes>
</log:record> }
/log:record[log:changes/log:change/#field = "moduleCodes"]/xdmp:path(.)
=>
/log:record
declare namespace log="some/namespace/definition" ;
document {
<log:record id="00000001" date="2013-04-14T01:42:02.922+01:00" type="change" xmlns:log="some/namespace/definition">
<log:head>
<some-header-info/>
</log:head>
<log:changes>
<log:change field="dateModified">
<log:old-value>2012-11-06T00:00:00.0000000</log:old-value>
<log:new-value>2013-03-20T00:00:00.0000000</log:new-value>
</log:change>
<log:change field="moduleCodes">
<log:old-value>
<log:moduleCodes>
<log:moduleCodes-value code="AAA"/>
</log:moduleCodes>
</log:old-value>
<log:new-value>
<log:moduleCodes>
<log:moduleCodes-value code="AAA"/>
<log:moduleCodes-value code="BBB"/>
</log:moduleCodes>
</log:new-value>
</log:change>
</log:changes>
</log:record> }
/log:record/log:changes/log:change[#field = "moduleCodes"]/xdmp:path(.)
=>
/log:record/log:changes/log:change[2]
So the implication is that the problem is in the index or the way the index is queried. You can try to debug that using xdmp:query-trace(true()) at the start of your query. For example:
declare namespace log="some/namespace/definition" ;
xdmp:query-trace(true()),
/log:record[log:changes/log:change/#field = "moduleCodes"]/xdmp:describe(.),
/log:record/log:changes/log:change[#field = "moduleCodes"]/xdmp:describe(.)
With 6.0-2.3 these both return the expected results for me.
fn:doc("test")/log:record
fn:doc("test")/log:record/log:changes/log:change[2]
Here are the traces, from the ErrorLog.txt file:
Analyzing path: fn:collection()/log:record[log:changes/log:change/#field = "moduleCodes"]/xdmp:describe(.)
Step 1 is searchable: fn:collection()
Step 2 is searchable: log:record[log:changes/log:change/#field = "moduleCodes"]
Step 3 is unsearchable: xdmp:describe(.)
First 2 steps of path are searchable: fn:collection()/log:record[log:changes/log:change/#field = "moduleCodes"]
Gathering constraints.
Comparison contributed hash value constraint: log:change/#field = "moduleCodes"
Step 2 predicate 1 contributed 3 constraints: log:changes/log:change/#field = "moduleCodes"
Comparison contributed hash value constraint: log:change/#field = "moduleCodes"
Step 2 predicate 1 contributed 1 constraint: log:changes/log:change/#field = "moduleCodes"
Step 2 contributed 4 constraints: log:record[log:changes/log:change/#field = "moduleCodes"]
Executing search.
Selected 1 fragment to filter
xdmp:eval("declare namespace log="some/namespace/definition" ;&#1...", (), <options xmlns="xdmp:eval"><database>598453498912235799</database><root>/tmp</root><isolati...</options>)
Analyzing path: fn:collection()/log:record/log:changes/log:change[#field = "moduleCodes"]/xdmp:describe(.)
Step 1 is searchable: fn:collection()
Step 2 is searchable: log:record
Step 3 is searchable: log:changes
Step 4 is searchable: log:change[#field = "moduleCodes"]
Step 5 is unsearchable: xdmp:describe(.)
First 4 steps of path are searchable: fn:collection()/log:record/log:changes/log:change[#field = "moduleCodes"]
Gathering constraints.
Step 2 contributed 1 constraint: log:record
Comparison contributed hash value constraint: log:change/#field = "moduleCodes"
Step 4 predicate 1 contributed 1 constraint: #field = "moduleCodes"
Step 4 contributed 1 constraint: log:change[#field = "moduleCodes"]
Comparison contributed hash value constraint: log:change/#field = "moduleCodes"
Step 4 predicate 1 contributed 1 constraint: #field = "moduleCodes"
Comparison contributed hash value constraint: log:change/#field = "moduleCodes"
Step 4 predicate 1 contributed 1 constraint: #field = "moduleCodes"
Step 4 contributed 1 constraint: log:change[#field = "moduleCodes"]
Step 3 contributed 1 constraint: log:changes
Executing search.
Selected 1 fragment to filter

Related

R psych::statsBy() error: "'x' must be numeric"

I'm trying to do a multilevel factor analysis using the "psych" package. The first step is recommended to use the statsBy() funtion to have a correlation data:
statsBy(study2, group = "ID")
However, it gives this "Error in FUN(data[x, , drop = FALSE], ...) : 'x' must be numeric".
For the dataset, I only included a grouping variable "ID", and other two numeric variables. I ran the following line to check if the varibales are numeric.
sapply(study2, is.numeric)
ID v1 V2
FALSE TRUE TRUE
Here are the code in the tracedown of the error.But I don't know what 'x' refers here, and I noticed in line 8 and 9, the X is in captital and is lowercase in line 10.
*10.
FUN(data[x, , drop = FALSE], ...)
9.
FUN(X[[i]], ...)
8.
lapply(X = ans[index], FUN = FUN, ...)
7.
tapply(seq_len(728L), list(z = c("5edfa35e60122c277654d35b", "5ed69fbc0a53140e516ad4ed", "5d52e8160ebbe900196e252e", "5efa3da57a38f213146c7352", "5ef98f3df4d541726b1bcc48", "5debb7511e806c2a59cad664", "5c28a4530091e40001ca4d00", "5872a0d958ca4c00018ce4fe", "5c87868eddda2d00012add18", "5e80b7427567f07891655e7e", ...
6.
eval(substitute(tapply(seq_len(nd), IND, FUNx, simplify = simplify)), data)
5.
eval(substitute(tapply(seq_len(nd), IND, FUNx, simplify = simplify)), data)
4.
structure(eval(substitute(tapply(seq_len(nd), IND, FUNx, simplify = simplify)), data), call = match.call(), class = "by")
3.
by.data.frame(data, z, colMeans, na.rm = na.rm)
2.
by(data, z, colMeans, na.rm = na.rm)
1.
statsBy(study2, group = "ID")*
The dataset has 728 rows and those like "5edfa35e60122c277654d35b" are IDs. Could anyone help explain what might have gone wrong?
I had the same error, the only way was to convert the group variable to the numeric class.
Try:
study2$ID<-as.numeric(study2$ID)
statsBy(study2, group = "ID")
If dat$ID is of class character:
study2$ID<-as.numeric(as.factor(study2$ID))
statsBy(study2, group = "ID")

Multi-level counter iteration

I'm stuck on creating an algorithm as follows. I know this shouldn't be too difficult, but I simply can't get my head around it, and can't find the right description of this kind of pattern.
Basically I need a multi-level counter, where when a combination exist in the database, the next value is tried by incrementing from the right.
1 1 1 - Start position. Does this exist in database? YES -> Extract this and go to next
1 1 2 - Does this exist in database? YES -> Extract this and go to next
1 1 3 - Does this exist in database? YES -> Extract this and go to next
1 1 4 - Does this exist in database? NO -> Reset level 1, move to level 2
1 2 1 - Does this exist in database? YES -> Extract this and go to next
1 2 2 - Does this exist in database? NO -> Reset level 2 and 1, move to level 3
2 1 1 - Does this exist in database? YES -> Extract this and go to next
2 1 2 - Does this exist in database? YES -> Extract this and go to next
2 1 3 - Does this exist in database? NO -> Reset level 1 and increment level 2
2 2 1 - Does this exist in database? YES -> Extract this and go to next
2 2 2 - Does this exist in database? YES -> Extract this and go to next
2 2 3 - Does this exist in database? YES -> Extract this and go to next
2 3 1 - Does this exist in database? NO -> Extract this and go to next
3 1 1 - Does this exist in database? NO -> Extract this and go to next
3 2 1 - Does this exist in database? NO -> End, as all increments tried
There could be more than three levels, though.
In practice, each value like 1, 2, etc is actually a $value1, $value2, etc. containing a runtime string being matched against an XML document. So it's not just a case of pulling out every combination already existing in the database.
Assuming, the length of the DB key is known upfront, here's one way how it can be implemented. I'm using TypeScript but similar code can be written in your favorite language.
First, I declare some type definitions for convenience.
export type Digits = number[];
export type DbRecord = number;
Then I initialize fakeDb object which works as a mock data source. The function I wrote will work against this object. This object's keys are representing the the database records' keys (of type string). The values are simple numbers (intentionally sequential); they represent the database records themselves.
export const fakeDb: { [ dbRecordKey: string ]: DbRecord } = {
'111': 1,
'112': 2,
'113': 3,
'211': 4,
'212': 5,
'221': 6,
'311': 7,
};
Next, you can see the fun part, which is the function that uses counterDigits array of "digits" to increment depending on whether the record presence or absence.
Please, do NOT think this is the production-ready code! A) there are unnecessary console.log() invocations which only exist for demo purposes. B) it's a good idea to not read a whole lot of DbRecords from the database into memory, but rather use yield/return or some kind of buffer or stream.
export function readDbRecordsViaCounter(): DbRecord[] {
const foundDbRecords: DbRecord[] = [];
const counterDigits: Digits = [1, 1, 1];
let currentDigitIndex = counterDigits.length - 1;
do {
console.log(`-------`);
if (recordExistsFor(counterDigits)) {
foundDbRecords.push(extract(counterDigits));
currentDigitIndex = counterDigits.length - 1;
counterDigits[currentDigitIndex] += 1;
} else {
currentDigitIndex--;
for (let priorDigitIndex = currentDigitIndex + 1; priorDigitIndex < counterDigits.length; priorDigitIndex++) {
counterDigits[priorDigitIndex] = 1;
}
if (currentDigitIndex < 0) {
console.log(`------- (no more records expected -- ran out of counter's range)`);
return foundDbRecords;
}
counterDigits[currentDigitIndex] += 1;
}
console.log(`next key to try: ${ getKey(counterDigits) }`);
} while (true);
}
The remainings are some "helper" functions for constructing a string key from a digits array, and accessing the fake database.
export function recordExistsFor(digits: Digits): boolean {
const keyToSearch = getKey(digits);
const result = Object.getOwnPropertyNames(fakeDb).some(key => key === keyToSearch);
console.log(`key=${ keyToSearch } => recordExists=${ result }`);
return result;
}
export function extract(digits: Digits): DbRecord {
const keyToSearch = getKey(digits);
const result = fakeDb[keyToSearch];
console.log(`key=${ keyToSearch } => extractedValue=${ result }`);
return result;
}
export function getKey(digits: Digits): string {
return digits.join('');
}
Now, if you run the function like this:
const dbRecords = readDbRecordsViaCounter();
console.log(`\n\nDb Record List: ${ dbRecords }`);
you should see the following output that tells you about the iteration steps; as well as reports the final result in the very end.
-------
key=111 => recordExists=true
key=111 => extractedValue=1
next key to try: 112
-------
key=112 => recordExists=true
key=112 => extractedValue=2
next key to try: 113
-------
key=113 => recordExists=true
key=113 => extractedValue=3
next key to try: 114
-------
key=114 => recordExists=false
next key to try: 121
-------
key=121 => recordExists=false
next key to try: 211
-------
key=211 => recordExists=true
key=211 => extractedValue=4
next key to try: 212
-------
key=212 => recordExists=true
key=212 => extractedValue=5
next key to try: 213
-------
key=213 => recordExists=false
next key to try: 221
-------
key=221 => recordExists=true
key=221 => extractedValue=6
next key to try: 222
-------
key=222 => recordExists=false
next key to try: 231
-------
key=231 => recordExists=false
next key to try: 311
-------
key=311 => recordExists=true
key=311 => extractedValue=7
next key to try: 312
-------
key=312 => recordExists=false
next key to try: 321
-------
key=321 => recordExists=false
next key to try: 411
-------
key=411 => recordExists=false
------- (no more records expected -- ran out of counter's range)
Db Record List: 1,2,3,4,5,6,7
It is strongly recommended to read the code. If you want me to describe the approach or any specific detail(s) -- let me know. Hope, it helps.

How to improve running time of my binary search code in peripherical parts?

I am studying for this great Coursera course https://www.coursera.org/learn/algorithmic-toolbox . On the fourth week, we have an assignment related to binary trees.
I think I did a good job. I created a binary search code that solves this problem using recursion in Python3. That's my code:
#python3
data_in_sequence = list(map(int,(input().split())))
data_in_keys = list(map(int,(input()).split()))
original_array = data_in_sequence[1:]
data_in_sequence = data_in_sequence[1:]
data_in_keys = data_in_keys[1:]
def binary_search(data_in_sequence,target):
answer = 0
sub_array = data_in_sequence
#print("sub_array",sub_array)
if not sub_array:
# print("sub_array",sub_array)
answer = -1
return answer
#print("target",target)
mid_point_index = (len(sub_array)//2)
#print("mid_point", sub_array[mid_point_index])
beg_point_index = 0
#print("beg_point_index",beg_point_index)
end_point_index = len(sub_array)-1
#print("end_point_index",end_point_index)
if sub_array[mid_point_index]==target:
#print ("final midpoint, ", sub_array[mid_point_index])
#print ("original_array",original_array)
#print("sub_array[mid_point_index]",sub_array[mid_point_index])
#print ("answer",answer)
answer = original_array.index(sub_array[mid_point_index])
return answer
elif target>sub_array[mid_point_index]:
#print("target num higher than current midpoint")
beg_point_index = mid_point_index+1
sub_array=sub_array[beg_point_index:]
end_point_index = len(sub_array)-1
#print("sub_array",sub_array)
return binary_search(sub_array,target)
elif target<sub_array[mid_point_index]:
#print("target num smaller than current midpoint")
sub_array = sub_array[:mid_point_index]
return binary_search(sub_array,target)
else:
return None
def bin_search_over_seq(data_in_sequence,data_in_keys):
final_output = ""
for key in data_in_keys:
final_output = final_output + " " + str(binary_search(data_in_sequence,key))
return final_output
print (bin_search_over_seq(data_in_sequence,data_in_keys))
I usually get the correct output. For instance, if I input:
5 1 5 8 12 13
5 8 1 23 1 11
I get the correct indexes of the sequences or (-1) if the term is not in sequence (first line):
2 0 -1 0 -1
However, my code does not pass on the expected running time.
Failed case #4/22: time limit exceeded (Time used: 13.47/10.00, memory used: 36696064/536870912.)
I think this happens not due to the implementation of my binary search (I think it is right). Actually, I think this happens due to some inneficieny in a peripheral part of the code. Like the way I am managing to output the final answer. However, the way I am presenting the final answer does not seem to be really "heavy"... I am lost.
Am I not seeing something? Is there another inefficiency I am not seeing? How can I solve this? Just trying to present the final result in a faster way?

setTargetReturn in fPortfolio package

I have a three asset portfolio. I need to set the target return for my second asset
whenever i try i get this error
asset.ts <- as.timeSeries(asset.ret)
spec <- portfolioSpec()
setSolver(spec) <- "solveRshortExact"
constraints <- c("Short")
setTargetReturn(Spec) = mean(colMeans(asset.ts[,2]))
efficientPortfolio(asset.ts, spec, constraints)
Error: is.numeric(targetReturn) is not TRUE
Title:
MV Efficient Portfolio
Estimator: covEstimator
Solver: solveRquadprog
Optimize: minRisk
Constraints: Short
Portfolio Weights:
MSFT AAPL NORD
0 0 0
Covariance Risk Budgets:
MSFT AAPL NORD
Target Return and Risks:
mean mu Cov Sigma CVaR VaR
0 0 0 0 0 0
Description:
Sat Apr 19 15:03:24 2014 by user: Usuario
i have tried and i have searched the web but i have no idea how to set the target return
for a specific expected return of the data set. i could copy the mean of my second asset # but i think due to decimal it could affect the answer.
I ran into this error , when using 2 assets.
Appears to be a bug in the PortOpt methods.
When there's 2 assets, it runs : .mvSolveTwoAssets
Which looks for the TargetReturn in the portfolioSpecs.
But as you know, targetReturn isn't always needed.
But in your code , you have 2 separate variables for spec.
'spec' , and 'Spec'
i.e.: 'Spec' .. assuming this is a typo, then this line needs to be changed.
setTargetReturn(Spec) = mean(colMeans(asset.ts[,2]))

idata.frame: Why error "is.data.frame(df) is not TRUE"?

I'm working with a large data frame called exp (file here) in R. In the interests of performance, it was suggested that I check out the idata.frame() function from plyr. But I think I'm using it wrong.
My original call, slow but it works:
df.median<-ddply(exp,
.(groupname,starttime,fPhase,fCycle),
numcolwise(median),
na.rm=TRUE)
With idata.frame, Error: is.data.frame(df) is not TRUE
library(plyr)
df.median<-ddply(idata.frame(exp),
.(groupname,starttime,fPhase,fCycle),
numcolwise(median),
na.rm=TRUE)
So, I thought, perhaps it is my data. So I tried the baseball dataset. The idata.frame example works fine: dlply(idata.frame(baseball), "id", nrow) But if I try something similar to my desired call using baseball, it doesn't work:
bb.median<-ddply(idata.frame(baseball),
.(id,year,team),
numcolwise(median),
na.rm=TRUE)
>Error: is.data.frame(df) is not TRUE
Perhaps my error is in how I'm specifying the groupings? Anyone know how to make my example work?
ETA:
I also tried:
groupVars <- c("groupname","starttime","fPhase","fCycle")
voi<-c('inadist','smldist','lardist')
i<-idata.frame(exp)
ag.median <- aggregate(i[,voi], i[,groupVars], median)
Error in i[, voi] : object of type 'environment' is not subsettable
which uses a faster way of getting the medians, but gives a different error. I don't think I understand how to use idata.frame at all.
Given you are working with 'big' data and looking for perfomance, this seems a perfect fit for data.table.
Specifically the lapply(.SD,FUN) and .SDcols arguments with by
Setup the data.table
library(data.table)
DT <- as.data.table(exp)
iexp <- idata.frame(exp)
Which columns are numeric
numeric_columns <- names(which(unlist(lapply(DT, is.numeric))))
dt.median <- DT[, lapply(.SD, median), by = list(groupname, starttime, fPhase,
fCycle), .SDcols = numeric_columns]
some benchmarking
library(rbenchmark)
benchmark(data.table = DT[, lapply(.SD, median), by = list(groupname, starttime,
fPhase, fCycle), .SDcols = numeric_columns],
plyr = ddply(exp, .(groupname, starttime, fPhase, fCycle), numcolwise(median), na.rm = TRUE),
idataframe = ddply(exp, .(groupname, starttime, fPhase, fCycle), function(x) data.frame(inadist = median(x$inadist),
smldist = median(x$smldist), lardist = median(x$lardist), inadur = median(x$inadur),
smldur = median(x$smldur), lardur = median(x$lardur), emptyct = median(x$emptyct),
entct = median(x$entct), inact = median(x$inact), smlct = median(x$smlct),
larct = median(x$larct), na.rm = TRUE)),
aggregate = aggregate(exp[, numeric_columns],
exp[, c("groupname", "starttime", "fPhase", "fCycle")],
median),
replications = 5)
## test replications elapsed relative user.self
## 4 aggregate 5 5.42 1.789 5.30
## 1 data.table 5 3.03 1.000 3.03
## 3 idataframe 5 11.81 3.898 11.77
## 2 plyr 5 9.47 3.125 9.45
Strange behaviour, but even in the docs it says that idata.frame is experimental. You probably found a bug. Perhaps you could rewrite the check at the top of ddply that tests is.data.frame().
In any case, this cuts about 20% off the time (on my system):
system.time(df.median<-ddply(exp, .(groupname,starttime,fPhase,fCycle), function(x) data.frame(
inadist=median(x$inadist),
smldist=median(x$smldist),
lardist=median(x$lardist),
inadur=median(x$inadur),
smldur=median(x$smldur),
lardur=median(x$lardur),
emptyct=median(x$emptyct),
entct=median(x$entct),
inact=median(x$inact),
smlct=median(x$smlct),
larct=median(x$larct),
na.rm=TRUE))
)
Shane asked you in another post if you could cache the results of your script. I don't really have an idea of your workflow, but it may be best to setup a chron to run this and store the results, daily/hourly whatever.

Resources