JSONata query ... feedback requested - jsonata

I am investigating JSONata as part of my quest to find a declarative syntax for describing transformations from hierarchical JSON data to traditional, normalized, relational tables.
This example uses the try.jsonata.org Invoice data and shows transformations to three RDBMS tables ... AccountTable, AccountOrderTable, AccountOrderProductTable.
Observe that I am iterating through the $.Account.Order array twice ... once to generate records for the AccountOrderTable and again to generate records for the AccountOrderProductTable. The closure captures $order so that the product entries have access to the parent $order.OrderID
While I'm OK with iterating over the $.Account.Order array twice, I am wondering if there is a clean way to construct the JSONata query so that the $map operation only gets applied once to each array.
Q: How could I construct this query (while generating the same resuls) so that I only $map the $.Account.Order array one time?

I think this produces the same result, but uses a temporary binding to eliminate the double loop round the Order array. Of course, you could argue that being a declarative language, the implementation should be responsible for optimising the evaluation. But I suspect the implementor might not get round to this for a while.

Related

Is there a way to tag a Snowflake view as "safe" for result reuse?

Reading the Snowflake documentation for Using Persisted Query Results, one of the conditions that must be met for result reuse is the following:
The query does not include user-defined functions (UDFs) or external
functions.
After some experiments with this trivial UDF and view:
create function trivial_function(x number)
returns number
as
$$
x
$$
;
create view redo as select trivial_function($1) as col1, $2 as col2
from values (1,2), (3,4);
I've verified that this condition also applies to queries over views that use UDFs in their definitions.
The thing is, I have a complex view that would benefit much from result reuse, but employs a lot of UDFs for the sake of clarity. Some UDFs could be inlined, but that would make the view much more difficult to read. And some UDFs (those written in Javascript) are impossible to inline.
And yet, I know that all the UDFs are "pure" in the functional programming sense: for the same inputs, they always return the same outputs. They don't check the current timestamp, generate random values, or reference some other table that might change between invocations.
Is there some way to "convince" the query planner that a view is safe for result reuse, despite the presence of UDFs?
There is parameter called IMMUTABLE:
CREATE FUNCTION
VOLATILE | IMMUTABLE
Specifies the behavior of the UDF when returning results:
VOLATILE: UDF might return different values for different rows, even for the same input (e.g. due to non-determinism and statefullness).
IMMUTABLE: UDF assumes that the function, when called with the same inputs, will always return the same result. This guarantee is not checked. Specifying IMMUTABLE for a UDF that returns different values for the same input will result in undefined behavior.
Default: VOLATILE

Marklogic - get list of all unique document structures in a Marklogic database

I want to get a list of all distinct document structures with a count in a Marklogic database.
e.g. a database with these 3 documents:
1) <document><name>Robert</name></document>
2) <document><name>Mark</name></document>
3) <document><fname>Robert</fname><lname>Smith</lname></document>
Would return that there are two unique document structures in the database, one used by 2 documents, and the other used by 1 document.
I am using this xquery and am getting back the list of unique sequence of elements correctly:
for $i in distinct-values(for $document in doc()
return <div>{distinct-values(
for $element in $document//*/*/name() return <div>{$element}</div>)} </div>)
return $i
I appreciate that this code will not handle duplicate element names but that is OK for now.
My t questions are:
1) Is there a better/more efficient way to do this? I am assuming yes.
2) Is there a way to get back enough detail so that I could build up the xml tree of each unique structure?
3) What is the best way to return the count of each distinct structure e.g. 2 and 1 and in the above example
If you have a finite list of elements for which you need to do this for, consider co-occurance or other similiar solutions: https://docs.marklogic.com/cts:value-co-occurrences
This requires a range index on each element in question.
MarkLogic works best to use indexes whenever possible. The other solution I can think of is that you actually create a hash/checksum for the values of the target content for each document in question and store this with the document (or in a triple if you happen to have a licence for semantics). Then you you would already have a key for
the unique combinations.
1) Is there a better/more efficient way to do this? I am assuming yes.
If it were up to me, I would create the document structured in a consistent fashion (like you're doing), then hash it, and attach the hash to each document as a collection. Then I could count the docs in each collection. I can't see any efficient way (using indexes) to get the counts without first writing to the document content or metadata (collection is a type of metadata) then querying against the indexes.
2) Is there a way to get back enough detail so that I could build up the xml tree of each unique structure?
After you get the counts for each collection, you could retrieve one doc from each collection and walk through it to build an empty XML structure. XSLT would probably be a good way to do this if you already know XSLT.
3) What is the best way to return the count of each distinct structure e.g. 2 and 1 and in the above example
Turn on the collection lexicon on your database. Then do something like the following:
for $collection in cts:collections()
return ($collection, cts:frequency($collection))
Not sure I follow exactly what you are after, but I am wondering if this is more what you are looking for- functx:distinct-element-paths($doc)
http://www.xqueryfunctions.com/xq/functx_distinct-element-paths.html
Here's a quick example:
xquery version "1.0-ml";
import module namespace functx = "http://www.functx.com" at "/MarkLogic/functx/functx-1.0-nodoc-2007-01.xqy";
let $doc := <document><fname>Robert</fname><lname>Smith</lname></document>
return
functx:distinct-element-paths($doc)
Outputs the following strings (which could be parsed, of course):
document
document/fname
document/lname
there are existing 3rd party tools that may work, depending on the size of the data, and the coverage required (is 100% sampleing needed).
Search for "Generate Schema from XML" --
Such tools will look at a sample set and infer a schema (xsd, dtd, rng etc).
They do an accurate job, but not always in the same way a human would.
If they do not have native ML integration then you need to expose a service or exort the data for analysis.
Once you HAVE a schema, load it into MarkLogic, and you can query the schema (and elements validated by it) directly and programmatically in ML
If you find a 'generate schema' tool that is implemented in XSLT, XQuery, or JavaScript you may be able to import and execute it in-server.

CouchDb filter and sort in one view

I'm new to the CouchDb.
I have to filter records by date (date must be between two values) and to sort the data by the name or by the date etc (it depends on user's selection in the table).
In MySQL it looks like
SELECT * FROM table WHERE date > "2015-01-01" AND date < "2015-08-01" ORDER BY name/date/email ASC/DESC
I can't figure out if I can use one view for all these issues.
Here is my map example:
function(doc) {
emit(
[doc.date, doc.name, doc.email],
{
email:doc.email,
name:doc.name,
date:doc.date,
}
);
}
I try to filter data using startkey and endkey, but I'm not sure how to sort data in this way:
startkey=["2015-01-01"]&endkey=["2015-08-01"]
Can I use one view? Or I have to create some views with keys order depending on my current order field: [doc.date, doc.name, doc.email], [doc.name, doc.date, doc.email] etc?
Thanks for your help!
As Sebastian said you need to use a list function to do this in Couch.
If you think about it, this is what MySQL is doing. Its query optimizer will pick an index into your table, it will scan a range from that index, load what it needs into memory, and execute query logic.
In Couch the view is your B-tree index, and a list function can implement whatever logic you need. It can be used to spit out HTML instead of JSON, but it can also be used to filter/sort the output of your view, and still spit out JSON in the end. It might not scale very well to millions of documents, but MySQL might not either.
So your options are the ones Sebastian highlighted:
view sorts by date, query selects date range and list function loads everything into memory and sorts by email/etc.
views sort by email/etc, list function filters out everything outside the date range.
Which one you choose depends on your data and architecture.
With option 1 you may skip the list function entirely: get all the necessary data from the view in one go (with include_docs), and sort client side. This is how you'll typically use Couch.
If you need this done server side, you'll need your list function to load every matching document into an array, and then sort it and JSON serialize it. This obviously falls into pieces if there are soo many matching documents that they don't even fit into memory or take to long to sort.
Option 2 scans through preordered documents and only sends those matching the dates. Done right this avoids loading everything into memory. OTOH it might scan way too many documents, trashing your disk IO.
If the date range is "very discriminating" (few documents pass the test) option 1 works best; otherwise (most documents pass) option 2 can be better. Remember that in the time it takes to load a useless document from disk (option 2), you can sort tens of documents in memory, as long as they fit in memory (option 1). Also, the more indexes, the more disk space is used and the more writes are slowed down.
you COULD use a list function for that, in two ways:
1.) Couch-View is ordered by dates and you sort by e-amil => but pls. be aware that you'd have to have ALL items in memory to do this sort by e-mail (i.e. you can do this only when your result set is small)
2.) Couch-View is ordered by e-mail and a list function drops all outside the date range (you can only do that when the overall list is small - so this one is most probably bad)
possibly #1 can help you

Query core data store based on a transient calculated value

I'm fairly new to the more complex parts of Core Data.
My application has a core data store with 15K rows. There is a single entity.
I need to display a subset of those rows in a table view filtered on a calculated search criteria, and for each row displayed add a value that I calculate in real time but don't store in the entity.
The calculation needs to use a couple of values supplied by the user.
A hypothetical example:
Entity: contains fields "id", "first", and "second"
User inputs: 10 and 20
Search / Filter Criteria: only display records where the entity field "id" is a prime number between the two supplied numbers. (I need to build some sort of complex predicate method here I assume?)
Display: all fields of all records that meet the criteria, along with a derived field (not in the the core data entity) that is the sum of the "id" field and a random number, so each row in the tableview would contain 4 fields:
"id", "first", "second", -calculated value-
From my reading / Googling it seems that a transient property might be the way to go, but I can't work out how to do this given that the search criteria and the resultant property need to calculate based on user input.
Could anyone give me any pointers that will help me implement this code? I'm pretty lost right now, and the examples I can find in books etc. don't match my particular needs well enough for me to adapt them as far as I can tell.
Thanks
Darren.
The first thing you need to do is to stop thinking in terms of fields, rows and columns as none of those structures are actually part of Core Data. In this case, it is important because Core Data supports arbitrarily complex fetches but the sqlite store does not. So, if you use a sqlite store your fetches are restricted those supported by SQLite.
In this case, predicates aimed at SQLite can't perform complex operations such as calculating whether an attribute value is prime.
The best solution for your first case would be to add a boolean attribute of isPrime and then modify the setter for your id attribute to calculate whether the set id value is prime or not and then set the isPrime accordingly. That will be store in the SQLite store and can be fetched against e.g. isPrime==YES &&((first<=%#) && (second>=%#))
The second case would simply use a transient property for which you would supply a custom getter to calculate its value when the managed object was in memory.
One often overlooked option is to not use an sqlite store but to use an XML store instead. If the amount of data is relatively small e.g. a few thousand text attributes with a total memory footprint of a few dozen meg, then an XML store will be super fast and can handle more complex operations.
SQLite is sort of the stunted stepchild in Core Data. It's is useful for large data sets and low memory but with memory becoming ever more plentiful, its loosing its edge. I find myself using it less these days. You should consider whether you need sqlite in this particular case.

Data structure to hold HQL or EJB QL

We need to produce a fairly complex dynamic query builder for retrieving reports on the fly. We're scratching our heads a little on what sort of data structure would be best.
It's really nothing more than holding a list of selectParts, a list of fromParts, a list of where criteria, order by, group by, that sort of thing, for persistence. When we start thinking about joins, especially outer joins, having clauses, and aggregate functions, things start getting a little fuzzy.
We're building it up interfaces first for now and trying to think as far ahead as we can, but definitely will go through a series of refactorings when we discover limitations with our structures.
I'm posting this question here in the hopes that someone has already come up with something that we can base it on. Or know of some library or some such. It would be nice to get some tips or heads-up on potential issues before we dive into implementations next week.
I've done something similar couple of times in the past. A couple of the bigger things spring to mind..
The where clause is the hardest to get right. If you divide things up into what I would call "expressions" and "predicates" it makes it easier.
Expressions - column references, parameters, literals, functions, aggregates (count/sum)
Predicates - comparisons, like, between, in, is null (predicates have expression as children, e.g. expr1 = expr2. Then you also having composites such as and/or/not.
The whole where clause, as you can imagine, is a tree with a predicate at the root, with maybe sub-predicates underneath eventually terminating with expressions at the leaves.
To construct the HQL you walk the model (depth first usually). I used a visitor as I need to walk my models for other reasons, but if you don't have multiple purposes you can build the rendering code right into the model.
e.g. If you had
"where upper(column1) = :param1 AND ( column2 is null OR column3 between :param2 and param3)"
Then the tree is
Root
- AND
- Equal
- Function(upper)
- ColumnReference(column1)
- Parameter(param1)
- OR
- IsNull
- ColumnReference(column2)
- Between
- ColumnReference(column3)
- Parameter(param2)
- Parameter(param3)
Then you walk the tree depth first and merge rendered bits of HQL on the way back up. The upper function for example would expect one piece of child HQL to be rendered and it would then generate
"upper( " + childHql + " )"
and pass that up to it's parent. Something like Between expects three child HQL pieces.
You can then re-use the expression model in the select/group by/order by clauses
You can skip storing the group by if you wish by just storing the select and before query construction scan for aggregate. If there is one or more then just copy all the non-aggregate select expressions into the group by.
From clause is just a list of table reference + zero or more join clauses. Each join clause has a type (inner/left/right) and a table reference. Table reference is a table name + optional alias.
Plus, if you ever get into wanting to parse a query language (or anything really) then I can highly recommend ANTLR. Learning curve is quite steep but there are plenty of example grammars to look at.
HTH.
if you need EJB-QL parser and data structures, EclipseLink (well several of it's internal classes) have good one:
JPQLParseTree tree = org.eclipse.persistence.internal.jpa.parsing.jpql.JPQLParser.buildParserFor(" _the_ejb_ql_string_ ").parse();
JPQLParseTree contains the all the data.
but generating EJB-QL back from modified JPQLParseTree is something you have to do yourself.

Resources