JSONata prevent array flattening - jsonata

Q: How do I prevent JSONata from "auto-flattening" arrays in an array constructor?
Given JSON data:
{
"w" : true,
"x":["a", "b"],
"y":[1, 2, 3],
"z": 9
}
the JSONata query seems to select 4 values:
[$.w, $.x, $.y, $.z]
The nested arrays at $.x and $.y are getting flattened/inlined into my outer wrapper, resulting in more than 4 values:
[ true, "a", "b", 1, 2, 3, 9 ]
The results I would like to achieve are
[ true, ["a", "b"], [1, 2, 3], 9 ]
I can achieve this by using
[$.w, [$.x], [$.y], $.z]
But this requires me to know a priori that $.x and $.y are arrays.
I would like to select 4 values and have the resulting array contain exactly 4 values, independent of the types of values that are selected.
There are clearly some things about the interactions between JSONata sequences and arrays that I can't get my head around.

In common with XPath/XQuery sequences, it will flatten the results of a path expression into the output array. It is possible to avoid this in your example by using the $each higher-order function to iterate over an object's key/value pairs. The following expression will get what you want without any flattening of results:
$each($, function($v) {
$v
})
This just returns the value for each property in the object.
UPDATE: Extending this answer for your updated question:
I think this is related to a previous github question on how to combine several independent queries into the same question. This uses an object to hold all the queries in a similar manner to the one you arrived at. Perhaps a slightly clearer expression would be this:
{
"1": t,
"2": u.i,
"3": u.j,
"4": u.k,
"5": u.l,
"6": v
} ~> $each(λ($v){$v})
The λ is just a shorthand for function, if you can find it on your keyboard (F12 in the JSONata Exerciser).

I am struggling to rephrase my question in such as way as to describe the difficulties I am having with JSONata's sequence-like treatment of arrays.
I need to run several queries to extract several values from the same JSON tree. I would like to construct one JSONata query expression which extracts n data items (or runs n subqueries) and returns exactly n values in an ordered array.
This example seems to query request 6 values, but because of array flattening the result array does not have 6 values.
This example explicitly wraps each query in an array constructor so that the result has 6 values. However, the values which are not arrays are wrapped in an extraneous & undesirable array. In addition one cannot determine what the original type was ...
This example shows the result that I am trying to accomplish ... I asked for 6 things and I got 6 values back. However, I must know the datatypes of the values I am fetching and explicitly wrap the arrays in an array constructor to work-around the sequence flattening.
This example shows what I want. I queried 6 things and got back 6 answers without knowing the datatypes. But I have to introduce an object as a temporary container in order to work around the array flattening behavior.
I have not found any predicates that allow me to test the type of a value in a query ... which might have let me use the ?: operator to dynamically decide whether or not to wrap arrays in an array constructor. e.g. $isArray($.foo) ? [$.foo] : $.foo
Q: Is there an easier way for me to (effectively) submit 6 "path" queries and get back 6 values in an ordered array without knowing the data types of the values I am querying?

Building on the example from Acoleman, here is a way to pass in n "query" strings (that represent paths):
(['t', 'u.i', 'u.j', 'u.k', 'u.l', 'v'] {
$: $eval('$$.' & $)
}).$each(function($o) {$o})
and get back an array ofn results with their original data format:
[
12345,
[
"i",
"ii",
"iii"
],
[],
"K",
{
"L": "LL"
},
null
]
It seems that using $each is the only way to avoid any flattening...
Granted, probably not the most efficient of expressions, since each has to be evaluated from a path string starting at the root of the data structure -- but there ya go.

Related

Performance: Replacing Series values with keys from a Dictionary in Python

I have a data series that contains various names of the same organizations. I want harmonize these names into a given standard using a mapping dictionary. I am currently using a nested for loop to iterate through each series element and if it is within the dictionary's values, I update the series value with the dictionary key.
# For example, corporation_series is:
0 'Corp1'
1 'Corp-1'
2 'Corp 1'
3 'Corp2'
4 'Corp--2'
dtype: object
# Dictionary is:
mapping_dict = {
'Corporation_1': ['Corp1', 'Corp-1', 'Corp 1'],
'Corporation_2': ['Corp2', 'Corp--2'],
}
# I use this logic to replace the values in the series
for index, value in corporation_series.items():
for key, list in mapping_dict.items():
if value in list:
corporation_series = corporation_series.replace(value, key)
So, if the series has a value of 'Corp1', and it exists in the dictionary's values, the logic replaces it with the corresponding key of corporations. However, it is an extremely expensive method. Could someone recommend me a better way of doing this operation? Much appreciated.
I found a solution by using python's .map function. In order to use .map, I had to invert my dictionary:
# Inverted Dict:
mapping_dict = {
'Corp1': ['Corporation_1'],
'Corp-1': ['Corporation_1'],
'Corp 1': ['Corporation_1'],
'Corp2': ['Corporation_2'],
'Corp--2':['Corporation_2'],
}
# use .map
corporation_series.map(newdict)
Instead of 5 minutes of processing, took around 5s. While this is works, I sure there are better solutions out there. Any suggestions would be most welcome.

How to get documents that contain sub-string in FaunaDB

I'm trying to retrieve all the tasks documents that have the string first in their name.
I currently have the following code, but it only works if I pass the exact name:
res, err := db.client.Query(
f.Map(
f.Paginate(f.MatchTerm(f.Index("tasks_by_name"), "My first task")),
f.Lambda("ref", f.Get(f.Var("ref"))),
),
)
I think I can use ContainsStr() somewhere, but I don't know how to use it in my query.
Also, is there a way to do it without using Filter()? I ask because it seems like it filters after the pagination, and it messes up with the pages
FaunaDB provides a lot of constructs, this makes it powerful but you have a lot to choose from. With great power comes a small learning curve :).
How to read the code samples
To be clear, I use the JavaScript flavor of FQL here and typically expose the FQL functions from the JavaScript driver as follows:
const faunadb = require('faunadb')
const q = faunadb.query
const {
Not,
Abort,
...
} = q
You do have to be careful to export Map like that since it will conflict with JavaScripts map. In that case, you could just use q.Map.
Option 1: using ContainsStr() & Filter
Basic usage according to the docs
ContainsStr('Fauna', 'a')
Of course, this works on a specific value so in order to make it work you need Filter and Filter only works on paginated sets. That means that we first need to get a paginated set. One way to get a paginated set of documents is:
q.Map(
Paginate(Documents(Collection('tasks'))),
Lambda(['ref'], Get(Var('ref')))
)
But we can do that more efficiently since one get === one read and we don't need the docs, we'll be filtering out a lot of them. It's interesting to know that one index page is also one read so we can define an index as follows:
{
name: "tasks_name_and_ref",
unique: false,
serialized: true,
source: "tasks",
terms: [],
values: [
{
field: ["data", "name"]
},
{
field: ["ref"]
}
]
}
And since we added name and ref to the values, the index will return pages of name and ref which we can then use to filter. We can, for example, do something similar with indexes, map over them and this will return us an array of booleans.
Map(
Paginate(Match(Index('tasks_name_and_ref'))),
Lambda(['name', 'ref'], ContainsStr(Var('name'), 'first'))
)
Since Filter also works on arrays, we can actually simple replace Map with filter. We'll also add a to lowercase to ignore casing and we have what we need:
Filter(
Paginate(Match(Index('tasks_name_and_ref'))),
Lambda(['name', 'ref'], ContainsStr(LowerCase(Var('name')), 'first'))
)
In my case, the result is:
{
"data": [
[
"Firstly, we'll have to go and refactor this!",
Ref(Collection("tasks"), "267120709035098631")
],
[
"go to a big rock-concert abroad, but let's not dive in headfirst",
Ref(Collection("tasks"), "267120846106001926")
],
[
"The first thing to do is dance!",
Ref(Collection("tasks"), "267120677201379847")
]
]
}
Filter and reduced page sizes
As you mentioned, this is not exactly what you want since it also means that if you request pages of 500 in size, they might be filtered out and you might end up with a page of size 3, then one of 7. You might think, why can't I just get my filtered elements in pages? Well, it's a good idea for performance reasons since it basically checks each value. Imagine you have a massive collection and filter out 99.99 percent. You might have to loop over many elements to get to 500 which all cost reads. We want pricing to be predictable :).
Option 2: indexes!
Each time you want to do something more efficient, the answer lies in indexes. FaunaDB provides you with the raw power to implement different search strategies but you'll have to be a bit creative and I'm here to help you with that :).
Bindings
In Index bindings, you can transform the attributes of your document and in our first attempt we will split the string into words (I'll implement multiple since I'm not entirely sure which kind of matching you want)
We do not have a string split function but since FQL is easily extended, we can write it ourselves bind to a variable in our host language (in this case javascript), or use one from this community-driven library: https://github.com/shiftx/faunadb-fql-lib
function StringSplit(string: ExprArg, delimiter = " "){
return If(
Not(IsString(string)),
Abort("SplitString only accept strings"),
q.Map(
FindStrRegex(string, Concat(["[^\\", delimiter, "]+"])),
Lambda("res", LowerCase(Select(["data"], Var("res"))))
)
)
)
And use it in our binding.
CreateIndex({
name: 'tasks_by_words',
source: [
{
collection: Collection('tasks'),
fields: {
words: Query(Lambda('task', StringSplit(Select(['data', 'name']))))
}
}
],
terms: [
{
binding: 'words'
}
]
})
Hint, if you are not sure whether you have got it right, you can always throw the binding in values instead of terms and then you'll see in the fauna dashboard whether your index actually contains values:
What did we do? We just wrote a binding that will transform the value into an array of values at the time a document is written. When you index the array of a document in FaunaDB, these values are indexes separately yet point all to the same document which will be very useful for our search implementation.
We can now find tasks that contain the string 'first' as one of their words by using the following query:
q.Map(
Paginate(Match(Index('tasks_by_words'), 'first')),
Lambda('ref', Get(Var('ref')))
)
Which will give me the document with name:
"The first thing to do is dance!"
The other two documents didn't contain the exact words, so how do we do that?
Option 3: indexes and Ngram (exact contains matching)
To get exact contains matching efficient, you need to use a (still undocumented function since we'll make it easier in the future) function called 'NGram'. Dividing a string in ngrams is a search technique that is often used underneath the hood in other search engines. In FaunaDB we can easily apply it as due to the power of the indexes and bindings. The Fwitter example has an example in it's source code that does autocompletion. This example won't work for your use-case but I do reference it for other users since it's meant for autocompleting short strings, not to search a short string in a longer string like a task.
We'll adapt it though for your use-case. When it comes to searching it's all a tradeoff of performance and storage and in FaunaDB users can choose their tradeoff. Note that in the previous approach, we stored each word separately, with Ngrams we'll split words even further to provide some form of fuzzy matching. The downside is that the index size might become very big if you make the wrong choice (this is equally true for search engines, hence why they let you define different algorithms).
What NGram essentially does is get substrings of a string of a certain length.
For example:
NGram('lalala', 3, 3)
Will return:
If we know that we won't be searching for strings longer than a certain length, let's say length 10 (it's a tradeoff, increasing the size will increase the storage requirements but allow you to do query for longer strings), you can write the following Ngram generator.
function GenerateNgrams(Phrase) {
return Distinct(
Union(
Let(
{
// Reduce this array if you want less ngrams per word.
indexes: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
indexesFiltered: Filter(
Var('indexes'),
// filter out the ones below 0
Lambda('l', GT(Var('l'), 0))
),
ngramsArray: q.Map(Var('indexesFiltered'), Lambda('l', NGram(LowerCase(Var('Phrase')), Var('l'), Var('l'))))
},
Var('ngramsArray')
)
)
)
}
You can then write your index as followed:
CreateIndex({
name: 'tasks_by_ngrams_exact',
// we actually want to sort to get the shortest word that matches first
source: [
{
// If your collections have the same property tht you want to access you can pass a list to the collection
collection: [Collection('tasks')],
fields: {
wordparts: Query(Lambda('task', GenerateNgrams(Select(['data', 'name'], Var('task')))))
}
}
],
terms: [
{
binding: 'wordparts'
}
]
})
And you have an index backed search where your pages are the size you requested.
q.Map(
Paginate(Match(Index('tasks_by_ngrams_exact'), 'first')),
Lambda('ref', Get(Var('ref')))
)
Option 4: indexes and Ngrams of size 3 or trigrams (Fuzzy matching)
If you want fuzzy searching, often trigrams are used, in this case our index will be easy so we're not going to use an external function.
CreateIndex({
name: 'tasks_by_ngrams',
source: {
collection: Collection('tasks'),
fields: {
ngrams: Query(Lambda('task', Distinct(NGram(LowerCase(Select(['data', 'name'], Var('task'))), 3, 3))))
}
},
terms: [
{
binding: 'ngrams'
}
]
})
If we would place the binding in values again to see what comes out we'll see something like this:
In this approach, we use both trigrams on the indexing side as on the querying side. On the querying side, that means that the 'first' word which we search for will also be divided in Trigrams as follows:
For example, we can now do a fuzzy search as follows:
q.Map(
Paginate(Union(q.Map(NGram('first', 3, 3), Lambda('ngram', Match(Index('tasks_by_ngrams'), Var('ngram')))))),
Lambda('ref', Get(Var('ref')))
)
In this case, we do actually 3 searches, we are searching for all of the trigrams and union the results. Which will return us all sentences that contain first.
But if we would have miss-spelled it and would have written frst we would still match all three since there is a trigram (rst) that matches.

JSONata query to flatten array of arrays

The JSONata doc "top-level-arrays-nested-arrays-and-array-flattening" covers the "flatten" case of an array of objects, each of which contains a property that contains an array value.
However, I have not been able to figure out how to flatten an array of arrays.
Q: What is the JSONata query to flatten an array of arrays?
input
[ [1,2], [], [3] ]
desired
[ 1, 2, 3 ]
I have figured out that flattening an array of arrays can be accomplished by using the $reduce function to iteratively apply the $append function.
$reduce($, $append)
for this simple test case:
$reduce( [ [1,2], [], [3] ], $append)
Q: Are there other ways to flatten an array of arrays in JSONata?
In JSONata, iterating over all the elements of an array returns a flattened array of the elements appended together... So it's really as simple as:
$.*
Almost looks like an emoji! ;*)
Technically, you don't even need the $. prefix -- but just using the expression * doesn't look right to me...

Associatively sorting a table by value in Lua

I have a key => value table I'd like to sort in Lua. The keys are all integers, but aren't consecutive (and have meaning). Lua's only sort function appears to be table.sort, which treats tables as simple arrays, discarding the original keys and their association with particular items. Instead, I'd essentially like to be able to use PHP's asort() function.
What I have:
items = {
[1004] = "foo",
[1234] = "bar",
[3188] = "baz",
[7007] = "quux",
}
What I want after the sort operation:
items = {
[1234] = "bar",
[3188] = "baz",
[1004] = "foo",
[7007] = "quux",
}
Any ideas?
Edit: Based on answers, I'm going to assume that it's simply an odd quirk of the particular embedded Lua interpreter I'm working with, but in all of my tests, pairs() always returns table items in the order in which they were added to the table. (i.e. the two above declarations would iterate differently).
Unfortunately, because that isn't normal behavior, it looks like I can't get what I need; Lua doesn't have the necessary tools built-in (of course) and the embedded environment is too limited for me to work around it.
Still, thanks for your help, all!
You seem to misunderstand something. What you have here is a associative array. Associative arrays have no explicit order on them, e.g. it's only the internal representation (usually sorted) that orders them.
In short -- in Lua, both of the arrays you posted are the same.
What you would want instead, is such a representation:
items = {
{1004, "foo"},
{1234, "bar"},
{3188, "baz"},
{7007, "quux"},
}
While you can't get them by index now (they are indexed 1, 2, 3, 4, but you can create another index array), you can sort them using table.sort.
A sorting function would be then:
function compare(a,b)
return a[1] < b[1]
end
table.sort(items, compare)
As Komel said, you're dealing with associative arrays, which have no guaranteed ordering.
If you want key ordering based on its associated value while also preserving associative array functionality, you can do something like this:
function getKeysSortedByValue(tbl, sortFunction)
local keys = {}
for key in pairs(tbl) do
table.insert(keys, key)
end
table.sort(keys, function(a, b)
return sortFunction(tbl[a], tbl[b])
end)
return keys
end
items = {
[1004] = "foo",
[1234] = "bar",
[3188] = "baz",
[7007] = "quux",
}
local sortedKeys = getKeysSortedByValue(items, function(a, b) return a < b end)
sortedKeys is {1234,3188,1004,7007}, and you can access your data like so:
for _, key in ipairs(sortedKeys) do
print(key, items[key])
end
result:
1234 bar
3188 baz
1004 foo
7007 quux
hmm, missed the part about not being able to control the iteration. there
But in lua there is usually always a way.
http://lua-users.org/wiki/OrderedAssociativeTable
Thats a start. Now you would need to replace the pairs() that the library uses. That could be a simples as pairs=my_pairs. You could then use the solution in the link above
PHP arrays are different from Lua tables.
A PHP array may have an ordered list of key-value pairs.
A Lua table always contains an unordered set of key-value pairs.
A Lua table acts as an array when a programmer chooses to use integers 1, 2, 3, ... as keys. The language syntax and standard library functions, like table.sort offer special support for tables with consecutive-integer keys.
So, if you want to emulate a PHP array, you'll have to represent it using list of key-value pairs, which is really a table of tables, but it's more helpful to think of it as a list of key-value pairs. Pass a custom "less-than" function to table.sort and you'll be all set.
N.B. Lua allows you to mix consecutive-integer keys with any other kinds of keys in the same table—and the representation is efficient. I use this feature sometimes, usually to tag an array with a few pieces of metadata.
Coming to this a few months later, with the same query. The recommended answer seemed to pinpoint the gap between what was required and how this looks in LUA, but it didn't get me what I was after exactly :- which was a Hash sorted by Key.
The first three functions on this page DID however : http://lua-users.org/wiki/SortedIteration
I did a brief bit of Lua coding a couple of years ago but I'm no longer fluent in it.
When faced with a similar problem, I copied my array to another array with keys and values reversed, then used sort on the new array.
I wasn't aware of a possibility to sort the array using the method Kornel Kisielewicz recommends.
The proposed compare function works but only if the values in the first column are unique.
Here is a bit enhanced compare function to ensure, if the values of a actual column equals, it takes values from next column to evaluate...
With {1234, "baam"} < {1234, "bar"} to be true the items the array containing "baam" will be inserted before the array containing the "bar".
local items = {
{1004, "foo"},
{1234, "bar"},
{1234, "baam"},
{3188, "baz"},
{7007, "quux"},
}
local function compare(a, b)
for inx = 1, #a do
-- print("A " .. inx .. " " .. a[inx])
-- print("B " .. inx .. " " .. b[inx])
if a[inx] == b[inx] and a[inx + 1] < b[inx + 1] then
return true
elseif a[inx] ~= b[inx] and a[inx] < b[inx] == true then
return true
else
return false
end
end
return false
end
table.sort(items,compare)

Are curly brackets used in Lua?

If curly brackets ('{' and '}') are used in Lua, what are they used for?
Table literals.
The table is the central type in Lua, and can be treated as either an associative array (hash table or dictionary) or as an ordinary array. The keys can be values of any Lua type except nil, and the elements of a table can hold any value except nil.
Array member access is made more efficient than hash key access behind the scenes, but the details don't usually matter. That actually makes handling sparse arrays handy since storage only need be allocated for those cells that contain a value at all.
This does lead to a universal 1-based array idiom that feels a little strange to a C programmer.
For example
a = { 1, 2, 3 }
creates an array stored in the variable a with three elements that (coincidentally) have the same values as their indices. Because the elements are stored at sequential indices beginning with 1, the length of a (given by #a or table.getn(a)) is 3.
Initializing a table with non-integer keys can be done like this:
b = { one=1, pi=3.14, ["half pi"]=1.57, [function() return 17 end]=42 }
where b will have entries named "one", "pi", "half pi", and an anonymous function. Of course, looking up that last element without iterating the table might be tricky unless a copy of that very function is stored in some other variable.
Another place that curly braces appear is really the same semantic meaning, but it is concealed (for a new user of Lua) behind some syntactic sugar. It is common to write functions that take a single argument that should be a table. In that case, calling the function does not require use of parenthesis. This results in code that seems to contain a mix of () and {} both apparently used as a function call operator.
btn = iup.button{title="ok"}
is equivalent to
btn = iup.button({title="ok"})
but is also less hard on the eyes. Incidentally, calling a single-argument function with a literal value also works for string literals.
list/ditionary constructor (i.e. table type constructor).
They are not used for code blocks if that's what you mean. For that Lua just uses the end keyword to end the block.
See here
They're used for table literals as you would use in C :
t = {'a', 'b', 'c'}
That's the only common case. They're not used for block delimiters. In a lua table, you can put values of different types :
t={"foo", 'b', 3}
You can also use them as dictionnaries, à la Python :
t={name="foo", age=32}

Resources