DocumentDB find by dynamic subdocument's value (Spring Data Mongo) - spring

I have a collection with multi-million documents. Each contains a dynamic subdocument (let's call it as context).
I need to query for all of those which contains a given value.
Example documents:
{
"name": "test"
[...]
"context": {
"key": "some value",
"another key": "some other value",
[...]
}
},
{
"name": "joe doe"
[...]
"context": {
"just another key": "a value here",
"last example key": "and its value",
[...]
}
}
What I want is to find all documents which one's context contains a given value. For example "a value here".
Using aggregation, I was able to transform all the values from the context into a single text and check if it contains that value. This just works fine, but it is slow as hell.
val filterExistsStage = Aggregation.match(Criteria("context").exists(true))
val flatVariablesStage = Aggregation.addFields().addField("text").withValue(ObjectToArray.valueOfToArray("context")).build()
val extractValuesStage = Aggregation.addFields().addField("text").withValue("\$text.v").build()
val condition = FilterCondition("text", filter.context, FilterOperation.CONTAINS)
val criteria = CRITERIA[condition.operator]!!(condition)
val matchStage = Aggregation.match(criteria)
return listOf(
filterExistsStage,
flatVariablesStage,
extractValuesStage,
matchStage
)
val CRITERIA = mapOf<FilterOperation, (condition: FilterCondition) -> Criteria>(
FilterOperation.CONTAINS to { Criteria.where(it.field).regex(Regex.escape(it.value as String)) },
)
Another solution would be the so called wildcard text indices. This is really fast but unfortunately DocumentDB (compatibile with Mongo 3.6 and 4.0) does not support it. It is only available from MongoDB v4.2.
mongoTemplate.indexOps(MyEntity::class.java)
.ensureIndex(TextIndexDefinition.forAllFields())
val textCriteria = TextCriteria.forDefaultLanguage().matching(filter.context)
val textMatchStage = Aggregation.match(textCriteria)
I also thought to save all these values into a single field as text and do simple text search. But I try to avoid this kind of optimization if there are better solutions.
I wonder if there is any solution for this in DocumentDB or better to use some kind of text search engine, like ElasticSearch?
It has to be fast on millions of documents.

Related

Filtering JSON based on sub array in a Power Automate Flow

I have some json data that I would like to filter in a Power Automate Flow.
A simplified version of the json is as follows:
[
{
"ItemId": "1",
"Blah": "test1",
"CustomFieldArray": [
{
"Name": "Code",
"Value": "A"
},
{
"Name": "Category",
"Value": "Test"
}
]
},
{
"ItemId": "2",
"Blah": "test2",
"CustomFieldArray": [
{
"Name": "Code",
"Value": "B"
},
{
"Name": "Category",
"Value": "Test"
}
]
}
]
For example, I wish to filter items based on Name = "Code" and Value = "A". I should be left with the item with ItemId 1 in that case.
I can't figure out how to do this in Power Automate. It would be nice to change the data structure, but this is the way the data is, and I'm trying to work out if this is possible in Power Automate without changing the data itself.
Firstly, I had to fix your JSON, it wasn't complete.
Secondly, filtering on sub array information isn't what I'd call easy. However, to get around the limitations, you can perform a bit of trickery.
Prior to the step above, I create a variable of type Array and called it Array.
In the step above, the left hand side expression is ...
string(item()?['CustomFieldArray'])
... and the contains comparison on the right hand side is simply as you can see, a string with the appropriate filter value ...
{"Name":"Code","Value":"A"}
... it's not an expression or a proper object, just a string.
If you need to enhance it to cater for case sensitive values, just set everything to lower case using the toLower expression on the left.
Although it's hard to see, that will produce your desired result ...
... you can see by the vertical scrollbars that it's reduced the size of the array.

Add an object value to a field to Elastic Search during ingest and drop empty valued fields all during ingest

I am ingesting csv data into elasticsearch using the append processor. I already have two fields that are objects (object1 and object2) and I want to append them both into an array of a different field (mainlist). So it would come out as mainlist:[ {object1}, {object}] I have tried the set processor with the copy_from parameter and I am getting an error that I am missing the required property name "value" even though the ElasticSearch documentation clearly doesn't use the "value" property when it uses the "copy_from". {"set": {"field": "mainlist","copy_from": ["object1", "object"]}}. My syntax is even copied exactly from the documentation. Please help.
Furthermore I need to drop empty fields at the ingest level so they are not returned. I don't wish to have "fieldname: "", returned to the user. What is the best way to do that. I am new to ElasticSearch and it has not been going well.
As to dropping the empty fields at ingest level -- set up a pipeline:
PUT _ingest/pipeline/no_empty_fields
{
"description": "Removes empty-ish fields from a doc",
"processors": [
{
"script": {
"source": """
def keys_to_remove = ctx.keySet()
.stream()
.filter(field -> ctx[field] == null ||
ctx[field] == "")
.collect(Collectors.toList());
for (key in keys_to_remove) {
ctx.remove(key);
}
"""
}
}
]
}
and apply it upon indexing
POST myindex/_doc?pipeline=no_empty_fields
{
"fieldname23": 123,
"fieldname": null,
"fieldname123": ""
}
You can of course extend the conditions to ditch other fields such as "undefined", "Infinity" and others.

Filtering Field with multiple values

How would I approach the following problem:
I want to filter on a field which contains multiple values(eg. ["value1", "value2", "value3"]).
The filter would also contain multiple values (eg. ["value1", "value2"].
I want to get back only the items which have the same field value as filter, eg. field is ["value1", "value2"] and the filter is also ["value1", "value2"]
Any help would be greatly appreciated
I think the somewhat-recently added (v6.1) terms_set query (which Val references on the question he linked in his comment) is what you want.
terms_set, unlike a regular terms, has a parameter to specify a minimum number of matches that must exist between the search terms and the terms contained in the field.
Given:
PUT my_index/_doc/1
{
"values": ["living", "in a van", "down by the river"],
}
PUT my_index/_doc/2
{
"values": ["living", "in a house", "down by the river"],
}
A terms query for ["living", "in a van", "down by the river"] will return you both docs: no good. A terms_set configured to require all three matching terms (the script params.num_terms evaluates to 3) can give you just the matching one:
GET my_index/_search
{
"query": {
"terms_set": {
"values": {
"terms": ["living", "in a van", "down by the river"],
"minimum_should_match_script": {
"source": "params.num_terms"
}
}
}
}
}
NOTE: While I used minimum_should_match_script in the above example, it isn't a very efficient pattern. The alternative minimum_should_match_field is the better approach, but using it in the example would have meant a couple of more PUTs to add the necessary field to the documents, so I went with brevity.

how to use Elastic Search nested queries by object key instead of object property

Following the Elastic Search example in this article for a nested query, I noticed that it assumes the nested objects are inside an ARRAY and that queries are based on some object PROPERTY:
{
nested_objects: [ <== array
{ name: "x", value: 123 },
{ name: "y", value: 456 } <== "name" property searchable
]
}
But what if I want nested objects to be arranged in key-value structure that gets updated with new objects, and I want to search by the KEY? example:
{
nested_objects: { <== key-value, not array
"x": { value: 123 },
"y": { value: 456 } <== how can I search by "x" and "y" keys?
"..." <=== more arbitrary keys are added now and then
]
}
Thank you!
You can try to do this using the query_string query, like this:
GET my_index/_search
{
"query": {
"query_string": {
"query":"nested_objects.\\*.value:123"
}
}
}
It will try to match the value field of any sub-field of nested_objects.
Ok, so my final solution after some ES insights is as follows:
1. The fact that my object keys "x", "y", ... are arbitrary causes a mess in my index mapping. So generally speaking, it's not a good ES practice to plan this kind of structure... So for the sake of mappings, I resort to the structure described in the "Weighted tags" article:
{ "name":"x", "value":123 },
{ "name":"y", "value":456 },
...
This means that, when it's time to update the value of the sub-object named "x", I'm having a harder (and slower) time finding it: I first need to query the entire top-level object, traverse the sub objects until I find one named "x" and then update its value. Then I update the entire sub-object array back into ES.
The above approach also causes concurrency issues in case I have multiple processes updating the same index. ES has optimistic locking I can use to retry when needed, or, I can queue updates and handle them serially

Couchbase full-text search and compound keys

I have the following data in Couchbase:
Document 06001:
{
"type": "box",
"name": "lxpag",
"number": "06001",
"materials": [
{
"type": "material",
"number": "070006",
"name": "hosepipe"
},
{
"type": "material",
"number": "080006",
"name": "Philips screw 4mm"
},
}
Document 12345:
{
"type": "material",
"number": "12345",
"name": "Another screw"
}
Now I want to be able to query by type and name or number: for a given query type only the documents with the respective type property shall be returned. Furthermore, a second query string specifies which kinds of materials should be searched for. If a material's id or name contains (not starts with) the search term, it shall be included. If one of the materials inside a box matches the term accordingly, the whole box shall be included.
What I have come up with is:
function (doc, meta) {
if (doc.type === 'box' && Array.isArray(doc.materials)) {
var queryString = "";
for (i = 0; i < doc.materials.length; ++i) {
var material = doc.materials[i];
if (material.name && material.number) {
queryString += " " + material.name + " " + material.number;
}
}
emit([doc.type, queryString], doc);
} else if (doc.type === 'material') {
var queryString = doc.name + " " + doc.number;
emit([doc.type, queryString], doc);
}
}
I see that this view might not be fit for substring searches (Do I need ElasticSearch for this?). Nevertheless, when I use the following query parameters:
startKey=["box","pag"]&endKey=["box\u02ad","pag\u02ad"]
...not only do I get the box but also all other documents that are returned by the view. Thus, with these keys, nothing is filtered. On the other hand, searching by key works.
How is this possible?
There is no good way of doing substring search with view keys. Your options are either integrating with ElasticSearch, or using N1QL, which lets you do wildcard string matches: "SELECT * FROM bucket WHERE type = 'material' and name LIKE '%screw%'"
I just saw the flaw in the queries: the parameters must be written in lowercase, otherwise they are not recognized by Couchbase and ignored (it would be really helpful if I got an error message here instead of the usual result list...). So instead, I have to query with
startKey=["box","pag"]&endKey=["box\u02ad","pag\u02ad"]
What I have not precisely found out so far is how to manage the substring search. Since pag is a substring of lxpag above query would not return any results. Any ideas no this matter?

Resources