Couchbase full-text search and compound keys - elasticsearch

I have the following data in Couchbase:
Document 06001:
{
"type": "box",
"name": "lxpag",
"number": "06001",
"materials": [
{
"type": "material",
"number": "070006",
"name": "hosepipe"
},
{
"type": "material",
"number": "080006",
"name": "Philips screw 4mm"
},
}
Document 12345:
{
"type": "material",
"number": "12345",
"name": "Another screw"
}
Now I want to be able to query by type and name or number: for a given query type only the documents with the respective type property shall be returned. Furthermore, a second query string specifies which kinds of materials should be searched for. If a material's id or name contains (not starts with) the search term, it shall be included. If one of the materials inside a box matches the term accordingly, the whole box shall be included.
What I have come up with is:
function (doc, meta) {
if (doc.type === 'box' && Array.isArray(doc.materials)) {
var queryString = "";
for (i = 0; i < doc.materials.length; ++i) {
var material = doc.materials[i];
if (material.name && material.number) {
queryString += " " + material.name + " " + material.number;
}
}
emit([doc.type, queryString], doc);
} else if (doc.type === 'material') {
var queryString = doc.name + " " + doc.number;
emit([doc.type, queryString], doc);
}
}
I see that this view might not be fit for substring searches (Do I need ElasticSearch for this?). Nevertheless, when I use the following query parameters:
startKey=["box","pag"]&endKey=["box\u02ad","pag\u02ad"]
...not only do I get the box but also all other documents that are returned by the view. Thus, with these keys, nothing is filtered. On the other hand, searching by key works.
How is this possible?

There is no good way of doing substring search with view keys. Your options are either integrating with ElasticSearch, or using N1QL, which lets you do wildcard string matches: "SELECT * FROM bucket WHERE type = 'material' and name LIKE '%screw%'"

I just saw the flaw in the queries: the parameters must be written in lowercase, otherwise they are not recognized by Couchbase and ignored (it would be really helpful if I got an error message here instead of the usual result list...). So instead, I have to query with
startKey=["box","pag"]&endKey=["box\u02ad","pag\u02ad"]
What I have not precisely found out so far is how to manage the substring search. Since pag is a substring of lxpag above query would not return any results. Any ideas no this matter?

Related

DocumentDB find by dynamic subdocument's value (Spring Data Mongo)

I have a collection with multi-million documents. Each contains a dynamic subdocument (let's call it as context).
I need to query for all of those which contains a given value.
Example documents:
{
"name": "test"
[...]
"context": {
"key": "some value",
"another key": "some other value",
[...]
}
},
{
"name": "joe doe"
[...]
"context": {
"just another key": "a value here",
"last example key": "and its value",
[...]
}
}
What I want is to find all documents which one's context contains a given value. For example "a value here".
Using aggregation, I was able to transform all the values from the context into a single text and check if it contains that value. This just works fine, but it is slow as hell.
val filterExistsStage = Aggregation.match(Criteria("context").exists(true))
val flatVariablesStage = Aggregation.addFields().addField("text").withValue(ObjectToArray.valueOfToArray("context")).build()
val extractValuesStage = Aggregation.addFields().addField("text").withValue("\$text.v").build()
val condition = FilterCondition("text", filter.context, FilterOperation.CONTAINS)
val criteria = CRITERIA[condition.operator]!!(condition)
val matchStage = Aggregation.match(criteria)
return listOf(
filterExistsStage,
flatVariablesStage,
extractValuesStage,
matchStage
)
val CRITERIA = mapOf<FilterOperation, (condition: FilterCondition) -> Criteria>(
FilterOperation.CONTAINS to { Criteria.where(it.field).regex(Regex.escape(it.value as String)) },
)
Another solution would be the so called wildcard text indices. This is really fast but unfortunately DocumentDB (compatibile with Mongo 3.6 and 4.0) does not support it. It is only available from MongoDB v4.2.
mongoTemplate.indexOps(MyEntity::class.java)
.ensureIndex(TextIndexDefinition.forAllFields())
val textCriteria = TextCriteria.forDefaultLanguage().matching(filter.context)
val textMatchStage = Aggregation.match(textCriteria)
I also thought to save all these values into a single field as text and do simple text search. But I try to avoid this kind of optimization if there are better solutions.
I wonder if there is any solution for this in DocumentDB or better to use some kind of text search engine, like ElasticSearch?
It has to be fast on millions of documents.

Terms Set Query's minimum_should_match_field does not behave as expected when the provided field has value zero

I am wondering, using "terms set" query, why when a field that specified by the minimum_should_match_field has value "0", it behaves as if it has value "1".
To replicate the problem, I take the example from the Elasticsearch doc and construct three steps below.
Step 1:
Create a new index
PUT /job-candidates
{
"mappings": {
"properties": {
"name": {
"type": "keyword"
},
"programming_languages": {
"type": "keyword"
},
"required_matches": {
"type": "long"
}
}
}
}
Step 2:
Create two docs with required_matches set to zero
PUT /job-candidates/_doc/1?refresh
{
"name": "Jane",
"programming_languages": [ "c++", "java" ],
"required_matches": 0
}
and also
PUT /job-candidates/_doc/1?refresh
{
"name": "Ben",
"programming_languages": [ "python" ],
"required_matches": 0
}
Step 3:
Search for docs with the following search
GET /job-candidates/_search
{
"query": {
"terms_set": {
"programming_languages": {
"terms": [ "c++", "java"],
"minimum_should_match_field": "required_matches"
}
}
}
}
Expected Results: I expect step 3 returns both docs "Jane" and "Ben"
Actual Results: but it only returns doc "Jane"
I don't understand. If minimum_should_match is 0, doesn't it mean that an returned doc do not need to match any term(s), therefore "Ben" doc should also be returned?
Some links I found but still can't answer my question:
minimum_should_match
It looks like minimum_should_match can't not be zero, but it does not says how search works if it's indeed zero or more than the number of optional values.
A discussion of default value for minimum_should_match
But they didn't discuss the "terms set" query in particular.
Any clarification will be appreciated! Thanks.
When looking at the terms_set source code, we can see that the underlying Lucene query being used is called CoveringQuery.
So the explanation can be found in Lucene's source code of CoveringQuery, whose documentation says
Per-document long value that records how many queries should match. Values that are less than 1 are treated like 1: only documents that have at least one matching clause will be considered matches. Documents that do not have a value for minimumNumberMatch do not match.
And a little further, the code that sets minimumNumberMatch is pretty self-explanatory:
final long minimumNumberMatch = Math.max(1, minMatchValues.longValue());
We can simply sum it up by stating that it doesn't really make sense to send a terms_set query with minimum_should_match: 0 as it would be equivalent to a match_all query.

Match keys with sibling object JSONATA

I have an JSON object with the structure below. When looping over key_two I want to create a new object that I will return. The returned object should contain a title with the value from key_one's name where the id of key_one matches the current looped over node from key_two.
Both objects contain other keys that also will be included but the first step I can't figure out is how to grab data from a sibling object while looping and match it to the current value.
{
"key_one": [
{
"name": "some_cool_title",
"id": "value_one",
...
}
],
"key_two": [
{
"node": "value_one",
...
}
],
}
This is a good example of a 'join' operation (in SQL terms). JSONata supports this in a path expression. See https://docs.jsonata.org/path-operators#-context-variable-binding
So in your example, you could write:
key_one#$k1.key_two[node = $k1.id].{
"title": $k1.name
}
You can then add extra fields into the resulting object by referencing items from either of the original objects. E.g.:
key_one#$k1.key_two[node = $k1.id].{
"title": $k1.name,
"other_one": $k1.other_data,
"other_two": other_data
}
See https://try.jsonata.org/--2aRZvSL
I seem to have found a solution for this.
[key_two].$filter($$.key_one, function($v, $k){
$v.id = node
}).{"title": name ? name : id}
Gives:
[
{
"title": "value_one"
},
{
"title": "value_two"
},
{
"title": "value_three"
}
]
Leaving this here if someone have a similar issue in the future.

couchDB- complex query on a view

I am using cloudantDB and want to query a view which looks like this
function (doc) {
if(doc.name !== undefined){
emit([doc.name, doc.age], doc);
}
what should be the correct way to get a result if I have a list of names(I will be using option 'keys=[]' for it) and a range of age(for which startkey and endkey should be used)
example: I want to get persons having name "john" or "mark" or "joseph" or "santosh" and lie between age limit 20 to 30.
If i go for list of names, query should be keys=["john", ....]
and if I go for age query should use startkey and endkey
I want to do both :)
Thanks
Unfortunately, you can't do so. Using the keys parameter query the documents with the specified key. For example, you can't only send keys=["John","Mark"]&startkey=[null,20]&endkey=[{},30]. This query would only and ONLY return the document having the name John and Mark with a null age.
In your question you specified CouchDB but if you are using Cloudant, index query might be interesting for you.
You could have something like that :
{
"selector": {
"$and": [
{
"name": {
"$in":["Mark","John"]
}
},
{
"year": {
"$gt": 20,
"$lt": 30
}
}
]
},
"fields": [
"name",
"age"
]
}
As for CouchDB, you need to either separate your request (1 request for the age and 1 for the people) or you do the filtering locally.

How to remove field from document which matches a pattern in elasticsearch using Java?

I have crawled few documents and created an index in elasticsearch. I am using sense to query:
This is my query in elasticsearch:
POST /index/_update_by_query
{
"script": {
"inline": "ctx._source.remove(\"home\")"
},
"query": {
"wildcard": {
"url": {
"value": "http://search.com/*"
}
}
}
}
This is my Java program:
Client client = TransportClient.builder().addPlugin(ReindexPlugin.class)
.build().addTransportAddress(new InetSocketTransportAddress(
InetAddress.getByName("127.0.0.1"), 9300));
UpdateByQueryRequestBuilder ubqrb = UpdateByQueryAction.INSTANCE
.newRequestBuilder(client);
Script script1 = new Script("ctx._source.remove" +FieldName);
BulkIndexByScrollResponse r = ubqrb.source("index").script(script1)
.filter(wildcardQuery("url", patternvalue)).get();
FieldName(where home is saved as a string) is the name of the field which I want to remove from my documents. patternvalue is where pattern "http://search.com/*" is stored. When I run this Java program, it doesn't remove home field from my documents. It adds a new field in my documents called remove. I might be missing something. Any help would be appreciated
If FieldName is the string home, then the expression "ctx._source.remove" +FieldName will be equal to "ctx._source.removehome" which is not the correct script. The correct code for that line is:
Script script1 = new Script("ctx._source.remove(\"" + FieldName + "\")");
This way the script will be:
ctx._source.remove("home")
That is the same as you wrote in json in:
"inline": "ctx._source.remove(\"home\")"
(\" in that json is just a " escaped in the json syntax)

Resources