Indexing/search algorithm stability between versions - elasticsearch

I'm migrating from Elasticsearch 1.5 to 7.10 there are multiple required changes, the most relevant one is the removal of the document type concept in version 6, to deal with it I introduced a new field doc_type and then I match with it when I search.
My question is, when I make the same (or equivalent because there are some changes) search query should I expect to have the exact same result set? Because I'm having some differences, so I would like to figure out if I broke something in the new mappings or in the search query.
Thank you in advance
Edit after first question:
In general: I have a service that communicates with ES 1.5 and I have to migrate it to ES 7.10 keeping the external API as stable as possible.
I'm not using scoring.
Previously I had document types A and B, when I make a query like this for example: host/indexname/A,B/_search, after the migration I keep A or B in doc_type, and the query becomes host/indexname/_search with a "bool":{"should":[{"terms":{"doc_type":["A"],"boost":1.0}},{"terms":{"doc_type":["B"],"boost":1.0}}],"adjust_pure_negative":true,"boost":1.0} in the body. If I put it in different indexes for A and B and the user want to match in both of them I'll have to "merge" the search response for both queries and I don't know which strategy should I follow for that, so keeping it all together I get a response with mixed (doc_type) results from ES. I followed this specific approach https://www.elastic.co/blog/removal-of-mapping-types-elasticsearch#custom-type-field
The differences are not so big, difficult to show a concrete example because it's a complex data/doc structure but the idea is, having for 1.5 this response for a giving query for example:
[a, b, c, d, e, f, g, h, i, j] (where each one may have any of types A or B)
With 7.10 I'm having responses like:
[a, b, e, c, d, f, g, h, i, j] or [a, b, c, d, e, g, i, j, k]
Second edit:
This query has been generated from the java client.
{
"from":0,
"size":100,
"query":{
"bool":{
"must":[
{
"query_string":{
"query":"mark_deleted:false",
"fields":[
],
"type":"best_fields",
"default_operator":"or",
"max_determinized_states":10000,
"enable_position_increments":true,
"fuzziness":"AUTO",
"fuzzy_prefix_length":0,
"fuzzy_max_expansions":50,
"phrase_slop":0,
"escape":false,
"auto_generate_synonyms_phrase_query":true,
"fuzzy_transpositions":true,
"boost":1.0
}
},
{
"bool":{
"should":[
{
"terms":{
"type":[
"A"
],
"boost":1.0
}
},
{
"terms":{
"type":[
"B"
],
"boost":1.0
}
},
{
"terms":{
"type":[
"D"
],
"boost":1.0
}
}
],
"adjust_pure_negative":true,
"boost":1.0
}
}
],
"adjust_pure_negative":true,
"boost":1.0
}
},
"post_filter":{
"term":{
"mark_deleted":{
"value":false,
"boost":1.0
}
}
},
"sort":[
{
"a_specific_date":{
"order":"desc"
}
}
],
"highlight":{
"pre_tags":[
"<b>"
],
"post_tags":[
"</b>"
],
"no_match_size":120,
"fields":{
"body":{
"fragment_size":120,
"number_of_fragments":1
}
}
}
}

First, since you don't care about scoring you should use bool/filter instead of bool/must at the top level, otherwise your results are sorted by _score by default and between 1.7 et 7.10, there have been so many changes that it would explain the differences you get. So you're better off simply sorting the results using any other field than _score
Second, instead of the bool/should on type you can use a simple terms query, which does exactly the same job, yet in a simpler way:
{
"from": 0,
"size": 100,
"query": {
"bool": {
"filter": [
{
"query_string": {
"query": "mark_deleted:false",
"fields": [],
"type": "best_fields",
"default_operator": "or",
"max_determinized_states": 10000,
"enable_position_increments": true,
"fuzziness": "AUTO",
"fuzzy_prefix_length": 0,
"fuzzy_max_expansions": 50,
"phrase_slop": 0,
"escape": false,
"auto_generate_synonyms_phrase_query": true,
"fuzzy_transpositions": true,
"boost": 1
}
},
{
"terms": {
"type": [
"A",
"B",
"C"
]
}
}
]
}
},
"post_filter": {
"term": {
"mark_deleted": {
"value": false,
"boost": 1
}
}
},
"sort": [
{
"a_specific_date": {
"order": "desc"
}
}
],
"highlight": {
"pre_tags": [
"<b>"
],
"post_tags": [
"</b>"
],
"no_match_size": 120,
"fields": {
"body": {
"fragment_size": 120,
"number_of_fragments": 1
}
}
}
}
Finally, I'm not sure why you're using a query_string query to do an exact match on mark_deleted:false, it doesn't make sense to me. A simple term query would be better and more adequate here.
Also not clear why you have remove all results that also have mark_deleted:false in your post_filter, since it's the same condition as in your query_string constraint.

Related

Cardinality aggregation is retuning invalid totalcount when trying to execute for more number of data

Cardinality Aggregation is returning incorrect result when index has more data. I have used below query to findout distinct value of txnids. In one day we will receive more than 60000 txns with duplicate entries. I went through few other threads, but i havent findout correct solution. I have tried with Precision_threshold also. Still not getting accurate distinct count values. Is it possible to get Distinct count? Because i have read in some documentation that ES will do only approximate distinct count. I have used below query:
{
"size": 10000,
"query": {
"bool": {
"must": [
{
"range": {
"dateRange": {
"from": "2022-03-14",
"to": "2022-03-14",
"include_lower": true,
"include_upper": true,
"boost": 1.0
}
}
},
{
"terms": {
"status.keyword": [
"Success"
],
"boost": 1.0
}
}
],
"must_not": [],
"adjust_pure_negative": true,
"boost": 1.0
}
},
"sort": [
{
"id.keyword": {
"order": "asc"
}
}
],
"search_after": [
""
],
"aggregations": {
"distinctRecords": {
"cardinality": {
"field": "recID.keyword",
"precision_threshold": 40000
}
}
}
}
My java code for cardinality
CardinalityAggregationBuilder distinctCount = AggregationBuilders.cardinality("distinctRecords")
.field("recID.Keyword").precisionThreshold(40000);
You can use Scripted Metric Aggregation
Considering you are using RestHighLevelclient, instead of using cardinality Aggregation builder use ScriptedMetricAggregationBuilder.
Script you can copy from either link provided or write your own script
where "copy the script here" is present.
Map<String,Object> params = new HashMap<>();
params.put("fieldName","recID.keyword");
ScriptedMetricAggregationBuilder scriptedMetricAggregationBuilder = AggregationBuilders.scriptedMetric("distinct_count")
.params(params)
.initScript(new Script("copy the script here"))
.mapScript(new Script("copy the script here"))
.combineScript(new Script("copy the script here"));

Elasticsearch ordering by field value which is not in the filter

can somebody help me please to make a query which will order result items according some field value if this field is not part of query in request. I have a query:
{
"_source": [
"ico",
"name",
"city",
"status"
],
"sort": {
"_score": "desc",
"status": "asc"
},
"size": 20,
"query": {
"bool": {
"should": [
{
"match": {
"normalized": {
"query": "idona",
"analyzer": "standard",
"boost": 3
}
}
},
{
"term": {
"normalized2": {
"value": "idona",
"boost": 2
}
}
},
{
"match": {
"normalized": "idona"
}
}
]
}
}
}
The result is sorted according field status alphabetically ascending. Status contains few values like [active, canceled, old....] and I need something like boosting for every possible values in query. E.g. active boost 5, canceled boost 4, old boost 3 ........... Is it possible to do it? Thanks.
You would need a custom sort using script to achieve what you want.
I've just made use of generic match_all query for my query, you can probably go ahead and add your query logic there, but the solution that you are looking for is in the sort section of the below query.
Make sure that status is a keyword type
Custom Sorting Based on Values
POST <your_index_name>/_search
{
"query":{
"match_all":{
}
},
"sort":[
{ "_score": "desc" },
{
"_script":{
"type":"number",
"script":{
"lang":"painless",
"inline":"if(params.scores.containsKey(doc['status'].value)) { return params.scores[doc['status'].value];} return 100000;",
"params":{
"scores":{
"active":5,
"old":4,
"cancelled":3
}
}
},
"order":"desc"
}
}
]
}
In the above query, go ahead and add the values in the scores section of the query. For e.g. if your value is new and you want it to be at say value 2, then your scores would be in the below:
{
"scores":{
"active":5,
"old":4,
"cancelled":3,
"new":6
}
}
So basically the documents would first get sorted by _score and then on that sorted documents, the script sort would be executed.
Note that the script sort is desc by nature as I understand that you would want to show active documents at the top, followed by other values. Feel free to play around with it.
Hope this helps!

Is it possible to sort by a range in Elasticsearch?

When I execute the following query:
{
"query": {
"bool": {
"filter": [
{
"match": {
"my_value": "hi"
}
},
{
"range": {
"my_range": {
"gt": 0,
"lte": 200
}
}
}
]
}
},
"sort": {
"my_range": {
"order": "asc",
"mode": "min"
}
}
}
I get the error:
"caused_by": {
"type": "illegal_argument_exception",
"reason": "Fielddata is not supported on field [my_range] of type [long_range]"
}
How can I enable a range datatype to be sortable? Is this possible?
Elasticsearch version: 5.4, but I am wondering if this is possible with ANY version.
More context
Not all documents in the alias/index have the range field. However, the query filters to only include documents with that field.
It is not straight-forward to sort using a field of range data type. Still you can use script based sorting to some extent to get the expected result.
e.g. For simplicity of script I'm assuming for all your docs, the data indexed against my_range field has data for gt and lte only and you want to sort based on the minimum values of the two then you can add the below for sorting:
{
"query": {
"bool": {
"filter": [
{
"match": {
"my_value": "hi"
}
},
{
"range": {
"my_range": {
"gt": 0,
"lte": 200
}
}
}
]
}
},
"sort": {
"_script": {
"type": "number",
"script": {
"lang": "painless",
"inline": "Math.min(params['_source']['my_range']['gt'], params['_source']['my_range']['lte'])"
},
"order": "asc"
}
}
}
You can modify the script as per your needs for complex data involving combination of all lt, gt, lte, gte.
Updates (Scripts for other different use cases):
1. Sort by difference
"Math.abs(params['_source']['my_range']['gt'] - params['_source']['my_range']['lte'])"
2. Sort by gt
"params['_source']['my_range']['gt']"
3. Sort by lte
"params['_source']['my_range']['lte']"
4. Sorting if query returns few docs which don't have range field
"if(params['_source']['my_range'] != null) { <sorting logic> } else { return 0; }"
Replace <sorting logic> with the required logic of sorting (which can be one of the 3 above or the one in the query)
return 0 can be replace by return -1 or anything other number as per the sorting needs
I think what you are looking for is sort based on the difference of the range coz I'm not sure if simply sorting on any of the range values would make any sense.
For e.g. if range for one document is 100, 300 and another 200, 600 then you would want to sort based on the difference for e.g. you would want the lesser range to be appearing i.e 300-100 = 200 to be appearing at the top.
If so, I've made use of the below painless script and implemented script based sorting.
Sorting based on difference in Range
POST <your_index_name>/_search
{
"query":{
"match_all":{
}
},
"sort":{
"_script":{
"type":"number",
"script":{
"lang":"painless",
"inline":"params._source.my_range.lte-params._source.my_range.gte"
},
"order":"asc"
}
}
}
Note that in this case, sort won't be based on any of the field values of my_range but only on their differences. If you want to further sort based on the fields like lte, lt, gte or gt you can have your sort implemented with multiple script as below:
Sorting based on difference in Range + Range Field (my_range.lte)
POST <your_index_name>/_search
{
"query":{
"match_all":{
}
},
"sort":[
{
"_script":{
"type":"number",
"script":{
"lang":"painless",
"inline":"params._source.my_range.lte - params._source.my_range.gte"
},
"order":"asc"
}
},
{
"_script":{
"type":"number",
"script":{
"lang":"painless",
"inline":"params._source.my_range.lte"
},
"order":"asc"
}
}
]
}
So in this case even if for two documents, ranges are same, the one with the lesser my_range.lte would be showing up first.
Sort based on range field
However if you simply want to sort based on one of the range values, you can make use of below query.
POST <your_index_name>/_search
{
"query":{
"match_all":{
}
},
"sort":{
"_script":{
"type":"number",
"script":{
"lang":"painless",
"inline":"params._source.my_range.lte"
},
"order":"asc"
}
}
}
Updated Answer to manage documents without range
This is for the scenario, Sort based on difference in range + Range.lte or Range.lt whichever is present
The below code what it does is,
Checks if the document has my_range field
If it doesn't have, then by default it would return Long.MAX_VALUE. This would mean if you sort by asc, this document should returned
last.
Further it would check if document has lte or lt and uses that value as high. Note that default value of high is Long.MAX_VALUE.
Similarly it would check if document has gte or gt and uses that value as low. Default value of low would be 0.
Calculate now high - low value on which sorting would be applied.
Updated Query
POST <your_index_name>/_search
{
"size":100,
"query":{
"match_all":{
}
},
"sort":[
{
"_script":{
"type":"number",
"script":{
"lang":"painless",
"inline":"""
if(params._source.my_range==null){
return Long.MAX_VALUE;
} else {
long high = Long.MAX_VALUE;
long low = 0L;
if(params._source.my_range.lte!=null){
high = params._source.my_range.lte;
} else if(params._source.my_range.lt!=null){
high = params._source.my_range.lt;
}
if(params._source.my_range.gte!=null){
low = params._source.my_range.gte;
} else if (params._source.my_range.gt==null){
low = params._source.my_range.gt;
}
return high - low;
}
"""
},
"order":"asc"
}
},
{
"_script":{
"type":"number",
"script":{
"lang":"painless",
"inline":"""
if(params._source.my_range==null){
return Long.MAX_VALUE;
}
long high = Long.MAX_VALUE;
if(params._source.my_range.lte!=null){
high = params._source.my_range.lte;
} else if(params._source.my_range.lt!=null){
high = params._source.my_range.lt;
}
return high;"""
},
"order":"asc"
}
}
]
}
This should work with ES 5.4. Hope it helps!
This can be resolved easily by using the regex interval filter :
Interval The interval option enables the use of numeric ranges,
enclosed by angle brackets "<>". For string: "foo80":
foo<1-100> # match
foo<01-100> # match
foo<001-100> # no match
Enabled with the INTERVAL or ALL flags.
Elactic docs
{
"query": {
"bool": {
"filter": [
{
"match": {
"my_value": "hi"
}
},
{
"regexp": {
"my_range": {
"value": "<0-200>"
}
}
}
]
}
},
"sort": {
"my_range": {
"order": "asc",
"mode": "min"
}
}
}

Elasticsearch Sorting by Likes and Dislikes

I've been struggling to express the current logic problem I'm trying to solve with Elasticsearch, and I think I have a good way to represent it.
Let's say I'm building out an API to sort Mario Kart characters in order of the user's preference. The user can list characters they like, and those they dislike. Here is the data set:
{character: {name: "Mario", weight: "Light"}},
{character: {name: "Luigi", weight: "Medium"}},
{character: {name: "Peach", weight: "Light"}},
{character: {name: "Bowser", weight: "Heavy"}},
{character: {name: "Toad", weight: "Light"}},
{character: {name: "Koopa", weight: "Medium"}}
The user inputs that they like Mario and Luigi and do not like Bowser. With Elasticsearch, how could I go about sorting this data for the user so the list is returned like so:
[Mario (+), Luigi (+), Peach, Toad, Koopa, Bowser (-)]
*Pluses and minuses in there for legibility.
This would return the user's top choices in front, the ones they are OK with in the middle, and the ones they don't prefer at the end. Having to use nested queries really trips me up here.
Evolving the query, let's say there's a team mode where each team is comprised of pairs of two, determined by the game in the following pairs:
[Luigi (+), Bowser (-)]
[Mario (+), Peach]
[Toad, Koopa]
How to I ensure that I don't filter out teams that contain Bowser, yet still weight the results so that it's like so:
[Mario (+), Peach]
[Toad, Koopa]
[Luigi (+), Bowser (-)]
Or, should [Luigi, Bowser] actually rank second?
I'm very confused about building complex queries like these in Elasticsearch and would appreciate any help.
Depending on your mapping, something along the lines of
GET /characters/_search
{
"sort":[
"_score"
],
"query":{
"bool":{
"should":[
{
"constant_score":{
"filter":{
"term":{
"name.keyword":"Mario"
}
},
"boost":2.0
}
},
{
"constant_score":{
"filter":{
"term":{
"name.keyword":"Luigi"
}
},
"boost":2.0
}
},
{
"constant_score":{
"filter":{
"term":{
"name.keyword":"Peach"
}
},
"boost":1.0
}
},
{
"constant_score":{
"filter":{
"term":{
"name.keyword":"Toad"
}
},
"boost":1.0
}
},
{
"constant_score":{
"filter":{
"term":{
"name.keyword":"Koopa"
}
},
"boost":1.0
}
},
{
"constant_score":{
"filter":{
"term":{
"name.keyword":"Bowser"
}
},
"boost":0
}
}
]
}
}
}
should work.
PS: IF you have a nested mapping then surround the bool query with a nested query clause and adjust the field name paths. To return only the name field add _source clause before the query with path to name as value.
First off I gotta say - IMHO using Elasticsearch for this is major overkill. You should probably go with a much simpler in memory data structure for this calculation.
Assuming you do decide to implement this with Elasticsearch, I would do the following thing:
1) Represent each character as a document using this mapping -
PUT game/characters/_mapping
{
"properties": {
"name":{
"type": "keyword"
},
"weight": {
"type": "keyword"
}
}
}
2) Each character will look like so:
PUT game/characters/boswer
{
"name": "bowser",
"weight": "heavy"
}
3) And then you can fetch them ordered by likes similiarly to how #sramalingam24 suggested. Note that the boosts must non-negative, so you'd need to "normalize" the likeability of the characters to a range above zero:
GET game/characters/_search
{
"size": 100,
"query": {
"bool": {
"should": [
{
"constant_score": {
"filter": {
"term": {
"name": "Peach"
}
},
"boost": 2
}
},{
"constant_score": {
"filter": {
"term": {
"name": "Mario"
}
},
"boost": 2
}
},{
"constant_score": {
"filter": {
"term": {
"name": "Toad"
}
},
"boost": 1
}
},{
"constant_score": {
"filter": {
"term": {
"name": "Bowser"
}
},
"boost": 0
}
},
]
}
}
}
Good luck!

N1QL vs ElasticSearch Join

My documents are:
Iwp::1::Porcentaje::Period::1
{
"id": null,
"period": 1,
"type": "IwpCumulative",
"category": "Porcentaje",
"sumEarn": 0,
"sumActual": 0.2248520710059172,
"sumForecast": 0,
"sumPlanned": 0,
"sumValue": 0,
"parent": "Iwp::1"
}
Iwp::1
{
"name": "Iwp 1",
"description": "Iwp 1 Description",
"manyPeriods": 50,
"type": "Iwp",
"countCC": 0,
"costCode": [
"CostCode::3",
"CostCode::4"
],
"iwpCumulatives": [
"Iwp::1::Porcentaje::Period::1",
.......
"Iwp::1::Porcentaje::Period::50",
"Iwp::1::Qty::Period::1",
........
"Iwp::1::Qty::Period::50",
]
}
How I Could do this query at ElasticSearch?
N1QL:
select
t.category,
t.period,
sum(t.sumActual)
from
default as q
inner join default as p on keys q.parent
inner join default as t on keys p.iwpCumulatives
where
q.type = 'IwpCumulative'
and q.period = 50
and q.sumActual > 0
and q.category = 'Porcentaje'
group by t.category,t.period
order by t.period,t.category;
I have this querys at ElasticSearch:
{
"query":{
"filtered":{
"query":{
"bool":{
"must":[
{"term":{"period":"5"}},
{"term":{"type":"iwpcumulative"}},
{"range":{"sumActual":{"gt":"0"}}},
{"term":{"category":"porcentaje"}}
]
}
}
}
}
}
and this:
{
"size":0,
"aggs":{
"group_by_state":{
"terms":{
"field":"category"
},
"aggs":{
"costars":{
"terms":{
"field":"period"
},
"aggs":{
"Suma":{
"sum":{
"field":"earn"
}
}
}
}
}
}
}
}
Now, I need to use the first result with their Id so I will to use at the second query.
Thanks in advance.
Because you use only one tables result in select terms (semi join) you can use the siren-join plugin for elasticsearch:
Look at this :
SIREn Plugin to add relational join capabilities to Elasticsearch

Resources