Elastic Query Filters Challenge - elasticsearch

I have the following query, generating a Top 100 sellers for a given supplier ID, running against a sales index that looks up the product skus for the given supplier in an index of product_skus. This works well.
query = {
size: 0,
query: {
bool: {
filter: [
{
constant_score: {
filter: {
terms: {
sku: {
index: "product_skus",
type: "product",
id: supplier_id,
path: "skus"
}
}
}
}
}
],
must_not: []
}
},
aggs: {
unit_sum: {
terms: {
field: "sku",
size: 100,
order: {
one: "desc"
}
},
aggs: {
one: {
sum: {
field: "units"
}
}
}
}
}
}
Now I have a scenario where a given user needs to have their access restricted to a subset of the suppliers skus. I am trying to get my head around the best way to tackle this. I am leaning towards having another index of the Skus a user can access and doing a second lookup, but I can't quite get my head around the query logic.
In simple terms for example; if in the above query, for supplier 1 we return products [A,B,C,D,E]
and user John should only see the results based on products [A,C,E]
How would I go about writing the query to do this? Is it as simple as adding in a should clause after the filter inside the bool?
Thanks in advance!

Routing is probably what you need in this situation, as your scenario allows you to use routing for users. As an additional bonus of organizing your data into separate shards, it will allow to increase performances when routing is used in query. Why? Because when using routing, the request will be sent only to shards containing the relevant data instead of every node across the cluster.
What would it look like in your case? Let's have a look with a simple mapping, and a product that should only be accessed with an id 123:
The mapping of product_skus (modify as needed):
PUT product_skus
{
"settings": {
"index": {
"number_of_shards": "5",
"number_of_replicas": "1"
}
},
"mappings": {
"product": {
"_routing": {
"required": true
},
"properties": {
"supplierId":{
"type": "integer"
}, "path":{
"type": "string"
}
}
}
}
}
Now let's put a product in the index type (notice the routing):
POST product_skus/product?routing=123
{
"supplierId": 123,
"path": "some/path"
}
And finally two requests and their output using the routing:
GET product_skus/_search?routing=123
{
"query": {
"match_all": {}
}
}
Output:
{
"took": 4,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "product_skus",
"_type": "product",
"_id": "AVrMHzgx28yun46LEMYm",
"_score": 1,
"_routing": "123",
"_source": {
"supplierId": 123,
"path": "some/path"
}
}
]
}
}
Second query:
GET product_skus/_search?routing=124
{
"query": {
"match_all": {}
}
}
Output:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 0,
"max_score": null,
"hits": []
}
}
It is only a simple example, you might want to check the documentation for more information:
The _routing field
Another routing example
An example of routing with fields of the type
In addition the following shows that only one shard is used with routing:
GET product_skus/_search_shards?routing=123
Output:
{
"nodes": {
"1sMKtN6aQ9yyOsTjknWyQA": {
"name": "1sMKtN6",
"ephemeral_id": "X-V2QGTwTmqUFQb1B6KIUw",
"transport_address": "127.0.0.1:9300",
"attributes": {}
}
},
"shards": [
[
{
"state": "STARTED",
"primary": true,
"node": "1sMKtN6aQ9yyOsTjknWyQA",
"relocating_node": null,
"shard": 0,
"index": "product_skus",
"allocation_id": {
"id": "1MMkFaALRxm1N-x8J8AGhg"
}
}
]
]
}
See the search shards API for more details.

Related

Change field type in index without reindex

First, I had this index template
GET localhost:9200/_index_template/document
And this is output
{
"index_templates": [
{
"name": "document",
"index_template": {
"index_patterns": [
"v*-documents-*"
],
"template": {
"settings": {
"index": {
"number_of_shards": "1"
}
},
"mappings": {
"properties": {
"firstOperationAtUtc": {
"format": "epoch_millis",
"ignore_malformed": true,
"type": "date"
},
"firstOperationAtUtcDate": {
"ignore_malformed": true,
"type": "date"
}
}
},
"aliases": {
"documents-": {}
}
},
"composed_of": [],
"priority": 501,
"version": 1
}
}
]
}
And my data is indexed, for example
GET localhost:9200/v2-documents-2021-11-20/_search
{
"query": {
"bool": {
"should": [
{
"exists": {
"field": "firstOperationAtUtc"
}
}
]
}
}
}
Output is
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 1.0,
"hits": [
{
"_index": "v2-documents-2021-11-20",
"_type": "_doc",
"_id": "9b46d6fe78735274342d1bc539b084510000000455",
"_score": 1.0,
"_source": {
"firstOperationAtUtc": 1556868952000,
"firstOperationAtUtcDate": "2019-05-03T13:35:52.000Z"
}
}
]
}
}
Next, I need to update mapping for field firstOperationAtUtc and remove format epoch_millis
localhost:9200/_template/document
{
"index_patterns": [
"v*-documents-*"
],
"template": {
"settings": {
"index": {
"number_of_shards": "1"
}
},
"mappings": {
"properties": {
"firstOperationAtUtc": {
"ignore_malformed": true,
"type": "date"
},
"firstOperationAtUtcDate": {
"ignore_malformed": true,
"type": "date"
}
}
},
"aliases": {
"documents-": {}
}
},
"version": 1
}
After that, If I get previous request I still have indexed data.
But now I need to update field firstOperationAtUtc and set data from firstOperationAtUtcDate
localhost:9200/v2-documents-2021-11-20/_update_by_query
{
"script": {
"source": "if (ctx._source.firstOperationAtUtcDate != null) { ctx._source.firstOperationAtUtc = ctx._source.firstOperationAtUtcDate }",
"lang": "painless"
},
"query": {
"match": {
"_id": "9b46d6fe78735274342d1bc539b084510000000455"
}
}
}
After that, if I get previous request
GET localhost:9200/v2-documents-2021-11-20/_search
{
"query": {
"bool": {
"should": [
{
"exists": {
"field": "firstOperationAtUtc"
}
}
]
}
}
}
I have no indexed data
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 0,
"relation": "eq"
},
"max_score": null,
"hits": []
}
}
But if I find with id, I will get this data with modify data but my field is ignored
GET localhost:9200/v2-documents-2021-11-20/_search
{
"query": {
"terms": {
"_id": [ "9b46d6fe78735274342d1bc539b084510000000455" ]
}
}
}
Output is
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 1.0,
"hits": [
{
"_index": "v2-documents-2021-11-20",
"_type": "_doc",
"_id": "9b46d6fe78735274342d1bc539b084510000000455",
"_score": 1.0,
"_ignored": [
"firstOperationAtUtc"
],
"_source": {
"firstOperationAtUtc": "2019-05-03T13:35:52.000Z",
"firstOperationAtUtcDate": "2019-05-03T13:35:52.000Z"
}
}
]
}
}
How I could indexed data without reindex? Because I have milliard data in index and this could may produce huge downtime in prod
What you changed is the index template, but not your index mapping. The index template is used only when a new index that matches the name pattern is created.
What you want to do is to modify the actual mapping of your index, like this:
PUT test/_mapping
{
"properties": {
"firstOperationAtUtc": {
"ignore_malformed": true,
"type": "date"
}
}
}
However, this won't be possible and you will get the following error, which makes sense as you cannot modify an existing field mapping.
Mapper for [firstOperationAtUtc] conflicts with existing mapper:
Cannot update parameter [format] from [epoch_millis] to [strict_date_optional_time||epoch_millis]
The only reason why your update by query seemed to work is because you have "ignore_malformed": true in your mapping. Because if you remove that parameter and try to run your update by query again, you'd see the following error:
"type" : "mapper_parsing_exception",
"reason" : "failed to parse field [firstOperationAtUtc] of type [date] in document with id '2'. Preview of field's value: '2019-05-03T13:35:52.000Z'",
"caused_by" : {
"type" : "illegal_argument_exception",
"reason" : "failed to parse date field [2019-05-03T13:35:52.000Z] with format [epoch_millis]",
"caused_by" : {
"type" : "date_time_parse_exception",
"reason" : "date_time_parse_exception: Failed to parse with all enclosed parsers"
}
}
So, to wrap it up, you have two options:
Create a new index with the right mapping and reindex your old index into it, but that doesn't seem like an option for you.
Create a new field in your existing index mapping (e.g. firstOperationAtUtcTime) and discard the use of firstOperationAtUtc
The steps would be:
Modify the index template to add the new field
Modify the actual index mapping to add the new field
Run your update by query by modifying the script to write your new field
In short:
# 1. Modify your index template
# 2. modify your actual index mapping
PUT v2-documents-2021-11-20/_mapping
{
"properties": {
"firstOperationAtUtcTime": {
"ignore_malformed": true,
"type": "date"
}
}
}
# 3. Run update by query again
POST v2-documents-2021-11-20/_update_by_query
{
"script": {
"source": "if (ctx._source.firstOperationAtUtcDate != null) { ctx._source.firstOperationAtUtcTime = ctx._source.firstOperationAtUtcDate; ctx._source.remove('firstOperationAtUtc')}",
"lang": "painless"
},
"query": {
"match": {
"_id": "9b46d6fe78735274342d1bc539b084510000000455"
}
}
}

Elasticsearch query with fuzziness AUTO not working as expected

From the Elasticsearch documentation regarding fuzziness:
AUTO
Generates an edit distance based on the length of the term. Low and high distance arguments may be optionally provided AUTO:[low],[high]. If not specified, the default values are 3 and 6, equivalent to AUTO:3,6 that make for lengths:
0..2
Must match exactly
3..5
One edit allowed
>5
Two edits allowed
However, when I am trying to specify low and high distance arguments in the search query the result is not what I am expecting.
I am using Elasticsearch 6.6.0 with the following index mapping:
{
"fuzzy_test": {
"mappings": {
"_doc": {
"properties": {
"description": {
"type": "text"
},
"id": {
"type": "keyword"
}
}
}
}
}
}
Inserting a simple document:
{
"id": "1",
"description": "hello world"
}
And the following search query:
{
"size": 10,
"timeout": "30s",
"query": {
"match": {
"description": {
"query": "helqo",
"fuzziness": "AUTO:7,10"
}
}
}
}
I assumed that fuzziness:AUTO:7,10 would mean that for the input term with length <= 6 only documents with the exact match will be returned. However, here is a result of my query:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.23014566,
"hits": [
{
"_index": "fuzzy_test",
"_type": "_doc",
"_id": "OQtUu2oBABnEwrgM3Ejr",
"_score": 0.23014566,
"_source": {
"id": "1",
"description": "hello world"
}
}
]
}
}
This is strange but seems like that bug exists only in version the Elasticsearch 6.6.0. I've tried 6.4.2 and 6.6.2 and both of them work just fine.

how to make proper query to select by ID and later update using elastic search?

I am very new in ES and I am trying to figure out some things.
I did a basic query this way
GET _search
{
"query": {
"match_all": {}
}
}
and I got this...
{
"took": 7,
"timed_out": false,
"_shards": {
"total": 768,
"successful": 768,
"failed": 0
},
"hits": {
"total": 456,
"max_score": 1,
"hits": [
{
"_index": "sometype_1",
"_type": "sometype",
"_id": "12312321312312",
"_score": 1,
"_source": {
"readModel": {
"id": "asdfqwerzcxv",
"status": "active",
"hidden": false
},
"model": {
"id": "asdfqwerzcxv",
"content": {
"objectId": "421421312312",
"message": "hello world",
..... //the rest of the object...
So right now I want to get the object with id asdfqwerzcxv and I did this:
GET _search
{
"query": {
"match" : {
"id" :"asdfqwerzcxv"
}
}
}
But of course is not working... I also tried to make the whole route like:
GET _search
{
"query": {
"match" : {
"_source" :{
"readModel" : {
"id": "asdfqwerzcxv"
}
}
}
}
}
But no luck...
is there a way to do this? could someone help me?
Thanks
You need to use the full-qualified field name, try this:
GET _search
{
"query": {
"match" : {
"readModel.id" :"asdfqwerzcxv"
^
|
add this
}
}
}

Elasticsearch wildcard case-sensitive

How to make wildcard case-insensitive?
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-wildcard-query.html
Since version 7.10 the wildcard query supports special parameter case_insensitive (boolean).
Example of case-insensitive search:
GET /_search
{
"query": {
"wildcard": {
"my_field": {
"value": "ki*y",
"case_insensitive": true
}
}
}
}
Wildcards are not_analyzed. It depends on what analyzers you've provided for the field you're searching. But if you're using the default analyzers then a wildcard query will return case-insensitive results.
Example: Post two names in a sample index one is "Sid" and other "sid".
POST sample/sample
{
"name" : "sid"
}
POST sample/sample
{
"name" : "Sid"
}
Then perform a wildcard query:
GET sample/_search
{
"query": {
"wildcard": {
"name": {
"value": "s*"
}
}
}
}
This will return me both the documents:
{
"took": 10,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 1,
"hits": [
{
"_index": "sample",
"_type": "sample",
"_id": "AWRPM87Wb6oopELrnEKE",
"_score": 1,
"_source": {
"name": "Sid"
}
},
{
"_index": "sample",
"_type": "sample",
"_id": "AWRPM9tpb6oopELrnEKF",
"_score": 1,
"_source": {
"name": "sid"
}
}
]
}
}
But if you perform a wildcard query on "S*" it will return nothing. Because the default token filter stores the terms in lowercase and the term "Sid" is stored as "sid" in the inverted index.
In my case this is not true, it is case sensitive by default - I am using ES 7.2.
In you sample the type of the field is "text" not "keyword"
I was looking for the same option for nodejs client, so came across this question, so posting as an answer might help someone else.
I have to convert the term to lowercase and its worked for me *${term.toLowerCase()}*
Here is the complete function
searchUsers(term, from, limit) {
let users = await EsClient.search({
index: 'users',
type: 'users',
body: {
from,
size: limit,
query: {
bool: {
should: [
{
wildcard: {
email: {
value: `*${term.toLowerCase()}*`
}
}
},
{
wildcard: {
"name.keyword": {
value: `*${term.toLowerCase()}*`
}
}
}
],
must_not: {
terms: {_id: blacklist}
}
}
}
}
});
}

How to get distinct values of afeild in ES?

I am trying to calculate distinct values of a field in ES. For example if I have an index containing documents like:
{
"NAME": "XYZ",
"TITLE": "ABC"
}
{
"NAME": "RTY",
"TITLE": "BNM"
}
I want to have an output like this:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 10000,
"max_score": 1,
"hits": [
{
"_index": "record_new",
"_type": "record_new",
"_id": "AWChga1952qKS23vpN8J",
"_score": 1,
"_source": {
"TITLE":{
"ABC",
"BNM"
}
}
}]
}
}
How can I get the distinct values in "title" field in this format. I have tried using aggregation but the output is very weired. Please help.
Test Data:
PUT http://localhost:9200/stackoverflow/os/1
{
"NAME": "XYZ",
"TITLE": "LINUX OS"
}
PUT http://localhost:9200/stackoverflow/os/1
{
"NAME": "XYZ",
"TITLE": "WINDOWS SERVER"
}
First Query Attempt:
Note I have used POST here instead of GET since most REST clients do not support payload with GET.
POST http://localhost:9200/stackoverflow/_search
{
"size":0,
"aggs":{
"uniq_soft_tags":{
"terms":{
"field":"TITLE"
}
}
}
}
If you did not give a mapping for your data and run the above, most probably you'll end up with below error.
Fielddata is disabled on text fields by default. Set fielddata=true on [TITLE] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory
Read more about that here.
Adding the mapping to enable Fielddata:
PUT http://localhost:9200/stackoverflow/_mapping/os/
{
"properties": {
"TITLE": {
"type": "text",
"fielddata": true
}
}
}
Second Query Attempt:
POST http://localhost:9200/stackoverflow/_search
{
"size":0,
"aggs":{
"uniq_soft_tags":{
"terms":{
"field":"TITLE"
}
}
}
}
Results:
{
"took": 10,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0,
"hits": []
},
"aggregations": {
"uniq_soft_tags": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "linux",
"doc_count": 1
},
{
"key": "os",
"doc_count": 1
},
{
"key": "server",
"doc_count": 1
},
{
"key": "windows",
"doc_count": 1
}
]
}
}
}
Note that the doc_counts here are approximate.
Make sure to read the following section in the docs. https://www.elastic.co/guide/en/elasticsearch/reference/current/fielddata.html#before-enabling-fielddata
That explains how you get that spilitting behaviour.
Before enabling fielddata
Before you enable fielddata, consider why you are using a text field
for aggregations, sorting, or in a script. It usually doesn’t make
sense to do so.
A text field is analyzed before indexing so that a value like New York
can be found by searching for new or for york. A terms aggregation on
this field will return a new bucket and a york bucket, when you
probably want a single bucket called New York.
UPDATE:
To prevent splitting behaviour you have to provide a mapping as follows. Note that with this you would not need the previous mapping where we set Fielddata to true.
PUT http://localhost:9200/stackoverflow/_mapping/os/
{
"properties": {
"TITLE": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
}
}
}
Now we can use,
TITLE field for searches.
TITLE.keyword field for aggregations, sorting, or in scripts.
Third Query Attempt:
POST http://localhost:9200/stackoverflow/_search
{
"size":0,
"aggs":{
"uniq_soft_tags":{
"terms":{
"field":"TITLE.keyword"
}
}
}
}
Results:
{
"took": 59,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0,
"hits": []
},
"aggregations": {
"uniq_soft_tags": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "LINUX OS",
"doc_count": 1
},
{
"key": "WINDOWS SERVER",
"doc_count": 1
}
]
}
}
}

Resources