ElasticSearch source filtering array of objects - elasticsearch

Here is a document
{
"Id": "1",
"Name": "Thing",
"Prices": [
{"CompanyId": "1", "Price": "11.11"},
{"CompanyId": "2", "Price": "12.12"},
{"CompanyId": "3", "Price": "13.13"}
And here is the associated ElasticSearch schema:
"Prices" : {
"type" : "nested",
"properties" : {
"CompanyId": {
"type" : "integer"
},
"Price" : {
"type" : "scaled_float",
"scaling_factor" : 100
}
}
}
If a user is buying for CompantId = 3 then the supplier doesn't want them to be able to see the preferential pricing for CompanyId = 1, say.
Therefore I need to use a source filter to remove all prices for which the CompanyId is not 3.
I have found that this works.
"_source":{
"excludes": ["Prices.companyId.CompanyId"]
}
But I don't understand how or why.
It can't possibly work because the required CompanyId is not mentioned anywhere in the whole ElasticSearch search JSON.
Adding a full search JSON:
{
"query":{
"bool":{
"must":[
{
"match_all":{
}
}
],
"filter":{
"match":{
"PurchasingViews":6060
}
}
}
},
"size":20,
"aggs":{
"CompanyName.raw":{
"terms":{
"field":"CompanyName.raw",
"size":20,
"order":{
"_count":"desc"
}
}
}
},
"_source":{
"excludes":[
"PurchasingViews",
"ContractFilters",
"SearchField*",
"Keywords*",
"Menus*",
"Prices.companyId.CompanyId"
]
}
}
Result:
{
"took":224,
"timed_out":false,
"_shards":{
"total":5,
"successful":5,
"skipped":0,
"failed":0
},
"hits":{
"total":1173525,
"max_score":1.0,
"hits":[
{
"_index":"products_purchasing",
"_type":"product_purchasing",
"_id":"12787114",
"_score":1.0,
"_source":{
"CompanyName":"...",
"Prices":[
{
"CompanyId":1474,
"Price":697.3
}
],
"CompanyId":571057,
"PartNumber":"...",
"LongDescription_en":"...",
"Name_en":"...",
"DescriptionSnippet_en":"...",
"ProductId":9605985,
"Id":12787114
}
}
]
},
"aggregations":{
"CompanyName.raw":{
"doc_count_error_upper_bound":84,
"sum_other_doc_count":21078,
"buckets":[
{
"key":"...",
"doc_count":534039
}
]
}
}
}

https://www.elastic.co/guide/en/elasticsearch/reference/current/nested.html
I believe the way you have put your mapping with the "nested" type has created the reference you are questioning.
Also - I would suggest framing the query as looking for 3 only rather than "excluding" everything except 3

Related

Elasticsearch: When doing an "inner_hit" on nested documents, return all fields of matched offset in the hierarchy

Mapping for document:
{
"mappings": {
"properties": {
"client_classes": {
"type": "nested",
"properties": {
"members": {
"type": "nested",
"properties": {
"phone_nos": {
"type": "nested"
}
}
}
}
}
}
}
}
Data in Document:
{
"client_name":"client1",
"client_classes":[
{
"class_name":"class1",
"members":[
{
"name":"name1",
"phone_nos":[
{
"ext":"91",
"number":"99119XXXX"
},
{
"ext":"04",
"number":"99885XXXX"
}
]
},
{
"name":"name2",
"phone_nos":[
{
"ext":"03",
"number":"99887XXXX"
}
]
}
]
}
]
}
I query for "number" with value "99119XXXX"
{
"query":{
"nested":{
"path":"client_classes.members.phone_nos",
"query":{
"match":{
"client_classes.members.phone_nos.number":"99119XXXX"
}
},
"inner_hits":{}
}
}
}
Result from inner hits:
"inner_hits":{
"client_classes.members.phone_nos":{
"hits":{
"total":{
"value":1,
"relation":"eq"
},
"max_score":0.9808291,
"hits":[
{
"_index":"clients",
"_type":"_doc",
"_id":"1",
"_nested":{
"field":"client_classes",
"offset":0,
"_nested":{
"field":"members",
"offset":0,
"_nested":{
"field":"phone_nos",
"offset":0
}
}
},
"_score":0.9808291,
"_source":{
"ext":"91",
"number":"99119XXXX"
}
}
]
}
}
}
I get the desired matched result hierarchy of all the nested objects, in the inner hit, but I only receive the "offset" value and "field" from these objects. I need the full object of the corresponding offset.
Something like this:
{
"client_name":"client1",
"client_classes":[
{
"class_name":"class1",
"members":[
{
"name":"name1",
"phone_nos":[
{
"ext":"91",
"number":"99119XXXX"
}
]
}
]
}
]
}
I understand that with inner_hit I also get the complete root document, from where I can use the offset values from the innerhit object. But fetching the entire root document could be expensive for our memory, so I only need the result I have shared above.
Is there any such possibility as of now?
I am using elasticsearch 7.7
UPDATE: Added Mapping, result and a slight fix in document
Yes, just add "_source": false at the top-level and you'll only get the nested inner hits
{
"_source": false, <--- add this
"query":{
"nested":{
"path":"client_classes.members.phone_nos",
"query":{
"match":{
"client_classes.members.phone_nos.number":"99119XXXX"
}
},
"inner_hits":{}
}
}
}

Elasticsearch query_string filter with Fields when not empty string

Im trying to build a query_string with elasticsearch DSL, my query is sql style is like this :
SELECT NAME,DESCRIPTION, URL, FACEBOOK_URL, YEAR_CREATION FROM MY_INDEX WHERE FACEBOOK_URL<>'' and ( Match('NAME: sometext OR DESCRIPTION: sometext )) AND YEAR_CREATION > 2000
I dont know how to include filter for no empty value for FACEBOOK_URL
Thanks for help...
It's very clear about #Kamal's point. You should examine the type of your "FACEBOOK" field, which must be keyword type but not text.
Please see the below mapping, sample documents, the request query and response.
Note that I may not have added all the fields but only the concerned fields so as to mirror the query you've added.
Mapping:
PUT facebook
{
"mappings": {
"properties": {
"name":{
"type": "text",
"fields": {
"keyword":{
"type":"keyword"
}
}
},
"description":{
"type": "text",
"fields": {
"keyword":{
"type":"keyword"
}
}
},
"facebook_url":{
"type": "keyword"
},
"year_creation":{
"type": "date"
}
}
}
}
Sample Docs:
In the below 4 documents, only the 3rd document mentioned would be something that you would want to be returned.
Docs 1 and 2 have empty values of facebook_url while doc 4 does not have the field in the first place at all.
POST facebook/_doc/1
{
"name": "sometext",
"description": "sometext",
"facebook_url": "",
"year_creation": "2019-01-01"
}
POST facebook/_doc/2
{
"name": "sometext",
"description": "sometext",
"facebook_url": "",
"year_creation": "2019-01-01"
}
POST facebook/_doc/3
{
"name" : "sometext",
"description" : "sometext",
"facebook_url" : "http://mytest.fb.link",
"year_creation" : "2019-01-01"
}
POST facebook/_doc/4
{
"name": "sometext",
"description": "sometext",
"year_creation": "2019-01-01"
}
Request Query:
POST facebook/_search
{
"_source": ["name", "description","facebook_url","year_creation"],
"query": {
"bool": {
"must": [
{
"bool": {
"should": [
{
"match": {
"name": "sometext"
}
},
{
"match": {
"description": "sometext"
}
}
]
}
},
{
"exists": {
"field": "facebook_url"
}
},
{
"range": {
"year_creation": {
"gte": "2000-01-01"
}
}
}
],
"must_not": [
{
"term": {
"facebook_url": {
"value": ""
}
}
}
]
}
}
}
I think the query would be self-explainable.
I have added Exists query so that if the document does not have that field, it would not be appearing the result, however for empty values I've added a clause in must_not.
Notice that in my design, I've used facebook_url as keyword type as it makes no sense to have it in text type. For that reason, I've used Term Query.
Also note that for date filtering, I've made use of Range Query. Do go through the links for more clarification as it is important to understand more on how each of these query works.
Response:
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 2.148216,
"hits" : [
{
"_index" : "facebook",
"_type" : "_doc",
"_id" : "3",
"_score" : 2.148216,
"_source" : {
"facebook_url" : "http://mytest.fb.link",
"year_creation" : "2019-01-01",
"name" : "sometext",
"description" : "sometext"
}
}
]
}
}
Updated Answer:
Change the field of ANNEE_CREATION from integer to Date field as that is the correct type for the Date fields.
You have not applied range query on the date field based on your query in question.
Note that for must_not apply the logic on keyword field of facebook that you have and not on text field.
{
"query":{
"bool":{
"must":[
{
"query_string":{
"query":" Bordeaux",
"fields":[
"VILLE",
"ADRESSE",
"FACEBOOK"
]
}
},
{
"exists":{
"field":"FACEBOOK"
}
}
],
"must_not":[
{
"term":{
"FACEBOOK.keyword":{ <------ Make sure this is a keyword field
"value":""
}
}
}
],
"filter":[
{
"range":{
"FONDS_LEVEES_TOTAL":{
"gt":0
}
}
},
{
"range":{ <----- Apply the range query here based on what you've mentioned in question
"ANNEE_CREATION":{ <----- Make sure this is the date field
"gte": "2015" <----- Make sure you apply correct query parameter in range query
}
}
}
]
}
},
"track_total_hits":true,
"from":0,
"size":8,
"_source":[
"FACEBOOK",
"NOM",
"ANNEE_CREATION",
"FONDS_LEVEES_TOTAL"
]
}
As expected only the document having Id 3 is returned as result.

Elasticsearch- sum specific field in specific datetime range?

I have a requirement of find sum one fields in a single query. I have managed to sum specific field in specific datetime range .
I try to use aggregation query to get the field total in specific datetime range,but get total value is 0
My document json look like the following way :
{
"_index": "flow-2018.02.01",
"_type": "doc",
"_source": {
"#timestamp": "2018-02-01T01:02:40.701Z",
"dest": {
"ip": "120.119.37.237",
"mac": "d4:6d:50:21:f8:44",
"port": 3280
},
"final": true,
"flow_id": "EQQA////DP//////FP8BAAFw5CIXZxTUbVAh+ERn/7FMeHcl7S6z0Aw",
"last_time": "2018-02-01T01:01:48.349Z",
"source": {
"ip": "100.255.177.76",
"mac": "70:e4:30:15:67:14",
"port": 45870,
"stats": {
"bytes_total": 60,
"packets_total": 1
}
},
"start_time": "2018-02-01T01:01:48.349Z",
"transport": "tcp",
"type": "flow"
},
"fields": {
"start_time": [
1517446908349
],
"#timestamp": [
1517446960701
],
"last_time": [
1517446908349
]
},
"sort": [
1517446960701
]
}
My search query :
{
"size":0,
"query":{
"bool":{
"must":[
{
"range":{
"_source.#timestamp":{
"gte": "2018-02-01T01:00:00.000Z",
"lte": "2018-02-01T01:05:00.000Z"
}
}
}
]
}
},
"aggs":{
"total":{
"sum":{
"field":"stats.packets_total "
}
}
}
}
Please help me to solve this issue. Thank you
You're almost there, stats.packets_total should be source.stats.packets_total (and make sure to remove the space at the end of the field name), like this:
{
"size":0,
"query":{
"bool":{
"must":[
{
"range":{
"#timestamp":{
"gte": "2018-02-01T01:00:00.000Z",
"lte": "2018-02-01T01:05:00.000Z"
}
}
}
]
}
},
"aggs":{
"total":{
"sum":{
"field":"source.stats.packets_total"
}
}
}
}

Elastic search, having should inside of should

I am searching throught logs with wildcards on multiple fields ,these wildcard queries are inside a should class, every log aside the fields that I search with wildcards is having a "action_id" field, I want to return only the logs that are matching one or more of wildcards and have one or more of the action_id's that I want(A or B in the example).
What doesnt work:
q={
"size" : 20,
"sort" : [
{ "clsfd_id" : {"order" : "asc"}},
],
"fields":
"query":{
"filtered":{
"query":{
"match_all":{
}
},
"filter":{
"bool":{
"should":[
{
"query":{
"term":{
"id":"*12*"
}
}} #gets filled dynamically with wildcard queries
],
"should":[
{"query":{
"term":{
"action_id":"A"
}
}
},
{"query":{
"term":{
"action_id":"B"
}
}
},
]
}
}
}
}
}
Example doc:
{
"_index": "logs",
"_type": "log",
"_id": "AVdPQBuRkYFjD-WA9CiE",
"_score": null,
"_source": {
"username": "Ιδιώτης",
"user_id": null,
"action_on": "Part",
"ip": "sensitive info",
"idents": [
"sensiti",
"sensitive info"
],
"time": "2016-09-21T19:18:11.184576",
"id": 5993765,
"changes": "bla bla bla",
"action_id": "A"
}
*This is Elastic 1.7 by the way
"I want to return only the logs that are matching the wildcards"
Shouldn't it then be in a must clause instead of should? Also terms filter can take multiple term candidates so you don't need to create separate filters for action_id A and B. When reading docs note that query DSL has changed quite a lot between 1.x and 2.x versions of Elastisearch, filters and queries have been more or less "merged" together.
Edit: Based on the comment, this should work assuming the original query was functional (sorry too lazy to test it):
{
"size": 20,
"sort": [{"clsfd_id" : {"order" : "asc"}}],
"query":{
"filtered":{
"query":{"match_all":{}},
"filter":{
"bool":{
"should":[
{
"query":{
"term": {"id":"*12*"}
}
},
{
"query":{
"term": {"id":"*23*"}
}
}
],
"must":[
{
"query":{
"term": {"action_id": ["A", "B"]}
}
}
]
}
}
}
}
}
The more verbose option is to create a new "query => filtered => filter => should => terms" instead of the multi-term filter.
Does this work?
{
"size" : 20,
"sort" : [{
"date" : {
"order" : "asc"
}
}
],
"fields" : [],
"query" : {
"filtered" : {
"query" : {
"wildcard" : {
"id" : {
"value" : "*12*"
}
}
},
"filter" : {
"terms" : {
"action_id" : ["A", "B"]
}
}
}
} }

How to select fields after aggregation in Elastic Search 2.3

I have following schema for an index:-
PUT
"mappings": {
"event": {
"properties": {
"#timestamp": { "type": "date", "doc_values": true},
"partner_id": { "type": "integer", "doc_values": true},
"event_id": { "type": "integer", "doc_values": true},
"count": { "type": "integer", "doc_values": true, "index": "no" },
"device_id": { "type": "string", "index":"not_analyzed","doc_values":true }
"product_id": { "type": "integer", "doc_values": true},
}
}
}
I need result equivalent to following query:-
SELECT product_id, device_id, sum(count) FROM index WHERE partner_id=5 AND timestamp<=end_date AND timestamp>=start_date GROUP BY device_id,product_id having sum(count)>1;
I am able to achieve the result by following elastic query:-
GET
{
"store": true,
"size":0,
"aggs":{
"matching_events":{
"filter":{
"bool":{
"must":[
{
"term":{
"partner_id":5
}
},
{
"range":{
"#timestamp":{
"from":1470904000,
"to":1470904999
}
}
}
]
}
},
"aggs":{
"group_by_productid": {
"terms":{
"field":"product_id"
},
"aggs":{
"group_by_device_id":{
"terms":{
"field":"device_id"
},
"aggs":{
"total_count":{
"sum":{
"field":"count"
}
},
"sales_bucket_filter":{
"bucket_selector":{
"buckets_path":{
"totalCount":"total_count"
},
"script": {"inline": "totalCount > 1"}
}
}
}
}}
}
}
}
}
}'
However for the case where count is <=1 query is returning empty buckets with key as product_id. Now out of 40 million groups, only 100k will have satisfy the condition, so I am returned with huge result set, majority of which is useless. How can I select only particular field after aggregation? I tried this but not working- `"fields": ["aggregations.matching_events.group_by_productid.group_by_device_id.buckets.key"]
Edit:
I have following set of data:-
device id Partner Id Count
db63te2bd38672921ffw27t82 367 3
db63te2bd38672921ffw27t82 272 1
I go this output:-
{
"took":6,
"timed_out":false,
"_shards":{
"total":5,
"successful":5,
"failed":0
},
"hits":{
"total":7,
"max_score":0.0,
"hits":[
]
},
"aggregations":{
"matching_events":{
"doc_count":5,
"group_by_productid":{
"doc_count_error_upper_bound":0,
"sum_other_doc_count":0,
"buckets":[
{
"key":367,
"doc_count":3,
"group_by_device_id":{
"doc_count_error_upper_bound":0,
"sum_other_doc_count":0,
"buckets":[
{
"key":"db63te2bd38672921ffw27t82",
"doc_count":3,
"total_count":{
"value":3.0
}
}
]
}
},
{
"key":272,
"doc_count":1,
"group_by_device_id":{
"doc_count_error_upper_bound":0,
"sum_other_doc_count":0,
"buckets":[
]
}
}
]
}
}
}
}
As you can see, bucket with key 272 is empty which make sense, but shouldn't this bucket be removed from result set altogether?
I've just found out that there is a fairly recent issue and PR that adds a _bucket_count path to a buckets_path option so that an aggregation can potentially filter the parent bucket based on the number of buckets another aggregation has. In other words if the _bucket_count is 0 for a parent bucket_selector the bucket should be removed.
This is the github issue: https://github.com/elastic/elasticsearch/issues/19553

Resources