Range in elasticsearch not really working - elasticsearch

I run this query:
curl -X GET "localhost:9200/mydocs/_search?pretty" -H 'Content-Type: application/json' -d'
{
"query": {
"bool" : { "must" : [{"wildcard": {"guid": "14744*"}}, {"range": {"availability.start": {"lt": "now"}}}] }
}
}
'
I then get this response:
"hits" : [
{
"_index" : "mydocs",
"_type" : "_doc",
"_id" : "14744",
"_score" : 2.0,
"_source" : {
"guid" : "14744",
"availability" : {
"start" : "2021-03-28T22:00:00.000Z",
"end" : "2021-12-31T22:59:00.000Z"
},
"title" : "Some title"
}
}
]
What I actually want is results where today is in the range for the availability's start and end.
The above results says the document is available between
2021-03-28T22:00:00.000Z
and
2021-12-31T22:59:00.000Z
Today is 2021-04-15:15:00.000Z
So, what I shoud do is to add:
{"range": {"availability.end": {"gt": "now"}}}
isn't it correct? But when I run:
curl -X GET "localhost:9200/mydocs/_search?pretty" -H 'Content-Type: application/json' -d'
{
"query": {
"bool" : { "must" : [{"wildcard": {"guid": "14744*"}}, {"range": {"availability.start": {"lt": "now"}}}, {"range": {"availability.end": {"gt": "now"}}}] }
}
}
'
I got an empty hits list.
Partial mapping:
{
mappings: {
_doc: {
properties: {
availability: {
properties: {
end: {
type: "keyword"
},
start: {
type: "keyword"
}
}
},
properties: {
guid: {
type: "keyword"
}
}
}
}
}

Your query is perfectly correct! Good job with that!
The problem is that the availability.* fields are defined as keyword.
They MUST be of type date in order for range queries on date values to deliver accurate results, otherwise the range queries will just perform a lexical (i.e. string) comparison of now vs the date values expressed as strings:
availability: {
properties: {
end: {
type: "date" <--- change this
},
start: {
type: "date" <--- and this
}
}
},
You can't change the mapping of existing fields, but you can always create new fields. So, you can change your mapping to create new date sub-fields for both start and end, like this:
PUT mydocs/_mapping
{
"properties": {
"availability": {
"properties": {
"end": {
"type": "keyword",
"fields": {
"date": {
"type": "date"
}
}
},
"start": {
"type": "keyword",
"fields": {
"date": {
"type": "date"
}
}
}
}
}
}
}
Then you simply need to run the following command in order to update your index:
POST mydocs/_update_by_query
And then modify your query to use the new sub-fields and that will work:
POST mydocs/_search
{
"query": {
"bool": {
"must": [
{
"wildcard": {
"guid": "14744*"
}
},
{
"range": {
"availability.start.date": {
"lt": "now"
}
}
},
{
"range": {
"availability.end.date": {
"gt": "now"
}
}
}
]
}
}
}

Related

Place an Analyzer on a a specific array item in a nested object

I have the following mapping
"mappings":{
"properties":{
"name": {
"type": "text"
},
"age": {
"type": "integer"
},
"customProps":{
"type" : "nested",
"properties": {
"key":{
"type": "keyword"
},
"value": {
"type" : "keyword"
}
}
}
}
}
example data
{
"name" : "person1",
"age" : 10,
"customProps":[
{"hairColor":"blue"},
{"height":"120"}
]
},
{
"name" : "person2",
"age" : 30,
"customProps":[
{"jobTitle" : "software engineer"},
{"salaryAccount" : "AvGhj90AAb"}
]
}
so i want to be able to search for document by salary account case insensitive, i am also searching using wild card
example query is
{
"query": {
"bool": {
"should": [
{
"nested": {
"path": "customProps",
"query": {
"bool": {
"must": [
{ "match": { "customProps.key": "salaryAccount" } },
{ "wildcard": { "customProps.value": "*AvG*"
}
}
]}}}}]}}}
i tried adding analyzer with PUT using the following syntax
{
"settings":{
"index":{
"analysis":{
"analyzer":{
"analyzer_case_insensitive" : {
"tokenizer":"keyword",
"filter":"lowercase"
}
}
}
}
},
"mappings":{
"people":{
"properties":{
"customProps":{
"properties":{
"value":{
"type": "keyword",
"analyzer": "analyzer_case_insensitive"
}
}
}
}
}
}
}
im getting the following error
"type" : "mapper_parsing_exception",
"reason" : "Root mapping definition has unsupported parameters: [people: {properties={customProps={properties={value={analyzer=analyzer_case_insensitive, type=keyword}}}}}]"
any idea how to do the analyzer for the salary account object in the array when it exists?
Your use case is quite clear, that you want to search on the value of salaryAccount only when this key exists in customProps array.
There are some issues with your mapping definition :
You cannot define a custom analyzer for keyword type field, instead you can use a normalizer
Based on the mapping definition you added at the beginning of the question, it seems that you are using elasticsearch version 7.x. But the second mapping definition that you provided, in that you have added mapping type also (i.e people), which is deprecated in 7.x
There is no need to add the key and value fields in the index mapping.
Adding a working example with index mapping, search query, and search result
Index Mapping:
PUT myidx
{
"mappings": {
"properties": {
"customProps": {
"type": "nested"
}
}
}
}
Search Query:
You need to use exists query, to check whether a field exists or not. And case_insensitive param in Wildcard query is available since elasticsearch version 7.10. If you are using a version below this, then you need to use a normalizer, to achieve case insensitive scenarios.
POST myidx/_search
{
"query": {
"bool": {
"should": [
{
"nested": {
"path": "customProps",
"query": {
"bool": {
"must": [
{
"exists": {
"field": "customProps.salaryAccount"
}
},
{
"wildcard": {
"customProps.salaryAccount.keyword": {
"value": "*aVg*",
"case_insensitive": true
}
}
}
]
}
}
}
}
]
}
}
}
Search Result:
"hits" : [
{
"_index" : "myidx",
"_type" : "_doc",
"_id" : "2",
"_score" : 2.0,
"_source" : {
"name" : "person2",
"age" : 30,
"customProps" : [
{
"jobTitle" : "software engineer"
},
{
"salaryAccount" : "AvGhj90AAb"
}
]
}
}
]

ElasticSearch: Querying if today is between and series of start and end dates in list

If I have some data with a field with multiple sets of start/end dates.. for example:
{
id: 1,
title: "My Title",
availability[{start: 01-01-2020, end: 01-05-2020},{start: 02-01-2020, end: 02-22-2020}]
}
Is it possible in elasticsearch to build a query to check if today (or any given date) falls within any of the start/end date combinations in the list?
Or would I need to structure my data differently to make this work?
Previously, I was dealing with just one start and one end date and could store them as their own fields and do a gte, lte combination to check.
Update:
if I add them as nested fields. e.g.:
"avails" : {
"type" : "nested",
"properties" : {
"availStart" : { "type" : "date" },
"availEnd" : { "type" : "date" }
}
}
If I do my search like this:
{
"query": {
"nested" : {
"path" : "avails",
"query" : {
"term" : {
{ "range" : {"avails.start" : {"lte": "now"}}},
{ "range" : {"avails.end" : {"gt" : "now"}}}
}
}
}
}
}
will it evaluate this for each nested record and return any parent record with a child record that matches?
It's good that you've chosen nested fields. Now you just need to make sure the mappings, field names, and the query are all consistent.
The date mapping including the format:
PUT myindex
{
"mappings": {
"properties": {
"avails": {
"type": "nested",
"properties": {
"start": { "type": "date", "format": "MM-dd-yyyy" },
"end": { "type": "date", "format": "MM-dd-yyyy" }
}
}
}
}
}
Syncing your doc
POST myindex/_doc
{
"id": 1,
"title": "My Title",
"avails": [
{
"start":"01-01-2020",
"end": "01-05-2020"
},
{
"start": "02-01-2020",
"end": "02-22-2020"
}
]
}
And finally the query. Yours was malformed -- if you want a logical AND, you'll need to wrap the range queries in a bool + must:
POST myindex/_search
{
"query": {
"nested": {
"path": "avails",
"query": {
"bool": {
"must": [
{ "range" : {"avails.start" : {"lte": "now"}}},
{ "range" : {"avails.end" : {"gt" : "02-01-2020"}}}
]
}
}
}
}
}

How to Query elasticsearch index with nested and non nested fields

I have an elastic search index with the following mapping:
PUT /student_detail
{
"mappings" : {
"properties" : {
"id" : { "type" : "long" },
"name" : { "type" : "text" },
"email" : { "type" : "text" },
"age" : { "type" : "text" },
"status" : { "type" : "text" },
"tests":{ "type" : "nested" }
}
}
}
Data stored is in form below:
{
"id": 123,
"name": "Schwarb",
"email": "abc#gmail.com",
"status": "current",
"age": 14,
"tests": [
{
"test_id": 587,
"test_score": 10
},
{
"test_id": 588,
"test_score": 6
}
]
}
I want to be able to query the students where name like '%warb%' AND email like '%gmail.com%' AND test with id 587 have score > 5 etc. The high level of what is needed can be put something like below, dont know what would be the actual query, apologize for this messy query below
GET developer_search/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"name": "abc"
}
},
{
"nested": {
"path": "tests",
"query": {
"bool": {
"must": [
{
"term": {
"tests.test_id": IN [587]
}
},
{
"term": {
"tests.test_score": >= some value
}
}
]
}
}
}
}
]
}
}
}
The query must be flexible so that we can enter dynamic test Ids and their respective score filters along with the fields out of nested fields like age, name, status
Something like that?
GET student_detail/_search
{
"query": {
"bool": {
"must": [
{
"wildcard": {
"name": {
"value": "*warb*"
}
}
},
{
"wildcard": {
"email": {
"value": "*gmail.com*"
}
}
},
{
"nested": {
"path": "tests",
"query": {
"bool": {
"must": [
{
"term": {
"tests.test_id": 587
}
},
{
"range": {
"tests.test_score": {
"gte": 5
}
}
}
]
}
},
"inner_hits": {}
}
}
]
}
}
}
Inner hits is what you are looking for.
You must make use of Ngram Tokenizer as wildcard search must not be used for performance reasons and I wouldn't recommend using it.
Change your mapping to the below where you can create your own Analyzer which I've done in the below mapping.
How elasticsearch (albiet lucene) indexes a statement is, first it breaks the statement or paragraph into words or tokens, then indexes these words in the inverted index for that particular field. This process is called Analysis and that this would only be applicable on text datatype.
So now you only get the documents if these tokens are available in inverted index.
By default, standard analyzer would be applied. What I've done is I've created my own analyzer and used Ngram Tokenizer which would be creating many more tokens than just simply words.
Default Analyzer on Life is beautiful would be life, is, beautiful.
However using Ngrams, the tokens for Life would be lif, ife & life
Mapping:
PUT student_detail
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 4,
"token_chars": [
"letter",
"digit"
]
}
}
}
},
"mappings" : {
"properties" : {
"id" : {
"type" : "long"
},
"name" : {
"type" : "text",
"analyzer": "my_analyzer",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"email" : {
"type" : "text",
"analyzer": "my_analyzer",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"age" : {
"type" : "text" <--- I am not sure why this is text. Change it to long or int. Would leave this to you
},
"status" : {
"type" : "text",
"analyzer": "my_analyzer",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"tests":{
"type" : "nested"
}
}
}
}
Note that in the above mapping I've created a sibling field in the form of keyword for name, email and status as below:
"name":{
"type":"text",
"analyzer":"my_analyzer",
"fields":{
"keyword":{
"type":"keyword"
}
}
}
Now your query could be as simple as below.
Query:
POST student_detail/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"name": "war" <---- Note this. This would even return documents having "Schwarb"
}
},
{
"match": {
"email": "gmail" <---- Note this
}
},
{
"nested": {
"path": "tests",
"query": {
"bool": {
"must": [
{
"term": {
"tests.test_id": 587
}
},
{
"range": {
"tests.test_score": {
"gte": 5
}
}
}
]
}
}
}
}
]
}
}
}
Note that for exact matches I would make use of Term Queries on keyword fields while for normal searches or LIKE in SQL I would make use of simple Match Queries on text Fields provided they make use of Ngram Tokenizer.
Also note that for >= and <= you would need to make use of Range Query.
Response:
{
"took" : 233,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 3.7260926,
"hits" : [
{
"_index" : "student_detail",
"_type" : "_doc",
"_id" : "1",
"_score" : 3.7260926,
"_source" : {
"id" : 123,
"name" : "Schwarb",
"email" : "abc#gmail.com",
"status" : "current",
"age" : 14,
"tests" : [
{
"test_id" : 587,
"test_score" : 10
},
{
"test_id" : 588,
"test_score" : 6
}
]
}
}
]
}
}
Note that I observe the document you've mentioned in your question, in my response when I run the query.
Please do read the links I've shared. It is vital that you understand the concepts. Hope this helps!

Join of reverse_nested aggregations in Elasticsearch

Please help me to find a mechanism to aggregate over the following domain or to prove that it doesn't exist in the current API.
curl -XDELETE 127.0.0.1:9200/test_index
curl -XPUT 127.0.0.1:9200/test_index -d '{
"mappings": {
"contact": {
"properties": {
"facebook_profile": {
"type": "nested",
"properties": {
"education": {
"type": "string"
},
"year": {
"type": "integer"
}
}
},
"google_profile": {
"type": "nested",
"properties": {
"education": {
"type": "string"
},
"year": {
"type": "integer"
}
}
}
}
}
}
}'
curl -XPUT 127.0.0.1:9200/test_index/contact/contact1 -d '{
"google_profile": {
"education": "stanford", "year": 1990
}
}'
curl -XPUT 127.0.0.1:9200/test_index/contact/contact2 -d '
{
"facebook_profile": {
"education": "stanford", "year": 1990
}
}'
How one can query ES to find statistics about how many of contacts graduated from particular universities?
I found one possibility, but it doesn't give me desired result, since it can't answer on the question above with respect to contacts, but only to their particular profiles (nested docs):
curl -XPOST '127.0.0.1:9200/test_index/_search?search_type=count&pretty=true' -d '{
"aggs": {
"facebook_educations": {
"aggs": {
"field": {
"terms": {
"field": "contact.facebook_profile.education"
},
"aggs": {
"reverse": {
"reverse_nested": {
}
}
}
}
},
"nested": {
"path": "contact.facebook_profile"
}
},
"google_educations": {
"aggs": {
"field": {
"terms": {
"field": "contact.google_profile.education"
},
"aggs": {
"reverse": {
"reverse_nested": {
}
}
}
}
},
"nested": {
"path": "contact.google_profile"
}
}
}
}'
What gives me:
"aggregations" : {
"facebook_educations" : {
"doc_count" : 1,
"field" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [ {
"key" : "stanford",
"doc_count" : 1,
"reverse" : {
"doc_count" : 1
}
} ]
}
},
"google_educations" : {
"doc_count" : 1,
"field" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [ {
"key" : "stanford",
"doc_count" : 1,
"reverse" : {
"doc_count" : 1
}
} ]
}
}
}
But here I can't be sure if one found contact is the same or different doc(parent), respectively I can't answer to my initial question.
Thank you for any advice.
It sounds like you are trying to aggregate on multiple fields. This is not directly supported in Elasticsearch, but there are ways to work around this and get the results you are looking for.
Have a look at the discussion on Github, and also in the documentation.
If I'm understanding correctly, whether "stanford" appears in facebook_profile.education or google_profile.education, you would like the contact to be counted only once in the aggregation.
You should be able to do this in one of two ways:
Use a script to concatenate the values stored in the fields:
{
"aggs": {
"by_education": {
"terms": {
"script": "doc['contact.facebook_profile.education'].values + doc['contact.google_profile.education'].values"
}
}
}
}
You can create create a new dedicated field at index time which contains the values from both fields, using the copy_to option. Then aggregate on the single field. For example, you could copy the contents of both fields to a new field called education_combined.
{
"mappings":{
"contact":{
"properties":{
"facebook_profile":{
"type":"nested",
"properties":{
"education":{
"type":"string",
"copy_to":"education_combined"
},
"year":{
"type":"integer"
}
}
},
"google_profile":{
"type":"nested",
"properties":{
"education":{
"type":"string",
"copy_to":"education_combined"
},
"year":{
"type":"integer"
}
}
},
"education_combined":{
"type":"string"
}
}
}
}
}
Then, simply aggregate on education_combined:
{
"aggs": {
"by_education": {
"terms": { "field": "education_combined" }
}
}
}

Using a custom_score to sort by a nested child's timestamp

I'm pretty new to elasticsearch and have been banging my head trying to get this sorting to work. The general idea is to search email message threads with nested messages and nested participants. The goal is to display search results at the thread level, sorting by the participant who is doing the search and either the last_received_at or last_sent_at column depending on which mailbox they are in.
My understanding is that you can't sort by a single child's value among many nested children. So in order to do this I saw a couple of suggestions for using a custom_score with a script, then sorting on the score. My plan is to dynamically change the sort column and then run a nested custom_score query that will return the date of one of the participants as the score. I've been noticing some issues with both the score format being strange (eg. always has 4 zeros at the end) and it may not be returning the date that I was expecting.
Below are simplified versions of the index and the query in question. If anyone has any suggestions, I'd be very grateful. (FYI - I am using elasticsearch version 0.20.6.)
Index:
mappings: {
message_thread: {
properties: {
id: {
type: long
}
subject: {
dynamic: true
properties: {
id: {
type: long
}
name: {
type: string
}
}
}
participants: {
dynamic: true
properties: {
id: {
type: long
}
name: {
type: string
}
last_sent_at: {
format: dateOptionalTime
type: date
}
last_received_at: {
format: dateOptionalTime
type: date
}
}
}
messages: {
dynamic: true
properties: {
sender: {
dynamic: true
properties: {
id: {
type: long
}
}
}
id: {
type: long
}
body: {
type: string
}
created_at: {
format: dateOptionalTime
type: date
}
recipient: {
dynamic: true
properties: {
id: {
type: long
}
}
}
}
}
version: {
type: long
}
}
}
}
Query:
{
"query": {
"bool": {
"must": [
{
"term": { "participants.id": 3785 }
},
{
"custom_score": {
"query": {
"filtered": {
"query": { "match_all": {} },
"filter": {
"term": { "participants.id": 3785 }
}
}
},
"params": { "sort_column": "participants.last_received_at" },
"script": "doc[sort_column].value"
}
}
]
}
},
"filter": {
"bool": {
"must": [
{
"term": { "messages.recipient.id": 3785 }
}
]
}
},
"sort": [ "_score" ]
}
Solution:
Thanks to #imotov, here is the final result. The participants were not properly nested in the index (while the messages didn't need to be). In addition, include_in_root was used for the participants to simplify the query (participants are small records and not a real size issue, although #imotov also provided an example without it). He then restructured the JSON request to use a dis_max query.
curl -XDELETE "localhost:9200/test-idx"
curl -XPUT "localhost:9200/test-idx" -d '{
"mappings": {
"message_thread": {
"properties": {
"id": {
"type": "long"
},
"messages": {
"properties": {
"body": {
"type": "string",
"analyzer": "standard"
},
"created_at": {
"type": "date",
"format": "yyyy-MM-dd'\''T'\''HH:mm:ss'\''Z'\''"
},
"id": {
"type": "long"
},
"recipient": {
"dynamic": "true",
"properties": {
"id": {
"type": "long"
}
}
},
"sender": {
"dynamic": "true",
"properties": {
"id": {
"type": "long"
}
}
}
}
},
"messages_count": {
"type": "long"
},
"participants": {
"type": "nested",
"include_in_root": true,
"properties": {
"id": {
"type": "long"
},
"last_received_at": {
"type": "date",
"format": "yyyy-MM-dd'\''T'\''HH:mm:ss'\''Z'\''"
},
"last_sent_at": {
"type": "date",
"format": "yyyy-MM-dd'\''T'\''HH:mm:ss'\''Z'\''"
},
"name": {
"type": "string",
"analyzer": "standard"
}
}
},
"subject": {
"properties": {
"id": {
"type": "long"
},
"name": {
"type": "string"
}
}
}
}
}
}
}'
curl -XPUT "localhost:9200/test-idx/message_thread/1" -d '{
"id" : 1,
"subject" : {"name": "Test Thread"},
"participants" : [
{"id" : 87793, "name" : "John Smith", "last_received_at" : null, "last_sent_at" : "2010-10-27T17:26:58Z"},
{"id" : 3785, "name" : "David Jones", "last_received_at" : "2010-10-27T17:26:58Z", "last_sent_at" : null}
],
"messages" : [{
"id" : 1,
"body" : "This is a test.",
"sender" : { "id" : 87793 },
"recipient" : { "id" : 3785},
"created_at" : "2010-10-27T17:26:58Z"
}]
}'
curl -XPUT "localhost:9200/test-idx/message_thread/2" -d '{
"id" : 2,
"subject" : {"name": "Elastic"},
"participants" : [
{"id" : 57834, "name" : "Paul Johnson", "last_received_at" : "2010-11-25T17:26:58Z", "last_sent_at" : "2010-10-25T17:26:58Z"},
{"id" : 3785, "name" : "David Jones", "last_received_at" : "2010-10-25T17:26:58Z", "last_sent_at" : "2010-11-25T17:26:58Z"}
],
"messages" : [{
"id" : 2,
"body" : "More testing of elasticsearch.",
"sender" : { "id" : 57834 },
"recipient" : { "id" : 3785},
"created_at" : "2010-10-25T17:26:58Z"
},{
"id" : 3,
"body" : "Reply message.",
"sender" : { "id" : 3785 },
"recipient" : { "id" : 57834},
"created_at" : "2010-11-25T17:26:58Z"
}]
}'
curl -XPOST localhost:9200/test-idx/_refresh
echo
# Using include in root
curl "localhost:9200/test-idx/message_thread/_search?pretty=true" -d '{
"query": {
"filtered": {
"query": {
"nested": {
"path": "participants",
"score_mode": "max",
"query": {
"custom_score": {
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"term": {
"participants.id": 3785
}
}
}
},
"params": {
"sort_column": "participants.last_received_at"
},
"script": "doc[sort_column].value"
}
}
}
},
"filter": {
"query": {
"multi_match": {
"query": "test",
"fields": ["subject.name", "participants.name", "messages.body"],
"operator": "and",
"use_dis_max": true
}
}
}
}
},
"sort": ["_score"],
"fields": []
}
'
# Not using include in root
curl "localhost:9200/test-idx/message_thread/_search?pretty=true" -d '{
"query": {
"filtered": {
"query": {
"nested": {
"path": "participants",
"score_mode": "max",
"query": {
"custom_score": {
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"term": {
"participants.id": 3785
}
}
}
},
"params": {
"sort_column": "participants.last_received_at"
},
"script": "doc[sort_column].value"
}
}
}
},
"filter": {
"query": {
"bool": {
"should": [{
"match": {
"subject.name":"test"
}
}, {
"nested" : {
"path": "participants",
"query": {
"match": {
"name":"test"
}
}
}
}, {
"match": {
"messages.body":"test"
}
}
]
}
}
}
}
},
"sort": ["_score"],
"fields": []
}
'
There are a couple of issues here. You are asking about nested objects, but participants are not defined in your mapping as nested objects. The second possible issue is that score has type float, so it might not have enough precision to represent timestamp as is. If you can figure out how to fit this value into float, you can take a look at this example: Elastic search - tagging strength (nested/child document boosting). However, if you are developing a new system, it might be prudent to upgrade to 0.90.0.Beta1, which supports sorting on nested fields.

Resources