Elasticsearch: Query the most recent that doesn't contain the field 'X' - elasticsearch

I have the following search query:
{
"query": {
"match": {
"name": "testlib"
}
}
}
When I do this query I get the three results below. What I want to do now is only return one result: the newest #timestamp that doesn't contain version_pre. So in this case, only return AV6qvDXDyHw9vNh6Wlpl.
[
{
"_index": "testsoftware",
"_type": "software",
"_id": "AV6qvDXDyHw9vNh6Wlpl",
"_score": 0.2876821,
"_source": {
"#timestamp": "2017-09-21T11:02:15-04:00",
"name": "testlib",
"version_major": 1,
"version_minor": 0,
"version_patch": 1
}
},
{
"_index": "testsoftware",
"_type": "software",
"_id": "AV6qvDF5MtcMTuGknsVs",
"_score": 0.18232156,
"_source": {
"#timestamp": "2017-09-20T17:21:35-04:00",
"name": "testlib",
"version_major": 1,
"version_minor": 0,
"version_patch": 0
}
},
{
"_index": "testsoftware",
"_type": "software",
"_id": "AV6qvDnVyHw9vNh6Wlpn",
"_score": 0.18232156,
"_source": {
"#timestamp": "2017-09-22T13:56:55-04:00",
"name": "testlib",
"version_major": 1,
"version_minor": 0,
"version_patch": 2,
"version_pre": 0
}
}
]

Use sort (https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-sort.html) and https://www.elastic.co/guide/en/elasticsearch/reference/2.3/query-dsl-exists-query.html:
{
"size" : 1,
"sort" : [{ "#timestamp" : {"order" : "asc"}}],
"query" : {
"bool": {
"must_not": {
"exists": {
"field": "version_pre"
}
}
}
Or even, via query string:
/_search?sort=#timestamp:desc&size=1&q=_missing_:version_pre

Related

Return Multi-Term Distinct Values

Within an Elastic Search index I am attempting to query by 2 distinct top-level field values from field companyName and field productName, ordered by a generatedDate field and include the domainModelId field.
The following SQL query shows the results of all existing values and I've high-lighted the two unique document rows (in this case) by generatedDate;
{
"query": "SELECT companyName, productName, generatedDate FROM nextware_domain_metaservices_domainmodel ORDER BY generatedDate DESC"
}
response as follows:
I tried the following
{
"size":0,
"aggs":
{
"companies":
{
"terms":
{
"field": "companyName.keyword"
},
"aggs":
{
"products":
{
"terms":
{
"field": "productName.keyword"
}
}
}
}
}
}
This returns the correct buckets as follows;
"aggregations": {
"companies": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "NextWare",
"doc_count": 18,
"products": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "ProductPortal",
"doc_count": 16
},
{
"key": "Domain",
"doc_count": 2
}
]
}
}
]
}
}
How can I include the value of domainModelId.Id field without a second query?
To include the value of domainModelId.Id, you need to use top_hits aggregation
Adding a working example with index data, search query, and search result
Index Data:
{
"companyName":"NextWare",
"productName":"Domain",
"domainModelId.Id":"i"
}
{
"companyName":"NextWare",
"productName":"Domain",
"domainModelId.Id":"c"
}
{
"companyName":"NextWare",
"productName":"ProductPortal",
"domainModelId.Id":"a"
}
{
"companyName":"NextWare",
"productName":"ProductPortal",
"domainModelId.Id":"b"
}
{
"companyName":"NextWare",
"productName":"ProductPortal",
"domainModelId.Id":"d"
}
{
"companyName":"NextWare",
"productName":"ProductPortal",
"domainModelId.Id":"e"
}
{
"companyName":"NextWare",
"productName":"ProductPortal",
"domainModelId.Id":"f"
}
{
"companyName":"NextWare",
"productName":"ProductPortal",
"domainModelId.Id":"g"
}
{
"companyName":"NextWare",
"productName":"ProductPortal",
"domainModelId.Id":"h"
}
Search Query:
{
"size": 0,
"aggs": {
"companies": {
"terms": {
"field": "companyName.keyword"
},
"aggs": {
"products": {
"terms": {
"field": "productName.keyword"
},
"aggs": {
"top_ids": {
"top_hits": {
"_source": {
"includes": [
"domainModelId.Id"
]
},
"size": 10
}
}
}
}
}
}
}
}
Search Result:
"aggregations": {
"companies": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "NextWare",
"doc_count": 9,
"products": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "ProductPortal",
"doc_count": 7,
"top_ids": {
"hits": {
"total": {
"value": 7,
"relation": "eq"
},
"max_score": 1.0,
"hits": [
{
"_index": "67049816",
"_type": "_doc",
"_id": "1",
"_score": 1.0,
"_source": {
"domainModelId.Id": "a"
}
},
{
"_index": "67049816",
"_type": "_doc",
"_id": "2",
"_score": 1.0,
"_source": {
"domainModelId.Id": "b"
}
},
{
"_index": "67049816",
"_type": "_doc",
"_id": "4",
"_score": 1.0,
"_source": {
"domainModelId.Id": "d"
}
},
{
"_index": "67049816",
"_type": "_doc",
"_id": "5",
"_score": 1.0,
"_source": {
"domainModelId.Id": "e"
}
},
{
"_index": "67049816",
"_type": "_doc",
"_id": "6",
"_score": 1.0,
"_source": {
"domainModelId.Id": "f"
}
},
{
"_index": "67049816",
"_type": "_doc",
"_id": "7",
"_score": 1.0,
"_source": {
"domainModelId.Id": "g"
}
},
{
"_index": "67049816",
"_type": "_doc",
"_id": "8",
"_score": 1.0,
"_source": {
"domainModelId.Id": "h"
}
}
]
}
}
},
{
"key": "Domain",
"doc_count": 2,
"top_ids": {
"hits": {
"total": {
"value": 2,
"relation": "eq"
},
"max_score": 1.0,
"hits": [
{
"_index": "67049816",
"_type": "_doc",
"_id": "3",
"_score": 1.0,
"_source": {
"domainModelId.Id": "c"
}
},
{
"_index": "67049816",
"_type": "_doc",
"_id": "9",
"_score": 1.0,
"_source": {
"domainModelId.Id": "i"
}
}
]
}
}
}
]
}
}
]
}
}

Find documents which is older then specified date range based on list of ids

After searching for sometime and seeing some answer not able to quite figure out the query for my requirement
My requirement is i have a list of document ids, what i need to find is the documents which are older than a specified range.
Scenario what i am trying:
total document present 10 documents with id ranging from 1 to 10.
trying to fetch 1,2,3 document if its 7 days older.
if only document 1,2 is 7 days older than it should only return 1 and 2 document and ignore the document 3 (if other documents are there which are 7 days older apart from document with id 1,2,3 it should not return in the result as i am passing the ids in the query).
Documents in index
{
"took": 391,
"timed_out": false,
"_shards": {
"total": 3,
"successful": 3,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 4,
"relation": "eq"
},
"max_score": 1.0,
"hits": [
{
"_index": "user",
"_type": "test",
"_id": "1",
"_score": 1.0,
"_source": {
"title": "string1",
"publishedDate": "2020-11-13T19:11:13.654Z"
}
},
{
"_index": "user",
"_type": "test",
"_id": "2",
"_score": 1.0,
"_source": {
"title": "string2",
"publishedDate": "2020-08-13T19:11:13.654Z"
}
},
{
"_index": "user",
"_type": "test",
"_id": "3",
"_score": 1.0,
"_source": {
"title": "string3",
"publishedDate": "2020-11-09T19:11:13.654Z"
}
},
{
"_index": "user",
"_type": "test",
"_id": "4",
"_score": 1.0,
"_source": {
"title": "string4",
"publishedDate": "2020-11-02T19:11:13.654Z"
}
}
]
}
}
Below is the query i am trying:
{
"query": {
"bool" : {
"must" : [
{"term" : {"_id" : {"value" : "1"}}},
{"term" : {"_id" : {"value" : "2"}}},
{"term" : {"_id" : {"value" : "3"}}}
],
"filter" : [
{"range" : {"publishedDate" : {"from" : "now-7d","to" : "now",
"include_lower" : true,"include_upper" : true,"boost" : 1.0
}
}
}
],
"adjust_pure_negative" : true,
"boost" : 1.0
}
}
}
Ideally it should return document 1 and 2 as only those two documents match with the query but above query doesn't return any result.
i think i am doing something wrong in the query.
can someone please help me with this.
Thanks in advance
If you want to retrieve those documents that are max 7 days older than the current date, then it should return only document 1, as document 2 is older than 7 days.
Adding a working example with search query and search result
Search Query:
{
"query": {
"bool": {
"must": [
{
"terms": {
"_id": [
"1",
"2",
"3"
]
}
}
],
"filter": [
{
"range": {
"publishedDate": {
"from": "now-7d",
"to": "now",
"include_lower": true,
"include_upper": true,
"boost": 1.0
}
}
}
]
}
}
}
Search Result:
"hits": [
{
"_index": "64906019",
"_type": "_doc",
"_id": "1",
"_score": 1.0,
"_source": {
"title": "string1",
"publishedDate": "2020-11-13T19:11:13.654Z"
}
}
]
Update 1:
Your query will also work if you just replace the must clause with the should clause
{
"query": {
"bool": {
"should": [ <-- note this
{
"term": {
"_id": "1"
}
},
{
"term": {
"_id": "2"
}
},
{
"term": {
"_id": "3"
}
}
],
"filter": [
{
"range": {
"publishedDate": {
"from": "now-7d",
"to": "now",
"include_lower": true,
"include_upper": true,
"boost": 1.0
}
}
}
]
}
}
}

How to make flattened sub-field in the nested field in elastic search?

Here, I have a indexed document like:
doc = {
"id": 1,
"content": [
{
"txt": I,
"time": 0,
},
{
"txt": have,
"time": 1,
},
{
"txt": a book,
"time": 2,
},
{
"txt": do not match this block,
"time": 3,
},
]
}
And I want to match "I have a book", and return the matched time: 0,1,2. Is there anyone who knows how to build the index and the query for this situation?
I think the "content.txt" should be flattened but "content.time" should be nested?
want to match "I have a book", and return the matched time: 0,1,2.
Adding a working example with index mapping,search query, and search result
Index Mapping:
{
"mappings": {
"properties": {
"content": {
"type": "nested"
}
}
}
}
Search Query:
{
"query": {
"nested": {
"path": "content",
"query": {
"bool": {
"must": [
{
"match": {
"content.txt": "I have a book"
}
}
]
}
},
"inner_hits": {}
}
}
}
Search Result:
"inner_hits": {
"content": {
"hits": {
"total": {
"value": 3,
"relation": "eq"
},
"max_score": 2.5226097,
"hits": [
{
"_index": "64752029",
"_type": "_doc",
"_id": "1",
"_nested": {
"field": "content",
"offset": 2
},
"_score": 2.5226097,
"_source": {
"txt": "a book",
"time": 2
}
},
{
"_index": "64752029",
"_type": "_doc",
"_id": "1",
"_nested": {
"field": "content",
"offset": 0
},
"_score": 1.5580825,
"_source": {
"txt": "I",
"time": 0
}
},
{
"_index": "64752029",
"_type": "_doc",
"_id": "1",
"_nested": {
"field": "content",
"offset": 1
},
"_score": 1.5580825,
"_source": {
"txt": "have",
"time": 1
}
}
]
}
}
}
}

Boolean similarity - is there a way to remove duplicates

Given the following index
PUT /test_index
{
"mappings": {
"properties": {
"field1": {
"type": "text",
"analyzer": "whitespace",
"similarity": "boolean"
},
"field2": {
"type": "text",
"analyzer": "whitespace",
"similarity": "boolean"
}
}
}
}
and the following data
POST /test_index/_bulk?refresh=true
{ "index" : {} }
{ "field1": "foo", "field2": "bar"}
{ "index" : {} }
{ "field1": "foo1 foo2", "field2": "bar1 bar2"}
{ "index" : {} }
{ "field1": "foo1 foo2 foo3", "field2": "bar1 bar2 bar3"}
for the given Boolean similarity query
POST /test_index/_search
{
"size": 10,
"min_score": 0.4,
"query": {
"function_score": {
"query": {
"bool": {
"should": [
{
"fuzzy":{
"field1":{
"value":"foo",
"fuzziness":"AUTO",
"boost": 1
}
}
},
{
"fuzzy":{
"field2":{
"value":"bar",
"fuzziness":"AUTO",
"boost": 1
}
}
}
]
}
}
}
}
}
I'm always receiving ["foo1 foo2 foo3", "bar1 bar2 bar3"] despite the fact that there is an exact result in index (the first one):
{
"took": 114,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 3,
"relation": "eq"
},
"max_score": 3.9999998,
"hits": [
{
"_index": "test_index",
"_type": "_doc",
"_id": "bXw8eXUBCTtfNv84bNPr",
"_score": 3.9999998,
"_source": {
"field1": "foo1 foo2 foo3",
"field2": "bar1 bar2 bar3"
}
},
{
"_index": "test_index",
"_type": "_doc",
"_id": "bHw8eXUBCTtfNv84bNPr",
"_score": 2.6666665,
"_source": {
"field1": "foo1 foo2",
"field2": "bar1 bar2"
}
},
{
"_index": "test_index",
"_type": "_doc",
"_id": "a3w8eXUBCTtfNv84bNPr",
"_score": 2.0,
"_source": {
"field1": "foo",
"field2": "bar"
}
}
]
}
}
I'm aware of the fact that Boolean works that way to match as many results, and I know I can do rescoring here, but this is not an option since I don't know how many top N results to fetch.
Are there any other options here? Maybe to create my own similarity plugin based on Boolean similarity to remove duplicates and leave the best matched token, but I don't know where to start from, I see only samples for script and rescore.
Update:- Based on the clarity provided in the comment section of my earlier answer, updating the answer.
Below query returns the expected results
{
"min_score": 0.4,
"size":10,
"query": {
"function_score": {
"query": {
"bool": {
"should": [
{
"fuzzy": {
"field1": {
"value": "foo",
"fuzziness": "AUTO",
"boost": 0.5
}
}
},
{
"term": { --> used for boosting the exact terms
"field1": {
"value": "foo",
"boost": 1.5 --> further boosting the exact match.
}
}
}
]
}
}
}
}
}
And search results
"hits": [
{
"_index": "test_index",
"_type": "_doc",
"_id": "zdMEvHUBlo4-1mHbtvNH",
"_score": 2.0,
"_source": {
"field1": "foo",
"field2": "bar"
}
},
{
"_index": "test_index",
"_type": "_doc",
"_id": "z9MEvHUBlo4-1mHbtvNH",
"_score": 0.99999994,
"_source": {
"field1": "foo1 foo2 foo3",
"field2": "bar1 bar2 bar3"
}
},
{
"_index": "test_index",
"_type": "_doc",
"_id": "ztMEvHUBlo4-1mHbtvNH",
"_score": 0.6666666,
"_source": {
"field1": "foo1 foo2",
"field2": "bar1 bar2"
}
}
]
Another query without the explicit boost of the exact term also returns the expected results
{
"min_score": 0.4,
"query": {
"function_score": {
"query": {
"bool": {
"should": [
{
"fuzzy": {
"field1": {
"value": "foo",
"fuzziness": "AUTO",
"boost": 0.5
}
}
},
{
"term": {
"field1": {
"value": "foo" --> notice there is no boost
}
}
}
]
}
}
}
}
}
And search result
"hits": [
{
"_index": "test_index",
"_type": "_doc",
"_id": "zdMEvHUBlo4-1mHbtvNH",
"_score": 1.5,
"_source": {
"field1": "foo",
"field2": "bar"
}
},
{
"_index": "test_index",
"_type": "_doc",
"_id": "z9MEvHUBlo4-1mHbtvNH",
"_score": 0.99999994,
"_source": {
"field1": "foo1 foo2 foo3",
"field2": "bar1 bar2 bar3"
}
},
{
"_index": "test_index",
"_type": "_doc",
"_id": "ztMEvHUBlo4-1mHbtvNH",
"_score": 0.6666666,
"_source": {
"field1": "foo1 foo2",
"field2": "bar1 bar2"
}
}
]

Elasticsearch query starting from a particular value

Is there a way to query starting from a particular value and get the next n records in Elasticsearch?
For example, I want to get 10 records starting from employee id "ABC_123".
The below query gives an error saying
[terms] query does not support [empId]
GET /_search
{
"from": 0, "size": 10,
"query" : {
"terms" : {
"empId" : "ABC_123"
}
}
}
What can I do about this?
You can use the prefix query, Also you can read more about the autocomplete on my blog, which discussed 4 approaches to make it work and their trade-off.
I used prefix query on your sample data and got the expected output and below is the step by step guide.
Index mapping
{
"mappings": {
"properties": {
"empId": {
"type": "keyword" --> field type `keyword`
}
}
}
}
Index sample docs
{
"empId" : "ABC_1231"
}
{
"empId" : "ABC_1232"
}
{
"empId" : "ABC_1233"
}
{
"empId" : "ABC_1234"
}
and so on
Prefix Search query
{
"from": 0,
"size": 10,
"query": {
"prefix": {
"empId": "ABC_123"
}
}
}
Search result
"hits": [
{
"_index": "so_prefix",
"_type": "_doc",
"_id": "1",
"_score": 1.0,
"_source": {
"empId": "ABC_1231"
}
},
{
"_index": "so_prefix",
"_type": "_doc",
"_id": "2",
"_score": 1.0,
"_source": {
"empId": "ABC_1232"
}
},
{
"_index": "so_prefix",
"_type": "_doc",
"_id": "3",
"_score": 1.0,
"_source": {
"empId": "ABC_1233"
}
},
{
"_index": "so_prefix",
"_type": "_doc",
"_id": "4",
"_score": 1.0,
"_source": {
"empId": "ABC_1234"
}
}
]

Resources