Elasticsearch - Find documents missing two fields - elasticsearch

I'm trying to create a query that returns information about how many documents that don't have data for two fields (date.new and date.old). I have tried the query below, but it works as OR-logic, where all documents missing either date.new or date.old are returned. Does anyone know how I can make this only return documents missing both fields?
{
"aggs":{
"Missing_field_count1":{
"missing":{
"field":"date.new"
}
},
"Missing_field_count2":{
"missing":{
"field":"date.old"
}
}
}
}

Aggregations is not the feature to use for this. You need to use the exists query wrapped within a bool/must_not query, like this:
GET index/_count
{
"size": 0,
"bool": {
"must_not": [
{
"exists": {
"field": "date.new"
}
},
{
"exists": {
"field": "date.old"
}
}
]
}
}

hits.total.value indicates the count of the documents that match the search request. The value indicates the number of hits that match and relation indicates whether the value is accurate (eq) or a lower bound (gte)
Index Data:
{
"data": {
"new": 1501,
"old": 10
}
}
{
"title": "elasticsearch"
}
{
"title": "elasticsearch-query"
}
{
"date": {
"new": 1400
}
}
The search query given by #Val answers on how to achieve your use case.
Search Result:
"hits": {
"total": {
"value": 2, <-- note this
"relation": "eq"
},
"max_score": 0.0,
"hits": [
{
"_index": "65112793",
"_type": "_doc",
"_id": "2",
"_score": 0.0,
"_source": {
"title": "elasticsearch"
}
},
{
"_index": "65112793",
"_type": "_doc",
"_id": "5",
"_score": 0.0,
"_source": {
"title": "elasticsearch-query"
}
}
]
}

Related

Remove results with same id from Elasticsearch search result

Let's assume we have a search result with 3 documents. Two of them share a key attribute (product-ID or similar).
Is it possible to remove duplicates from the search result by using Elasticsearch, so that only 2 documents would be returned in that case? I don't want to implement this in application logic as I would still like to use pagination, aggregation, etc. It does not matter which of the two documents with the same id is removed.
Thanks,
Philipp
Edit:
This would be the example in Elasticsearch:
PUT /tmp_pd_articles
{
"mappings": {
"properties": {
"name": { "type": "text" },
"articleNumber": { "type": "keyword" }
}
}
}
PUT /tmp_pd_articles/_doc/1
{
"name": "My Book 1",
"articleNumber": "A9781"
}
PUT /tmp_pd_articles/_doc/2
{
"name": "My Book 1 (with some other title)",
"articleNumber": "A9781"
}
PUT /tmp_pd_articles/_doc/3
{
"name": "My Book 2",
"articleNumber": "A9782"
}
GET /tmp_pd_articles/_search
{
"query": { "match_all": {} }
}
The goal is to write a query that returns only two articles instead of all three:
#1 ("A9781", "My Book 1") OR #2 ("A9781", "My Book 1 (with some other title)") AND
#3 ("A9782", "My Book 2")
This reduction should be applied because #1 and #2 share the same productNumber "A9781". I wonder whether there is a Elasticsearch query to accomplish this goal.
Yes, its possible using top-hits aggregation, please use below query to filter the data., note tested it on your mapping and sample data, and it provides your expected data.
{
"size": 0, --> returns only aggregate data, if you want to include all 3 documents remove this size param.
"aggs": {
"dedup": {
"terms": {
"field": "articleNumber"
},
"aggs": {
"dedup_docs": {
"top_hits": {
"size": 1
}
}
}
}
}
}
And Search result
"aggregations": {
"dedup": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "A9781",
"doc_count": 2,
"dedup_docs": {
"hits": {
"total": {
"value": 2,
"relation": "eq"
},
"max_score": 1.0,
"hits": [
{
"_index": "tmp_pd_articles",
"_type": "_doc",
"_id": "1",
"_score": 1.0,
"_source": {
"name": "My Book 1",
"articleNumber": "A9781"
}
}
]
}
}
},
{
"key": "A9782",
"doc_count": 1,
"dedup_docs": {
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 1.0,
"hits": [
{
"_index": "tmp_pd_articles",
"_type": "_doc",
"_id": "3",
"_score": 1.0,
"_source": {
"name": "My Book 2",
"articleNumber": "A9782"
}
}
]
}
}
}
]
}

How to select buckets of aggregation results based on top hit document attribute?

I am trying to get result for following Elasticsearch query. I got the response as shown below. Now I want to select the buckets based on the top hit document field "source".
POST /data/_search?size=0{
"aggs":{
"by_partyIds":{
"terms":{
"field":"id.keyword"
},
"aggs":{
"oldest_record":{
"top_hits":{
"sort":[
{
"createdate.keyword":{
"order":"asc"
}
}
],
"_source":[
"source"
],
"size":1
}
}
}
}
}
}
Response :
{
"aggregations": {
"by_partyIds": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "1",
"doc_count": 3,
"oldest_record": {
"hits": {
"total": 3,
"max_score": null,
"hits": [
{
"_index": "data",
"_type": "osr",
"_id": "DcagSm4B9WnM0Ke-MgGk",
"_score": null,
"_source": {
"source": "US"
},
"sort": [
"20-09-18 05:45:26.000000000AM"
]
}
]
}
}
},
{
"key": "2",
"doc_count": 3,
"oldest_record": {
"hits": {
"total": 3,
"max_score": null,
"hits": [
{
"_index": "data",
"_type": "osr",
"_id": "7caiSm4B9WnM0Ke-HwGx",
"_score": null,
"_source": {
"source": "UK"
},
"sort": [
"22-09-18 05:45:26.000000000AM"
]
}
]
}
}
}
]
}
}
}
Now I want to get the buckets with count US as source. Can we write the query for that? I tried A bucket aggregation which is parent pipeline aggregation which executes a script which determines whether the current bucket will be retained in the parent multi-bucket aggregation. The specified metric must be numeric and the script must return a boolean value. If the script language is expression then a numeric return value is permitted. In this case 0.0 will be evaluated as false and all other values will evaluate to true.

ElasticSearch : How can I boost score depending on field value?

I am trying to get rid of sorting in elasticsearch by boosting the _score based on field value. Here is my scenario:
I have a field in my document: applicationDate. This is time elapsed since EPOC. I want record having greater applicationDate (most recent) to have higer score.
If score of two documents are same, I want to sort them on another field that is of type String. Say "status" is another field that can have value (Available, in progress, closed ). So, documents having same applicationDate should have _score based on status.
Available should have more score , In Progress a less, Closed, least. So by this means, I wont have to sort the documents after getting results.
Please give me some pointers.
You should be able to achieve this using Function Score .
Depending on your requirements it could be as simple as the following
Example:
put test/test/1
{
"applicationDate" : "2015-12-02",
"status" : "available"
}
put test/test/2
{
"applicationDate" : "2015-12-02",
"status" : "progress"
}
put test/test/3
{
"applicationDate" : "2016-03-02",
"status" : "progress"
}
post test/_search
{
"query": {
"function_score": {
"functions": [
{
"field_value_factor" : {
"field" : "applicationDate",
"factor" : 0.001
}
},
{
"filter": {
"term": {
"status": "available"
}
},
"weight": 360
},
{
"filter": {
"term": {
"status": "progress"
}
},
"weight": 180
}
],
"boost_mode": "multiply",
"score_mode": "sum"
}
}
}
**Results:**
"hits": [
{
"_index": "test",
"_type": "test",
"_id": "3",
"_score": 1456877060,
"_source": {
"applicationDate": "2016-03-02",
"status": "progress"
}
},
{
"_index": "test",
"_type": "test",
"_id": "1",
"_score": 1449014780,
"_source": {
"applicationDate": "2015-12-02",
"status": "available"
}
},
{
"_index": "test",
"_type": "test",
"_id": "2",
"_score": 1449014660,
"_source": {
"applicationDate": "2015-12-02",
"status": "progress"
}
}
]
Have you looked at function scores?
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-function-score-query.html
Specifically look at decay functions in the above documentation.
There is a new field called rank_feature_field that can be useful for this usecase:
https://www.elastic.co/guide/en/elasticsearch/reference/current/rank-feature.html

Specifying total size of results to return for ElasticSearch query when using inner_hits

ElasticSearch allows inner_hits to specify 'from' and 'size' parameters, as can the outer request body of a search.
As an example, assume my index contains 25 books, each having less than 50 chapters. The below snippet would return all chapters across all books, because a 'size' of 100 books includes all of 25 books and a 'size' of 50 chapters includes all of "less than 50 chapters":
"index": 'books',
"type": 'book',
"body": {
"from" : 0, "size" : 100, // outer hits, or books
"query": {
"filtered": {
"filter": {
"nested": {
"inner_hits": {
"size": 50 // inner hits, or chapters
},
"path": "chapter",
"query": { "match_all": { } },
}
}
}
},
.
.
.
Now, I'd like to implement paging with a scenario like this. My question is, how?
In this case, do I have to return back the above max of 100 * 50 = 5000 documents from the search query and implement paging in the application level by displaying only the slice I am interested in? Or, is there a way to specify the total number of hits to return back in the search query itself, independent of the inner/outer size?
I am looking at the "response" as follows, and so would like this data to be able to be paginated:
response.hits.hits.forEach(function(book) {
chapters = book.inner_hits.chapters.hits.hits;
chapters.forEach(function(chapter) {
// ... this is one displayed result ...
});
});
I don't think this is possible with Elasticsearch and nested fields. The way you see the results is correct: ES paginates and returns books and it doesn't see inside nested inner_hits. Is not how it works. You need to handle the pagination manually in your code.
There is another option, but you need a parent/child relationship instead of nested.
Then you are able to query the children (meaning, the chapters) and paginate the results (the chapters). You can use inner_hits and return back the parent (the book itself).
PUT /library
{
"mappings": {
"book": {
"properties": {
"name": {
"type": "string"
}
}
},
"chapter": {
"_parent": {
"type": "book"
},
"properties": {
"title": {
"type": "string"
}
}
}
}
}
The query:
GET /library/chapter/_search
{
"size": 5,
"query": {
"has_parent": {
"type": "book",
"query": {
"match_all": {}
},
"inner_hits" : {}
}
}
}
And a sample output (trimmed, complete example here):
"hits": [
{
"_index": "library",
"_type": "chapter",
"_id": "1",
"_score": 1,
"_source": {
"title": "chap1"
},
"inner_hits": {
"book": {
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "library",
"_type": "book",
"_id": "book1",
"_score": 1,
"_source": {
"name": "book1"
}
}
]
}
}
}
},
{
"_index": "library",
"_type": "chapter",
"_id": "2",
"_score": 1,
"_source": {
"title": "chap2"
},
"inner_hits": {
"book": {
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "library",
"_type": "book",
"_id": "book1",
"_score": 1,
"_source": {
"name": "book1"
}
}
]
}
}
}
}
The search api allows for the addition of certain standard parameters, listed in the docs at: https://www.elastic.co/guide/en/elasticsearch/client/javascript-api/current/api-reference-2-0.html#api-search-2-0
According to the doc:
size Number — Number of hits to return (default: 10)
Which would make your request something like:
"size": 5000,
"index": 'books',
"type": 'book',
"body": {

Elastic Search- Fetch Distinct Tags

I have document of following format:
{
_id :"1",
tags:["guava","apple","mango", "banana", "gulmohar"]
}
{
_id:"2",
tags: ["orange","guava", "mango shakes", "apple pie", "grammar"]
}
{
_id:"3",
tags: ["apple","grapes", "water", "gulmohar","water-melon", "green"]
}
Now, I want to fetch unique tags value from whole document 'tags field' starting with prefix g*, so that these unique tags will be display by tag suggestors(Stackoverflow site is an example).
For example: Whenever user types, 'g':
"guava", "gulmohar", "grammar", "grapes" and "green" should be returned as a result.
ie. the query should returns distinct tags with prefix g*.
I tried everywhere, browse whole documentations, searched es forum, but I didn't find any clue, much to my dismay.
I tried aggregations, but aggregations returns the distinct count for whole words/token in tags field. It does not return the unique list of tags starting with 'g'.
"query": {
"filtered": {
"query": {
"bool": {
"should": [
{
"query_string": {
"allow_leading_wildcard": false,
"fields": [
"tags"
],
"query": "g*",
"fuzziness":0
}
}
]
}
},
"filter": {
//some condition on other field...
}
}
},
"aggs": {
"distinct_tags": {
"terms": {
"field": "tags",
"size": 10
}
}
},
result of above: guava(w), apple(q), mango(1),...
Can someone please suggest me the correct way to fetch all the distinct tags with prefix input_prefix*?
It's a bit of a hack, but this seems to accomplish what you want.
I created an index and added your docs:
DELETE /test_index
PUT /test_index
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0
}
}
POST /test_index/_bulk
{"index":{"_index":"test_index","_type":"doc","_id":1}}
{"tags":["guava","apple","mango", "banana", "gulmohar"]}
{"index":{"_index":"test_index","_type":"doc","_id":2}}
{"tags": ["orange","guava", "mango shakes", "apple pie", "grammar"]}
{"index":{"_index":"test_index","_type":"doc","_id":3}}
{"tags": ["guava","apple","grapes", "water", "grammar","gulmohar","water-melon", "green"]}
Then I used a combination of prefix query and highlighting as follows:
POST /test_index/_search
{
"query": {
"prefix": {
"tags": {
"value": "g"
}
}
},
"fields": [ ],
"highlight": {
"pre_tags": [""],
"post_tags": [""],
"fields": {
"tags": {}
}
}
}
...
{
"took": 5,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 1,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "1",
"_score": 1,
"highlight": {
"tags": [
"guava",
"gulmohar"
]
}
},
{
"_index": "test_index",
"_type": "doc",
"_id": "2",
"_score": 1,
"highlight": {
"tags": [
"guava",
"grammar"
]
}
},
{
"_index": "test_index",
"_type": "doc",
"_id": "3",
"_score": 1,
"highlight": {
"tags": [
"guava",
"grapes",
"grammar",
"gulmohar",
"green"
]
}
}
]
}
}
Here is the code I used:
http://sense.qbox.io/gist/c14675ee8bd3934389a6cb0c85ff57621a17bf11
What you're trying to do amounts to autocomplete, of course, and there are perhaps better ways of going about that than what I posted above (though they are a bit more involved). Here are a couple of blog posts we did about ways to set up autocomplete:
http://blog.qbox.io/quick-and-dirty-autocomplete-with-elasticsearch-completion-suggest
http://blog.qbox.io/multi-field-partial-word-autocomplete-in-elasticsearch-using-ngrams
As per #Sloan Ahrens advice, I did following:
Updated the mapping:
"tags": {
"type": "completion",
"context": {
"filter_color": {
"type": "category",
"default": "",
"path": "fruits.color"
},
"filter_type": {
"type": "category",
"default": "",
"path": "fruits.type"
}
}
}
Reference: ES API Guide
Inserted these indexes:
{
_id :"1",
tags:{input" :["guava","apple","mango", "banana", "gulmohar"]},
fruits:{color:'bar',type:'alice'}
}
{
_id:"2",
tags:{["orange","guava", "mango shakes", "apple pie", "grammar"]}
fruits:{color:'foo',type:'bob'}
}
{
_id:"3",
tags:{ ["apple","grapes", "water", "gulmohar","water-melon", "green"]}
fruits:{color:'foo',type:'alice'}
}
I don't need to modify much, my original index. Just added input before tags array.
POST rescu1/_suggest?pretty'
{
"suggest": {
"text": "g",
"completion": {
"field": "tags",
"size": 10,
"context": {
"filter_color": "bar",
"filter_type": "alice"
}
}
}
}
gave me the desired output.
I accepted #Sloan Ahrens answer as his suggestions worked like a charm for me, and he showed me the right direction.

Resources