ElasticSearch: Get distinct field values from multi_match - elasticsearch

My Query with multiple multi_matches looks like follows:
"query": {
"bool": {
"should" : [
{"multi_match" : {
"query": "test",
"fields": ["field1^15", "field2^8"],
"tie_breaker": 0.2,
"minimum_should_match": "50%"
}},
{"multi_match" : {
"query": "test2",
"fields": ["field1^15", "field2^8"],
"tie_breaker": 0.2,
"minimum_should_match": "50%"
}
}
]
}
}
I want to get all distinct field1 values which match the query. How can I realize that?
EDIT:
Mapping:
"field1": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
},
"analyzer": "nGram_analyzer"
}
This is what I tried so far (I still get multiple identical field1 values):
"query": {
"bool": {
"should" : [
{"multi_match" : {
"query": "test",
"fields": ["field1^15", "field2^8"],
"tie_breaker": 0.2,
"minimum_should_match": "50%"
}},
{"multi_match" : {
"query": "test2",
"fields": ["field1^15", "field2^8"],
"tie_breaker": 0.2,
"minimum_should_match": "50%"
}
}
]
}
},
"aggs": {
"field1": {
"terms": {
"field": "field1.keyword",
"size": 100 //1
}
}
}
UPDATE:
The query
GET /test/test/_search
{
"_source": ["field1"],
"size": 10000,
"query": {
"multi_match" : {
"query": "test",
"fields": ["field1^15", "field2^8"],
"tie_breaker": 0.2,
"minimum_should_match": "50%"
}
},
"aggs": {
"field1": {
"terms": {
"field": "field1.keyword",
"size": 1
}
}
}
}
results in
{
"took": 4,
"timed_out": false,
"_shards": {
"total": 10,
"successful": 10,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 35,
"max_score": 110.26815,
"hits": [
{
"_index": "test",
"_type": "test",
"_id": "AVzz99c4X4ZbfhscNES7",
"_score": 110.26815,
"_source": {
"field1": "test-hier"
}
},
{
"_index": "test",
"_type": "test",
"_id": "AVzz8JWGX4ZbfhscMwe_",
"_score": 107.45808,
"_source": {
"field1": "test-hier"
}
},
{
"_index": "test",
"_type": "test",
"_id": "AVzz8JWGX4ZbfhscMwe_",
"_score": 107.45808,
"_source": {
"field1": "test-da"
}
},
...
So actually there should only be one "test-hier".

You can add a terms aggregation on the field1.keyword field and you'll get all distinct values (you can change size to any other value that better matches the cardinality of your field):
{
"size": 0,
"query": {...},
"aggs": {
"field1": {
"terms": {
"field": "field1.keyword",
"size": 100
},
"aggs": {
"single_hit": {
"top_hits": {
"size": 1
}
}
}
}
}
}

Related

Elasticsearch returns NullPointerException during inner_hits query

I have an index, which stores a nested document. I wanna see this nested documents, for this purpose I used 'inner_hits' in request, but elastic returns nullPointerException. Do anyone meet with this problem?)
Request to elasticsearch using Postman:
GET http://localhost/my-index/_search
{
"query": {
"nested": {
"path": "address_object",
"query": {
"bool": {
"must": {
"term": {"address_object.city": "Paris"}
}
}
},
"inner_hits" : {}
}
}
}
Response with status code 200:
{
"took": 161,
"timed_out": false,
"_shards": {
"total": 2,
"successful": 1,
"skipped": 0,
"failed": 1,
"failures": [
{
"shard": 0,
"index": "my-index",
"node": "DWdD83KaTmUiodENQkGDww",
"reason": {
"type": "null_pointer_exception",
"reason": null
}
}
]
},
"hits": {
"total": 6500039,
"max_score": 2.1761138,
"hits": []
}
}
Elasticsearch version: 6.2.4
Lucene version: 7.2.1
Update:
Mapping:
{
"my-index": {
"mappings": {
"mytype": {
"dynamic": "false",
"_source": {
"enabled": false
},
"properties": {
"adverts_count": {
"type": "integer",
"store": true
},
...
"address_object": {
"type": "nested",
"properties": {
"adverts_count": {
"type": "integer",
"store": true
},
"city": {
"type": "keyword",
"store": true
}
}
},
...
Sample document:
{
"_index": "my-index",
"_type": "mytype",
"_id": "XDWrGncBdwNBWGEagAM2",
"_score": 2.1587489,
"fields": {
"is_target_page_shown": [
0
],
"updated_at": [
1612264276
],
"is_shown": [
0
],
"nb_queries": [
1
],
"search_query": [
"phone"
],
"target_category": [
15
],
"adverts_count": [
1
]
}
}
Extra information:
If I remove the "inner_hits": {} from search request, elastic returns nested documents(_index, _type, _id, _score), but ain't other fields(e.g city)
Also, as suggested in the comments, I tried setting to true ignore_unmapped, but it doesn't helped. The same nullPointerException.
I tried reproducing your issue, but as you have not provided the proper sample documents(one which you provided doesn't have the address_object properties), I used your mapping and below sample documents.
PUT index-name/_doc/1
{
"address_object" :{
"adverts_count" : 1,
"city": "paris"
}
}
PUT index-name/_doc/2
{
"address_object" :{
"adverts_count" : 1,
"city": "blr"
}
}
And when I use the same search provided by you.
POST 71907588/_search
{
"query": {
"nested": {
"path": "address_object",
"query": {
"bool": {
"must": {
"term": {
"address_object.city": "paris"
}
}
}
},
"inner_hits": {}
}
}
}
I get a proper response, matching paris as city as shown in the search response.
"hits": [
{
"_index": "71907588",
"_id": "1",
"_score": 0.6931471,
"_source": {
"address_object": {
"adverts_count": 1,
"city": "paris"
}
},
"inner_hits": {
"address_object": {
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 0.6931471,
"hits": [
{
"_index": "71907588",
"_id": "1",
"_nested": {
"field": "address_object",
"offset": 0
},
"_score": 0.6931471,
"_source": {
"city": "paris",
"adverts_count": 1
}
}
]
}
}
}
}
]

how can I fetch only inner fields from source in ElasticSearch?

I have index structure like this:
{
"id" : 42,
"Person" : {
"contracts" : [
{
"contractID" : "000000000000102"
}
],
"Ids" : [
3,
387,
100,
500,
274,
283,
328,
400,
600
]
},
"dateUpdate" : "2020-12-07T13:15:00.408Z"
}
},
...
}
I need a search query that will fetch only inner "Ids" field from source and nothing more. How can I do this?
You can use _source in inner_hits, in the following way
Index Mapping:
{
"mappings": {
"properties": {
"Person": {
"type": "nested"
}
}
}
}
Search Query:
{
"query": {
"bool": {
"must": [
{
"nested": {
"path": "Person",
"query": {
"match_all": {}
},
"inner_hits": {
"_source": {
"includes": [
"Person.Ids"
]
}
}
}
}
]
}
}
}
Search Result:
"inner_hits": {
"Person": {
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 1.0,
"hits": [
{
"_index": "65237264",
"_type": "_doc",
"_id": "1",
"_nested": {
"field": "Person",
"offset": 0
},
"_score": 1.0,
"_source": {
"Ids": [
3,
387,
100,
500,
274,
283,
328,
400,
600
]
}
}
]
}
}
}
You can also use nested inner_hits and _souce, in the following way
{
"query": {
"nested": {
"path": "Person",
"query": {
"match_all": {}
},
"inner_hits": {
"_source" : false,
"docvalue_fields" : [
{
"field": "Person.Ids",
"format": "use_field_mapping"
}
]
}
}
}
}

Kibana filter for equality on a field returns two documents with different field values

While using Kibana Discover mode, we found a concerning result.
For a given index, over a specific time range a case was found where when filtering on a field "time_stamp" (mapping to long) is equal to a specific value (1545287341), it returned two documents: one with the exact value and another which was close.
How is this possible? The only document returned should have the specified value? What is the possible cause for this inaccurate response from elasticsearch? Would appreciate help as this is very beguiling.
I am capturing the query sent by Kibana here.
{
"version": true,
"size": 500,
"sort": [{
"#timestamp": {
"order": "desc",
"unmapped_type": "boolean"
}
}],
"_source": {
"excludes": []
},
"aggs": {
"2": {
"date_histogram": {
"field": "#timestamp",
"interval": "3h",
"time_zone": "Etc/UTC",
"min_doc_count": 1
}
}
},
"stored_fields": ["*"],
"script_fields": {},
"docvalue_fields": ["#timestamp", "day"],
"query": {
"bool": {
"must": [{
"match_all": {}
}, {
"match_phrase": {
"dev_id.keyword": {
"query": "22170821152"
}
}
}, {
"match_phrase": {
"time_stamp": {
"query": 1545287341
}
}
}, {
"range": {
"#timestamp": {
"gte": 1544659200000,
"lte": 1545350399999,
"format": "epoch_millis"
}
}
}],
"filter": [],
"should": [],
"must_not": []
}
},
"highlight": {
"pre_tags": ["#kibana-highlighted-field#"],
"post_tags": ["#/kibana-highlighted-field#"],
"fields": {
"*": {}
},
"fragment_size": 2147483647
}
}
The (redacted) response showing the close-but-not-exact response is here as well:
{
"responses": [{
"took": 2,
"timed_out": false,
"_shards": {
"total": 10,
"successful": 10,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 2,
"max_score": null,
"hits": [{
"_index": "pkt-2018-12",
"_type": "doc",
"_id": "CzvHahOE1jrv+tFWGorFH4gV6cs=",
"_version": 1,
"_score": null,
"_source": {
"time_stamp": 1.545287341E9,
"#timestamp": "2018-12-20T06:29:01.000Z",
},
"fields": {
"#timestamp": ["2018-12-20T06:29:01.000Z"]
},
"highlight": {
"dev_id.keyword": ["#kibana-highlighted-field#22170821152#/kibana-highlighted-field#"]
},
"sort": [1545287341000]
}, {
"_index": "pkt-2018-12",
"_type": "doc",
"_id": "PbeMWFMNpvwrjnZpBJtexDwfE9k=",
"_version": 1,
"_score": null,
"_source": {
"time_stamp": 1.545287281E9,
"#timestamp": "2018-12-20T06:28:01.000
},
"fields": {
"#timestamp": ["2018-12-20T06:28:01.000Z"]
},
"highlight": {
"dev_id.keyword": ["#kibana-highlighted-field#22170821152#/kibana-highlighted-field#"]
},
"sort": [1545287281000]
}]
},
"aggregations": {
"2": {
"buckets": [{
"key_as_string": "2018-12-20T06:00:00.000Z",
"key": 1545285600000,
"doc_count": 2
}]
}
},
"status": 200
}]
}

elastic bool query must match mot getting considered

i am basically trying to write a query where it should return the document where
school is "holy international" AND grade is "second".
but the issue with the current query is that its not considering the must match query part. ie even though i don't i specify the school is the giving me this document where as it is not a match.
query is giving me all the documents where the grade is second.
i want only document where school is "holy international" AND grade is "second".
as well as i have not specified in the match query for "schools.school" but its giving me results.
mapping
{
"settings": {
"analysis": {
"analyzer": {
"my_keyword_lowercase1": {
"tokenizer": "keyword",
"filter": ["lowercase", "my_pattern_replace1", "trim"]
},
"my_keyword_lowercase2": {
"tokenizer": "standard",
"filter": ["lowercase", "trim"]
}
},
"filter": {
"my_pattern_replace1": {
"type": "pattern_replace",
"pattern": ".",
"replacement": ""
}
}
}
},
"mappings": {
"test_data": {
"properties": {
"schools": {
"type": "nested",
"properties": {
"school": {
"type": "string",
"analyzer": "my_keyword_lowercase1"
},
"grade": {
"type": "string",
"analyzer": "my_keyword_lowercase2"
}
}
}
}
}
}
}
data
{
"_index": "data_index",
"_type": "test_data",
"_id": "57a33ebc1d41",
"_version": 1,
"found": true,
"_source": {
"summary": null,
"schools": [{
"school": "little flower",
"grade": "first",
"date": "2007-06-01",
},
{
"school": "holy international",
"grade": "second",
"date": "2007-06-01",
},
],
"first_name": "Adam",
"location": "Kansas City",
"last_name": "Roger",
"country": "US",
"name": "Adam Roger",
}
}
query
{
"_source": ["first_name"],
"query": {
"nested": {
"path": "schools",
"inner_hits": {
"_source": {
"includes": [
"schools.school",
"schools.grade"
]
}
},
"query": {
"bool": {
"must": {
"match": {
"schools.school": {
"query": "" <-----X didnt specify anything
}
}
},
"filter": {
"match": {
"schools.grade": {
"query": "second",
"operator": "and",
"minimum_should_match": "100%"
}
}
}
}
}
}
}
}
result
{
"took": 26,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.2876821,
"hits": [
{
"_index": "data_test",
"_type": "test_data",
"_id": "57a33ebc1d41",
"_score": 0.2876821,
"_source": {
"first_name": "Adam"
},
"inner_hits": {
"schools": {
"hits": {
"total": 1,
"max_score": 0.2876821,
"hits": [
{
"_nested": {
"field": "schools",
"offset": 0
},
"_score": 0.2876821,
"_source": {
"schools": {
"school": "holy international",
"grade": "second"
}
}
}
]
}
}
}
}
]
}
}
So, basically your problem is analysis step, when I load everything and checked, it become very clear:
This filter completely wipes all string from schools.school field
"filter": {
"my_pattern_replace1": {
"type": "pattern_replace",
"pattern": ".",
"replacement": ""
}
}
I think, that's happening because . is regexp literal, so, when I checked it:
POST /_analyze
{
"field": "schools.school",
"text": "holy international"
}
{
"tokens": [
{
"token": "",
"start_offset": 0,
"end_offset": 18,
"type": "word",
"position": 0
}
]
}
That's why you always get a match, every string you passed during indexing time and during search time becomes "". Some additional info from Elastic wiki - https://www.elastic.co/guide/en/elasticsearch/reference/5.1/analysis-pattern_replace-tokenfilter.html
After I removed patter replace filter, this query returns everything as expected:
{
"_source": ["first_name"],
"query": {
"nested": {
"path": "schools",
"inner_hits": {
"_source": {
"includes": [
"schools.school",
"schools.grade"
]
}
},
"query": {
"bool": {
"must": {
"match": {
"schools.school": {
"query": "holy international"
}
}
},
"filter": {
"match": {
"schools.grade": {
"query": "second"
}
}
}
}
}
}
}
}

How do I perform an "OR" filter on an aggregate?

I am trying to grab the first 10 documents grouped by domain. These 10 documents need to have a value for "crawl_date" that haven't been crawled for a while or haven't been crawled at all (eg a blank value). I have:
curl -XPOST 'http://localhost:9200/tester/test/_search' -d '
{
"size": 10,
"aggs": {
"group_by_domain": {
"filter": {
"or":[
"term": {"crawl_date": ""},
"term": {"crawl_date": ""} // how do I put a range here? e.g. <= '2014-12-31'
]
},
"terms": {
"field": "domain"
}
}
}
}'
I am new to ES and using version 2.2. Since the documentation isn't fully updated I am struggling.
EDIT:
To clarify, I need 10 urls that haven't been crawled or haven't been crawled for a while. Each of those 10 urls has to come from a unique domain so that when I crawl them I don't overload someone's server.
Another Edit:
So, I need something like this (1 link for each of 10 unique domains):
1. www.domain1.com/page
2. www.domain2.com/url
etc...
Instead, I am getting just the domain and the number of pages:
"buckets": [
{
"key": "http://www.dailymail.co.uk",
"doc_count": 212
},
{
"key": "https://sedo.com",
"doc_count": 196
},
{
"key": "http://www.foxnews.com",
"doc_count": 118
},
{
"key": "http://data.worldbank.org",
"doc_count": 117
},
{
"key": "http://detail.1688.com",
"doc_count": 117
},
{
"key": "https://twitter.com",
"doc_count": 112
},
{
"key": "http://search.rakuten.co.jp",
"doc_count": 104
},
{
"key": "https://in.1688.com",
"doc_count": 92
},
{
"key": "http://www.abc.net.au",
"doc_count": 87
},
{
"key": "http://sport.lemonde.fr",
"doc_count": 85
}
]
The "hits" returns multiple pages for just 1 domain:
"hits": [
{
"_index": "tester",
"_type": "test",
"_id": "http://www.barnesandnoble.com/w/at-the-edge-of-the-orchard-tracy-chevalier/1121908441?ean=9780525953005",
"_score": 1,
"_source": {
"domain": "http://www.barnesandnoble.com",
"crawl_date": "0001-01-01T00:00:00Z"
}
},
{
"_index": "tester",
"_type": "test",
"_id": "http://www.barnesandnoble.com/b/bargain-books/_/N-8qb",
"_score": 1,
"_source": {
"domain": "http://www.barnesandnoble.com",
"crawl_date": "0001-01-01T00:00:00Z"
}
},
etc....
Barnes and Noble will quickly ban my UA if I try to crawl that many domains at the same time.
I need something like this:
1. "http://www.dailymail.co.uk/page/text.html",
2. "https://sedo.com/another/page"
3. "http://www.barnesandnoble.com/b/bargain-books/_/N-8qb"
4. "http://www.starbucks.com/homepage/"
etc.
Using Aggregations
If you want to use aggregations, I'd suggest using the terms aggregations to remove the duplicates from your result set and as sub aggregation, I'd use the top_hits aggregation, which gives you the best hit from the aggregated documents of each domain (per default the score for each document within a domain should be the same.)
Consequently the query will look like that:
POST sites/page/_search
{
"size": 0,
"aggs": {
"filtered_domains": {
"filter": {
"bool": {
"should": [
{
"bool": {
"must_not": {
"exists": {
"field": "crawl_date"
}
}
}
},
{
"range": {
"crawl_date": {
"lte": "2016-01-01"
}
}
}
]
}
},
"aggs": {
"domains": {
"terms": {
"field": "domain",
"size": 10
},
"aggs": {
"pages": {
"top_hits": {
"size": 1
}
}
}
}
}
}
}
}
Giving you results like that
"aggregations": {
"filtered_domains": {
"doc_count": 3,
"domains": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "barnesandnoble.com",
"doc_count": 2,
"pages": {
"hits": {
"total": 2,
"max_score": 1,
"hits": [
{
"_index": "test",
"_type": "page",
"_id": "barnesandnoble.com/test2.html",
"_score": 1,
"_source": {
"crawl_date": "1982-05-16",
"domain": "barnesandnoble.com"
}
}
]
}
}
},
{
"key": "starbucks.com",
"doc_count": 1,
"pages": {
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "test",
"_type": "page",
"_id": "starbucks.com/index.html",
"_score": 1,
"_source": {
"crawl_date": "1982-05-16",
"domain": "starbucks.com"
}
}
]
}
}
}
]
}
}
Using Parent/Child Aggregations
If you can change your index structure, I'd suggest to create an index with either parent/child relationship or nested documents.
If you do so, you can select 10 distinct domains and retrieve one (or more) specific pages of this url.
Let me show you an example with parent/child (if you use sense, you should be able to just copy paste):
First generate the mappings for the documents:
PUT /sites
{
"mappings": {
"domain": {},
"page": {
"_parent": {
"type": "domain"
},
"properties": {
"crawl_date": {
"type": "date"
}
}
}
}
}
Insert some documents
PUT sites/domain/barnesandnoble.com
{}
PUT sites/domain/starbucks.com
{}
PUT sites/domain/dailymail.co.uk
{}
POST /sites/page/_bulk
{ "index": { "_id": "barnesandnoble.com/test.html", "parent": "barnesandnoble.com" }}
{ "crawl_date": "1982-05-16" }
{ "index": { "_id": "barnesandnoble.com/test2.html", "parent": "barnesandnoble.com" }}
{ "crawl_date": "1982-05-16" }
{ "index": { "_id": "starbucks.com/index.html", "parent": "starbucks.com" }}
{ "crawl_date": "1982-05-16" }
{ "index": { "_id": "dailymail.co.uk/index.html", "parent": "dailymail.co.uk" }}
{}
Search for the urls to crawl
POST /sites/domain/_search
{
"query": {
"has_child": {
"type": "page",
"query": {
"bool": {
"filter": {
"bool": {
"should": [
{
"bool": {
"must_not": {
"exists": {
"field": "crawl_date"
}
}
}
},
{
"range": {
"crawl_date": {
"lte": "2016-01-01"
}
}
}]
}
}
}
},
"inner_hits": {
"size": 1
}
}
}
}
We do a has_child query on the parent type and therefor receive only distinct urls of the parent type. To get the specific pages, we have to add an inner_hits query, which gives us the child documents leading to the hits in the parent type.
If you set inner_hits size to 1, you get only one page per domain.
You can even add a sorting in the inner_hits query... For example, you can sort by the crawl_date. ;)
The above search gives you the following result:
"hits": [
{
"_index": "sites",
"_type": "domain",
"_id": "starbucks.com",
"_score": 1,
"_source": {},
"inner_hits": {
"page": {
"hits": {
"total": 1,
"max_score": 1.9664046,
"hits": [
{
"_index": "sites",
"_type": "page",
"_id": "starbucks.com/index.html",
"_score": 1.9664046,
"_routing": "starbucks.com",
"_parent": "starbucks.com",
"_source": {
"crawl_date": "1982-05-16"
}
}
]
}
}
}
},
{
"_index": "sites",
"_type": "domain",
"_id": "dailymail.co.uk",
"_score": 1,
"_source": {},
"inner_hits": {
"page": {
"hits": {
"total": 1,
"max_score": 1.9664046,
"hits": [
{
"_index": "sites",
"_type": "page",
"_id": "dailymail.co.uk/index.html",
"_score": 1.9664046,
"_routing": "dailymail.co.uk",
"_parent": "dailymail.co.uk",
"_source": {}
}
]
}
}
}
},
{
"_index": "sites",
"_type": "domain",
"_id": "barnesandnoble.com",
"_score": 1,
"_source": {},
"inner_hits": {
"page": {
"hits": {
"total": 2,
"max_score": 1.4142135,
"hits": [
{
"_index": "sites",
"_type": "page",
"_id": "barnesandnoble.com/test.html",
"_score": 1.4142135,
"_routing": "barnesandnoble.com",
"_parent": "barnesandnoble.com",
"_source": {
"crawl_date": "1982-05-16"
}
}
]
}
}
}
}
]
Finally, let me note one thing. Parent/child relationship comes with small costs at query time. If this isn't a problem for your use case, I'd go for this solution.
I suggest you use the exists filter instead of trying to match an empty term (the missing filter is deprecated in 2.2). Then, the range filter will help you filter out the documents you don't need.
Finally, since you have used the absolute URL as id, make sure to aggregate on the _uid field and not the domain field, that way you'll get unique counts per exact page.
curl -XPOST 'http://localhost:9200/tester/test/_search' -d '{
"size": 10,
"aggs": {
"group_by_domain": {
"filter": {
"bool": {
"should": [
{
"bool": {
"must_not": {
"exists": {
"field": "crawl_date"
}
}
}
},
{
"range": {
"crawl_date": {
"lte": "2014-12-31T00:00:00.000"
}
}
}
]
}
},
"aggs": {
"domains": {
"terms": {
"field": "_uid"
}
}
}
}
}
}'
You have to use Filter Aggregation and then sub-aggregation
{
"size": 10,
"aggs": {
"filter_date": {
"filter": {
"bool": {
"should": [
{
"bool": {
"must_not": [
{
"exists": {
"field": "crawl_date"
}
}
]
}
},
{
"range": {
"crawl_date": {
"from": "now-100d"
}
}
}
]
}
},
"aggs": {
"group_by_domain": {
"terms": {
"field": "domain"
}
}
}
}
}
}

Resources