How do I get a terms aggregation to respect the pipeline above it? - elasticsearch

(I've attached test data below the question)
I would like to know why when I use min_doc_count of a terms aggregation that is inside a pipeline aggregation, does it not respect the results of the aggregation above it?
Here's my query:
{
"size": 0,
"query": {
"bool": {
"filter": [
{
"nested": {
"path": "facets",
"query": {
"bool": {
"filter": [
{
"term": {
"facets.name": "brand"
}
},
{
"term": {
"facets.value": "hokey"
}
}
]
}
}
}
}
]
}
},
"aggs": {
"facets": {
"nested": {
"path": "facets"
},
"aggs": {
"names": {
"terms": {
"field": "facets.name"
},
"aggs": {
"values": {
"terms": {
"field": "facets.value",
"min_doc_count": 0
}
}
}
}
}
}
}
}
Looking at the above, and with using min_doc_count=0 for the facets.value term, why am I seeing results for all possible facets.value even when they don't match the above facets.name?
Surely, the aggs should be hierarchical and respect the higher levels? Do I need to run a script filter or something?
Please experiment with the below data to see what I mean. I don't want to have to run multiple queries for our search filtering because without min_doc_count because lots of facet.values are filtered out, but with it, we have too many irrelevant results in the lowest aggregation.
Mapping:
{
"mappings": {
"properties": {
"facets": {
"type": "nested",
"properties": {
"name": { "type": "keyword"},
"value": { "type": "keyword"}
}
}
}
}
}
Bulk documents:
{ "index": { "_index": "product-facets", "_id": 1} }
{"facets":[{"name":"brand","value":"ubest"},{"name":"color","value":"green"},{"name":"department","value":"soccer"}]}
{ "index": { "_index": "product-facets", "_id": 2} }
{"facets":[{"name":"brand","value":"ubest"},{"name":"color","value":"green"},{"name":"department","value":"adventure"}]}
{ "index": { "_index": "product-facets", "_id": 3} }
{"facets":[{"name":"brand","value":"beert"},{"name":"color","value":"white"},{"name":"department","value":"soccer"}]}
{ "index": { "_index": "product-facets", "_id": 4} }
{"facets":[{"name":"brand","value":"ubest"},{"name":"color","value":"yellow"},{"name":"department","value":"adventure"}]}
{ "index": { "_index": "product-facets", "_id": 5} }
{"facets":[{"name":"brand","value":"hokey"},{"name":"color","value":"yellow"},{"name":"department","value":"adventure"}]}
{ "index": { "_index": "product-facets", "_id": 6} }
{"facets":[{"name":"brand","value":"beert"},{"name":"color","value":"black"},{"name":"department","value":"casual"}]}
{ "index": { "_index": "product-facets", "_id": 7} }
{"facets":[{"name":"brand","value":"hokey"},{"name":"color","value":"white"},{"name":"department","value":"adventure"}]}
{ "index": { "_index": "product-facets", "_id": 8} }
{"facets":[{"name":"brand","value":"ubest"},{"name":"color","value":"black"},{"name":"department","value":"casual"}]}
{ "index": { "_index": "product-facets", "_id": 9} }
{"facets":[{"name":"brand","value":"hokey"},{"name":"color","value":"white"},{"name":"department","value":"soccer"}]}
{ "index": { "_index": "product-facets", "_id": 10} }
{"facets":[{"name":"brand","value":"hokey"},{"name":"color","value":"white"},{"name":"department","value":"adventure"}]}

Related

elk's elastic search dsl case sensitive

I'm doing an Elasticsearch Query DSL query on ELK such as:
{
"query": {
"wildcard": {
"url.path": {
"value": "*download*",
"boost": 1,
"rewrite": "constant_score"
}
}
}
}
but it seems is case sensitive (so show only info with "download", not "Download" or "DOWNLOAD").
i.e. is case sensitive.
can I disable this? and search case insensitive?
Version used: 7.9.1
The below query will help you perform case-insensitive search as it will fetch results for *download, *Download and *DOWNLOAD. You may replace with your index and with the field you would like to perform this search.
Search Query
GET /<my-index>/_search
{
"query" : {
"bool" : {
"must" : {
"query_string" : {
"query" : "*download",
"fields": ["<field1>"]
}
}
}
}
}
If you wish to perform the same search on multiple fields, you can add the same in list.
Search on multiple fields
GET /<my-index>/_search
{
"query" : {
"bool" : {
"must" : {
"query_string" : {
"query" : "*download",
"fields": ["<field1>","<field2>","field3>"]
}
}
}
}
}
There is a case_insensitive parameter available for wildcard query, but it was introduced in Elasticsearch 7.10.0, so you need to upgrade if you are still on 7.9.1.
If you can upgrade to 7.10.0 or higher:
Ideally, in index mapping field should use wildcard type:
{
"mappings": {
"properties": {
"url.path": {
"type": "wildcard"
}
}
}
}
Then a wildcard query with case insensitivity enabled will find all the variants ("download", "DOWNLOAD", "download", etc)
{
"query": {
"wildcard": {
"url.path": {
"value": "*download*",
"boost": 1,
"rewrite": "constant_score",
"case_insensitive": true
}
}
}
}
If you must remain at 7.9.1:
Define your mapping in such a way that Elasticsearch treats the field contents as lowercase. The following will mimic wildcard type (it's a keyword, so only one token) indexed as lowercase.
{
"mappings": {
"properties": {
"url": {
"type": "text",
"analyzer": "lowercase-keyword"
}
}
},
"settings": {
"analysis": {
"analyzer": {
"lowercase-keyword": {
"type": "custom",
"tokenizer": "keyword",
"filter": "lowercase"
}
}
}
}
}
The query, without the case_insensitive parameter which is unsupported in this version:
{
"query": {
"wildcard": {
"url": {
"value": "*download*",
"boost": 1,
"rewrite": "constant_score"
}
}
}
}
Example results (note that searching for "*download*" and "*DoWnLoAd*" with both work in the same way):
{
"hits": {
"total": {
"value": 3,
"relation": "eq"
},
"max_score": 1.0,
"hits": {
"total": {
"value": 3,
"relation": "eq"
},
"max_score": 1.0,
"hits": [
{
"_index": "my-index",
"_type": "_doc",
"_id": "PtbQe3wByTvslqtrs7Cn",
"_score": 1.0,
"_source": {
"url": "http://example.com/download"
}
},
{
"_index": "my-index",
"_type": "_doc",
"_id": "P9bQe3wByTvslqtrvbDt",
"_score": 1.0,
"_source": {
"url": "http://example.com/Download"
}
},
{
"_index": "my-index",
"_type": "_doc",
"_id": "QNbQe3wByTvslqtrzbDw",
"_score": 1.0,
"_source": {
"url": "http://example.com/DOWNLOAD"
}
}
]
}
}
You can use case_insensitive parameter for wildcard query. This parameter was introduced in 7.10.0 version
Adding a working example with index data, mapping, search query, and search result
Index Mapping:
{
"mappings": {
"properties": {
"url": {
"properties": {
"path": {
"type": "wildcard"
}
}
}
}
}
}
Index Data:
{
"url":{
"path":"xx/download"
}
}
Search Query:
{
"query": {
"wildcard": {
"url.path": {
"value": "*Download*",
"boost": 1,
"rewrite": "constant_score",
"case_insensitive": false
}
}
}
}
Search Result:
No results will be there when you are searching for *Download* or *DOWNLOAD*
Update:
You can use the wildcard query with "case_insensitive": true parameter
Adding a sample index data, search query, and search result
Index Data:
{
"url": {
"path": "download"
}
}
{
"url": {
"path": "DOWNLOAD"
}
}
{
"url": {
"path": "Download"
}
}
Search Query:
{
"query": {
"wildcard": {
"url.path": {
"value": "*DOWNLOAD*",
"boost": 1,
"rewrite": "constant_score",
"case_insensitive": true
}
}
}
}
Search Result:
"hits": [
{
"_index": "67210888",
"_type": "_doc",
"_id": "1",
"_score": 1.0,
"_source": {
"url": {
"path": "download"
}
}
},
{
"_index": "67210888",
"_type": "_doc",
"_id": "2",
"_score": 1.0,
"_source": {
"url": {
"path": "Download"
}
}
},
{
"_index": "67210888",
"_type": "_doc",
"_id": "3",
"_score": 1.0,
"_source": {
"url": {
"path": "DOWNLOAD"
}
}
}
]

Elasticsearch range query with multiple condition

I have to fetch records from Elastic Search on the basis of date it is updated and created. I have these two fields updatedDate and createdDate and the condition should be:
To fetch records that has updatedDate within the range of past 3 years.
If updatedDate is null, fetch records that has createdDate within the range of past 3 years.
I have written the query in java for fetching the records on the basis of record createdDate:
.must(QueryBuilders.rangeQuery("createdDate").from(startDate,true).to(endDate,true));
startDate and endDate holds the date range.
I am new to Elastic Search, don't know how to implement the above condition.
Since you have not provided any index data, so adding a working example with sample index data, mapping, search query and search result that satisfies all the conditions required for your use case.
Index Mapping:
{
"mappings": {
"properties": {
"createdDate": {
"format": "yyyy-MM-dd'T'HH:mm:ss'Z'",
"type": "date"
},
"updatedDate": {
"format": "yyyy-MM-dd'T'HH:mm:ss'Z'",
"type": "date"
}
}
}
}
Index Data:
{
"createdDate": "2020-08-15T00:00:00Z"
}
{
"createdDate": "2019-08-15T00:00:00Z"
}
{
"createdDate": "2010-08-15T00:00:00Z"
}
{
"updatedDate": "2021-08-15T00:00:00Z",
"createdDate": "2002-08-15T00:00:00Z"
}
{
"updatedDate": "2018-08-15T00:00:00Z",
"createdDate": "2020-09-15T00:00:00Z"
}
{
"updatedDate": "2000-08-15T00:00:00Z",
"createdDate": "2020-09-15T00:00:00Z"
}
Search Query:
{
"query": {
"bool": {
"should": [
{
"bool": {
"must": [
{
"bool": {
"filter": {
"range": {
"createdDate": {
"gte": "now-3y",
"lte": "now"
}
}
},
"must_not": {
"exists": {
"field": "updatedDate"
}
}
}
}
]
}
},
{
"bool": {
"filter": {
"range": {
"updatedDate": {
"gte": "now-3y",
"lte": "now"
}
}
}
}
}
],
"minimum_should_match": 1
}
}
}
Search Result:
"hits": [
{
"_index": "64965551",
"_type": "_doc",
"_id": "1",
"_score": 0.0,
"_source": {
"createdDate": "2020-08-15T00:00:00Z"
}
},
{
"_index": "64965551",
"_type": "_doc",
"_id": "2",
"_score": 0.0,
"_source": {
"createdDate": "2019-08-15T00:00:00Z"
}
},
{
"_index": "64965551",
"_type": "_doc",
"_id": "5",
"_score": 0.0,
"_source": {
"updatedDate": "2018-08-15T00:00:00Z",
"createdDate": "2020-09-15T00:00:00Z"
}
}
]

Filter elastic data on array count

How can we fetch candidates which have at least one phone number from the below index data along with other conditions like must and should?
Using elastic version 6.*
{
"_index": "test",
"_type": "docs",
"_id": "1271",
"_score": 1.518617,
"_source": {
"record": {
"createdDate": "2020-10-16T10:49:51.53",
"phoneNumbers": [
{
"type": "Cell",
"id": 0,
"countryCode": "+1",
"phoneNumber": "7845200448",
"extension": "",
"typeId": 700
}
]
},
"entityType": "Candidate",
"dbId": "1271",
"id": "1271"
}
}
You can use terms query that returns documents that contain one
or more exact terms in a provided field.
Search Query:
{
"query": {
"bool": {
"must": [
{
"terms": {
"record.phoneNumbers.phoneNumber.keyword": [
"7845200448"
]
}
}
]
}
}
}
Search Result:
"hits": [
{
"_index": "stof_64388591",
"_type": "_doc",
"_id": "1",
"_score": 1.0,
"_source": {
"record": {
"createdDate": "2020-10-16T10:49:51.53",
"phoneNumbers": [
{
"type": "Cell",
"id": 0,
"countryCode": "+1",
"phoneNumber": "7845200448",
"extension": "",
"typeId": 700
}
]
},
"entityType": "Candidate",
"dbId": "1271",
"id": "1271"
}
}
]
Update 1: For version 7.*
You need to use a script query, to filter documents based on the provided script.
{
"query": {
"bool": {
"filter": {
"script": {
"script": {
"source": "doc['record.phoneNumbers.phoneNumber.keyword'].length > 0",
"lang": "painless"
}
}
}
}
}
}
For version 6.*
{
"query": {
"bool": {
"filter": {
"script": {
"script": {
"source": "doc['record.phoneNumbers.phoneNumber.keyword'].values.length > 0",
"lang": "painless"
}
}
}
}
}
}
You can use exists query for this purpose like below which is a lightweight query in comparison with scripts:
{
"query": {
"exists": {
"field": "record.phoneNumbers.phoneNumber"
}
}
}

What is wrong in this elastic search query?

I can't understand why I have no results? Using ES 2.
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"term": {
"technical.techcolor": "red"
}
}
}
}
And here is the info from db that I am searching against.
{"technical":
[{
"techname22": "test",
"techcolor":"red",
"techlocation": "usa"
}],
"audio":
{
"someAudioMetadata": "test"
}
}
Since you have not specified your mapping, I am considering the following mapping.
Mapping:
{
"mappings": {
"company": {
"properties": {
"technical": {
"type": "nested"
}
}
}
}
}
Search Query:
{
"query": {
"filtered": {
"query": {
"match_all": {
}
},
"filter": {
"nested": {
"path": "technical",
"filter": {
"term": {
"technical.techcolor": "red"
}
}
}
}
}
}
}
Search Result:
"hits": {
"total": 1,
"max_score": 1.0,
"hits": [
{
"_index": "demos",
"_type": "company",
"_id": "1",
"_score": 1.0,
"_source": {
"technical": [
{
"techname22": "test",
"techcolor": "red",
"techlocation": "usa"
}
],
"audio": {
"someAudioMetadata": "test"
}
}
}
]
}
To know more about nested datatype you can refer to this official documentation and for Query and Filter Context refer this

How do I perform an "OR" filter on an aggregate?

I am trying to grab the first 10 documents grouped by domain. These 10 documents need to have a value for "crawl_date" that haven't been crawled for a while or haven't been crawled at all (eg a blank value). I have:
curl -XPOST 'http://localhost:9200/tester/test/_search' -d '
{
"size": 10,
"aggs": {
"group_by_domain": {
"filter": {
"or":[
"term": {"crawl_date": ""},
"term": {"crawl_date": ""} // how do I put a range here? e.g. <= '2014-12-31'
]
},
"terms": {
"field": "domain"
}
}
}
}'
I am new to ES and using version 2.2. Since the documentation isn't fully updated I am struggling.
EDIT:
To clarify, I need 10 urls that haven't been crawled or haven't been crawled for a while. Each of those 10 urls has to come from a unique domain so that when I crawl them I don't overload someone's server.
Another Edit:
So, I need something like this (1 link for each of 10 unique domains):
1. www.domain1.com/page
2. www.domain2.com/url
etc...
Instead, I am getting just the domain and the number of pages:
"buckets": [
{
"key": "http://www.dailymail.co.uk",
"doc_count": 212
},
{
"key": "https://sedo.com",
"doc_count": 196
},
{
"key": "http://www.foxnews.com",
"doc_count": 118
},
{
"key": "http://data.worldbank.org",
"doc_count": 117
},
{
"key": "http://detail.1688.com",
"doc_count": 117
},
{
"key": "https://twitter.com",
"doc_count": 112
},
{
"key": "http://search.rakuten.co.jp",
"doc_count": 104
},
{
"key": "https://in.1688.com",
"doc_count": 92
},
{
"key": "http://www.abc.net.au",
"doc_count": 87
},
{
"key": "http://sport.lemonde.fr",
"doc_count": 85
}
]
The "hits" returns multiple pages for just 1 domain:
"hits": [
{
"_index": "tester",
"_type": "test",
"_id": "http://www.barnesandnoble.com/w/at-the-edge-of-the-orchard-tracy-chevalier/1121908441?ean=9780525953005",
"_score": 1,
"_source": {
"domain": "http://www.barnesandnoble.com",
"crawl_date": "0001-01-01T00:00:00Z"
}
},
{
"_index": "tester",
"_type": "test",
"_id": "http://www.barnesandnoble.com/b/bargain-books/_/N-8qb",
"_score": 1,
"_source": {
"domain": "http://www.barnesandnoble.com",
"crawl_date": "0001-01-01T00:00:00Z"
}
},
etc....
Barnes and Noble will quickly ban my UA if I try to crawl that many domains at the same time.
I need something like this:
1. "http://www.dailymail.co.uk/page/text.html",
2. "https://sedo.com/another/page"
3. "http://www.barnesandnoble.com/b/bargain-books/_/N-8qb"
4. "http://www.starbucks.com/homepage/"
etc.
Using Aggregations
If you want to use aggregations, I'd suggest using the terms aggregations to remove the duplicates from your result set and as sub aggregation, I'd use the top_hits aggregation, which gives you the best hit from the aggregated documents of each domain (per default the score for each document within a domain should be the same.)
Consequently the query will look like that:
POST sites/page/_search
{
"size": 0,
"aggs": {
"filtered_domains": {
"filter": {
"bool": {
"should": [
{
"bool": {
"must_not": {
"exists": {
"field": "crawl_date"
}
}
}
},
{
"range": {
"crawl_date": {
"lte": "2016-01-01"
}
}
}
]
}
},
"aggs": {
"domains": {
"terms": {
"field": "domain",
"size": 10
},
"aggs": {
"pages": {
"top_hits": {
"size": 1
}
}
}
}
}
}
}
}
Giving you results like that
"aggregations": {
"filtered_domains": {
"doc_count": 3,
"domains": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "barnesandnoble.com",
"doc_count": 2,
"pages": {
"hits": {
"total": 2,
"max_score": 1,
"hits": [
{
"_index": "test",
"_type": "page",
"_id": "barnesandnoble.com/test2.html",
"_score": 1,
"_source": {
"crawl_date": "1982-05-16",
"domain": "barnesandnoble.com"
}
}
]
}
}
},
{
"key": "starbucks.com",
"doc_count": 1,
"pages": {
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "test",
"_type": "page",
"_id": "starbucks.com/index.html",
"_score": 1,
"_source": {
"crawl_date": "1982-05-16",
"domain": "starbucks.com"
}
}
]
}
}
}
]
}
}
Using Parent/Child Aggregations
If you can change your index structure, I'd suggest to create an index with either parent/child relationship or nested documents.
If you do so, you can select 10 distinct domains and retrieve one (or more) specific pages of this url.
Let me show you an example with parent/child (if you use sense, you should be able to just copy paste):
First generate the mappings for the documents:
PUT /sites
{
"mappings": {
"domain": {},
"page": {
"_parent": {
"type": "domain"
},
"properties": {
"crawl_date": {
"type": "date"
}
}
}
}
}
Insert some documents
PUT sites/domain/barnesandnoble.com
{}
PUT sites/domain/starbucks.com
{}
PUT sites/domain/dailymail.co.uk
{}
POST /sites/page/_bulk
{ "index": { "_id": "barnesandnoble.com/test.html", "parent": "barnesandnoble.com" }}
{ "crawl_date": "1982-05-16" }
{ "index": { "_id": "barnesandnoble.com/test2.html", "parent": "barnesandnoble.com" }}
{ "crawl_date": "1982-05-16" }
{ "index": { "_id": "starbucks.com/index.html", "parent": "starbucks.com" }}
{ "crawl_date": "1982-05-16" }
{ "index": { "_id": "dailymail.co.uk/index.html", "parent": "dailymail.co.uk" }}
{}
Search for the urls to crawl
POST /sites/domain/_search
{
"query": {
"has_child": {
"type": "page",
"query": {
"bool": {
"filter": {
"bool": {
"should": [
{
"bool": {
"must_not": {
"exists": {
"field": "crawl_date"
}
}
}
},
{
"range": {
"crawl_date": {
"lte": "2016-01-01"
}
}
}]
}
}
}
},
"inner_hits": {
"size": 1
}
}
}
}
We do a has_child query on the parent type and therefor receive only distinct urls of the parent type. To get the specific pages, we have to add an inner_hits query, which gives us the child documents leading to the hits in the parent type.
If you set inner_hits size to 1, you get only one page per domain.
You can even add a sorting in the inner_hits query... For example, you can sort by the crawl_date. ;)
The above search gives you the following result:
"hits": [
{
"_index": "sites",
"_type": "domain",
"_id": "starbucks.com",
"_score": 1,
"_source": {},
"inner_hits": {
"page": {
"hits": {
"total": 1,
"max_score": 1.9664046,
"hits": [
{
"_index": "sites",
"_type": "page",
"_id": "starbucks.com/index.html",
"_score": 1.9664046,
"_routing": "starbucks.com",
"_parent": "starbucks.com",
"_source": {
"crawl_date": "1982-05-16"
}
}
]
}
}
}
},
{
"_index": "sites",
"_type": "domain",
"_id": "dailymail.co.uk",
"_score": 1,
"_source": {},
"inner_hits": {
"page": {
"hits": {
"total": 1,
"max_score": 1.9664046,
"hits": [
{
"_index": "sites",
"_type": "page",
"_id": "dailymail.co.uk/index.html",
"_score": 1.9664046,
"_routing": "dailymail.co.uk",
"_parent": "dailymail.co.uk",
"_source": {}
}
]
}
}
}
},
{
"_index": "sites",
"_type": "domain",
"_id": "barnesandnoble.com",
"_score": 1,
"_source": {},
"inner_hits": {
"page": {
"hits": {
"total": 2,
"max_score": 1.4142135,
"hits": [
{
"_index": "sites",
"_type": "page",
"_id": "barnesandnoble.com/test.html",
"_score": 1.4142135,
"_routing": "barnesandnoble.com",
"_parent": "barnesandnoble.com",
"_source": {
"crawl_date": "1982-05-16"
}
}
]
}
}
}
}
]
Finally, let me note one thing. Parent/child relationship comes with small costs at query time. If this isn't a problem for your use case, I'd go for this solution.
I suggest you use the exists filter instead of trying to match an empty term (the missing filter is deprecated in 2.2). Then, the range filter will help you filter out the documents you don't need.
Finally, since you have used the absolute URL as id, make sure to aggregate on the _uid field and not the domain field, that way you'll get unique counts per exact page.
curl -XPOST 'http://localhost:9200/tester/test/_search' -d '{
"size": 10,
"aggs": {
"group_by_domain": {
"filter": {
"bool": {
"should": [
{
"bool": {
"must_not": {
"exists": {
"field": "crawl_date"
}
}
}
},
{
"range": {
"crawl_date": {
"lte": "2014-12-31T00:00:00.000"
}
}
}
]
}
},
"aggs": {
"domains": {
"terms": {
"field": "_uid"
}
}
}
}
}
}'
You have to use Filter Aggregation and then sub-aggregation
{
"size": 10,
"aggs": {
"filter_date": {
"filter": {
"bool": {
"should": [
{
"bool": {
"must_not": [
{
"exists": {
"field": "crawl_date"
}
}
]
}
},
{
"range": {
"crawl_date": {
"from": "now-100d"
}
}
}
]
}
},
"aggs": {
"group_by_domain": {
"terms": {
"field": "domain"
}
}
}
}
}
}

Resources