Difference between a "plain" terms query and a terms query using a filter - elasticsearch

I am trying to understand what the difference is between:
a "plain" elasticsearch query that is going to match a terms query and return a certain number of hits.
and a filtered query (therefore using a filter) that is going to return the same number of hits.
Here is the terms query:
GET _search
{
"query": {
"terms": {
"childcareTypes": [
"SOLE_CHARGE",
"OUT_OF_SCHOOL",
"BABY_SITTING"
],
"minimum_match": 3
}
}
}
Here is the filtered version:
GET _search
{
"query": {
"filtered": {
"filter": {
"terms": {
"childcareTypes": [
"SOLE_CHARGE",
"OUT_OF_SCHOOL",
"BABY_SITTING"
],
"execution": "and"
}
}
}
}
}
Both return a total hits of 8000 (against my index).
Here is the result from the "plain" terms query:
{
"took": 7,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 8000,
"max_score": 5.134171,
"hits": [
{
"_index": "bignibou",
"_type": "advertisement",
"_id": "AUs2T2lt3L5LNr7nkot2",
"_score": 5.134171,
"_source": {
"childcareWorkerType": "AUXILIAIRE_PARENTALE",
"childcareTypes": [
"SOLE_CHARGE",
"OUT_OF_SCHOOL",
"BABY_SITTING"
],
"address": {
"latitude": 48.8532558,
"longitude": 2.36584
},
"giveBath": "EMPTY"
}
},
...
Here is the result from the "filtered" query:
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 8000,
"max_score": 1,
"hits": [
{
"_index": "bignibou",
"_type": "advertisement",
"_id": "AUs2T2lt3L5LNr7nkot2",
"_score": 1,
"_source": {
"childcareWorkerType": "AUXILIAIRE_PARENTALE",
"childcareTypes": [
"SOLE_CHARGE",
"OUT_OF_SCHOOL",
"BABY_SITTING"
],
"address": {
"latitude": 48.8532558,
"longitude": 2.36584
},
"giveBath": "EMPTY"
}
},
....
Then what are the differences between the two?

This is related to the differences between queries and filters (more information here).
In your case, unlike terms query, terms filter :
is cached
doesn't compute the score : all matching documents have the same _score of 1 (look at your results)
Consequently, the biggest difference is that the filtered query will be faster than a 'plain' terms query.

Related

ElasticSearch: Exact match on Keyword datatype field with array of values

In ElasticSearch, I have a mapping for an email field and title field as given below:
{
"person": {
"mappings": {
"_doc": {
"email": {
"type": "keyword",
"boost": 80
},
"title": {
"type": "text",
"boost": 70
}
}
}
}
Each person can have more than one email address and title. So, I'm storing the values in arrays.
I use query_string to search for persons with an email address and/or title. Email address needs to match exactly.
I have indexed a document with the following data. Calling GET person/_search in Kibana will yield the following document in the result.
{
"took": 0,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "person",
"_type": "_doc",
"_id": "101",
"_score": 1,
"_source": {
"title": """["Actor", "Hero", "Model"]""",
"email": """["jdepp#hotmail.com", "johnny#hollywood.com", "jdepp#gmail.com", "johnny.depp#yahoo.com"]""",
"SEARCH_ENTITY": "PERSON"
}
}
]
}
}
Now when I add some email search parameter I don't get the document back in the result. Remember email is of type keyword.
Request:
GET person/_search
{
"query" : {
"query_string" : {
"query" : "SEARCH_ENTITY:PERSON AND (email: (johnny.depp#yahoo.com))"
}
}
}
Response:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 0,
"max_score": null,
"hits": []
}
}
But the same kind of query works for title field which is of type text.
Request:
GET person/_search
{
"query" : {
"query_string" : {
"query" : "SEARCH_ENTITY:PERSON AND (title: ((actor)))"
}
}
}
Response:
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 20.137747,
"hits": [
{
"_index": "person",
"_type": "_doc",
"_id": "101",
"_score": 20.137747,
"_source": {
"ID": "101",
"title": """["Actor", "Hero", "Model"]""",
"email": """["jdepp#hotmail.com", "johnny#hollywood.com", "jdepp#gmail.com", "johnny.depp#yahoo.com"]"""
}
}
]
}
}
Can someone tell me what I need to do to make this work for email field which is of keyword type?
Note: If I store only one email address without using an array, it works fine.
Thanks.
Make sure you parse the json array strings in title and email like so before you index your docs:
POST person/_doc/101
{
"title": [
"Actor",
"Hero",
"Model"
],
"email": [
"jdepp#hotmail.com",
"johnny#hollywood.com",
"jdepp#gmail.com",
"johnny.depp#yahoo.com"
],
"SEARCH_ENTITY": "PERSON"
}
Nothing needs to be changed about the mapping -- just the field values.

Function score ignored

I have two nearly identical documents, one of which has the fields CONSTRUCTION: 1 and EDUCATION: 0.1, the other with CONSTRUCTION: 0.1 and EDUCATION: 1. I want to be able to sort results by the value of either the CONSTRUCTION or EDUCATION field
GET /objects/_search
{
"query": {
"function_score": {
"query": {
"match": {
"name": {
"query": "Monkeys"
}
}
},
"field_value_factor": {
"field" : "CONSTRUCTION",
"missing": 1
}
}
},
"_source": ["name", "CONSTRUCTION", "EDUCATION"]
}
Returns the incorrect results:
{
"took": 8,
"timed_out": false,
"_shards": {
"total": 3,
"successful": 3,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 1.7622693,
"hits": [
{
"_index": "objects__feed_id_key_pages__date_2019-12-10__timestamp_1575988952__batch_id_3gpnz7fc__",
"_type": "_doc",
"_id": "dit:greatDomesticUi:KeyPages:12",
"_score": 1.7622693,
"_source": {
"CONSTRUCTION": 0.1,
"name": "Space Monkeys - education",
"EDUCATION": 1
}
},
{
"_index": "objects__feed_id_key_pages__date_2019-12-10__timestamp_1575988952__batch_id_3gpnz7fc__",
"_type": "_doc",
"_id": "dit:greatDomesticUi:KeyPages:11",
"_score": 1.0226655,
"_source": {
"CONSTRUCTION": 1,
"name": "Space Monkeys - construction",
"EDUCATION": 0.1
}
}
]
}
}
This only always returns the same results. Indeed if you misspell the field_value_factor field, you get the same score "field_value_factor": { "field" : "WHATEVER",... }. This suggests the field simply isn't being read.
Dynamic mapping was turned off. The EDUCATION and CONSTRUCTION fields were not mapped. Mystery solved!

I want to use a wildcard query for url in elasticsearch. I am using elasticsearch 2.3.0

My index looks like this:
GET pibtest1/_search
{
"took": 5,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 11,
"max_score": 1,
"hits": [
{
"_index": "pibtest1",
"_type": "SearchTech",
"_id": "_update",
"_score": 1,
"_source": {
"script": "ctx._source.remove(\"wiki_collection\")"
}
},
{
"_index": "pibtest1",
"_type": "SearchTech",
"_id": "http://www.searchtechnologies.com/bundles/jquery?v=gOdOgfykTFJnypePAvGweyMPwl-krhx8ntIhefPKelg1",
"_score": 1,
"_source": {
"extension": {
"X-Parsed-By": "org.apache.tika.parser.DefaultParser",
"Content-Encoding": "ISO-8859-1",
"resourceName": "http://www.searchtechnologies.com/bundles/jquery?v=gOdOgfykTFJnypePAvGweyMPwl-krhx8ntIhefPKelg1"
},
"keywords": "keywords-NOT-PROVIDED",
"default_collection": true,
"wiki_collection": false,
"description": "description-NOT-PROVIDED",
"connectorSpecific": {
"discoveredBy": "http://www.searchtechnologies.com/",
"xslt": "false",
"pathFromSeed": "E",
"md5": "OKTGVLEWTE5V4PWXUBM2RK3KMQ"
},
"title": "Title-NOT-PROVIDED",
"url": "http://www.searchtechnologies.com/bundles/jquery?v=gOdOgfykTFJnypePAvGweyMPwl-krhx8ntIhefPKelg1",
"remove": "wiki_collection",
"UD": "http://www.searchtechnologies.com/bundles/jquery?v=gOdOgfykTFJnypePAvGweyMPwl-krhx8ntIhefPKelg1",
Now I want to use a wildcard query to search for few url which includes some pattern(for eg. http://www.searchtechnologies.com/bundles)
This is my wildcard query:
GET pibtest1/_search
{
"query": {
"wildcard": {
"url": {
"value": "http://www.searchtechnologies.com/bundles*"
}
}
}
}
I am using "*" wildcard which matches any character sequence. But I am not getting any results. My output looks like this:
{
"took": 11,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 0,
"max_score": null,
"hits": []
}
}
I want my results to include those url which matches this "http://www.searchtechnologies.com/bundles" pattern. Any help would be appreciated.
Based on comments your url field is an analyzed field. So when you insert data the data will be tokenized as ["www.searchtechnologies.com", "v", "jquery", "gOdOgfykTFJnypePAvGweyMPwl", ...]. So your query wont match this field.
You should delete your index.
Insert a mapping and specify url field as not analyzed {"index":"not_analyzed"}
Insert your data.
Run wildcard query.
If you dont want to delete your index because a downtime check: https://www.elastic.co/blog/changing-mapping-with-zero-downtime

How to filter out elements from an array that doesn’t match the query?

stackoverflow won't let me write that much example code so I put it on gist.
So I have this index
with this mapping
here is a sample document I insert into newly created mapping
this is my query
GET products/paramSuggestions/_search
{
"size": 10,
"query": {
"filtered": {
"query": {
"match": {
"paramName": {
"query": "col",
"operator": "and"
}
}
}
}
}
}
this is the unwanted result I get from previous query
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.33217794,
"hits": [
{
"_index": "products",
"_type": "paramSuggestions",
"_id": "1",
"_score": 0.33217794,
"_source": {
"productName": "iphone 6",
"params": [
{
"paramName": "color",
"value": "white"
},
{
"paramName": "capacity",
"value": "32GB"
}
]
}
}
]
}
}
and finally the wanted result, how I want the query result to look like
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.33217794,
"hits": [
{
"_index": "products",
"_type": "paramSuggestions",
"_id": "1",
"_score": 0.33217794,
"_source": {
"productName": "iphone 6",
"params": [
{
"paramName": "color",
"value": "white"
},
]
}
}
]
}
}
How should the query look like to achieve the wanted result with filtered array field which matches the query? In other words, all other non-matching array items should not appear in the final result.
The final result is the _source document that you indexed. There is no feature that lets you mask field elements of your document out of the Elasticsearch response.
That said, depending on your goal, you can look into how Highlighters and Suggesters identify result terms matching the query, or possibly, roll-your-own client-side masking using info returned from setting "explain": true in your query.

How can an aggregation be greater than the total number of hits?

I have records of the type
{
"_index": "constant",
"_type": "host",
"_id": "AU7TX249tNLhGJRMfUXb",
"_score": 1,
"_source": {
"private": true,
"host-ip": "172.22.69.64",
}
}
If I look for aggregates of private and host-ip via
POST constant/host/_search
{
"query": {
"match_all": {}
},
"size": 0,
"aggs":{
"test":{
"cardinality":{
"field": "host-ip"
}
},
"test2":{
"cardinality":{
"field": "private"
}
}
}
}
I get as a result
{
"took": 7,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 7730,
"max_score": 0,
"hits": []
},
"aggregations": {
"test": {
"value": 7860
},
"test2": {
"value": 2
}
}
}
My understanding of the result above is the following:
there is a total of 7730 documents of type host in the index constant
there are two different values for private (this is expected)
What I do not understand is how it is possible to have 7860 distinct values of host-ip when the total number of documents in the index is 7730?
Is my understanding of total in hits correct?
Cardinality aggregation is not exact. As the doc says:
A single-value metrics aggregation that calculates an approximate count of distinct values
So, that's the reason behind the greater number.
You can play with the option precision_threshold to make results more accurate, but it will consume more resources.

Resources