I don't get any documents back from my elasticsearch query. Can someone point out my mistake? - filter

I thought I had figured out Elasticsearch but I suspect I have failed to grok something, and hence this problem:
I am indexing products, which have a huge number of fields, but the ones in question are:
{
"show_in_catalogue": {
"type": "boolean",
"index": "no"
},
"prices": {
"type": "object",
"dynamic": false,
"properties": {
"site_id": {
"type": "integer",
"index": "no"
},
"currency": {
"type": "string",
"index": "not_analyzed"
},
"value": {
"type": "float"
},
"gross_tax": {
"type": "integer",
"index": "no"
}
}
}
}
I am trying to return all documents where "show_in_catalogue" is true, and there is a price with site_id 1:
{
"filter": {
"term": {
"prices.site_id": "1",
"show_in_catalogue": true
}
},
"query": {
"match_all": {}
}
}
This returns zero results. I also tried an "and" filter with two separate terms - no luck.
A subset of one of the documents returned if I have no filters looks like:
{
"prices": [
{
"site_id": 1,
"currency": "GBP",
"value": 595,
"gross_tax": 1
},
{
"site_id": 2,
"currency": "USD",
"value": 745,
"gross_tax": 0
}
]
}
I hope I am OK to omit so much of the document here; I don't believe it to be contingent but I cannot be certain, of course.
Have I missed a vital piece of knowledge, or have I done something terminally thick? Either way, I would be grateful for an expert's knowledge at this point. Thanks!
Edit:
At the suggestion of J.T. I also tried reindexing the documents so that prices.site_id was indexed - no change. Also tried the bool/must filter below to no avail.
To clarify, the reason I'm using an empty query is that the web interface may supply a query string, but the same code is used to simply filter all products. Hence I left in the query, but empty, since that's what Elastica seems to produce with no query string.
{
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"bool": {
"must": [
{
"term": {
"show_in_catalogue": true
}
},
{
"term": {
"prices.site_id": 1
}
}
]
}
}
}
}
}

You have site_id set as {"index": "no"}. This tells ElasticSearch to exclude the field from the index which makes it impossible to query or filter on that field. The data will still be stored. Likewise, you can set a field to only be in the index and searchable, but not stored.
I'm new to ElasticSearch as well and can't always grok the questions! I'm actually confused by you query. If you are going to "just filter" then you don't need a query. What I don't understand is your use of two fields inside the term filter. I've never done this. I guess it acts as an OR? Also, if nothing matches, it seems to return everything. If you wanted a query with the results of that query filtered, then you would want to use a
-d '{
"query": {
"filtered": {
"query": {},
"filter": {}
}
}
}'
If you just want to apply filters is the filter that should work without any "query" necessary
-d '{
"filter": {
"bool": {
"must": [
{
"term": {
"show_in_catalogue": true
}
},
{
"term": {
"prices.site_id": 1
}
}
]
}
}
}'

Related

ElasticSearch - score boosting using scripting

We have a specific use-case for our ElasticSearch instance: we store documents which contain proper names, dates of birth, addresses, ID numbers, and other related info.
We use a name-matching plugin which overrides the default scoring of ES and assigns a relevancy score between 0 and 1 based on how closely the name matches.
What we need to do is boost that score by a certain amount if other fields match. I have started to read up on ES scripting to achieve this. I need assistance on the script part of the query. Right now, our query looks like this:
{
"size":100,
"query":{
"bool":{
"should":[
{"match":{"Name":"John Smith"}}
]
}
},
"rescore":{
"window_size":100,
"query":{
"rescore_query":{
"function_score":{
"doc_score":{
"fields":{
"Name":{"query_value":"John Smith"},
"DOB":{
"function":{
"function_score":{
"script_score":{
"script":{
"lang":"painless",
"params":{
"query_value":"01-01-1999"
},
"inline":"if **<HERE'S WHERE I NEED ASSISTANCE>**"
}
}
}
}
}
}
}
}
},
"query_weight":0.0,
"rescore_query_weight":1.0
}
}
The Name field will always be required in a query and is the basis for the score, which is returned in the default _score field; for ease of demonstration, we'll just add one additional field, DOB, which if matched, should boost the score by 0.1. I believe I'm looking for something along the lines of if(query_value == doc['DOB'].value add 0.1 to _score), or something along these lines.
So, what would be the correct syntax to be entered into the inline row to achieve this? Or, if the query requires other syntax revision, please advise.
EDIT #1 - it's important to highlight that our DOB field is a text field, not a date field.
Splitting to a separate answer as this solves the problem differently (i.e. - by using script_score as OP proposed instead of trying to rewrite away from scripts).
Assuming the same mapping and data as the previous answer, a scripted version of the query might look like the following:
POST /employee/_search
{
"size": 100,
"query": {
"bool": {
"should": [
{
"match": {
"Name": "John"
}
},
{
"match": {
"Name": "Will"
}
}
]
}
},
"rescore": {
"window_size": 100,
"query": {
"rescore_query": {
"function_score": {
"query": {
"bool": {
"should": [
{
"match": {
"Name": "John"
}
},
{
"match": {
"Name": "Will"
}
}
]
}
},
"functions": [
{
"script_score": {
"script": {
"source": "double boost = 0.0; if (params['_source']['State'] == 'FL') { boost += 0.1; } if (params['_source']['DOB'] == '1965-05-24') { boost += 0.3; } return boost;",
"lang": "painless"
}
}
}
],
"score_mode": "sum",
"boost_mode": "sum"
}
},
"query_weight": 0,
"rescore_query_weight": 1
}
}
}
Two notes about the script:
The script uses params['_source'][field_name] to access the document, which is the only way to get access to text fields. This is significantly slower as it requires accessing documents directly on disk, though this penalty might not be too bad in the context of a rescore. You could instead use doc[field_name].value if the field was an aggregatable type, such as keyword, date, or something numeric
DOB here is compared directly to a string. This is possible because we're using the _source field, and the JSON for the documents has the dates specified as strings. This is somewhat brittle, but likely will do the trick
Assuming static weights per additional field, you can accomplish this without using scripting (though you may need to use script_score for any more complex weighting). To solve your issue of directly adding to a document's original score, your rescoring query will need to be a function score query that:
Composes queries for additional fields in a should clause for the function score's main query (i.e. - will only produce scores for documents matching at least one additional field)
Uses one function per additional field, with the filter set to select documents with some value for that field, and a weight to specify how much the score should increase (or some other scoring function if desired)
Mapping (as template)
Adding a State and DOB field for sake of example (making sure multiple additional fields contribute to the score correctly)
PUT _template/employee_template
{
"index_patterns": ["employee"],
"settings": {
"number_of_shards": 1
},
"mappings": {
"_doc": {
"properties": {
"Name": {
"type": "text"
},
"State": {
"type": "keyword"
},
"DOB": {
"type": "date"
}
}
}
}
}
Sample data
POST /employee/_doc/_bulk
{"index":{}}
{"Name": "John Smith", "State": "NY", "DOB": "1970-01-01"}
{"index":{}}
{"Name": "John C. Reilly", "State": "CA", "DOB": "1965-05-24"}
{"index":{}}
{"Name": "Will Ferrell", "State": "FL", "DOB": "1967-07-16"}
Query
EDIT: Updated the query to include the original query in the new function score in an attempt to compensate for custom scoring plugins.
A few notes about the query below:
Setting the rescorers score_mode: max is effectively a replace here, since the newly computed function score should only be greater than or equal to the original score
query_weight and rescore_query_weight are both set to 1 such that they are compared on equal scales during score_mode: max comparison
In the function_score query:
score_mode: sum will add together all the scores from functions
boost_mode: sum will add the sum of the functions to the score of the query
POST /employee/_search
{
"size": 100,
"query": {
"bool": {
"should": [
{
"match": {
"Name": "John"
}
},
{
"match": {
"Name": "Will"
}
}
]
}
},
"rescore": {
"window_size": 100,
"query": {
"rescore_query": {
"function_score": {
"query": {
"bool": {
"should": [
{
"match": {
"Name": "John"
}
},
{
"match": {
"Name": "Will"
}
}
],
"filter": {
"bool": {
"should": [
{
"term": {
"State": "CA"
}
},
{
"range": {
"DOB": {
"lte": "1968-01-01"
}
}
}
]
}
}
}
},
"functions": [
{
"filter": {
"term": {
"State": "CA"
}
},
"weight": 0.1
},
{
"filter": {
"range": {
"DOB": {
"lte": "1968-01-01"
}
}
},
"weight": 0.3
}
],
"score_mode": "sum",
"boost_mode": "sum"
}
},
"score_mode": "max",
"query_weight": 1,
"rescore_query_weight": 1
}
}
}

promote results in Elasticsearch

I searched in the documentation for a way to promote ElasticSearch results if a specific field has a certain value, but I didn't find any good practice, for example, I have a user that lives in Paris if the user search for a query I want the documents that are relevant to Paris to appear the first or just to be promoted.
There is a lot to this but you want to research "boosting". This can be done at the mapping level or the query level.
Mapping example:
{
"mappings": {
"_doc": {
"properties": {
"location": {
"type": "keyword",
"boost": 2 <--- 2x boost to the final score
}
}
}
}
}
Query Example:
GET /_search
{
"query": {
"bool": {
"must": {
"match": {
"content": {
"query": "full text search",
"operator": "and"
}
}
},
"should": [
{ "term": {
"location": {
"value": "xxx",
"boost": 3 <--- 3x boost if the location matches
}
}}
]
}
}
}

Elasticsearch nested significant terms aggregation with background filter

I am having hard times applying a background filter to a nested significant terms aggregation , the bg_count is always 0.
I'm indexing article views that have ids and timestamps, and have multiple applications on a single index. I want the foreground and background set to relate to the same application, so I'm trying to apply a term filter on the app_id field both in the boo query and in the background filter. article_views is a nested object since I want to be also able to query on views with a range filter on timestamp, but I haven't got to that yet.
Mapping:
{
"article_views": {
"type": "nested",
"properties": {
"id": {
"type": "string",
"index": "not_analyzed"
},
"timestamp": {
"type": "date",
"format": "strict_date_optional_time||epoch_millis"
}
}
},
"app_id": {
"type": "string",
"index": "not_analyzed"
}
}
Query:
{
"aggregations": {
"articles": {
"nested": {
"path": "article_views"
},
"aggs": {
"articles": {
"significant_terms": {
"field": "article_views.id",
"size": 5,
"background_filter": {
"term": {
"app_id": "17"
}
}
}
}
}
}
},
"query": {
"bool": {
"must": [
{
"term": {
"app_id": "17"
}
},
{
"nested": {
"path": "article_views",
"query": {
"terms": {
"article_views.id": [
"1",
"2"
]
}
}
}
}
]
}
}
}
As I said, in my result, the bg_count is always 0, which had me worried. If the significant terms is on other fields which are not nested the background_filter works fine.
Elasticsearch version is 2.2.
Thanks
You seem to be hitting the following issue where in your background filter you'd need to "go back" to the parent context in order to define your background filter based on a field of the parent document.
You'd need a reverse_nested query at that point, but that doesn't exist.
One way to circumvent this is to add the app_id field to your nested documents so that you can simply use it in the background filter context.

elasticsearch aggregation on field containing spaces

I have a field that contains spaces called "CompanyName". The CompanyName field contains things like, "ABC Client", "BCD CLIENT 123", "EFG CLIENT HIJ"
When I index the data I set the mapping to "index" : "not_analyzed". When I run an aggregation, without any other queries, it appears to work fine.
The issue I have is that if I want to first run another query and then get an aggregation of those results, the aggregation then breaks because it interprets the spaces in the company names, so it looks like the aggregation is run over the output of the first query and not over the field that I setup.
The query:
{
"size": 0,
"query": {
"filtered": {
"filter": {
"bool": {
"must": [
{
"term": {
"Stuff": "1"
}
},
{
"term": {
"filename": "FileOfData.sourcedata"
}
}
]
}
}
}
},
"aggs": {
"users": {
"terms": {
"field": "CompanyName"
}
}
}
}
I have also tried adding a custom analyzer using:
"analysis": {
"analyzer": {
"companynamestring": {
"type": "custom",
"tokenizer": "keyword",
"filter": "lowercase"
}
}
}
}
And it is still not working. Does anyone know how I can run a query and then get an aggregation that returns only the full CompanyName field and is not tokenized?
Thanks!

Elasticsearch outputs the score of 1.0 for all results when searching for a single "starred" term

We are using Elasticsearch to search for the most relevant companies in a specific catalog. When we use the normal search term like lettering we get reasonable scores and can sort the results according to the score.
However, when we modify the search term before querying and make the "starred" version of it (e.g., *lettering*) to be able to search for substrings we get a score of 1.0 for every result. The search for substrings is a requirement in the project.
Any ideas on what could cause this relevance computation? The problem occurs only when a single term is used. We get comprehensible scores when we use two starred terms in combination (e.g., *lettering* *digital*).
EDIT 1:
Exemplary mapping (YAML, other properties are mapped in the same way, excepting boost which is different for each property):
elasticSearchMapping:
type: object
include_in_all: true
enabled: true
properties:
'keywords':
type: string
include_in_all: true
boost: 50
Query:
{
"query": {
"filtered": {
"query": {
"bool": {
"must": [{
"match_all": []
}, {
"query_string": {
"query": "*lettering*"
}
}]
}
},
"filter": {
"bool": {
"must": [{
"term": {
"__parentPath": "/sites/industrycatalog"
}
}, {
"terms": {
"__workspace": ["live"]
}
}, {
"term": {
"__dimensionCombinationHash": "d751713988987e9331980363e24189ce"
}
}, {
"term": {
"__typeAndSupertypes": "IndustryCatalog:Entry"
}
}],
"should": [],
"must_not": [{
"term": {
"_hidden": true
}
}, {
"range": {
"_hiddenBeforeDateTime": {
"gt": "now"
}
}
}, {
"range": {
"_hiddenAfterDateTime": {
"lt": "now"
}
}
}]
}
}
}
},
"fields": ["__path"],
"script_fields": {
"distance": {
"script": "doc['coordinates'].distanceInKm(51.75631079999999,14.332867899999997)"
}
},
"sort": [{
"customer.featureFlags.industrycatalog": {
"order": "asc"
}
}, {
"_geo_distance": {
"coordinates": {
"lat": "51.75631079999999",
"lon": "14.332867899999997"
},
"order": "asc",
"unit": "km",
"distance_type": "plane"
}
}],
"size": 999999
}
What you are doing is wildcard query, They fall under term level queries and by default constant score is applied.
Check the Lucene Documentation, WildcardQuery extends MultiTermQuery
You can also verify this with the help of explain api, you will something like this
"_explanation": {
"value": 1,
"description": "ConstantScore(company:lettering), product of:",
"details": [{
"value": 1,
"description": "boost"
}, {
"value": 1,
"description": "queryNorm"
}]
}
You can change this behavior with rewriting,
Try this, rewrite also works with query string query
{
"query": {
"wildcard": {
"company": {
"value": "digital*",
"rewrite": "scoring_boolean"
}
}
}
}
It has various options for scoring, see what fits your requirement.
EDIT 1, the reason you see score other than 1 for *lettering* *digital* is due to queryNorm, you can again check with explain api, If you look closely, all documents with both matches will have same score and documents with single match will have same score also.
P.S : leading wildcard is not recommended at all. You will get performance issues since it has to check against every single term in the inverted index. You might want to check edge ngram or ngram filter
Hope this helps!

Resources