Elasticsearch filtered query with script for term frequency - elasticsearch

I'm using the attachment plugin: https://github.com/elastic/elasticsearch-mapper-attachments
I'm able to find documents with a specific word in 1 or more fields but unable to filter documents with a lower term frequency than searched for.
This works:
POST /crm/employee/_search
{
"query": {"filtered": {
"query": {"match": {
"employee.cv.content": "transitie"
}},
"filter": {
"bool": {
"should": [
{"terms": {
"employee.listEmployeeType.id": [
2
]
}}
]
}
}
}},
"highlight": {"fields": {"employee.cv.content" : {}}}
}
After a long search, I've found the following:
"script": {
"script": "crm['employee.cv.content'][lookup].tf() > occurrence",
"params": {
"lookup": "transitie",
"occurrence": 1
}
},
I'm unable to implement it unfortunately. I hope i've explained the issue good enough for someone to give me a push in the right direction!

{
"query": {
"filtered": {
"query": {
"match": {
"employee.cv.content": "transitie"
}
},
"filter": {
"bool": {
"should": [
{
"terms": {
"employee.listEmployeeType.id": [
2
]
}
}
],
"must": [
{
"script": {
"script": "_index['employee.cv.content'][lookup].tf() > occurrence",
"params": {
"lookup": "transitie",
"occurrence": 1
}
}
}
]
}
}
}
},
"highlight": {
"fields": {
"employee.cv.content": {}
}
}
}

Related

How can I put `must_not` under `filter` in Elasticsearch?

I'd like to use not equal in my filter but it doesn't work [13:24] [bool] failed to parse field [filter]:
"query": {
"bool": {
"filter": [
{
"must_not" : {
"term" : {
"status" : "DECLINED"
}
}
},
{
"term": { "type": "ORDER"}
}
]
}
}
it works if I put the must_not under query like below. How can I put not equal in filter?
"query": {
"bool": {
"must_not": {
"term": {
"status": "DECLINED"
}
},
"filter": ...
May be one more bool needed inside the filter ?
/_search
{
"query": {
"bool": {
"filter": {
"bool": {
"must": [
{
"term": { "type": "ORDER"}
}
],
"must_not": [
{
"term": {
"status": "DECLINED"
}
}
]
}
}
}
}
}

Limit the size per index when searching multiple index in Elastic

I have been following the guidelines from this post. I can get the desired output but in the same DSL how can I limit the size of results for each index ?
Full text Search with Multiple index in Elastic Search using NEST C#
POST http://localhost:9200/componenttypeindex%2Cprojecttypeindex/Componenttype%2CProjecttype/_search?pretty=true&typed_keys=true
{
"query": {
"bool": {
"should": [
{
"bool": {
"filter": [
{
"term": {
"_index": {
"value": "componenttypeindex"
}
}
}
],
"must": [
{
"multi_match": {
"fields": [
"Componentname",
"Summary^1.1"
],
"operator": "or",
"query": "test"
}
}
]
}
},
{
"bool": {
"filter": [
{
"term": {
"_index": {
"value": "projecttypeindex"
}
}
}
],
"must": [
{
"multi_match": {
"fields": [
"Projectname",
"Summary^0.3"
],
"operator": "or",
"query": "test"
}
}
]
}
}
]
}
}
}
With your given query, you could use aggregations to group and limit number of hits per index (in this case, limiting to 5):
{
"size": 0,
"query": {
... Same query as above ...
},
"aggs": {
"index_agg": {
"terms": {
"field": "_index",
"size": 20
},
"aggs": {
"hits_per_index": {
"top_hits": {
"size": 5
}
}
}
}
}
}

ElasticSearch should with nested and bool must_not exists

With the following mapping:
"categories": {
"type": "nested",
"properties": {
"category": {
"type": "integer"
},
"score": {
"type": "float"
}
}
},
I want to use the categories field to return documents that either:
have a score above a threshold in a given category, or
do not have the categories field
This is my query:
{
"query": {
"bool": {
"should": [
{
"nested": {
"path": "categories",
"query": {
"bool": {
"must": [
{
"terms": {
"categories.category": [
<id>
]
}
},
{
"range": {
"categories.score": {
"gte": 0.5
}
}
}
]
}
}
}
},
{
"bool": {
"must_not": [
{
"exists": {
"field": "categories"
}
}
]
}
}
],
"minimum_should_match": 1
}
}
}
It correctly returns documents both with and without the categories field, and orders the results so the ones I want are first, but it doesn't filter the results having score below the 0.5 threshold.
Great question.
That is because categories is not exactly a field from the elasticsearch point of view[a field on which inverted index is created and used for querying/searching] but categories.category and categories.score is.
As a result categories being not found in any document, which is actually true for all the documents, you observe the result what you see.
Modify the query to the below and you'd see your use-case working correctly.
POST <your_index_name>/_search
{
"query": {
"bool": {
"should": [
{
"nested": {
"path": "categories",
"query": {
"bool": {
"must": [
{
"terms": {
"categories.category": [
"100"
]
}
},
{
"range": {
"categories.score": {
"gte": 0.5
}
}
}
]
}
}
}
},
{
"bool": {
"must_not": [ <----- Note this
{
"nested": {
"path": "categories",
"query": {
"bool": {
"must": [
{
"exists": {
"field": "categories.category"
}
},
{
"exists": {
"field": "categories.score"
}
}
]
}
}
}
}
]
}
}
],
"minimum_should_match": 1
}
}
}

"update by query" not working as expected with straight calls

I've an script that calls Elasticsearch with some update_by_query.
Here I update the item with id=299966 and change the trash flag, trash=0:
_update_by_query
{
"query": {
"query": {
"bool": {
"must": [
{
"terms": {
"_id": [
299966
]
}
}
],
"should": [
]
}
}
},
"script": {
"inline": "ctx._source.trash=0"
}
}
Then I the item with id=299966 (same item as above) to trash=1:
_update_by_query
{
"query": {
"query": {
"bool": {
"must": [
{
"terms": {
"_id": [
299966
]
}
}
],
"should": [
]
}
}
},
"script": {
"inline": "ctx._source.trash=1"
}
}
The thing is that after doing this two operations, if I search for the item with id=299966, I get trash=0, when it's supposed to be trash=1 as it's the last one executed. I always mantain the order and my own log shows that the one with trash=0 is first executed, and then the one with trash=1.
Is there any stuff inside the update_by_query logic that avoids to make two calls? Do I have to wait some seconds or something to make the second update_by_query?
PS: Nervemind those double query on the codes. It's working ok.
Thanks in advance.
The solution I found is to use _flush after every _update or every _update_by_query.
myindex/_update_by_query
{
"query": {
"query": {
"bool": {
"must": [
{
"terms": {
"_id": [
299966
]
}
}
],
"should": [
]
}
}
},
"script": {
"inline": "ctx._source.trash=0"
}
}
myindex/_flush
myindex/_update_by_query
{
"query": {
"query": {
"bool": {
"must": [
{
"terms": {
"_id": [
299966
]
}
}
],
"should": [
]
}
}
},
"script": {
"inline": "ctx._source.trash=1"
}
}

Elasticsearch boost score with nested query

I have the following query in Elasticsearch version 1.3.4:
{
"filtered": {
"query": {
"bool": {
"should": [
{
"bool": {
"should": [
{
"match_phrase": {
"_all": "java"
}
},
{
"bool": {
"should": [
{
"match_phrase": {
"_all": "adobe creative suite"
}
}
]
}
}
]
}
},
{
"bool": {
"should": [
{
"nested": {
"path": "skills",
"query": {
"bool": {
"must": [
{
"term": {
"skills.name.original": "java"
}
},
{
"bool": {
"should": [
{
"match": {
"skills.source": {
"query": "linkedin",
"boost": 5
}
}
},
{
"match": {
"skills.source": {
"query": "meetup",
"boost": 5
}
}
}
]
}
}
],
"minimum_should_match": "100%"
}
}
}
}
]
}
}
],
"minimum_should_match": "100%"
}
},
"filter": {
"and": [
{
"bool": {
"should": [
{
"term": {
"skills.name.original": "java"
}
}
]
}
},
{
"bool": {
"should": [
{
"term": {
"skills.name.original": "ajax"
}
},
{
"term": {
"skills.name.original": "html"
}
}
]
}
}
]
}
}
}
Mappings look like this:
skills: {
type: "nested",
include_in_parent: true,
properties: {
name: {
type: "multi_field",
fields: {
name: {type: "string"},
original: {type : "string", analyzer : "string_lowercase"}
}
}
}
}
and finally the document structure, for skills (excluded other parts), looks like this:
"skills":
[
{
"name": "java",
"source": [
"linkedin",
"facebook"
]
},
{
"name": "html",
"source": [
"meetup"
]
}
]
My goal with this query is to, first filter out some irrelevant hits with the filters (bottom of the query), then score a person by searching the whole document for the match_phrase "java", extra boosting if it also contains the match_phrase "adobe creative suit", then check the nested value where we get a hit in "skills" to see what kind of "source(s)" the skill came from. Then give the query a boost based on what source, or sources the nested object has.
This kinda of works, at least I don't get any errors, but the final score is odd and its hard to see if its working. If I give a small boost, lets say 2, the score goes DOWN slightly, my top hit at the moment has a score of 32.176407 with boost = 1. With a boost of 5 it goes down to 31.637703. I would expect it to go up, not down? With a boost of 1000, the score goes down to 2.433376.
Is this the right way to do this, or is there a better/easier way? I could change the structure and mappings etc. And why is my score decreasing?
Edit: I have simplified the query a little, only dealing with one "skill":
{
"filtered": {
"query": {
"bool": {
"must": [
{
"bool": {
"must": [
{
"bool": {
"should": [
{
"match_phrase": {
"_all": "java"
}
}
],
"minimum_should_match": 1
}
}
]
}
}
],
"should": [
{
"nested": {
"path": "skills",
"score_mode": "avg",
"query": {
"bool": {
"must": [
{
"term": {
"skills.name.original": "java"
}
}
],
"should": [
{
"match": {
"skills.source": {
"query": "linkedin",
"boost": 1.2
}
}
},
{
"match": {
"skills.source": {
"query": "meetup",
"boost": 1.2
}
}
}
]
}
}
}
}
]
}
},
"filter": {
"and": [
{
"bool": {
"should": [
{
"term": {
"skills.name.original": "java"
}
}
]
}
}
]
}
}
}
The problem now is that I expect two similar documents, where the only difference is the "source" value on the skill "java". They are "linkedin" and "meetup" respectively. In my new query, they both get the same boost, but the final _score is very different for the two documents.
From the query explanation for doc 1:
"value": 3.82485,
"description": "Score based on child doc range from 0 to 125"
and for doc two:
"value": 2.1993546,
"description": "Score based on child doc range from 0 to 125"
These values are the only ones that differ, and I cant see why.
I can't answer the question regarding the boost, but how many shards do you have on index?
TF and IDF are calculated per shard not per index and this could be creating the difference in score.
https://groups.google.com/forum/#!topic/elasticsearch/FK-PYb43zcQ.
If you reindex with only 1 shard does change the outcome?
Edit: Also, the doc range is the range of docs for each document in the shard and you can use this to calculate IDF for each doc to verify scores.

Resources