Elasticsearch: Filter (or Query) by Term Frequency - elasticsearch

How do I run an elasticsearch query that only returns results with the term X mentioned at least Y times in a document?
For example, suppose you had a footer in all of your indexed documents that say something like copyright 2013. Suppose when the user runs a search for the term copyright, you want to be smart and only show those documents that say the word copyright twice (otherwise you'll return all documents). I know there are multiple ways of accomplishing this, but one way, would be to run a filter that returns only those documents that use the term copyright twice. Does such a filter exist?
I could envision something like this, but I don't see anything comparable in the docs:
"filter" : {
"term" : { "user" : "copyright"},
"frequency" : { "gt" : 1 }
}
Considering that Elasticsearch stores term frequencies, I would expect that this would be possible to implement.

Use a script filter in which you access the term frequency of copyright in field user using something like _index['user']['copyright'].tf():
{
"query": {
"filtered": {
"filter": {
"script": {
"script": "_index['name'][term_to_lookup].tf() > occurrences",
"params": {
"term_to_lookup": "copyright",
"occurrences": 1
}
}
}
}
}
}

Related

Elasticsearch. Painless script to search based on the last result

Let's see if someone could shed a light on this one, which seems to be a little hard.
We need to correlate data from multiple index and various fields. We are trying painless script.
Example:
We make a search in an index to gather data about the queueid of mails sent by someone#domain
Once we have the queueids, we need to store the queueids in an array an iterate over it to make new searchs to gather data like email receivers, spam checks, postfix results and so on.
Problem: Hos can we store the data from one search and use it later in the second search?
We are testing something like:
GET here_an_index/_search
{
"query": {
"bool" : {
"must": [
{
"range": {
"#timestamp": {
"gte": "now-15m",
"lte": "now"
}
}
}
],
"filter" : {
"script" : {
"script" : {
"source" : "doc['postfix_from'].value == params.from; qu = doc['postfix_queueid'].value; return qu",
"params" : {
"from" : "someona#mdomain"
}
}
}
}
}
}
}
And, of course, it throws an error.
"doc['postfix_from'].value ...",
"^---- HERE"
So, in a nuttshell: is there any way ti execute a search looking for some field value based on a filter (like from:someone#dfomain) and use this values on later searchs?
We have evaluated using script fields or nested, but due to some architecture reasons and what those changes would entail, right now, can not be used.
Thank you very much!

Elasticsearch : constant_score query vs bool.filter query

I am trying to achieve an exact match result using Elasticsearch (so I don't care about scoring here)
I see that there are 2 ways to do this :
{
"query" : {
"constant_score" : {
"filter" : {
"term" : {
"exact_match_field" : "hello world !"
}
}
}
}
}
or
{
"query": {
"bool": {
"filter": {
"term": {
"exact_match_field": "hello world !"
}
}
}
}
}
Both work and gives me the result I want. Whats the difference between them ? Are there performance benefits of using one vs the other ?
(I am using Elasticsearch V 5.6)
Thanks !
Constant score query gives an equal score to any matching document irrespective of any scoring factors like TF, IDF etc. This can be used when you don't care whether how much a doc matched but just if a doc matched or not and give a score too, unlike filter.
A constant_score query takes a boost argument that is set as the score for every returned document when combined with other queries. By default boost is set to 1.
If you are interested below link will give you more insight
https://www.compose.com/articles/elasticsearch-query-time-strategies-and-techniques-for-relevance-part-ii/

Elasticsearch query based on two values

I am trying to use elasticsearch in order to find documents with a rule based on two doc properties.
Lets say the documents are in the following structure:
{
"customer_payment_timestamp" : 14387930787,
"customer_delivery_timestamp" : 14387230787,
}
and i would like to query these kind of documents and find all documents where customer_payment_timestamp is greater than customer_delivery_timestamp.
Tried the official documentation, but I couldn't find any relevant example regarding the query itself or a pre-mapped field... is it even possible?
You can achieve this with a script filter like this:
POST index/_search
{
"query": {
"bool": {
"filter": {
"script": {
"script": "doc.customer_payment_timestamp.value > doc. customer_delivery_timestamp.value"
}
}
}
}
}
Note: you need to make sure that dynamic scripting is enabled

Productsearch with Elasticsearch

I am relatively new to elasticsearch and I want to perform a search for products with brand and type names.
I already tried a bit but I think I am missing something important to have a solid search algorithm. Here is my approach:
A product looks e.g. like this:
{
brandName: "Samsung",
typeName: "PS-50Q7HX",
...
}
I will have a single input field. The user can search for a brand/type only or for a brand in combination with a type name. E.g.
Samsung | Samsung PS-50Q7HX | PS-50Q7HX
To eliminate misstyping in the typeName field I use an ngram tokenizer which works great when I search for types only. But in combination with the brandName field I get in trouble. Using something like this does not work well (especially when I use an ngram tokenizer on the brandName field too):
{
"query" : {
"multi_match" : {
"query": "Samsung PS 50Q 7HX",
"type": "cross_fields",
"fields": ["brandName", "typeName"]
}
}
}
Of course I know why this is not working well with two ngram tokenizer and a mixed field but I am not sure how to solve this the best way.
I think the main problem is that I do not know if the user entered a brand name or not and I thought about using a second index filled with all available brands, which I use to perform a "pre-search" for an eventually given brand name in my query string. If I find a match I am able to split the search string into type and brand name and perform a more specific search. Like this one
{
"query": {
"bool": {
"must": [
{ "match": { "brandName": "Samsung" } },
{ "match": { "typeName": "PS-50Q7HX" } }
]
}
}
}
Does this sound like a good approach? Or does anyone see a better way?
Any help is appreciated!
Thank you very much and best regards,
Stefan
To eliminate the typo mistake by the user, you used ngram analyzer which is a costly one. You could use stem analyzer which provide some flexible options to eliminate the typo mistakes
As per my concern, instead of index this in 2 different fields you could index this as a single field.
ex:- "FIELD_NAME": "Samsung|PS-50Q7HX"
Brand name and Product name with some delimiter i used |. analyse this field values with delimiter. so your content data will be index as follows
Samsung
PS-50Q7HX
Then you could search by the following query
{
"query": {
"query-string": {
"query": "Samsung PS-50Q7HX",
"default_operator": "or",
"fields": [
"FIELD_NAME"
]
}
}
}
this will retrieve the document which has the brand name as samsung or product name as PS-50Q7Hx from index. you could use prefix search and if you use default_operator as and then your search will be most accuracy.

Filter items which array contains any of given values

I have a set of documents like
{
tags:['a','b','c']
// ... a bunch properties
}
As stated in the title: Is there a way to filter all documents containing any of given tags using Nest ?
For instance, the record above would match ['c','d']
Or should I build multiple "OR"s manually ?
elasticsearch 2.0.1:
There's also terms query which should save you some work. Here example from docs:
{
"terms" : {
"tags" : [ "blue", "pill" ],
"minimum_should_match" : 1
}
}
Under hood it constructs boolean should. So it's basically the same thing as above but shorter.
There's also a corresponding terms filter.
So to summarize your query could look like this:
{
"filtered": {
"query": {
"match": { "title": "hello world" }
},
"filter": {
"terms": {
"tags": ["c", "d"]
}
}
}
}
With greater number of tags this could make quite a difference in length.
Edit: The bitset stuff below is maybe an interesting read, but the answer itself is a bit dated. Some of this functionality is changing around in 2.x. Also Slawek points out in another answer that the terms query is an easy way to DRY up the search in this case. Refactored at the end for current best practices. —nz
You'll probably want a Bool Query (or more likely Filter alongside another query), with a should clause.
The bool query has three main properties: must, should, and must_not. Each of these accepts another query, or array of queries. The clause names are fairly self-explanatory; in your case, the should clause may specify a list filters, a match against any one of which will return the document you're looking for.
From the docs:
In a boolean query with no must clauses, one or more should clauses must match a document. The minimum number of should clauses to match can be set using the minimum_should_match parameter.
Here's an example of what that Bool query might look like in isolation:
{
"bool": {
"should": [
{ "term": { "tag": "c" }},
{ "term": { "tag": "d" }}
]
}
}
And here's another example of that Bool query as a filter within a more general-purpose Filtered Query:
{
"filtered": {
"query": {
"match": { "title": "hello world" }
},
"filter": {
"bool": {
"should": [
{ "term": { "tag": "c" }},
{ "term": { "tag": "d" }}
]
}
}
}
}
Whether you use Bool as a query (e.g., to influence the score of matches), or as a filter (e.g., to reduce the hits that are then being scored or post-filtered) is subjective, depending on your requirements.
It is generally preferable to use Bool in favor of an Or Filter, unless you have a reason to use And/Or/Not (such reasons do exist). The Elasticsearch blog has more information about the different implementations of each, and good examples of when you might prefer Bool over And/Or/Not, and vice-versa.
Elasticsearch blog: All About Elasticsearch Filter Bitsets
Update with a refactored query...
Now, with all of that out of the way, the terms query is a DRYer version of all of the above. It does the right thing with respect to the type of query under the hood, it behaves the same as the bool + should using the minimum_should_match options, and overall is a bit more terse.
Here's that last query refactored a bit:
{
"filtered": {
"query": {
"match": { "title": "hello world" }
},
"filter": {
"terms": {
"tag": [ "c", "d" ],
"minimum_should_match": 1
}
}
}
}
Whilst this an old question, I ran into this problem myself recently and some of the answers here are now deprecated (as the comments point out). So for the benefit of others who may have stumbled here:
A term query can be used to find the exact term specified in the reverse index:
{
"query": {
"term" : { "tags" : "a" }
}
From the documenation https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-term-query.html
Alternatively you can use a terms query, which will match all documents with any of the items specified in the given array:
{
"query": {
"terms" : { "tags" : ["a", "c"]}
}
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-terms-query.html
One gotcha to be aware of (which caught me out) - how you define the document also makes a difference. If the field you're searching in has been indexed as a text type then Elasticsearch will perform a full text search (i.e using an analyzed string).
If you've indexed the field as a keyword then a keyword search using a 'non-analyzed' string is performed. This can have a massive practical impact as Analyzed strings are pre-processed (lowercased, punctuation dropped etc.) See (https://www.elastic.co/guide/en/elasticsearch/guide/master/term-vs-full-text.html)
To avoid these issues, the string field has split into two new types: text, which should be used for full-text search, and keyword, which should be used for keyword search. (https://www.elastic.co/blog/strings-are-dead-long-live-strings)
For those looking at this in 2020, you may notice that accepted answer is deprecated in 2020, but there is a similar approach available using terms_set and minimum_should_match_script combination.
Please see the detailed answer here in the SO thread
You should use Terms Query
{
"query" : {
"terms" : {
"tags" : ["c", "d"]
}
}
}

Resources