How to return all documents where a string occurs in the document at least N times - elasticsearch

If I wanted to return all documents that contain the term beetlejuice, I could use a query like
{
"bool":{
"should":[
{
"term":{
"description":"beetlejuice"
}
}
]
}
}
What's not clear is how to return all documents where the description field contains the string beetlejuice at least 3 times within it. I see minimum_should_match, but I think that is to be used for separate queries in a bool. How can I craft a query to match when a word occurs at least N times within the document's description field?

You can use scripting for achieving what you desired.
Basically, all you need is the term frequency of the desired term in the document field and you can access the value using scripts.
_index['FIELD']['TERM'].tf()
Sample Filter Script :
"filter" : {
"script" : {
"script" : "_index['description']['beetlejuice'].tf() > N",
"params" : {
"N" : 2
}
}
}

Related

How to limit elasticsearch to a list of documents each identified by a unique keyword

I have an elasticsearch document repository with ~15M documents.
Each document has an unique 11-char string field (comes from a mongo DB) that is unique to the document. This field is indexed as keyword.
I'm using C#.
When I run a search, I want to be able to limit the search to a set of documents that I specify (via some list of the unique field ids).
My query text uses bool with must to supply a filter for the unique identifiers and additional clauses to actually search the documents. See example below.
To search a large number of documents, I generate multiple query strings and run them concurrently. Each query handles up to 64K unique ids (determined by the limit on terms).
In this case, I have 262,144 documents to search (list comes, at run time, from a separate mongo DB query). So my code generates 4 query strings (see example below).
I run them concurrently.
Unfortunately, this search takes over 22 seconds to complete.
When I run the same search but drop the terms node (so it searches all the documents), a single such query completes the search in 1.8 seconds.
An incredible difference.
So my question: Is there an efficient way to specify which documents are to be searched (when each document has a unique self-identifying keyword field)?
I want to be able to specify up to a few 100K of such unique ids.
Here's an example of my search specifying unique document identifiers:
{
"_source" : "talentId",
"from" : 0,
"size" : 10000,
"query" : {
"bool" : {
"must" : [
{
"bool" : {
"must" : [ { "match_phrase" : { "freeText" : "java" } },
{ "match_phrase" : { "freeText" : "unix" } },
{ "match_phrase" : { "freeText" : "c#" } },
{ "match_phrase" : { "freeText" : "cnn" } } ]
}
},
{
"bool" : {
"filter" : {
"bool" : {
"should" : [
{
"terms" : {
"talentId" : [ "goGSXMWE1Qg", "GvTDYS6F1Qg",
"-qa_N-aC1Qg", "iu299LCC1Qg",
"0p7SpteI1Qg", ... 4,995 more ... ]
}
}
]
}
}
}
}
]
}
}
}
#jarmod is right.
But if you don't wanna completely redo your architecture, is there some other single talent-related shared field you could query instead of thousands of talendIds? It could be one more simple match_phrase query.

Elasticsearch. Painless script to search based on the last result

Let's see if someone could shed a light on this one, which seems to be a little hard.
We need to correlate data from multiple index and various fields. We are trying painless script.
Example:
We make a search in an index to gather data about the queueid of mails sent by someone#domain
Once we have the queueids, we need to store the queueids in an array an iterate over it to make new searchs to gather data like email receivers, spam checks, postfix results and so on.
Problem: Hos can we store the data from one search and use it later in the second search?
We are testing something like:
GET here_an_index/_search
{
"query": {
"bool" : {
"must": [
{
"range": {
"#timestamp": {
"gte": "now-15m",
"lte": "now"
}
}
}
],
"filter" : {
"script" : {
"script" : {
"source" : "doc['postfix_from'].value == params.from; qu = doc['postfix_queueid'].value; return qu",
"params" : {
"from" : "someona#mdomain"
}
}
}
}
}
}
}
And, of course, it throws an error.
"doc['postfix_from'].value ...",
"^---- HERE"
So, in a nuttshell: is there any way ti execute a search looking for some field value based on a filter (like from:someone#dfomain) and use this values on later searchs?
We have evaluated using script fields or nested, but due to some architecture reasons and what those changes would entail, right now, can not be used.
Thank you very much!

Elasticsearch : constant_score query vs bool.filter query

I am trying to achieve an exact match result using Elasticsearch (so I don't care about scoring here)
I see that there are 2 ways to do this :
{
"query" : {
"constant_score" : {
"filter" : {
"term" : {
"exact_match_field" : "hello world !"
}
}
}
}
}
or
{
"query": {
"bool": {
"filter": {
"term": {
"exact_match_field": "hello world !"
}
}
}
}
}
Both work and gives me the result I want. Whats the difference between them ? Are there performance benefits of using one vs the other ?
(I am using Elasticsearch V 5.6)
Thanks !
Constant score query gives an equal score to any matching document irrespective of any scoring factors like TF, IDF etc. This can be used when you don't care whether how much a doc matched but just if a doc matched or not and give a score too, unlike filter.
A constant_score query takes a boost argument that is set as the score for every returned document when combined with other queries. By default boost is set to 1.
If you are interested below link will give you more insight
https://www.compose.com/articles/elasticsearch-query-time-strategies-and-techniques-for-relevance-part-ii/

Understanding boosting in ElasticSearch

I've been using ElasticSearch for a little bit with the goal of building a search engine and I'm interested in manually changing the IDFs (Inverse Document Frequencies) of each term to match the ones one can measure from the Google Books unigrams.
In order to do that I plan on doing the following:
1) Use only 1 shard (so IDFs are not computed for every shard and they are "global")
2) Get the ttf (total term frequency, which is used to compute the IDFs) for every term by running this query for every document in my index
curl -XGET 'http://localhost:9200/index/document/id_doc/_termvectors?pretty=true' -d '{
"fields" : ["content"],
"offsets" : true,
"term_statistics" : true
}'
3) Use the Google Books unigram model to "rescale" the ttf for every term.
The problem is that, once I've found the "boost" factors I have to use for every term, how can I use this in a query?
For instance, let's consider this example
"query":
{
"bool":{
"should":[
{
"match":{
"title":{
"query":"cat",
"boost":2
}
}
},
{
"match":{
"content":{
"query":"cat",
"boost":2
}
}
}
]
}
}
Does that mean that the IDFs of the term "cat" is going to be boosted / multiplied by a factor of 2?
Also, what happens if instead of search for one word I have a sentence? Would that mean that the IDFs of each word is going to be boosted by 2?
I tried to understand the role of the boost parameter (https://www.elastic.co/guide/en/elasticsearch/guide/current/query-time-boosting.html) and t.getBoost(), but that seems a little confusing.
The boost is used when query with multi query clauses, example:
{
"bool":{
"should":[
{
"match":{
"clause1":{
"query":"query1",
"boost":3
}
}
},
{
"match":{
"clause2":{
"query":"query2",
"boost":2
}
}
},
{
"match":{
"clause3":{
"query":"query1",
"boost":1
}
}
}
]
}
}
In the above query, it means clause1 is three times important than clause3, clause2 is the twice important than clause2, It's not simply multiply 3, 2, because when calculate score, because there is normalized for scores.
also if you just query with one query clause with boost, it's not useful.
An usage scenario for using boost:
A set of page document set with title and content field.
You want to search title and content with some terms, and you think title is more important than content when search these documents. so you can set title query clause boost more than content. Such as if your query hit one document by title field, and one hit document by content field, and you want to hit title field's document prior to the content field document. so boost can help you do it.

Can you refer to and filter on a script field in a query expression, after defining it?

I'm new to ElasticSearch and was wondering, once you define a script field with mvel syntax, can you subsequently filter on, or refer to it in the query body as if it was any other field?
I can't find any examples of this while same time I don't see any mention of whether this is possible on the docs page
http://www.elasticsearch.org/guide/reference/modules/scripting/
http://www.elasticsearch.org/guide/reference/api/search/script-fields/
The book ElasticSearch Server doesn't mention if this is possible or not either
As for 2018 and Elastic 6.2 it is still not possible to filter by fields defined with script_fields, however, you can define custom script filter for the same purpose. For example, lets assume that you've defined the following script field:
{
"script_fields" : {
"some_date_fld_year":"doc["some_date_fld"].empty ? null : doc["some_date_fld"].date.year"
}
}
you can filter by it with
{
"query": {
"bool" : {
"must" : {
"script" : {
"script" : {
"source": " (doc["some_date_fld"].empty ? null : doc["some_date_fld"].date.year) >= 2017",
"lang": "painless"
}
}
}
}
}
}
It's not possible for one simple reason: the script_fields are calculated during final stage of search (fetch phase) and only for the records that you retrieve (top 10 by default). The script filter is applied to all records that were not filtered out by preceding filters and it happens during query phase, which precedes the fetch phase. In other words, when filters are applied the script_fields don't exist yet.

Resources