elasticsearch match all words from document in the search query

elasticsearch match all words from document in the search query - elasticsearch

We can search for ALL words in a specific document.field like this:
{ "query" : { "match" : { "title": { "query" : "Black Nike Mens", "operator" : "and" } } } }
This will search for the words Black, Nike and Mens in the field title such that only those documents are returned that will have ALL these words in the title field.
But what I am trying to do is a little different.
I want to lookup such that if all the words of the title field of the document are present in my search query then it will return that document.
For e.g.
suppose there is a document with title : "Nike Free Sparq Mens White" in the elasticsearch database
now if I search with a query : "Nike Free Sparq 09 - Mens - White/Black/Varsity Red" then it should return this document, because all the words in the document.title do exist in my query
but if I search with a query : "Nike Free Lebron - Mens - White/Black" then it should NOT return the document because my query has the word Sparq missing
this is a sort of reverse-and-operator search
Is this possible? If yes, then how?

I finally got it to work but not with a direct method!
This is what I do:
Create a clean list of words from the source query, by:
change to lower case
replacing any special chars and punctuation with space
remove duplicate words
Search using normal match with OR operator for the words joined as a string
Now we will find the best relevant hits in result
We take those hits one by one and do a word to word search in php (or whatever programming language you use)
This word search will check for all the words of a document from the hits we just found, and match them with the words in source query; such that all words from hit document should be present in the source query string
This worked for me well enough!
Unless someone has a direct method from elasticsearch query language.

The Percolate query should help here. You'd register your documents as queries, making "Nike Free Sparq Mens White" a match query with an AND operator.
Then your query can become a document like one having "Nike Free Sparq 09 - Mens - White/Black/Varsity Red" as content. You should get "Nike Free Sparq Mens White" back, because it matches all terms.
Unfortunately, this won't scale well (e.g. if you have millions of documents, it might get slow).

Related

prevent elasticsearch from matching target phrase multiple times in document

I am an Elastic Search newbie.
How can one make elastic search rank documents that more precisely match the input string?
For example, suppose we have the query
{
"query": {
"match": {
"name": "jones"
}
}
}
Suppose we have two documents:
Doc1: "name" : "jones"
Doc2: "name" : "jones jones jones jones jones"
I want Doc1 to be ranked more highly? It is a more precise match. How can I do this?
(Hopefully, in the most general possible way -- e.g. what if everywhere above 'jones' were replaced with 'fred jones')
Perhaps there are two approaches:
Maybe you can tell ES, "hey for this query a high term frequency should not be rewarded" (which seems to go against the core of ES, TF-IDF ...Because it very strongly wants to rewards a high TF (term frequency).
Maybe you can tell ES "prefer shorter matches over longer ones" (maybe using script_score???)
Surprised that I can't find answers to this question elsewhere. I must be missing something very fundamental.

Elasticsearch - query primary and secondary attribute with different terms

I'm using elasticsearch to query data that originally was exported out of several relational databases that had a lot of redundencies. I now want to perform queries where I have a primary attribute and one or more secondary attributes that should match. I tried using a bool query with a must term and a should term, but that doesn't seem to work for my case, which may look like this:
Example:
I have a document with fullname and street name of a user and I want to search for similiar users in different indices. So the best match for my query should be the best match on fullname and best match on streetname field. But since the original data has a lot of redundencies and inconsistencies the field fullname (which I manually created out of fields name1, name2, name3) may contain the same name multiple times and it seems that elasticsearch ranks a double match in a must field higher than a match in a should attribute.
That means, I want to query for John Doe Back Street with the following sample data:
{
"fullname" : "John Doe John and Jane",
"street" : "Main Street"
}
{
"fullname" : "John Doe",
"street" : "Back Street"
}
Long story short, I want to query for a main attribute fullname - John Doe and secondary attribute street - Back Street and want the second document to be the best match and not the first because it contains John multiple times.

Manipulation of relevance in Elasticsearch is not the easiest part. Score calculation is based on three main parts:
Term frequency
Inverse document frequency
Field-length norm
Shortly:
the often the term occurs in field, the MORE relevant is
the often the term occurs in entire index, the LESS relevant is
the longer the term is, the MORE relevant is
I recommend you to read below materials:
What Is Relevance?
Theory Behind Relevance Scoring
Controlling Relevance and subpages
If in general, in your case, result of fullname is more important than from street you can boost importance of the first one. Below you have example code base on my working code:
{
"query": {
"multi_match": {
"query": "john doe",
"fields": [
"fullname^10",
"street"
]
}
}
}
In this example result from fullname is ten times (^10) much important than result from street. You can try to manipulate the boost or use other ways to control relevance but as I mentioned at the beginning - it is not the easiest way and everything depends on your particular situation. Mostly because of "inverse document frequency" part which considers terms from entire index - each next added document to index will probably change the score of the same search query.
I know that I did not answer directly but I hope to helped you to understand how this works.

Elasticsearch more like this returns too many documents

I have documents like this:
{
title:'...',
body: '...'
}
I want to get documents which are more than 90% similar to the with a specific document. I have used this query:
query = {
"query": {
"more_like_this" : {
"fields" : ["title", "body"],
"like" : "body of another document",
"min_term_freq" : 1,
"max_query_terms" : 12
}
}
}
How to change this query to check for 90% similarity with specified doc?

Take a look at the Query Formation Parameter minimum_should_match

You should specify minimun_should_match
minimum_should_match
After the disjunctive query has been formed, this parameter controls
the number of terms that must match. The syntax is the same as the
minimum should match. (Defaults to "30%").
It form query using this
The MLT query simply extracts the text from the input document,
analyzes it, usually using the same analyzer at the field, then
selects the top K terms with the highest tf-idf to form a disjunctive
query of these terms
So if you would like to boost you title field you should boost your title field because if the title contains most of the terms present in the term frequency/ Inverse document frequency. the result should be boosted because it has more relevance. You can boost your title field by 1.5.
Refer this document for referenceren on the more_like_this query

How to get exact macth first next followed matches in elastic search

I am very new to elastic search, I need to search the words with particular word match
ex: I have words as
cricketnplay, cricket23, cricket, criketlegend
when I search a word cricket
the result will be like 1st one exact match and next followed matches
cricket
cricket23
cricketlegend
cricketnplay
how to query to get output like this please help,
Thanks in advance

You need to search with _search query.
GET /twitter/tweet/_search
{
"query" : {
"term" : { <field> : "cricket" }
}
}
This query will return all matched elements with match score sorted in descending order of score.
Read more about _search query here

analyzed vs not_analyzed: storage size

I recently started using ElasticSearch 2. And as I undestand analyzed vs not_analyzed in the mapping, not_analyzed should be better in storage (https://www.elastic.co/blog/elasticsearch-storage-the-true-story-2.0 and https://www.elastic.co/blog/elasticsearch-storage-the-true-story).
For testing purposes I created some indexes with all the String field as analyzed (by default) and then I created some other indexes with all the fields as not_analyzed, my surprise came when I checked the size of the indexes and I saw that the indexes with the not_analyzed Strings were 40% bigger!! I was inserting the same documents in each index (35000 docs).
Any idea why this is happening? My documents are simple JSON documents. I have 60 String fields in each document that I want to set as not_analyzed and I tried both setting each field as not analyzed and also creating a dynamic template.
I edit for adding the mapping, although I think it has nothing special:
{
"mappings": {
"my_type" : {
"_ttl" : { "enabled" : true, "default" : "7d" },
"properties" : {
"field1" : {
"properties" : {
"field2" : {
"type" : "string", "index" : "not_analyzed"
}
more not_analyzed String fields here
...
...
...
}

not_analyzed fields are still indexed. They just don't have any transformations applied to them beforehand ("analysis" - in Lucene parlance).
As an example:
(Doc 1) "The quick brown fox jumped over the lazy dog"
(Doc 2) "Lazy like the fox"
Simplified postings list created by Standard Analyzer (default for analyzed string fields - tokenized, lowercased, stopwords removed):
"brown": [1]
"dog": [1]
"fox": [1,2]
"jumped": [1]
"lazy": [1,2]
"over": [1]
"quick": [1]
30 characters worth of string data
Simplified postings list created by "index": "not_analyzed":
"The quick brown fox jumped over the lazy dog": [1]
"Lazy like the fox": [2]
62 characters worth of string data
Analysis causes input to get tokenized and normalized for the purpose of being able to look up documents using a term.
But as a result, the unit of text is reduced to a normalized term (vs an entire field with not_analyzed), and all the redundant (normalized) terms across all documents are collapsed into a single logical list saving you all the space that would normally be consumed by repeated terms and stopwords.

From the documentation, it looks like not_analyzed makes the field act like a "keyword" instead of a "full-text" field -- let's compare these two!
Full text
These fields are analyzed, that is they are passed through an analyzer to convert the string into a list of individual terms before being indexed.
Keyword
Keyword fields are not_analyzed. Instead, the exact string value is added to the index as a single term.
I'm not surprised that storing an entire string as a term, rather than breaking it into a list of terms, doesn't necessarily translate to saved space. Honestly, it probably depends on the index's analyzer and the string being indexed.
As a side note, I just re-indexed about a million documents of production data and cut our index disk space usage by ~95%. The main difference I made was modifying what was actually saved in the source (AKA stored). We indexed PDFs for searching, but did not need them to be returned and so that saved us from saving this information in two different ways (analyzed and raw). There are some very real downsides to this, though, so be careful!

Doc1{
"name":"my name is mayank kumar"
}
Doc2.{
"name":"mayank"
}
Doc3.{
"name":"Mayank"
}
We have 3 documents.
So if field 'name' is 'not_analyzed' and we search for 'mayank' only second document would be returned.If we search for 'Mayank' only third document would be returned.
If field 'name' is 'analyzed' by a analyser 'lowercase analyser'(just as a example).We we search for 'mayank', all 3 documents would be returned.
If we search for 'kumar' ,first document would be returned.This happens because in first document the field value gets tokenised as "my" "name" "is" "mayank" "kumar"
'not_analyzed' is basically used for 'full-text' search(mostly except in wildcards matching).less space on disk.Takes less time during indexing.
'analyzed' is basically used for matching documents.more space on disk (if the analyze fields are big).Takes more time during indexing.(More fields due to analyze fields)

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

elasticsearch match all words from document in the search query - elasticsearch

Related

prevent elasticsearch from matching target phrase multiple times in document

Elasticsearch - query primary and secondary attribute with different terms

Elasticsearch more like this returns too many documents

How to get exact macth first next followed matches in elastic search

analyzed vs not_analyzed: storage size

Categories

Resources