Remove documents with lowest relevancy in match query - elasticsearch

I want to remove documents with lowest relevancy in match query. Is there any other way to do this beside score t?
Use case:
Suppose we have :-
index: office
doctype: employee
post(field): Account officer, account manager, accountant, chief acc etc which are different documents.
Now I search "account" in a match query against all the docs in the "post" field.
Let's say "chief acc" value for "post" field in above doc is 'least relavant'.
I want to exclude those very less relevant matches in search results list.
I tried by using score of results but I think that is not feasible. Is there any other way to achieve this beside score??

Yes you can do this by having a filtered query inside your query:
POST _search
{
"query":{
"filtered":{
"filter":{
"not":{
"term":{
"post":"chief acc"
}
}
}
}
}
}
If you're using ES 5.0 you have to use must_not filter instead of not:
"must_not" : {
"term" : { "post" : "chief acc" }
}
Maybe you could have a look at this SO as well. Hope it helps!

Related

How can we make few tokens to be phrase in elastic search query

I want to search part of query to be considered as phrase .For e.g. I want to search "Can you show me documents for Hospitality and Airline Industry"
Here I want Airline Industry to be considered as phrase.I dont find any such settings in multi_match .
Even when we try to use multi_match query using "Can you show me documents for Hospitality and \"Airline Industry\"" .Default analyser breaks it into separate tokens.I dont want to change settings of my analyser.Also I have found that we can do this in simple_query_string but that has consequences that we can not apply filter option as we have in multi_match boolean query because I want to apply filter on certain feilds as well.
search_text="Can you show me documents for Hospitality and Airline Industry" Now I Want to pass Airline Industry as a phrase to search my indexed document against 2 fields.
okay so say I have existing code like this.
If filter:
qry={
“query":{
“bool”:{
“must”:{
"multi_match":{
"query":search_text,
"type":"best_fields",
"fields":["TITLE1","TEXT"],
"tie_breaker":0.3,
}
},
“filter”:{“terms”:{“GRP_CD”:[“1234”,”5678”] }
}
}
else:
qry={
"query":{
"multi_match":{
"query":search_text',
"type":"best_fields",
"fields":["TITLE1",TEXT"],
"tie_breaker":0.3
}
}
}
'But then I have realised this code is not handling Airline Industry as a phrase even though I am passing search string like this
"Can you show me documents for Hospitality and \"Airline Industry\""
As per elastic search document I came to know there is this query which might handle this
qry={"query":{
"simple_query_string":{
"query":"Can you show me documents for Hospitality and \"Airline Industry\"",
"fields":["TITLE1","TEXT"] }
} }
But now my issue is what if user want to apply filter..with filter query as above I can not pass phrase and boolean query is not possible with simple_query_string'
You can always combine queries using boolean query. Lets understand this case by case. Before going to the cases I would like to clarify one thing which is about filter. The filter clause of boolean query behave just like a must clause but the difference is that any query (even another boolean query with a must/should clause(s)) inside filter clause have filter context. Filter context means, that part of query will not be considered for score calculation.
Now lets move on to cases:
Case 1: Only query and no filters.
{
"query": {
"bool": {
"must": [
{
"simple_query_string": {
"query": "Can you show me documents for Hospitality and \"Airline Industry\"",
"fields": [
"TITLE1",
"TEXT"
]
}
}
]
}
}
}
Notice that the query is same as specified by you in the question. All I have done here is that I wrapped it in a bool query. This doesn't make any logical change to the query but doing so will make it easier to add queries to filter clause programmatically.
Case 2: Phrase query with filter.
{
"query": {
"bool": {
"must": [
{
"simple_query_string": {
"query": "Can you show me documents for Hospitality and \"Airline Industry\"",
"fields": [
"TITLE1",
"TEXT"
]
}
}
],
"filter": [
{
"terms": {
"GRP_CD": [
"1234",
"5678"
]
}
}
]
}
}
}
This way you can combine query(query context) with the filters.

Understanding boosting in ElasticSearch

I've been using ElasticSearch for a little bit with the goal of building a search engine and I'm interested in manually changing the IDFs (Inverse Document Frequencies) of each term to match the ones one can measure from the Google Books unigrams.
In order to do that I plan on doing the following:
1) Use only 1 shard (so IDFs are not computed for every shard and they are "global")
2) Get the ttf (total term frequency, which is used to compute the IDFs) for every term by running this query for every document in my index
curl -XGET 'http://localhost:9200/index/document/id_doc/_termvectors?pretty=true' -d '{
"fields" : ["content"],
"offsets" : true,
"term_statistics" : true
}'
3) Use the Google Books unigram model to "rescale" the ttf for every term.
The problem is that, once I've found the "boost" factors I have to use for every term, how can I use this in a query?
For instance, let's consider this example
"query":
{
"bool":{
"should":[
{
"match":{
"title":{
"query":"cat",
"boost":2
}
}
},
{
"match":{
"content":{
"query":"cat",
"boost":2
}
}
}
]
}
}
Does that mean that the IDFs of the term "cat" is going to be boosted / multiplied by a factor of 2?
Also, what happens if instead of search for one word I have a sentence? Would that mean that the IDFs of each word is going to be boosted by 2?
I tried to understand the role of the boost parameter (https://www.elastic.co/guide/en/elasticsearch/guide/current/query-time-boosting.html) and t.getBoost(), but that seems a little confusing.
The boost is used when query with multi query clauses, example:
{
"bool":{
"should":[
{
"match":{
"clause1":{
"query":"query1",
"boost":3
}
}
},
{
"match":{
"clause2":{
"query":"query2",
"boost":2
}
}
},
{
"match":{
"clause3":{
"query":"query1",
"boost":1
}
}
}
]
}
}
In the above query, it means clause1 is three times important than clause3, clause2 is the twice important than clause2, It's not simply multiply 3, 2, because when calculate score, because there is normalized for scores.
also if you just query with one query clause with boost, it's not useful.
An usage scenario for using boost:
A set of page document set with title and content field.
You want to search title and content with some terms, and you think title is more important than content when search these documents. so you can set title query clause boost more than content. Such as if your query hit one document by title field, and one hit document by content field, and you want to hit title field's document prior to the content field document. so boost can help you do it.

Analyzer to find , e.g: "starbucks" when mistakenly querying "star bucks"

How would I define an analyzer so a query recalls a document with term "starbucks" when mistakenly querying "star bucks"?
Or in general: how would I define an analyzer that is able to search for combined terms by omitting term-separators/ spaces, in the supplied query?
N-grams clearly don't work, since you'd have to know to split up the term 'starbucks' on indexing in 2 separate terms 'star' and 'bucks'. Splitting on syllables might be enough, but not sure if that's possible (or scales)
Thoughts?
You can use Fuzzy Search.
Here is a full working sample:
PUT test1
POST test1/a
{
"item1": "starbucks"
}
POST test1/a
{
"item1": "foo"
}
GET test1/a/_search
{
"query": {
"fuzzy": {
"item1": "star bucks"
}
}
}

Elasticsearch bulk or search

Background
I am working on an API that allows the user to pass in a list of details about a member (name, email addresses, ...) I want to use this information to match up with account records in my Elasticsearch database and return a list of potential matches.
I thought this would be as simple as doing a bool query on the fields I want, however I seem to be getting no hits.
I'm relatively new to Elasticsearch, my current _search request looks like this.
Example Query
POST /member/account/_search
{
"query" : {
"filtered" : {
"filter" : {
"bool" : {
"should" [{
"term" : {
"email": "jon.smith#gmail.com"
}
},{
"term" : {
"email": "samy#gmail.com"
}
},{
"term" : {
"email": "bo.blog#gmail.com"
}
}]
}
}
}
}
}
Question
How should I update this query to return records that match any of the email addresses?
Am I able to prioritise records that match email and another field? Example "family_name".
Will this be a problem if I need to do this against a few hundred emails addresses?
Well , you need to make the change in the index side rather than query side.
By default your email ID is broken into
jon.smith#gmail.com => [ jon , smith , gmail , com]
While indexing.
Now when you are searching using term query , it does not apply the analyzer and it tries to get the exact match of jon.smith#gmail.com , which as you can see , wont work.
Even if you use match query , then you will end up getting all document as matches.
Hence you need to change the mapping to index email ID as a single token , rather than tokenizing it.
So using not_analyzed would be the best solution here.
When you define email field as not_analyzed , the following happens while indexing.
jon.smith#gmail.com => [ jon.smith#gmail.com]
After changing the mapping and indexing all your documents , now you can freely run the above query.
I would suggest to use terms query as following -
{
"query": {
"terms": {
"email": [
"jon.smith#gmail.com",
"samy#gmail.com",
"bo.blog#gmail.com"
]
}
}
}
To answer the second part of your question - You are looking for boosting and would recommend to go through function score query

Elastic Search boost query corresponding to first search term

I am using PyElasticsearch (elasticsearch python client library). I am searching strings like Arvind Kejriwal India Today Economic Times and that gives me reasonable results. I was hoping I could increase weight of the first words more in the search query. How can I do that?
res = es.search(index="article-index", fields="url", body={
"query": {
"query_string": {
"query": "keywordstr",
"fields": [
"text",
"title",
"tags",
"domain"
]
}
}
})
I am using the above command to search right now.
split given query into multiple terms. In your example it will be Arvind, Kejriwal... Now form query string queries(or field query or any other which fits into the need) for each of the given terms. A query string query will look like this
http://www.elasticsearch.org/guide/en/elasticsearch/reference/0.90/query-dsl-query-string-query.html
{
"query_string" : {
"default_field" : "content",
"query" : "<one of the given term>",
"boost": <any number>
}
}
Now you have got multiple queries like above with different boost values(depending upon which have higher weight). Combine all of those queries into one query using BOOL query. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-bool-query.html
If you want all of the terms to be present in the result, query will be like this.
{
"bool" : {
"must" : [q1, q2, q3 ...]
}
}
you can use different options of bool query. for example you want any of 3 terms to present in result then query will be like
{
"bool" : {
"should" : [q1, q2,q3 ...]
},
"minimum_should_match" : 3,
}
theoretically:
split into terms using api
query against terms with different boosting
Lucene Query Syntax does the trick. Thanks
http://lucene.apache.org/core/2_9_4/queryparsersyntax.html#Boosting%20a%20Term

Resources