Favor exact matches over nGram in elasticsearch - elasticsearch

I am trying to map a field as nGram and 'exact' match, and make the exact matches appear first in the search results. This is an answer to a similar question, but I am struggling to make it work.
No matter what boost value I specify for the 'exact' field I get the same results order each time. This is how my field mapping looks:
"name" : {
"type" : "multi_field",
"fields" : {
"name" : {
"type" : "string",
"boost" : 2.0,
"analyzer" : "ngram"
},
"exact" : {
"type" : "string",
"boost" : 4.0,
"analyzer" : "simple",
"include_in_all" : false
}
}
}
And this is how the query looks like:
{
"query": {
"filtered": {
"query": {
"query_string": {
"fields":["name","name.exact"],
"query":"Woods"
}
}
}
}
}

Understating how score is calculated
Elasticsearch has an option for producing an explanation with every search result. by setting the explain parameter to be true
POST <Index>/<Type>/_search?explain&format=yaml
{
"query" : " ....."
}
it will produce a lot of output for every hit and that can be overwhelming, but it worth taking some time to understand what it all means
the output of eplian might be harder to read in json, so adding format=yaml makes it easier to read
Understanding why a document is matched or not
you can pass the query to a specific document like below to see explanation how matching is being done.
GET <Index>/<type>/<id>/_explain
{
"query": "....."
}

The multi_field mapping is correct, but the search query needs to be changed like this:
{
"query": {
"filtered": {
"query": {
"multi_match": { # changed from "query_string"
"fields": ["name","name.exact"],
"query": "Woods",
# added this so the engine does a "sum of" instead of a "max of"
# this is deprecated in the latest versions but works with 0.x
"use_dis_max": false
}
}
}
}
}
Now the results take into account the 'exact' match and adds up to the score.

Related

How to make use of `gt` and `fields` in the same query in Elasticsearch

In my previous question, I was introduced to the fields in a query_string query and how it can help me to search nested fields of a document.
{
"query": {
"query_string": {
"fields": ["*.id","id"],
"query": "2"
}
}
}
But it only works for matching, what if I want to do some comparison? After some reading and testing, it seems queries like range do not support fields. Is there any way I can perform a range query, e.g. on a date, over a field that can be scattered anywhere in the document hierarchy?
i.e. considering the following document:
{
"id" : 1,
"Comment" : "Comment 1",
"date" : "2016-08-16T15:22:36.967489",
"Reply" : [ {
"id" : 2,
"Comment" : "Inner comment",
"date" : "2016-08-16T16:22:36.967489"
} ]
}
Is there a query searching over the date field (like date > '2016-08-16T16:00:00.000000') which matches the given document, because of the nested field, without explicitly giving the address to Reply.date? Something like this (I know the following query is incorrect):
{
"query": {
"range" : {
"date" : {
"gte" : "2016-08-16T16:00:00.000000",
},
"fields": ["date", "*.date"]
}
}
}
The range query itself doesn't support it, however, you can leverage the query_string query (again) and the fact that you can wildcard fields and that it supports range queries in order to achieve what you need:
{
"query": {
"query_string": {
"query": "\*date:[2016-08-16T16:00:00.000Z TO *]"
}
}
}
The above query will return your document because Reply.date matches *date

ElasticSearch - Search for complete phrase only

I am trying to create a search that will return me exactly what i requested.
For instance let's say i have 2 documents with a field named 'Val'
First doc have a value of 'a - Copy', second document is 'a - Copy (2)'
My goal is to search exactly the value 'a - Copy' and find only the first document in my returned results and not both of them with different similarity rankings
When i try most of the usual queries like:
GET test/_search
{
"query": {
"match": {
"Val": {
"query": "a - copy",
"type": "phrase"
}
}
}
}
or:
GET /test/doc/_search
{
"query": {
"query_string": {
"default_field": "Val",
"query": "a - copy"
}
}
}
I get both documents all the time
There is a very good documentation for finding exact values in ES:
https://www.elastic.co/guide/en/elasticsearch/guide/current/_finding_exact_values.html
It shows you how to use the term filter and it mentions problems with analyzed fields, too.
To put it in a nutshell you need to run a term filter like this (I've put your values in):
GET /test/doc/_search
{
"query" : {
"filtered" : {
"query" : {
"match_all" : {}
},
"filter" : {
"term" : {
"Val" : "a - copy"
}
}
}
}
}
However, this doesn't work with analyzed fields. You won't get any results.
To prevent this from happening, we need to tell Elasticsearch that
this field contains an exact value by setting it to be not_analyzed.
There are multiple ways to achieve that. e.g. custom field mappings.
Yes, you are getting that because your field is, most likely, analyzed and split into tokens.
You need an analyzer similar to this one
"custom_keyword_analyzer": {
"type": "custom",
"tokenizer": "keyword",
"filter": "lowercase"
}
which uses the keyword tokenizer and the lowercase filter (I noticed you indexed upper case letters, but expect to search with lowercase letters).
And then use a term filter to search your documents.

Indexing a comma-separated value field in Elastic Search

I'm using Nutch to crawl a site and index it into Elastic search. My site has meta-tags, some of them containing comma-separated list of IDs (that I intend to use for search). For example:
contentTypeIds="2,5,15". (note: no square brackets).
When ES indexes this, I can't search for contentTypeIds:5 and find documents whose contentTypeIds contain 5; this query returns only the documents whose contentTypeIds is exactly "5". However, I do want to find documents whose contentTypeIds contain 5.
In Solr, this is solved by setting the contentTypeIds field to multiValued="true" in the schema.xml. I can't find how to do something similar in ES.
I'm new to ES, so I probably missed something. Thanks for your help!
Create custom analyzer which will split indexed text into tokens by commas.
Then you can try to search. In case you don't care about relevance you can use filter to search through your documents. My example shows how you can attempt search with term filter.
Below you can find how to do this with sense plugin.
DELETE testindex
PUT testindex
{
"index" : {
"analysis" : {
"tokenizer" : {
"comma" : {
"type" : "pattern",
"pattern" : ","
}
},
"analyzer" : {
"comma" : {
"type" : "custom",
"tokenizer" : "comma"
}
}
}
}
}
PUT /testindex/_mapping/yourtype
{
"properties" : {
"contentType" : {
"type" : "string",
"analyzer" : "comma"
}
}
}
PUT /testindex/yourtype/1
{
"contentType" : "1,2,3"
}
PUT /testindex/yourtype/2
{
"contentType" : "3,4"
}
PUT /testindex/yourtype/3
{
"contentType" : "1,6"
}
GET /testindex/_search
{
"query": {"match_all": {}}
}
GET /testindex/_search
{
"filter": {
"term": {
"contentType": "6"
}
}
}
Hope it helps.
POST _analyze
{
"tokenizer": {
"type": "char_group",
"tokenize_on_chars": [
"whitespace",
"-",
"\n",
","
]
},
"text": "QUICK,brown, fox"
}

match or term query on a long property for exact match?

My document has the following mapping property:
"sid" : {"type" : "long", "store": "yes", "index": "not_analyzed"},
This property has only one long value for each record. I would like to query this property. I tried the following two queries:
{
"query" : {
"term" : {
"sid" : 10
}
}
}
{
"query" : {
"match" : {
"sid" : 10
}
}
}
Both queries work and return the target document. My question: which one is more efficient? And why?
You want to use a term query, and if you want to be even more effecient, use a filtered query so your results get cached.
GET index1/test/_search
{
"query": {
"filtered": {
"filter": {
"term": {
"sid": 10
}
}
}
}
}
Both work like the same way as you mentioned. As distinguished from match query the term query matches documents that have fields that contain a term (not analyzed!). So my opinion is that term query is more efficient in your case, because no analyzing have to be done.See:http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-term-query.html

How should I query Elastic Search given my mapping and using keywords?

I have a very simple mapping which looks like this (I streamlined the example a bit):
{
"location" : {
"properties": {
"name": { "type": "string", "boost": 2.0, "analyzer": "snowball" },
"description": { "type": "string", "analyzer": "snowball" }
}
}
}
Now I index a lot of locations using some random values which are based on real English words.
I'd like to be able to search for locations that match any of the given keywords in either the name or the description field (name is more important, hence the boost I gave it). I tried a few different queries and they don't return any results.
{
"fields" : ["name", "description"],
"query" : {
"terms" : {
"name" : ["savage"],
"description" : ["savage"]
},
"from" : 0,
"size" : 500
}
}
Considering there are locations which have the word savaged in the description it should get me some results (savage is the stem of savaged). It yields 0 results using the above query. I've been using curl to query ES:
curl -XGET -d #query.json http://localhost:9200/myindex/locations/_search
If I use query string instead:
curl -XGET http://localhost:9200/fieldtripfinder/locations/_search?q=description:savage
I actually get one result (of course now it would be searching the description field only).
Basically I am looking for a query that will do a OR kind of search using multiple keywords and compare them to the values in both the name and the description field.
Snowball stems "savage" into "savag" that’s why term "savage" didn't return any results. However, when you specify "savage" on URL, it’s getting analyzed and you get results. Depending on what your intention is, you can either use correct stem ("savag") or analyze your terms by using "match" query instead of "terms":
{
"fields" : ["name", "description"],
"query" : {
"bool" : {
"should" : [
{"match" : {"name" : "savage"}},
{"match" : {"description" : "savage"}}
]
},
"from" : 0,
"size" : 500
}
}

Resources