Percentage of matched terms in Elasticsearch - elasticsearch

I am using elasticsearch to find similar documents. Below is the query I am using:
{
"query": {
"more_like_this":{
"like": {
"_index": "docs",
"_type": "pdfs",
"_id": "pdf_1"
},
"min_term_freq": 1,
"min_doc_freq": 1,
"max_query_terms: 50,
"minimum_should_match": "50%"
}
}
}
I am extracting the text from PDF and storing in my index "docs". Below are the mappings for type "pdfs":
{
"properties": {
"content":{
"type": "string",
"analyzer": "my_analyzer"
}
}
}
In the result sets I am getting similar documents with their scores. Based on what I have read so far it is not possible to calculate percentage similarity based on score so I am not trying to do that. I am trying to figure out if it is possible to know:
"Out of 50 query terms from the source document how many terms are
matched in a document? or percentage of terms matched?"
As you can see that in my query I am specifying minimum_should_match as 50% so I am assuming that elasticsearch is filtering the documents somewhere based on the how much percentage of terms are matched in a document. I want to get that percentage. I am fairly new to elasticsearch. So far I have gone through the documentation but couldn't find out how to do it.
Any pointer/help is appreciated!

Related

Elasticsearch Rank based on rarity of a field value

I'd like to know how can I rank lower items, which have fields that are frequently appearing among the results.
Say, we have a similar result set:
"name": "Red T-Shirt"
"store": "Zara"
"name": "Yellow T-Shirt"
"store": "Zara"
"name": "Red T-Shirt"
"store": "Bershka"
"name": "Green T-Shirt"
"store": "Benetton"
I'd like to rank the documents in such a manner that the documents containing frequently found fields,
"store" in this case, are deboosted to appear lower in the results.
This is to achieve a bit of variety, so that the search doesn't yield top results from the same store.
In the example above, if I search for "T-Shirt", I want to see one Zara T-Shirt at the top and the rest
of Zara T-Shirts should be appearing lower, after all other unique stores.
So far I tried to research for using aggregation buckets for sorting or script sorting, but without success.
Is it possible to achieve this inside of the search engine?
Many thanks in advance!
This is possible with a combination of diversified sampler aggregation and top hits aggregation, as learned from the Elastic forum. I don't know what the performance implications are, if used on a high-load production system. Here is a code example, use at your own risk:
{
"query": {}, // whatever query
"size": 0, // since we don't use hits
"aggs": {
"my_unbiased_sample": {
"diversified_sampler": {
"shard_size": 100,
"field": "store"
},
"aggs": {
"keywords": {
"top_hits": {
"_source": {
"includes": [ "name", "store" ]
},
"size": 100
}
}
}
}
}
}

Elastic Search comparing sentences with synonyms?

Is there an api within elastic search to compare the following two sentences?
The weather is great
The climate is good
The search described here https://www.elastic.co/guide/en/elasticsearch/guide/2.x/practical-scoring-function.html doesn't work since the sentences have largely different words
The following query will give you the score that would be computed by elasticsearch. Replace test by the name of your index and field by the name of the field using the correct analyzer.
{
"script": {
"source": "_score"
},
"context": "score",
"context_setup": {
"index": "test",
"query": {
"match": {
"field": "The weather is great"
}
},
"document": {
"field": "The climate is good"
}
}
}
You will not get a score between 0.5 and 1 though. Elasticsearch is not built to perform pairwise string comparison, it is used to search within a collection of documents.
If you really want to get a score between 0.5 and 1 you will have to write a scripted similarity function
But again, I don't think elasticsearch fit with your usecase.

Using Timelion in ElasticSearch/Kibana 5.0

I'm trying to visualize a timeseries in Timelion. I have a few hundred datapoints in elasticsearch with this sort of format - I've manually removed some fields which I never meant to use in the timeseries plot.
"_index": "foo-2016-11-06",
"_type": "bar",
"_id": "7239171989271733678",
"_score": 1,
"_source": {
"timestamp": "2016-11-06T15:27:37.123581+00:00",
"rank": 2,
}
What I want is to quite simply plot the change in rank over time. I found this post Kibana Timelion plugin how to specify a field in the elastic search which seems to describe the same thing and I understand I should be able to just do .es(metric='sum:rank').
My problem is that no matter how I define my timelion query (even just calling .es(*)), I end up just getting a horizontal line where y=0.
timelion
Things I've tried so far:
Changed timefield in timelion.json from #timefield to just timefield
Extending the timeseries window (even into the future)
Set default_index to _all in timelion.json
Queried specific indices that I know contain data
All of them give me the same outcome which you can see in the attached picture. Does anyone have any idea what might be going on here?
Set the timelion.json as above:
{
"quandl": {
"key": ""
},
"es": {
"timefield": "timestamp",
"default_index": "_all",
"allow_url_parameter": false
},
"graphite": {
"url": "https://www.hostedgraphite.com/UID/ACCESS_KEY/graphite"
},
"default_interval": "1h",
"max_buckets": 2000
}
set the granularity to 'Auto' and use the above Timelion query:.es(index='foo-2016-11-06', metric='max:rank').

Elasticsearch: document size and query performance

I have an ES index with medium size documents (15-30 Mb more or less).
Each document has a boolean field and most of the times users just want to know if a specific document ID has that field set to true.
Will document size affect the performance of this query?
"size": 1,
"query": {
"term": {
"my_field": True
}
},
"_source": [
"my_field"
]
And will a "size":0 query results in better time performance?
Adding "size":0 to your query, you will avoid some net transfer this behaviour will improve your performance time.
But as I understand your case of use, you can use count
An example query:
curl -XPOST 'http://localhost:9200/test/_count -d '{
"query": {
"bool": {
"must": [
{
"term": {
"id": xxxxx
}
},
{
"term": {
"bool_field": True
}
}
]
}
}
}'
With this query only checking if there is some total, you will know if a doc with some id have set the bool field to true/false depending on the value that you specify in bool_field at query. This will be quite fast.
Considering that Elasticsearch will index your fields, the document size will not be a big problem for the performance. Using size 0 don't affect the query performance inside Elasticsearch but affect positively the performance to retrieve the document because the network transfer.
If you just want to check one boolean field for a specific document you can simply use Get API to obtain the document just retrieving the field you want to check, like this:
curl -XGET 'http://localhost:9200/my_index/my_type/1000?fields=my_field'
In this case Elasticsearch will just retrieve the document with _id = 1000 and the field my_field. So you can check the boolean value.
{
"_index": "my_index",
"_type": "my_type",
"_id": "1000",
"_version": 9,
"found": true,
"fields": {
"my_field": [
true
]
}
}
By looking at your question I see that you haven't mentioned the elasticsearch version you are using. I would say there are lot of factors that affects the performance of a elasticsearch cluster.
However assuming it is the latest elasticsearch and considering that you are after a single value, the best approach is to change your query in to a non-scoring, filtering query. Filters are quite fast in elasticsearch and very easily cached. Making a query non-scoring avoids the scoring phase entirely(calculating relevance, etc...).
To to this:
GET localhost:9200/test_index/test_partition/_search
{
"query" : {
"constant_score" : {
"filter" : {
"term" : {
"my_field" : True
}
}
}
}
}
Note that we are using the search API. The constant_score is used to convert the term query in to a filter, which should be inherently fast.
For more information. Please refer Finding exact values

Unexpected Match query scoring on a FirstMiddleLast field

I am using a match query to search a fullName field which contains names in (first [middle] last) format. I have two documents, one with "Brady Holt" as the fullName and the other as "Brad von Holdt". When I search for "brady holt", the document with "Brad von Holdt" is scored higher than the document with "Brady Holt" even though it is an exact match. I would expect the document with "Brady Holt" to have the highest score. I am guessing it has something to do with the 'von' middle name causing the score to be higher?
These are my documents:
[
{
"id": 509631,
"fullName": "Brad von Holdt"
},
{
"id": 55425,
"fullName": "Brady Holt"
}
]
This is my query:
{
"query": {
"match": {
"fullName": {
"query": "brady holt",
"fuzziness": 1.0,
"prefix_length": 3,
"operator": "and"
}
}
}
}
This is the query result:
"hits": [
{
"_index": "demo",
"_type": "person",
"_id": "509631",
"_score": 2.4942014,
"_source": {
"id": 509631,
"fullName": "Brad von Holdt"
}
},
{
"_index": "demo",
"_type": "person",
"_id": "55425",
"_score": 2.1395948,
"_source": {
"id": 55425,
"fullName": "Brady Holt"
}
}
]
A good read on how Elasticsearch does scoring, and how to manipulate relevancy, can be found in the Elasticsearch Guide: What is Relevance?. In particular, you may want to experiment with the explain functionality of a search query.
The shortest answer for you here is that the score of a hit is the product of its best-matching term according to a TF/IDF calculation. The number of matching terms will affect which documents are matched, but it's the "best" term that determine's a document's score. Your query doesn't have an "exact" match, per se: it has multiple matching terms, the scores of which are calculated independently.
Tuning relevancy can be a bit of a subtle art, and depends a lot on how the fields are being analyzed, the overall frequency distributions of various terms, the queries you're running, and even how you're sharding and distributing the index within a cluster (different shards will have different term frequencies).
(It may also be relevant, so to speak, that your example has two spellings of "Holt" and "Holdt".)
In any case, getting familiar with explain functionality and the underlying scoring mechanics is a helpful next step for you here.
Also, if you want an exact phrase match, you should read the ES guide on Phrase Matching.

Resources