Elasticsearch: document size and query performance - elasticsearch

I have an ES index with medium size documents (15-30 Mb more or less).
Each document has a boolean field and most of the times users just want to know if a specific document ID has that field set to true.
Will document size affect the performance of this query?
"size": 1,
"query": {
"term": {
"my_field": True
}
},
"_source": [
"my_field"
]
And will a "size":0 query results in better time performance?

Adding "size":0 to your query, you will avoid some net transfer this behaviour will improve your performance time.
But as I understand your case of use, you can use count
An example query:
curl -XPOST 'http://localhost:9200/test/_count -d '{
"query": {
"bool": {
"must": [
{
"term": {
"id": xxxxx
}
},
{
"term": {
"bool_field": True
}
}
]
}
}
}'
With this query only checking if there is some total, you will know if a doc with some id have set the bool field to true/false depending on the value that you specify in bool_field at query. This will be quite fast.

Considering that Elasticsearch will index your fields, the document size will not be a big problem for the performance. Using size 0 don't affect the query performance inside Elasticsearch but affect positively the performance to retrieve the document because the network transfer.
If you just want to check one boolean field for a specific document you can simply use Get API to obtain the document just retrieving the field you want to check, like this:
curl -XGET 'http://localhost:9200/my_index/my_type/1000?fields=my_field'
In this case Elasticsearch will just retrieve the document with _id = 1000 and the field my_field. So you can check the boolean value.
{
"_index": "my_index",
"_type": "my_type",
"_id": "1000",
"_version": 9,
"found": true,
"fields": {
"my_field": [
true
]
}
}

By looking at your question I see that you haven't mentioned the elasticsearch version you are using. I would say there are lot of factors that affects the performance of a elasticsearch cluster.
However assuming it is the latest elasticsearch and considering that you are after a single value, the best approach is to change your query in to a non-scoring, filtering query. Filters are quite fast in elasticsearch and very easily cached. Making a query non-scoring avoids the scoring phase entirely(calculating relevance, etc...).
To to this:
GET localhost:9200/test_index/test_partition/_search
{
"query" : {
"constant_score" : {
"filter" : {
"term" : {
"my_field" : True
}
}
}
}
}
Note that we are using the search API. The constant_score is used to convert the term query in to a filter, which should be inherently fast.
For more information. Please refer Finding exact values

Related

How to rank ElasticSearch documents based on scores

I have an Elastic search index that contain thousands of documents, each document represent a user.
each document has set of fields (is_verified: boolean, country: string, is_creator: boolean), also i have another service that call ES search to lookup for documents, how i can rank the retrieved documents based on those fields? for example a verified user with match should come first than un verified one.
is there some kind of document scoring while indexing the documents ? if yes can i modify it based on my criteria ?
what shall i read/look to understand how to rank in elastic search.
thanks
I guess the sorting function mentioned by Mikael is pretty straight forward and should cover your use cases. Check Elastic Doc for more information on that.
But in case you want to do really fancy sorting, maybe you could use a bool query and different boost values to set your desired relevancy for each matched field. It tried to come up with a real life example, but honestly didn't find one. For the sake of completeness, he following snippet should give you an idea how to achieve similar results as with the sort API (but still, i would prefer using sort).
GET /yourindexname/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"name": "Monica"
}
}
],
"should": [
{
"term": {
"is_verified": {
"value": true,
"boost": 2
}
}
},
{
"term": {
"is_creator": {
"value": true,
"boost": 2
}
}
}
]
}
}
}
is there some kind of document scoring while indexing the documents ? if yes can i modify it based on my criteria ?
I wouldn't assign a fixed score to a document while indexing, as the score should be dependent on the query. However, if you insist to have a predefined relevancy for each document, theoretically you could add a field relevancy having that value for ordering and use it later in the query:
GET /yourindexname/_search
{
"query" : {
"match" : {
"name": "Monica"
}
},
"sort" : [
{
"relevancy": {
"order": "desc"
},
"_score"
}
]
}
You can consider using the Sort Api inside your search queries ,In example below we used the search on the field country and sorted the result with respect of Boolean field (is_verified) , You can also add the other Boolean field inside Sort brackets .
GET /yourindexname/_search
{
"query" : {
"match" : {
"country": "Iceland"
}
},
"sort" : [
{
"is_verified": {
"order": "desc"
}
}
]
}

Can i use Elasticsearch with data that needs authentication to view(ex.Logged in users only)

I want to implement searching on my website.Users should be able to search for products that they have in their shop.Obviously,the products returned should only be theirs,same if customers search on their website.How can i implement this with Elasticsearch?Obviously,i will have my backend do the query not the front-end,but how will i limit the search results to be only for one user.Is it only possible through filtering with my own code.Does it have something like WHERE from sql?Am i going about it the wrong way?Will it be better if i use the Full text search from PostgreSQL.
I am using GO btw.
Best regards
Update:My usecase as requested:
User is paired with an ID.He is in his dashboard and searches for a product he has in his shop.His requests passes the session token cookie and i get his ID on my server.Then i need to get the products that match his query and only his.
In SQL it would be SELECT * FROM products WHERE shop_id=ID for example.Is it possible with Elasticsearch?Is it more trouble than worth instead of implementing full text search on PostgreSQL?
Iy can be easily achieved using Elasticsearch and you should define shop_id as a keyword field and later on use that in filter context of query to make sure, you search only on the products belong to a particular shop_id.
Using shop_id in filter context also improves the performance of your search significantly as these are by default cached at Elasticsearch as explained in the official doc
In a filter context, a query clause answers the question “Does this
document match this query clause?” The answer is a simple Yes or
No — no scores are calculated. Filter context is mostly used for
filtering structured data, e.g.
Is the status field set to "published"?
Frequently used filters will be cached automatically by Elasticsearch, to speed up performance.
Sample mapping and query according to your requirement:
Index mapping
{
"mappings" :{
"properties" : {
"product" : {
"type" : "text"
},
"shop_id" :{
"type" : "keyword"
}
}
}
}
Index sample docs for 2 diff shop ids
{
"product" : "foo",
"shop_id" : "stackoverflow"
}
{
"product" : "foo",
"shop_id" : "opster"
}
Search for foo product where shop_id is stackoverflow
{
"query": {
"bool": {
"must": [
{
"match": {
"product": "foo"
}
}
],
"filter": [
{
"term": {
"shop_id": "stackoverflow"
}
}
]
}
}
}
Search result
"hits": [
{
"_index": "productshop",
"_type": "_doc",
"_id": "2",
"_score": 0.18232156,
"_source": {. --> note only foo belong to `stackoverflow` returned
"product": "foo",
"shop_id": "stackoverflow"
}
}
]

Elastic Search comparing sentences with synonyms?

Is there an api within elastic search to compare the following two sentences?
The weather is great
The climate is good
The search described here https://www.elastic.co/guide/en/elasticsearch/guide/2.x/practical-scoring-function.html doesn't work since the sentences have largely different words
The following query will give you the score that would be computed by elasticsearch. Replace test by the name of your index and field by the name of the field using the correct analyzer.
{
"script": {
"source": "_score"
},
"context": "score",
"context_setup": {
"index": "test",
"query": {
"match": {
"field": "The weather is great"
}
},
"document": {
"field": "The climate is good"
}
}
}
You will not get a score between 0.5 and 1 though. Elasticsearch is not built to perform pairwise string comparison, it is used to search within a collection of documents.
If you really want to get a score between 0.5 and 1 you will have to write a scripted similarity function
But again, I don't think elasticsearch fit with your usecase.

Elasticsearch query speed up with repeated used terms query filter

I will need to find out the co-occurrence times between one single tag and another fixed set of tags as whole. I have 10000 different single tags, and there are 10k tags inside fixed set of tags. I loop through all single tags under a fixed set of tags context with a fixed time range. I have total 1 billion documents inside the index with 20 shards.
Here is the elasticsearch query, elasticsearch 6.6.0:
es.search(index=index, size=0, body={
"query": {
"bool": {
"filter": [
{"range": {
"created_time": {
"gte": fixed_start_time,
"lte": fixed_end_time,
"format": "yyyy-MM-dd-HH"
}}},
{"term": {"tags": dynamic_single_tag}},
{"terms": {"tags": {
"index" : "fixed_set_tags_list",
"id" : 2,
"type" : "twitter",
"path" : "tag_list"
}}}
]
}
}, "aggs": {
"by_month": {
"date_histogram": {
"field": "created_time",
"interval": "month",
"min_doc_count": 0,
"extended_bounds": {
"min": two_month_start_time,
"max": start_month_start_time}
}}}
})
My question: Is there any solution which can have a cache inside elasticsearch for a fixed 10k set of tags terms query and time range filter which can speed up the query time? It took 1.5s for one single tag for my query above.
What you are seeing is normal behavior for Elasticsearch aggregations (actually, pretty good performance given that you have 1 billion documents).
There are a couple of options you may consider: using a batch of filter aggregations, re-indexing with a subset of documents, and downloading the data out of Elasticsearch and computing the co-occurrences offline.
But probably it is worth trying to send those 10K queries and see if Elasticsearch built-in caching kicks in.
Let me explain in a bit more detail each of these options.
Using filter aggregation
First, let's outline what we are doing in the original ES query:
filter documents with create_time in certain time window;
filter documents containing desired tag dynamic_single_tag;
also filter documents who have at least one tag from the list fixed_set_tags_list;
count how many such documents there are per each month in certain time period.
The performance is a problem because we've got 10K of tags to make such queries for.
What we can do here is to move filter on dynamic_single_tag from query to aggregations:
POST myindex/_doc/_search
{
"size": 0,
"query": {
"bool": {
"filter": [
{ "terms": { ... } }
]
}
},
"aggs": {
"by tag C": {
"filter": {
"term": {
"tags": "C" <== here's the filter
}
},
"aggs": {
"by month": {
"date_histogram": {
"field": "created_time",
"interval": "month",
"min_doc_count": 0,
"extended_bounds": {
"min": "2019-01-01",
"max": "2019-02-01"
}
}
}
}
}
}
}
The result will look something like this:
"aggregations" : {
"by tag C" : {
"doc_count" : 2,
"by month" : {
"buckets" : [
{
"key_as_string" : "2019-01-01T00:00:00.000Z",
"key" : 1546300800000,
"doc_count" : 2
},
{
"key_as_string" : "2019-02-01T00:00:00.000Z",
"key" : 1548979200000,
"doc_count" : 0
}
]
}
}
Now, if you are asking how this can help the performance, here is the trick: to add more such filter aggregations, for each tag: "by tag D", "by tag E", etc.
The improvement will come from doing "batch" requests, combining many initial requests into one. It might not be practical to put all 10K of them in one query, but even batches of 100 tags per query can be a game changer.
(Side note: roughly the same behavior can be achieved via terms aggregation with include filter parameter.)
This method of course requires getting hands dirty and writing a bit more complex query, but it will come handy if one needs to run such queries at random times with 0 preparation.
re-index the documents
The idea behind second method is to reduce the set of documents beforehand, via reindex API. reindex query might look like this:
POST _reindex
{
"source": {
"index": "myindex",
"type": "_doc",
"query": {
"bool": {
"filter": [
{
"range": {
"created_time": {
"gte": "fixed_start_time",
"lte": "fixed_end_time",
"format": "yyyy-MM-dd-HH"
}
}
},
{
"terms": {
"tags": {
"index": "fixed_set_tags_list",
"id": 2,
"type": "twitter",
"path": "tag_list"
}
}
}
]
}
}
},
"dest": {
"index": "myindex_reduced"
}
}
This query will create a new index, myindex_reduced, containing only elements that satisfy first 2 clauses of filtering.
At this point, the original query can be done without those 2 clauses.
The speed-up in this case will come from limiting the number of documents, the smaller it will be, the bigger the gain. So, if fixed_set_tags_list leaves you with a little portion of 1 billion, this is the option you can definitely try.
Downloading data and processing outside Elasticsearch
To be honest, this use-case looks more like a job for pandas. If data analytics is your case, I would suggest using scroll API to extract the data on disk and then process it with an arbitrary script.
In python it could be as simple as using .scan() helper method of elasticsearch library.
Why not to try the brute force approach?
Elasticsearch will already try to help you with your query via request cache. It is applied only to pure-aggregation queries (size: 0), so should work in your case.
But it will not, because the content of the query will always be different (the whole JSON of the query is used as caching key, and we have a new tag in every query). A different level of caching will start to play.
Elasticsearch heavily relies on the filesystem cache, which means that under the hood the more often accessed blocks of the filesystem will get cached (practically loaded into RAM). For the end-user it means that "warming up" will come slowly and with volume of similar requests.
In your case, aggregations and filtering will occur on 2 fields: create_time and tags. This means that after doing maybe 10 or 100 requests, with different tags, the response time will drop from 1.5s to something more bearable.
To demonstrate my point, here is a Vegeta plot from my study of Elasticsearch performance under the same query with heavy aggregations sent with fixed RPS:
As you can see, initially the request was taking ~10s, and after 100 requests it diminished to brilliant 200ms.
I would definitely suggest to try this "brute force" approach, because if it works it is good, if it does not - it costed nothing.
Hope that helps!

Percentage of matched terms in Elasticsearch

I am using elasticsearch to find similar documents. Below is the query I am using:
{
"query": {
"more_like_this":{
"like": {
"_index": "docs",
"_type": "pdfs",
"_id": "pdf_1"
},
"min_term_freq": 1,
"min_doc_freq": 1,
"max_query_terms: 50,
"minimum_should_match": "50%"
}
}
}
I am extracting the text from PDF and storing in my index "docs". Below are the mappings for type "pdfs":
{
"properties": {
"content":{
"type": "string",
"analyzer": "my_analyzer"
}
}
}
In the result sets I am getting similar documents with their scores. Based on what I have read so far it is not possible to calculate percentage similarity based on score so I am not trying to do that. I am trying to figure out if it is possible to know:
"Out of 50 query terms from the source document how many terms are
matched in a document? or percentage of terms matched?"
As you can see that in my query I am specifying minimum_should_match as 50% so I am assuming that elasticsearch is filtering the documents somewhere based on the how much percentage of terms are matched in a document. I want to get that percentage. I am fairly new to elasticsearch. So far I have gone through the documentation but couldn't find out how to do it.
Any pointer/help is appreciated!

Resources