In elasticsearch 1.7, Is there a performance difference between using the type filter and using the term filter (for the field _type)? - elasticsearch

Ex:
{
"type" : {
"value" : "my_type"
}
}
vs.
{
"term" : {
"_type" : "my_type"
}
}

Term filters are certainly fast, since they are cached and do not influence the score (constant_Score). However, whether they are faster than type filters needs testing on your end.
I did some testing on my ES 5.2, and found type queries (which replaces type filters) have almost equivelant performance compared to term filters.
Since it looks like type filter serves this exact purpose (filters documents matching the provided document/mapping type), I'm inclined to say type filters are faster. Of course, we need empirical results to be certain.

Related

Is there a difference between using search terms and should when querying Elasticsearch

I am performing a refactor of the code to query an ES index, and I was wondering if there is any difference between the two snippets below:
"bool" : {
"should" : [ {
"terms" : {
"myType" : [ 1 ]
}
}, {
"terms" : {
"myType" : [ 2 ]
}
}, {
"terms" : {
"myType" : [ 4 ]
}
} ]
}
and
"terms" : {
"myType" : [ 1, 2, 4 ]
}
Please check this blog from Elastic discuss page which will answer your question. Coying here for quick referance:
There's a few differences.
The simplest to see is the verbosity - terms queries just list an
array while term queries require more JSON.
terms queries do not score matches based on IDF (the rareness) of
matched terms - the term query does.
term queries can only have up to 1024 values due to Boolean's max
clause count
terms queries can have more terms
By default, Elasticsearch limits the terms query to a maximum of
65,536 terms. You can change this limit using the
index.max_terms_count setting.
Which of them is going to be faster? Is speed also related to the
number of terms?
It depends. They execute differently. term queries do more expensive scoring but does so lazily. They may "skip" over docs during execution because other more selective criteria may advance the stream of matching docs considered.
The terms queries doesn't do expensive scoring but is more eager and creates the equivalent of a single bitset with a one or zero for every doc by ORing all the potential matching docs up front. Many terms can share the same bitset which is what provides the scalability in term numbers.

What is the difference between `constant_score + filter` and `term` query?

I have two queries in Elasticsearch:
{
"term" : {
"price" : 20
}
}
and
"constant_score" : {
"filter" : {
"term" : {
"price" : 20
}
}
}
They are returning the same query result. I wonder what the main difference between them. I read some articles about scoring document. And I believe both queries are scoring document. The constant_score will use default score 1.0 to match the document's score. So I don't see much difference between these two.
The results would be exactly the exact.
However, the biggest difference is that the constant_score/filter version will cache the results of the term query since it's run in a filter context. All future executions will leverage that cache. Also, one feature of the constant_score query is that the returned score is always equal to the given boost value (which defaults to 1)
The first query will be run outside of the filter context and hence not benefit from the filter cache.

Using term or terms with one value in Elasticsearch queries

I am querying an Elasticsearch index using the values of a field. Sometimes, I have to extract all the documents having a field set to exactly one value; Some other times I have to retrieve all the documents having a field, set with one of the values in a list of values.
The latter use case contains the former. Can I use a single query using the terms construct?
POST /_search
{
"query": {
"terms" : { "user" : ["kimchy", "elasticsearch"]}
}
}
Or, in cases I know I need to search only for a unique value, it is better to use the term construct?
POST _search
{
"query": {
"term" : { "user" : "kimchy" }
}
}
Which approach is better regarding performance? Does Elasticsearch perform any optimization if the value in the terms construct is unique?
Thanks to all.
See this link. Terms query is automatically cached while term query is not . So, the next you run the same query, the took time for query for execution will be faster. So if you have a case where you need to run the same query again and again, terms query is a good choice. If not, there is not much of difference between the two.

Finding fields Elasticsearch has matched on

I am using Elasticsearch to search for a group a user should join. I have the user data nested into the search query. On return I get back the closest matched group that user should be in.
The field I am searching on is a nested field as follows:
`{"interests": [
{"topics":["python", "stackoverflow", "elasticsearch"]},
{"topics":["arts", "textiles"]}
]}`
However if you want an understanding of a match - how do you do this?
Elasticsearch does have an explain function which says what the scoring is made up of using tfidf, but not specifically what terms were used.
For example, if I search for 'Textile', the doc should match on 'textiles'. Thus I want the term 'textiles' to be returned in explain or some other way.
The only way I see that provides this need, is to store the search and the document retrieved and then process both to discover words ES has most likely matched on.
EDIT - for some more clarity of the question
An example in my index of a group which has "interests": ['arts', 'fine arts', 'art painting', 'arts and crafts', 'sports']
Now my search, I am looking for Arts and many other things. Now the term I am searching for comes up in this list many times, thus should always be a contributor.
What I want in the response is to say these words were matched ['arts', 'fine arts', 'art painting', 'arts and crafts']along with the degree to which they match i..e 'arts' should be higher than the others, but all others are also relevant
Elasticsearch allows you to specify the _name field for all queries and
filters. This means that you can separate your query into different parts with
separate names, which will allow you to determine which parts matched.
For example:
{
"query" : {
"bool" : {
"should" : [
{"match" : { "interests.topics" : {"query" : "python", "_name" : "py-topic"} }},
{"match" : { "interests.topics" : {"query" : "arts", "_name" : "arts-topic"} }}
]
}
}
}
Then, in your response, you will get back any array of which queries (or
filters) matched and you can determine if the py-topic query and/or the
arts-topic query matched above.

ElasticSearch - Statistical facet on length of string field

I would like to retrieve data about a string field like the min, max and average length (by counting the number of characters inside the string). My issue is that aggregations can only be used for numeric fields. Besides, I tried it using a simple statistical facet,
"query":{
"match_all": {}
},
"facets":{
"stat1":{
"statistical":{
"field":"title"}
}
}
but I get shard failures and SearchPhaseExecutionException. When trying with a script field the error returned is an OutOfMemoryError:
"query":{
"match_all": {}
},
"script_fields":{
"test1":{"script": "doc[\"title\"].value" }
}
Is it possible to retrive such data about a simple "title" string field using CURL? Thank you!
I haven't actually tried the following, but I believe it should work.
First some useful doc-references:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-facets-statistical-facet.html.
In order to implement the statistical facet, the relevant field values
are loaded into memory from the index. This means that per shard,
there should be enough memory to contain them. Since by default,
dynamic introduced types are long and double, one option to reduce the
memory footprint is to explicitly set the types for the relevant
fields to either short, integer, or float when possible.
I'm not sure directly how to set the type of the script-field to 'short' which is probably what you want. to reduce memory. it SHOULD be possible though.
ALSO: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-request-script-fields.html
It’s important to understand the difference between
doc['my_field'].value and _source.my_field. The first, using the doc
keyword, will cause the terms for that field to be loaded to memory
(cached), which will result in faster execution, but more memory
consumption. Also, the doc[...] notation only allows for simple valued
fields (can’t return a json object from it) and make sense only on
non-analyzed or single term based fields.
So ALTERNATIVE: would be to use _source instead of doc which would not cache the lengths.
Gives:
{
"query" : {
"match_all" : {}
},
"facets" : {
"stat1" : {
"statistical" : {
"script" : "doc['title'].value.length()
//"script" : "_source.title.length() //ALTERNATIVE which isn't cached
}
}
}
}

Resources