Short queries return not enough results - elasticsearch

Hey I have a field in elasticsearch that is analyzed with the alphanumeric_analyzer. Then I index data into that field that looks like this:
Test-00001
Test-00002
to
Test-01000
If I execute the following query, I get 250 results consistently. But they aren't necessarily Test-00001 to Test -00250.
`{
"query": {
"match": {
"filename_Analyzed": {
"type": "phrase_prefix",
"query": "0"
}
}
}
}`
I was expecting to get 1000 results, but I only get 250. Are my expectations correct, or is the search incorrect?
EDIT 1:
Gist for the mapping:
https://gist.github.com/goalie7960/8ffd1536269a901f18bc
EDIT 2:
If I double the number of shards, the number of results also doubles. So 5 shards = 250 results, 10 shards = 500 results, etc.
EDIT 3:
Here's a gist for the analyzer I am using. But I can also reproduce with the standard analyzer.
https://gist.github.com/goalie7960/b0bbbddf1cee29b4b5ed

Turns out the prefix query or phrase prefix was exceeding the max expansion limit in elastic search. A non simple solution was to switch to ngram analysis and it has fixed the problem. Yay.

Related

ElasticSearch 7.7 how can I increase the count of results of whole index

I understand that its theres hardcoded limit in Elasticsearch of 10k results per query. What I wanna know if theres any way to search results within this 10k limit but at the same time at least show count of all results for this particular query.
So let's suppose if there are 1M results matching for certain query, the count should show 1M instead of max limit of 10k.
Thank you.
Yes, You can.
You need to add the below attribute to your search query
{
"track_total_hits": true
}
It will show you the total count along with default result.
Elasticsearch supports a /_count API to result the count of all hits in query
GET /index/_count
{
// your search query here
"query": {
"match_all": {}
}
}
You can add "from" and "size" to visit specific hits of response
Example
GET index/_search
{
"from": 0,
"size": 100,
"query": {
"match_all": {}
}
}
In the returned query response from Elasticsearch, there is a field response['hits']['total']['value'] which has the count of hits too, but it also has its limitations.
NOTE: /_count API doesn't support "from" and "size", it gives you the total count.
for more details visit
Elasticsearch Count API.

Is query context evaluated before filter context in elasticsearch? How to determine the order of evaluation?

I am using the below query :
GET customer/doc/_search?routing=123
{
"query": {
"bool": {
"filter": [
{
"term": {
"location": "Delhi"
}
}
],
"should": [
{
"match_phrase_prefix": {
"phone": {
"query": "650",
"max_expansions": 100
}
}
}
]
}
}
}
The problem is my search on phone isn't working anymore. It used to work fine when I had less data, now every shard has data for multiple locations. Search on phone now requires me to type in 6 or 7 characters at times. (There may be matching phone numbers that have different location but are on this shard)
This is due to max_expansions I am guessing. When I increase it to 500 it does return me search results (not all), but the query becomes slow.
Isn't there a way to force es to apply filter first (and restrict the dataset) and then apply the should clause, so that I get the matching results even with small value of max_expansions?
Any help is appreciated.
It is due to max_expansions. Restricting dataset is not exactly what you may want to do ( Thats also not very straight forward - you may have to use some script which will in turn slowdown query).
When you query for a wildcard expression, Lucene expands the wildcard expression into set of actual terms in your inverted index term dictionary. Now , when you restrict the term expansion to 500 - it might miss a few.
I would consider using prefixes during indexing phase. Prefixes helps to avoid the costly expansion in runtime phase.

Application-side Joins Elasticsearch

I have two indexes in Elasticsearch, a system index, and a telemetry index. I'd like to perform queries and aggregations on the telemetry index using filters from the systems index. The systems index is relatively small and only receives new documents occasionally, but the telemetry index is much larger and is constantly receiving new documents. This seems like an ideal situation for using an application-side join.
I tried emulating the example query at the pervious link, but it turns out the filtered query is deprecated as of ES 5.0. (Why is this example in the current documentation?!)
Here are my queries:
GET /system/_search
{
"query": {
"match": {
"name": "George's system"
}
}
}
GET /telemetry/_search
{
"query": {
"bool":{
"must": {
"multi_match": {
"operator": "and",
"fields": ["systemId"]
, [1] }
}
}
}
}
}
The second one fails with a json_parse_exception because for some reason it doesn't like the [ ] characters after "fields".
Can anyone provide a simple example of using application-side joins?
Once such a query is defined (perhaps in Kibana's Dev Tools console) is there a way to visualize it in Kibana?
With elastic there is no way to execute two nested queries like in a relational database where the first query uses the response of the second. The example in the application-side join, means that you are actually making two queries (two different requests to elastic) on the application side.
First query you get the list of ids you need to filter on.
Second query you pass the list of ids that you got to the terms filter.
This works when you have no more than 1024 values for systemId. Because terms query has a limit on the number of terms.
Because this query is not feasible, then you can't visualize it in kibana.
In such case you have to sacrifice a little of space and add the systemId to your mapping.
Good Luck!

How is Elastic Search sorting when no sort option specified and no search query specified

I wonder how Elastic search is sorting (on what field) when no search query is specified (I just filter on documents) and no sort option specified. It looks like sorting is than random ... Default sort order is _score, but score is always 1 when you do not specify a search query ...
You got it right. Its then more or less random with score being 1. You still get consistent results as far as I remember. You have the "same" when you get results in SQL but don't specify ORDER BY.
Just in case someone may see this post even it posted over 6 yrs ago..
When you wanna know how elasticsearch calculate its own score known as _score, you can use the explain option.
I suppose that your query(with filter & without search) might like this more or less (but the point is making the explain option true) :
POST /goods/_search
{
"explain": true,
"query": {
"bool": {
"must": {
"match_all": {}
},
"filter": {
"term": {
"maker_name": "nike"
}
}
}
}
}
As running this, you will notice that the _explaination of each hits describes as below :
"_explanation" : {
"value" : 1.0,
"description" : "ConstantScore(maker_name:nike)",
"details" : [ ]
}
which means ES gave constant score to all of the hits.
So to answer the question, "yes".
The results are sorted kinda randomly because all the filtered results have same (constant) score without any search query.
By the way, enabling an explain option is more helpful when you use search queries. You will see how ES calculates the score and will understand the reason why it returns in that order.
Score is mainly used for sorting, Score is calculated by lucene score calculating using several constraints,For more info refer here .

ElasticSearch results aren't relevant

In ElasticSearch, I've created two documents with one field, "CategoryMajor"
In doc1, I set CategoryMajor to "Restaurants"
In doc2, I set CategoryMajor to "Restaurants Restaurants Restaurants Restaurants Restaurants"
If I perform a search for CategoryMajor:Restaurants, doc1 shows up as MORE RELEVANT than doc2. Which is not typical Lucene behavior, which gives more relevance the more times a term shows up. doc2 should be MORE RELEVANT than doc1.
How in do I fix this?
You can add &explain=true to your GET query to see that score of doc2 is lowered by "fieldNorm" factor. This is caused by default lucene similarity calculation formula, which lowers score for longer documents. Please read this document about default lucene similarity formula:
http://lucene.apache.org/core/3_6_0/api/all/org/apache/lucene/search/Similarity.html
To disable this behaviour add "omit_norms=true" for CategoryMajor field to your index mapping by sending PUT request to:
http://localhost:9200/index/type/_mapping
with request body:
{
"type": {
properties": {
"CategoryMajor": {
"type": "string",
"omit_norms": "true"
}
}
}
}
I'm not certain, but it may be necessary to delete your index, create it again, put above mapping and then reindex your documents. Reindexing after changing mapping is necessary for sure :).

Resources